Welcome back to part 2 of my series, Personal Deep Dive: My Notes and Insights. If you found the first post useful, you’re in for more today. If you missed it, don’t worry — you can catch up here. Now, let’s get into the fascinating world of Artificial Intelligence.

Structured Output

Why Structured Output Matters for AI

When working with LLMs, one of the key considerations is how the model generates and outputs information. For any AI-driven system to be effective, especially when integrated into automation workflows, structured output becomes an absolute necessity. You might wonder, “Why is structure so important?” Well, let’s break it down.

In simple terms, structured output refers to the way AI organizes and formats its responses. Whether you’re using an LLM for data analysis, automation, or function calls, having outputs that follow a clear, predictable format ensures the information can be properly parsed and used by other systems. Without this structure, AI might generate responses that are too vague, incomplete, or incompatible with your application.

For instance, if you’re building a chatbot to handle customer service inquiries, the AI needs to return responses in a structured format — like JSON — so the system can easily integrate with the backend and trigger the appropriate actions (e.g., escalating a query or updating a customer record).

But how do we get this structured output? There are several ways, and I’ll walk you through the most popular and effective approaches.

Leveraging Function Tools for Structured Output

One of the most powerful capabilities of modern LLMs is function calling. This feature allows the model to recognize when specific functions should be invoked based on the user’s prompt. Function calling isn’t just about answering questions; it’s about crafting outputs that match the specific input requirements of an external system or workflow. For example, many LLMs, including GPT-3.5-turbo, GPT-4, Gemini-1.0, and Claude-3, offer function calling support, which allows them to generate responses as structured JSON.

Using JSON Mode for Precise Output

If you’re using OpenAI’s API, there’s a handy feature called JSON mode that guarantees structured output. This feature ensures that your LLM response adheres to a predefined format (typically JSON), making it much easier to integrate the results with other systems. you can check more about it here

The Role of Pydantic in Structuring LLM Outputs

Now, let’s dive into some more technical territory. If you’re a developer, you might want to go a step further by validating the structure of the AI-generated output. This is where Pydantic, a Python library, comes in handy.

Pydantic helps you define clear data models with type hints. These models ensure that the data returned from the LLM is consistent and conforms to the expected structure. you can check here

Instructor: A User-Friendly Solution for Structured Data

For those looking for an even easier way to manage structured outputs, the Instructor library is a fantastic tool. Instructor is built on top of Pydantic and aims to simplify the process of getting structured data like JSON from LLMs.

With Instructor, you don’t have to manually manage data validation or retry logic. It automatically handles retries (using the Tenacity package) and can stream responses in real time. Plus, it integrates with many LLMs, including GPT-4, GPT-3.5, and open-source models like Mistral and Llama.

Advanced Techniques: Outlines and XGrammar

For those diving deeper into structured output, there are more advanced tools worth exploring.

One such technique is the use of Outlines , XGrammar and BAML for efficient structured generation. Outlines uses regular expressions (regex) and finite state machines (FSMs) to generate structured JSON, making it ideal for high-performance LLM inference. XGrammar, an open-source library, takes this a step further by providing a portable backend that integrates seamlessly with LLMs, optimizing structured generation and reducing inference time.

Below are the some resources to learn more about structured output

LLM Wrapper

LLMs are incredibly powerful, the process of interacting with them from making requests to handling their responses can sometimes be cumbersome, especially if you’re not deeply familiar with the underlying technicalities.

This is where LLM wrappers come in. An LLM wrapper is essentially a library or framework that abstracts away the repetitive, lower-level details involved in interacting with an LLM. Think of it as a bridge that makes it easier and more efficient to communicate with these models by providing pre-built functionality and common abstractions. With a wrapper, you can focus on the big picture, while the wrapper handles the technical nuances.

There are several great libraries available, each with its own strengths and use cases. Based on my experience, here are a few of the most popular ones:

LibraryMethodDescription
langchainPrompting & function callingPydantic output parser as part of langchain
llama_indexPrompting & function callingPydantic program as part of llama_index
guidanceConstrained token samplingProgramming paradigm for constrained generation
outlinesConstrained token samplingConstrained token sampling using CFGs²
instructorFunction callingSpecify Pydantic models to define structure of LLM outputs
marvinFunction callingToolbox of task-specific OpenAI API wrappers
spacy-llmPromptingspaCy plugin to add LLM responses to a pipeline
fructoseFunction callingLLM calls as strongly-typed functions
mirascopeFunction callingPrompting, chaining and structured information extraction
texttunnelFunction callingEfficient async OpenAI API function calling

AI Agents

After the advent of RAG (Retrieval-Augmented Generation), the spotlight is now turning toward AI Agents. If you’ve been keeping an eye on AI trends, you may have noticed the growing buzz around agents — and for good reason. These intelligent systems are capable of autonomous decision-making and executing tasks based on given goals, making them a crucial advancement in AI.

An AI agent is essentially an autonomous system that can perform tasks and make decisions based on a set of goals and inputs, typically using machine learning models and algorithms. Unlike traditional AI systems that might only answer questions or perform a single, predefined task, AI agents are designed to execute sequences of tasks, interact with other systems, learn from their environment, and adapt to new information or situations.

Here some resources that help you learn about AI agents

  1. CS 194/294-196 (LLM Agents) - Lecture 1, Denny Zhou
  2. AI Agents’ Secret Sauce
  3. 16 Months of Building AI Agents in 60 Minutes
  4. 18 Months of Building Autonomous AI Agents in 42 Minutes
  5. Using Agents to Build an Agent Company: Joao Moura
  6. Agentic Design Patterns Part 1
  7. Multi AI Agent Systems with crewAI
  8. Agents in Llamaindex
  9. Memory in Agent Systems

Advanced

  • AI Agent Mastery: Agent Architectures
    Series on mastering AI agents, covering:

    • Agent architectures
    • Agent frameworks
    • Evaluating agents
    • Handling agents stuck in loops
  • LLM Agents Learning Platform
    Comprehensive resource for learning about LLM agents with various topics, such as:

    • Foundation of LLMs
    • Reasoning, planning, and tool use
    • LLM agent infrastructure
    • Retrieval-augmented generation
    • Code generation and data science
    • Multimodal agents and robotics
    • Evaluation and benchmarking
    • Privacy, safety, and ethics
    • Human-agent interaction and personalization
    • Multi-agent collaboration

Evaluating AI

Evaluating AI models is an essential step in the AI development lifecycle. Without a proper evaluation process, it’s difficult to determine whether the AI is performing as expected, or if there are issues that need to be addressed. This is particularly crucial for Large Language Models (LLMs), where the complexity and scale of tasks can lead to unexpected results or suboptimal performance. Evaluating the outputs, behaviors, and predictions of these models helps ensure that they are doing the job correctly and allows you to spot areas where improvements are necessary.

Below are the some concepts that I learned during studying about Evaluating AI

Types of Metrics Used for Evaluating LLMs

To evaluate large language models effectively, several types of metrics are used. These metrics can be broadly classified into intrinsic, extrinsic, and hybrid categories, each serving a specific purpose.

1. Intrinsic Metrics

Intrinsic metrics evaluate the internal workings of the model, focusing on its quality and natural language generation capabilities. These metrics help assess how well the model is functioning from a linguistic or statistical perspective.

  • Perplexity: Measures how well the model predicts the next word in a sequence. A lower perplexity score means the model is more confident and accurate in its predictions.

  • Fluency: Measures the coherence and naturalness of the generated text. It’s essentially an assessment of how human-like the AI’s output sounds.

  • BLEU (Bilingual Evaluation Understudy) Score: Commonly used for translation tasks, this metric evaluates the similarity between the generated text and a reference text. It’s a way of quantifying how well the model’s output matches a “correct” answer.

2. Extrinsic Metrics

Extrinsic metrics focus on evaluating how well the model performs specific tasks, often related to business goals or real-world applications.

  • Accuracy: Measures the proportion of correct predictions or answers made by the model, typically used for classification tasks.

  • F1 Score: A combination of precision and recall, this score is valuable for tasks with imbalanced data. It strikes a balance between avoiding false positives and minimizing false negatives.

  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score: Used for summarization tasks, ROUGE measures the overlap between the generated summary and the reference summary in terms of recall, precision, and F1 score.

3. Hybrid Metrics

Hybrid metrics combine both intrinsic and extrinsic measures to provide a more comprehensive evaluation of the model’s performance.

  • METEOR: Used primarily for machine translation tasks, METEOR considers not just the exact matches between the generated and reference text, but also synonyms and word order. This makes it more flexible than BLEU in evaluating quality.

  • GEVAL: Another hybrid metric that evaluates the overall quality of the text generation while considering both syntactical and semantic aspects.

Libraries and Platforms for LLM Evaluation

There are several powerful libraries and platforms that can help you evaluate your LLMs effectively. These tools assist in different aspects of evaluation, from basic performance metrics to more advanced tasks like hallucination detection, bias analysis, and prompt evaluation.

Here’s a list of useful tools and libraries for evaluating LLMs:

Tool/LibraryDescriptionUse Case
DeepEvalAn open-source LLM evaluation framework that simplifies the process of evaluating model performance.General LLM evaluation framework for various tasks.
SelfcheckGPTA zero-resource tool for detecting hallucinations in generative LLMs.Hallucination detection in LLM-generated text.
LLM-as-JudgeUses an LLM as the evaluator to assess another LLM’s outputs.Evaluating one model using another model (meta-evaluation).
Agent-as-a-JudgeEvaluates agents with agents, focusing on multi-agent system performance.Multi-agent evaluation for complex systems.
ChainpollA flexible technique for LLM-based evaluation, powered by Galileo platform.LLM-based evaluation across different domains and tasks.
PrometheusA family of language models specialized in evaluating other LLMs.Cross-evaluation of multiple LLMs using specialized models.
EvalLMA tool that helps prompt designers evaluate and compare generated outputs on user-defined criteria.Prompt optimization and evaluation of LLM outputs.
ChainForgeAn open-source visual programming environment for evaluating LLM robustness and model performance.Testing prompt robustness and model stability.
SPADESystem for prompt analysis and delta-based evaluation.Analyzing prompt effectiveness and adjusting for optimal results.
GiskardA platform that helps assess bias, performance, and security issues in AI models.Bias and security analysis for AI models in production.
TrueLensA tool that allows users to evaluate, iterate faster, and select the best LLM application.Iterative evaluation of LLMs in production environments.
Quotient AIAn AI development and evaluation platform combining research and real-world applications.Prototyping and fast evaluation of AI products in real-world settings.
Inspect AIA framework for large language model evaluations developed by the UK AI Safety Institute.Ethical AI evaluation and performance monitoring.

Observablity

After Evaluation, observability plays a crucial role in ensuring AI systems are working as intended. Without observability, you may miss key issues such as performance drops, biased outputs, or unexpected behaviors.

Below are the some tool that i have explored that might helps you to get started

ToolFeaturesLLM IntegrationPros & Cons
HeliconePlayground, Datasets, Rate Limiting, Webhooks, Alerts, CacheOpenAI, Azure, AnthropicComprehensive set of features, ideal for various use cases
LangFusePrompt Management, Trace Logging, User Tracking, EvaluationLangfuse is model agnostic for tracing/observability, and supports openai, azure openai, anthropic, aws bedrock, and google vertex for playground and evalsLangfuse is very simple to self-host at production scale on any cloud vendor (only OSS dependencies) compared to many of the other solutions, might be something interesting to highlight as I think the other solutions are less scalable or more complex to host yourself
PhoenixModel Tracing, Model Evaluation-Suitable for deep model analysis but lacks transparency in hosting and integration details
OPENLITPrompt Management, Model Tracing, Cost TrackingOpenAI, Azure, AnthropicHelm chart available for easy deployment; flexible integrations
LangtracePrompt Management, Model Tracing, Cost TrackingOpenAI, Azure, AnthropicOpen-source with strong tracking features, but limited integration options
LunaryPrompt Management, Model Tracing, Cost Tracking, Multi-User SupportOpenAIOffers multi-user support but limited integration with other LLMs
ArizeModel Tracing, Evaluation, Retrieval (RAG) Analysis, Datasets, Fine-Tuning Export, Annotations, Human Feedback, Experiments, Embedding Analysis, Data ExportOpenAI, Azure, Google CloudExcellent for Retrieval-Augmented Generation (RAG), including playground and evaluation features, but requires enterprise subscription

Resources

My Hackthon expereince

After diving into the world of Large Language Models (LLMs) and getting comfortable with their capabilities, I decided to take the next step and work on some projects. I started looking for interesting project ideas, and that’s when I discovered an exciting opportunity: a hackathon organized by TimeScale, featuring an Open Source AI Challenge in collaboration with pgai and Ollama on dev.to

For this challenge, I built an AI-Powered Chat App for Websites, YouTube, Files, and Dev.to Articles. You can read more about it in this article. Unfortunately, I didn’t win the hackathon, but the experience was incredibly valuable.

Next, I participated in another hackathon, this time organized by AssemblyAI. In this one, I developed a tool called “Boost Your Sales Calls with Free AI-Powered Analysis & Practice.” While I didn’t win this one either, you can check out more details about the project here.

Though I didn’t win either hackathon, I gained practical experience and learned a lot from both projects. The process of building and refining these applications has significantly deepened my understanding of AI and how it can be applied in real-world scenarios.

Looking ahead, I’m excited to continue building more cool AI-driven projects. If you have any interesting project ideas, feel free to leave a comment below—I’m always open to new suggestions.

Finally, I’ve compiled all the AI-related code I’ve worked on in a GitHub repository. While I’ve done my best to clean it up, be aware that it might still have a few bugs here and there! You can check it out here.

That’s all for today! My next goal is to dive into advanced topics like fine-tuning models, retaining context, and much more. Once I’ve learned those concepts, I’ll be sure to share my progress in a future blog post. Until then, happy learning!

After learning and getting comfortable with LLM I have planed to do some projects so i am started looking for the project ideas and i find there i hackthon conducted by a TimeScale was  the Open Source AI Challenge with pgai and Ollama where i have build the AI-Powered Chat App for Websites,Youtube, Files, and Dev.to Articles you can check out more in this articel But unlucly i have not won the hackthon

Next i have particpated the hackthon conducted by  AssemblyAI where i have build the Boost Your Sales Calls with Free AI-Powered Analysis & Practice unluclky i have not won this hackthon also you can check here for more

But my participating two hackthon i have gained some practial knowelege Next i am planing to build some other cool projects using AI if you any ideas feel free to comment it

FInally here the repo where i have put all my AI realated code that i have mentioned above you can check it out but it may have bugs :sweat simle

That all today So next i am plaing to learn finetuning , reataing context and many more once i learned i will share that in next blog unitl then bye happy learning

Here the some resources to help you to get started in AI

Evalution

Evaluting the AI will help us make sure that AI is doing the job correctly or we can find out does we doing any thing wrong here some lib that help you to get started

LLM metrics

LLM Evaluation Metrics Types

Intrinsic metrics: evaluate the model’s internal workings, such as perplexity and fluency.

  • Perplexity: measures how well the model predicts a test dataset. Lower perplexity indicates better performance.
  • Fluency: measures the coherence and naturalness of the generated text.
  • BLEU (Bilingual Evaluation Understudy) Score: measures the similarity between the generated text and a reference text.

Extrinsic metrics: evaluate the model’s performance on specific tasks, such as question-answering and text classification.

  • Accuracy: measures the proportion of correct predictions or answers.
  • F1 Score: measures the balance between precision and recall.
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score: measures the quality of generated summaries.

Hybrid metrics: combine intrinsic and extrinsic metrics to provide a more comprehensive evaluation.

  • METEOR (Metric for Evaluation of Translation with Explicit ORdering) Score: measures the similarity between generated and reference translations, taking into account the order of the words.
  • GEVAL

Some Library and Platform that help you to evalute the LLM

  • DeepEval The open-source LLM evaluation framework
  • SelfcheckGPT Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
  • LLM-as-Judge
  • Agent-as-a-Judge: Evaluate Agents with Agents
  • Chainpoll is a powerful, flexible technique for LLM-based evaluation that is unique to Galileo. It is used to power multiple metrics across the Galileo platform.
  • Prometheus is a family of open-source language models specialized in evaluating other language models
  • EvalLM is an interactive system that aids prompt designers in iterating on prompts by evaluating and comparing generated outputs on user-defined criteria.
  • ChainForge  is an open-source visual programming environment for prompt engineering. With ChainForge, you can evaluate the robustness of prompts and text generation models in a way that goes beyond anecdotal evidence.
  • SPADE (System for Prompt Analysis and Delta-based Evaluation) will suggest binary eval functions for your prompt that you can run on all future LLM responses on
  • Giskard Protect your company against biases, performance & security issues in AI models
  • True lens Evaluate, iterate faster, and select your best LLM app with TruLens.
  • Quotient AI  is an advanced AI development and evaluation platform. We combine our experience building evals for top AI products with state-of-the-art research to make it simple and fast to prototype, evaluate, and ship AI products without the unnecessary technical overhead.
  • Inspect AIa framework for large language model evaluations created by the UK AI Safety Institute.