LLM strucuted output

Pydantic

Pydantic, a popular Python library with over 70 million downloads per month, offers a robust and developer-friendly way to structure your prompts and validate LLM output. Here’s how it works:

Define data models with type hints: Pydantic allows you to create classes that represent the structure of your desired output using Python type hints.
Automatic validation and parsing: When you pass data to a Pydantic model, it automatically validates the data against the defined types and converts it to the appropriate Python objects.
Seamless integration with OpenAI function calling: Pydantic models can be easily converted to JSON schema, which can be used with OpenAI function calling to ensure that the LLM output conforms to your expectations.

Instructor and Marvin: Simplifying Pydantic Integration

Libraries like Instructor and Marvin make it even easier to use Pydantic for structured prompting with LLMs:

Instructor focuses on OpenAI function calling and provides a simple way to define Pydantic models as response models for your API calls.
Marvin is a more comprehensive framework that supports multiple LLMs and offers additional features, including prompt management and evaluation.

Outlines

Generate structured JSON using regular expressions (regex) and finite state machines (FSMs). This technique is used in a library called Outlines to make Large Language Model (LLM) inference faster.

Step 1: Convert JSON Schema to a Regular Expression:
- First, the JSON Schema is translated into a regular expression.
- If a string generated by the LLM matches this regex, it’s valid according to the schema and can be parsed.
- Example: The JSON schema for a character with a name and age is converted to the regex \\{"name":("John"|"Paul"),"age":(20|30)\\\\}.
Step 2: Translate the Regex into a Finite State Machine:
- Regular expressions can be represented as FSMs.
- Libraries like interegular can perform this translation.
- The FSM represents all possible valid JSON strings that conform to the schema.
Step 3: Generate JSON from the FSM:
- Starting from the initial state of the FSM, the algorithm generates one allowed character at random and transitions to the next state.
- This process repeats until a final state is reached.
- The generated string is guaranteed to be valid JSON.
Step 4: Token-Based FSM:
- LLMs work with tokens, not individual characters.
- The character-based FSM is transformed into a token-based FSM.
- This is done by mapping FSM states to allowed token transitions.
Step 5: Coalescence (Optimization):
- Tokenizers often create redundant paths in the FSM where different token sequences lead to the same output string.
- Coalescence exploits this redundancy by merging these paths.
- Instead of sampling each token individually, the algorithm can append longer token words, significantly speeding up generation.
- Example: Instead of generating “n”, “a”, “m”, “e” separately, the algorithm can directly append “name.”
- This can lead to a 5x speedup compared to traditional structured generation.
Considerations:
- While coalescence improves speed, different token paths can lead to different LLM states and affect the probability distribution of subsequent tokens.
- Care must be taken to avoid excluding more likely sequences during optimization.

Tokenizers: A tokenizer is a fundamental component in Natural Language Processing (NLP) that breaks down text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the tokenizer’s design. Tokenizers are essential for LLMs, as they convert text into a numerical representation that the model can process.

Logit Generators: LLMs, at their core, are probabilistic models. They don’t deterministically produce a single output but instead assign probabilities to different possible next tokens. These probabilities are represented as logits, which are the raw, unnormalized outputs from the model. A logit generator is the part of the LLM that calculates these logits, reflecting the model’s assessment of how likely each token is to appear next, given the preceding context.

OpenAI logbits here

import OpenAI from "openai";

const openai = new OpenAI();

async function main() {
  const completion = await openai.chat.completions.create({
    messages: [{ role: "user", content: "Hello!" }],
    model: "gpt-4o",
    logprobs: true,
    top_logprobs: 2,
    logit_bias={2435:-100, 640:-100}
  });

  console.log(completion.choices[0]);
}

main();

//output 
{
.....
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          },
    ..............     
}

We can use the logit_bias parameter to increase or decrease the likelihood of specific tokens appearing in the model’s output
Accepts a JSON object that maps tokens (specified by their token ID in the GPT tokenizer) to an associated bias value from -100 to 100.
A bias value of -100 will likely block the token from being generated
logprobs Include the log probabilities on the logprobs most likely output tokens, as well the chosen tokens. For example, if logprobs is 5, the API will return a list of the 5 most likely tokens

xgrammar

XGrammar is an open-source library for efficient, flexible, and portable structured generation. It supports general context-free grammar to enable a broad range of structures while bringing careful system optimizations to enable fast executions. XGrammar features a minimal and portable C++ backend that can be easily integrated into multiple environments and frameworks, and is co-designed with the LLM inference engine and enables zero-overhead structured generation in LLM inference.

Praser

Jina AI

https://jina.ai/reader/ free for non commerical use

Docling

It use model to parse the content check here

Omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

firecrawl

Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API. check here

AgentQL

Transforming websites into agent-friendly surfaces with an AI-native query language check here