Eval
LLM Evaluation Metrics Types
Intrinsic metrics: evaluate the model’s internal workings, such as perplexity and fluency.
- Perplexity: measures how well the model predicts a test dataset. Lower perplexity indicates better performance.
- Fluency: measures the coherence and naturalness of the generated text.
- BLEU (Bilingual Evaluation Understudy) Score: measures the similarity between the generated text and a reference text.
Extrinsic metrics: evaluate the model’s performance on specific tasks, such as question-answering and text classification.
- Accuracy: measures the proportion of correct predictions or answers.
- F1 Score: measures the balance between precision and recall.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score: measures the quality of generated summaries.
Hybrid metrics: combine intrinsic and extrinsic metrics to provide a more comprehensive evaluation.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering) Score: measures the similarity between generated and reference translations, taking into account the order of the words.
- GEVAL
G-Eval
Using GPT-4 and chain-of-thoughts (CoT) approach to generate detailed evaluation steps for NLG outputs.
How G-EVAL Works
- Text Embeddings: G-EVAL uses GPT-4 to generate text embeddings for both the generated text and human-written reference texts.
- Similarity Computation: The similarity between the generated text and human-written text is computed using a similarity metric, such as cosine similarity or dot product.
- Score Computation: The similarity scores are aggregated to compute a final score, which reflects the overall quality of the generated text.
checkout here
SelfcheckGPT
- BERTScore: Compares the generated text with reference samples using BERT embeddings.
- Question-Answering (QA): Generates questions from the text and checks consistency in answers.
- N-gram Analysis: Uses statistical properties of n-grams for consistency checks.
- Natural Language Inference (NLI): Uses entailment and contradiction probabilities.
- LLM Prompting: Queries LLMs directly to check consistency.
Check out more SelfcheckGPT
DeepEval
An open-source LLM evaluation framework that includes:
- G-Eval
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS
- Hallucination
- Toxicity
- Bias
- and more. GitHub
LLM-as-Judge
- Use pairwise comparisons: Instead of asking the LLM to score a single output on a Likert scale, present it with two options and ask it to select the better one. This tends to lead to more stable results.
- Control for position bias: The order of options presented can bias the LLM’s decision. To mitigate this, do each pairwise comparison twice, swapping the order of pairs each time. Just be sure to attribute wins to the right option after swapping!
- Allow for ties: In some cases, both options may be equally good. Thus, allow the LLM to declare a tie so it doesn’t have to arbitrarily pick a winner.
- Use Chain-of-Thought: Asking the LLM to explain its decision before giving a final answer can increase eval reliability. As a bonus, this lets you to use a weaker but faster LLM and still achieve similar results. Because this part of the pipeline is typically run in batch, the extra latency from CoT isn’t a problem.
- Control for response length: LLMs tend to bias toward longer responses. To mitigate this, ensure response pairs are similar in length.
use YAML because it is less verbose, and hence consumes fewer tokens than JSON. when getting output from LLM
Metrics for N-Gram Matching
- BLEU: Compares the generated text with reference completions, scoring between 0 (no match) and 1 (perfect match).
- ROUGE-N: Measures n-gram overlap between generated text and references.
Never ask the point to LLM for question out of 5 like because we cannot decide what we going to do with the point. so ask for Critiques how it can improved etc For more check Creating a LLM-as-a-Judge That Drives Business Results
Agent as judge
Using agent as judge. which have access to tools etc so it will be perform better then LLM as judge
for more check here
Auto-Arena
Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
Auto-Arena framework consists of three stages: Question Generation, Multi-round Peer Battles, and Committee Discussions. These three stages are run sequentially and fully simulated with LLM-powered agents to evaluate the response check here
Chain poll
A HIGH EFFICACY METHOD FOR LLM HALLUCINATION DETECTION
The Correctness and Context Adherence metrics in the Galileo console are powered by ChainPoll-Correctness and ChainPoll-Adherence
ChainPoll-based metric for each of these cases.
- ChainPoll-Correctness uses ChainPoll to detect open-domain hallucination.
- ChainPoll-Adherence uses ChainPoll to detect open-domain hallucination.
Steps
- Ask gpt-3.5-turbo whether the completion contained hallucination(s), using a detailed and carefully engineered prompt.
- Run step 1 multiple times, typically 5. (We use batch inference here for its speed and cost advantages.)
- Divide the number of “yes” answers from step 2 by the total number of answers to produce a score between 0 and 1
I need you to verify the following statements for correctness using the ChainPoll method:
1. Break down the response into individual facts.
2. Verify each fact using reliable sources.
3. Identify any inconsistencies or errors.
4. Provide the correct information if any fact is incorrect.
Prometheus
Prometheus is a family of open-source language models specialized in evaluating other language models. By effectively simulating human judgments and proprietary LM-based evaluations, we aim to resolve the following issues:
-
Fairness: Not relying on closed-source models for evaluations!
-
Controllability: You don’t have to worry about GPT version updates or sending your private data to OpenAI by constructing internal evaluation pipelines
-
Affordability: If you already have GPUs, it is free to use!
Ragas
Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context.
Ragas Framework
Resources
EvalLM
Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria it a website where we can enter prompt and do eval using perdefind criteria and user defined criteria
ChainForge
ChainForge is an open-source visual programming environment for prompt engineering, LLM evaluation and experimentation
SPADE
System for Prompt Analysis and Delta-Based Evaluation (SPADE) A method for automatically synthesizing data quality assertions that identify bad LLM outputs
How it Works
- Prompt Tracking: Logs prompt changes over time.
- Prompt Changes Evaluation: Generates responses based on updated prompts.
- Automated Unit Test Generation: Creates unit tests for each prompt variation.
- Delta-Based Analysis: Compares outputs before and after prompt changes.
- Quality Assertion Creation: Forms assertions to detect bad outputs.
Resources
- A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions here
- Evaluating the Effectiveness of LLM-Evaluators
Giskard
Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data
The Giskard LLM scan comprises two main types of detectors:
- Traditional detectors: which exploit known techniques or heuristics to detect vulnerabilities Example : LLMCharsInjectionDetector
- LLM assisted detectors: which use another LLM model to probe the model under analysis. Example:LLMBasicSycophancyDetector
Snorkel
Create custom trained model with custom data and use it for eval Steps
- Create golden dataset
- Encode acceptance criteria into custom quality model
- Slice your prompts to evaluate what matters
- Review fine grainded benchmarks
True lens
RAG Triad of metrics
- Context Relevance → is retervied context relvant to the query?
- Answer Relevance → is the response relvant to the query?
- Groundedness → is response supported by the context?
Quotient AI
Quotient AI automates manual evaluations starting from real data, incorporating human feedback.
- Context Relevance
- Chunk Relevance
- Faithfulness
- ROUGE-L
- BERT Sentence Similarity
- BERTScore
Eleuther AI
A framework for few-shot evaluation of language models.
Arize Phoenix
Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting.
Braintrust
Braintrust is an end-to-end platform for building AI applications. It makes software development with large language models (LLMs) robust and iterative.
Tools
- Port Key
- TrueLens: Website
- Inspect AI: GitHub
- Giskard: GitHub
- https://github.com/EleutherAI/lm-evaluation-harness
Resources
- A Survey on Hallucination in Large Language Models
- Evaluating the Effectiveness of LLM-Evaluators
- A framework for few-shot evaluation of language models.
- LLM Evaluation Skills Are Easy to Pick Up
- How to Cook Good AI Products with What You Already Have in your Data Warehouse
- Optimizing RAG Through an Evaluation-Based Methodology