RAG Evaluation

This session explores the intricacies of prompt engineering for large language models (LLMs), emphasizing its importance in optimizing LLM performance for specific tasks. Unlike traditional machine learning models, evaluating LLMs involves subjective metrics like context relevance, answer faithfulness, and prompt relevance.

The session highlights frameworks like ARES, which uses smaller LLMs as evaluators, and LLMA, which focuses on instruction-following capabilities. Techniques such as chain-of-thought prompting, few-shot prompting, and retrieval-augmented generation are presented as effective strategies to guide LLMs in reasoning, learning from examples, and leveraging external information.

The session also introduces public datasets like KILT and SuperGLUE, along with task-specific datasets such as Natural Questions, HotpotQA, and FEVER. These datasets provide standardized benchmarks to test and refine prompt engineering approaches. The hosts emphasize the need for creativity and experimentation in this evolving field, blending technical expertise with an intuitive understanding of language and machine learning.

Required Reading and Listening

Listen to the podcast (transcription):

Presentations

Notebooks

Response Evaluation based on the comparison of RAG responses to given questions to the ground truth answer
- Example notebook RAG_Evaluation
- Example code in response_evaluation
Retrieval Evaluation assess the tech-chunnk retrieval and ranking of the RAG system
- Example notebook RAG_Retrieval_Evaluation
- Example code in retrieval_evaluation

Reading

Paper: ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Additional Resources

ARES is a framework for evaluating Retrieval-Augmented Generation (RAG) models.
G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. The G-Eval metric is the most versatile type of metric deepeval has to offer, and is capable of evaluating almost any use case with human-like accuracy.
RAGAS is a library that provides tools to supercharge the evaluation of Large Language Model (LLM) applications.