Generative AI & LLMs

This session provides a comprehensive overview of artificial intelligence (AI), focusing on the evolution, concepts, and applications of Generative AI. The session explores topics such as the history of AI, different types of machine learning, neural networks, deep learning, large language models, and AI agents. It examines various generative models, including Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). The session emphasizes prompt engineering and transfer learning techniques for optimizing the performance of these models.

Slide deck posted on the iCollege class site.

Evaluating Large Language Models

Transcript

Speaker	Text
Alex	Welcome back to the Deep Dive today, uh, we’re diving deep into prompt engineering for LLMs, large language models, right? Exactly. You know, as data scientists with your machine running background, you already know how powerful these LLMs can be. Yeah. But you know, to really get them to do what we want for specific tasks. We need prompt engineering. Yeah, it’s like its own art, exactly. And that’s what we’ll unpack today, how to actually evaluate those prompts to get the absolute best performance out of LLMs for any given task. Yeah, and
Sam	it’s definitely not like evaluating traditional machine learning models.
Alex	Right, no, not at all. You know, those traditional models, you have clear metrics like accuracy or precision. With LLMs though, we’re dealing with these open-ended outputs, so evaluating them gets a little subjective.
Sam	Yeah, for sure. It’s not just right or wrong anymore. You’ve got to think about things like How well the LLM understands the context,
Alex	yeah, and how faithful its response is to what we gave it.
Sam	Exactly. Like is it making things up and then of course how relevant that answer is to our original prompt.
Alex	OK, so let’s break this down. Let’s say we’re working on text summarization. How would we evaluate different prompts to get good summaries from the LLM?
Sam	Well, first we’d have to define what good even means for our summaries. Like, are we looking for a really concise overview, a detailed analysis, or something in between?
Alex	Let’s say we want concise summaries that just get the key takeaways.
Sam	All right, so we could try a few different prompts to get those right. Like we could start simple with something like summarize this text or get a little more specific. Like, give me the key takeaways or even in 3 sentences, what are the most important points?
Alex	OK, interesting. So 3 different prompts, but all trying to get the same thing. How do we know which one actually works best?
Sam	That’s where those specific metrics for prompt evaluation come in. There’s been some interesting research on this. Yeah, like one system I’ve been looking at is Aries. Are, yeah, it stands for the Automated RAG evaluation System. Oh right, RAG is retrieval augmented generation. It’s basically combining information retrieval with LLMs, super relevant to prompt engineering because a lot of the time we want our LM to use external info to give a complete answer.
Alex	OK, so AES is designed to evaluate prompts for those arche systems. What’s different about how it evaluates things?
Sam	So AES really focuses on three key metrics. Context relevance, answer faithfulness and answer relevance. Got you.
Alex	So break those down for me.
Sam	Sure. So context relevance is all about whether the information pulled in is actually relevant to the question. Makes sense. Answer faithfulness is about whether the generated response, the answer, is grounded in that information, not just making stuff up. And then answer relevance is about if that answer is actually relevant to the prompt, even if it is faithful to the information it was given. OK,
Alex	so it’s looking at different angles of the LLN’s response. How does it actually calculate those metrics though?
Sam	This is the cool part. It uses LLM judges, LMM
Alex	judges,
Sam	yeah, basically smaller LLMs trained to do the evaluation. So they analyze the retrieved context, the generated answer, and the original prompt and score, each of those three metrics. Wow.
Alex	So we’re using LLMs to evaluate. Other LLMs
Sam	pretty meta, right? And it’s been shown to be really effective.
Alex	That’s wild. So walk me through how this works in practice.
Sam	Sure. So AOS has a three stage process. First, it creates a synthetic data set of question answer pairs from a bunch of documents. So
Alex	like a little practice world for the judges.
Sam	Yeah, exactly. Then in stage two, it trains those LLM judges using this data set. It’s got separate judges for each of the metrics and trains them using contrastive learning.
Alex	I’ve heard of that. Refresh my memory on contrastive learning though.
Sam	Of course it’s basically where the model gets pairs of examples, some similar and some different, and has to learn to tell them apart. Oh right, yeah, yeah. So in this case, the judges are learning what makes a good answer based on those metrics by comparing. Good and bad examples.
Alex	Interesting. So they’re learning from examples to then judge other examples.
Sam	Exactly. And then the last stage is scoring. And for that, Aries uses a technique called prediction powered inference or PPI. PPI, catchy. Yeah, well, it uses a little bit of human anatated data to give us some confidence intervals for the
Alex	score. So instead of one score, we get a range that tells us more about how the prompt is doing. Yep.
Sam	And that’s a big advantage of Aries. It not only automates prompt evaluation, but also gives us a more reliable estimate of how those prompts are working.
Alex	Aries sounds super powerful for evaluating prompts in these airy wedgie systems, but are there other frameworks out there specifically for different LLMs or tasks?
Sam	There are a few, yeah. One that comes to mind is LLM bar. It’s focused on evaluating. LLMs based on how well they follow instructions.
Alex	Instruction following. So basically how well they do what the prompt actually tells them to do.
Sam	Precisely. LLM bar is great for checking out different prompt engineering strategies like chain of thought prompting, chain of thought, yeah, where you’re basically guiding the LLM to think step by step. OK,
Alex	so it seems like choosing the right evaluation framework really depends on what kind of LOM you’re using and what you want it to do.
Sam	Totally. And beyond these frameworks, there’s another really important element for evaluating and benchmarking. Public data sets. Oh
Alex	right. Having these standardized data sets lets us compare different approaches and see what really works best overall. Exactly.
Sam	And there are some amazing ones out there like KItT and Super GLE. They’re incredibly diverse, covering all sorts of queries, documents, and answer types.
Alex	So it’s like having a comprehensive testing ground for your prompt engineering skills. You got it. Kilt and super GLE sound pretty intense. Any specific data sets within those that stand out to you? Oh
Sam	yeah, definitely. For you, given your machine learning background and all, I think data sets like natural questions, hot pot QA, and fever might be right up your alley.
Alex	Interesting. Let’s hear about those.
Sam	All right, so natural questions. That one’s all about question answering systems. It’s built with real Google search queries and their answers from Wikipedia.
Alex	Sounds like a real world challenge.
Sam	It is. And then there’s hotpot QA. That one’s focused on multi-hop question answering.
Alex	Multi-hop. So the LLM has to gather info from multiple sources to answer.
Sam	Exactly. Really puts the LLM’s reasoning and comprehension to the test.
Alex	Makes sense. What about fever?
Sam	Ah, fever, that stands for fact extraction and verification. Catchy. I know, right? It’s all about training LLMs to figure out if a claim is actually true. So it’s a great data set for testing both information retrieval and the LLM’s reasoning abilities.
Alex	So we’ve got kilt and super GLU as Big benchmarks and then more focused ones like natural questions, hot pot QA, and fever within them.
Sam	Exactly. And the great thing about these public data sets is that they give you a standardized environment to experiment, learn, and improve your prompt engineering.
Alex	You can compare your work with others and see what’s really working well in the field. Exactly. This has been a fantastic overview of how We can evaluate prompt engineering. We talked about the need for these specific metrics, the roles of frameworks like RREs and LLM bar, and of course the importance of public data sets. And we’re just scratching the surface here. Prompt engineering is constantly evolving with new techniques and best practices. Yeah,
Sam	it’s a really exciting time to be working in this area.
Alex	It really is. But for now we’re going to take a quick break. We’ll be right back to dive into some specific prompt engineering techniques that can really boost LLM performance. Sounds good. Welcome back to the deep dive. We’ve been talking about how important prompt engineering is for getting the most out of LLMs, and
Sam	we saw that evaluating these prompts, it’s not exactly a walk in the park.
Alex	Yeah, definitely not like traditional machine learning where you just look at accuracy,
Sam	right? It’s way more nuanced because LLM output is so open-ended.
Alex	Exactly. We have to consider the context, how faithful the response is to the info we gave it. And how relevant the answer is to the prompt. Yeah,
Sam	all those things matter,
Alex	and that’s where systems like Aries come in, right, with those LLM judges trained to evaluate those aspects.
Sam	Yeah, it’s pretty amazing how we can use LLMs to evaluate other LLMs.
Alex	Definitely mind bending. But you mentioned before that A is mainly for evaluating our gag systems. What about other LLMs and tasks?
Sam	Well, if you’re focused on how well the LLM follows instructions, there’s LLM
Alex	bar. Oh, OK, so LLM bar would be better for evaluating prompts for things like writing creative content or translating languages.
Sam	Exactly. It can tell us how well different prompting techniques work for following instructions and problem solving. Like chain of thought prompting.
Alex	We touched on that earlier. Why is chain of thought prompting so effective?
Sam	So with chain of thought prompting, you’re encouraging the LLM to think step by step. It basically makes the reasoning process more transparent and structured. We’re guiding the LLM to think more like a human would by breaking the problem down.
Alex	So it’s like giving the LLM a roadmap so it doesn’t go off track.
Sam	Yeah, exactly. And it’s been really successful for tasks that involve logic, problem solving, even common sense reasoning.
Alex	Wow. So it’s not just about the instructions themselves, but how we structure them to help the LLM
Sam	think. You got it. It’s about understanding the cognitive processes and mirroring them in our prime.
Alex	So prompt engineering needs a deep understanding of both the task and how LLMs work.
Sam	Absolutely. It’s this balance of human intuition and what the machine can do.
Alex	Aside from chain of thought prompting, are there other techniques data scientists should know? Oh yeah,
Sam	definitely. There’s few shot prompting, which is pretty widely used.
Alex	Remind me how that one works again. So with
Sam	few shot prompting, you give the LLM a few examples of the output you want before asking it to generate its own. Oh
Alex	right, so it’s like showing it a model answer.
Sam	Exactly. Even just a handful of examples can really improve the quality and relevance of what it generates.
Alex	It’s amazing that such a small amount of data can have such a big effect.
Sam	It is. And the cool thing about few shot prompting is that it’s so flexible. Oh yeah, yeah, you can experiment with different types and numbers of examples. You can even combine it with other techniques like chain of thought,
Alex	so it’s adaptable to different tasks and LLMs.
Sam	Exactly. And then there’s another technique we talked about briefly earlier, retrieval augmented
Alex	prompting, right? Giving the LLM extra info from external sources like a knowledge base. Yeah,
Sam	it’s all about giving the LLM the right context to give us more comprehensive and insightful answers, like
Alex	giving it a huge library to pull from.
Sam	Exactly. And with retrieval augmentation, we can even personalize prompts giving users info specific to them. Oh wow.
Alex	So like a chatbot that can access your browsing history to give you personalized recommendations.
Sam	Exactly. It’s really pushing the boundaries of what’s possible with LLMs.
Alex	The possibilities are pretty much endless. From personalized education to targeted ads, it seems like so we’ve got chain of thought for reasoning, few shot for quick learning, and retrieval augmentation for context and personalization. That’s a lot of tools. It is. How does a data scientist even know where to begin with all these options? Well,
Sam	the best approach really depends on the LLM you’re using, the task, and even the data you have.
Alex	So it comes down to experimenting and figuring out what works best for each situation.
Sam	Yeah, prompt engineering is all about trying things out, seeing the results, refining your prompts, and doing it all over again and
Alex	being creative, right?
Sam	Oh, absolutely. Creativity is super important in this field.
Alex	Sounds like prompt engineering is as much an art as it is a science.
Sam	It really is. And the better we understand how LLMs process information and generate language, the better our prompt engineering will get.
Alex	This has been a fascinating look into prompt engineering. We talked about the challenges of evaluation, picking the right framework, and All those powerful techniques and
Sam	we’re only just getting started. There’s still so much to explore with
Alex	LLMs. Well, I’ll have to save that for another deep dive, but now let’s shift gears and talk about public data sets and their role in evaluating and benchmarking all these different approaches. Welcome back. We’ve been exploring the world of prompt engineering, talking about frameworks, techniques, all that good stuff.
Sam	It’s a field that needs a good balance of being analytical and creative.
Alex	Totally agree. We’ve covered how to choose the right evaluation framework and those prompting techniques for your task, but there’s another important piece we need to talk about. Public data sets for benchmarking.
Sam	Oh yeah, those are essential. They let us compare different approaches to prompt engineering and see what works best across different tasks and LLMs.
Alex	It’s like a standardized test for your prompt engineering skills. Exactly. Earlier we talked about kilt and Super GLU, those big benchmarks with a wide range of data sets. Let’s dive into some specific data sets that might be interesting for our listeners.
Sam	OK, sure. We mentioned natural questions, hot pot QA, and fever. They’re all great for evaluating prompt engineering for specific types of tasks.
Alex	Let’s go over those again. Natural Questions focuses on question answering systems, right? Yep.
Sam	Uses real Google search queries and their answers from Wikipedia. Provides a nice realistic challenge for those systems. And Hotpot
Alex	QA is all about multi-hop question answering, where the LLM has to get info from multiple sources,
Sam	right? It tests how well the LLM can reason, understand, and put together info from different parts of a text.
Alex	Fever focuses on fact verification. Yep.
Sam	Fever challenges the LLM to figure out if a claim is true. It has to find info and then actually judge how accurate and reliable it is.
Alex	So these data sets are really useful for seeing how well our prompt engineering techniques are doing in specific areas.
Sam	Definitely. And it shows how important it is to choose the right data set for your task. Like if you’re building a chatbot for customer support, you’d probably want a data set that focuses on conversations and question answering.
Alex	That makes sense. So for data scientists getting into prompt engineering, what’s the key takeaway about using these public data sets
Sam	experiment. Try different data sets, different prompting techniques, different evaluation metrics. The more you experiment, the more you learn what works and what doesn’t.
Alex	And you might discover something totally new along the way.
Sam	Exactly. This field is all about innovating and discovering new things.
Alex	We’re just scratching the surface of what LLMs can do, and prompt engineering is leading the way for sure. This has been a great deep dive into the world of prompt engineering. We talked about evaluation frameworks, different prompting techniques, the importance of public data sets. It’s been a great overview, and it’s clear that prompt engineering isn’t just a technical skill, it’s an art that needs creativity, intuition, and a deep. Understanding of both language and machine learning.
Sam	Couldn’t have said it better
Alex	myself. To our listeners, keep exploring this amazing field, experiment, push the boundaries, see what you can do with LLMs. The future of this tech is in your hands.
Sam	Keep learning, keep innovating, and we’ll see you next time on the Deep Dive.

Highlights

This session emphasizes several key insights about Generative AI:

Generative AI is a rapidly evolving field with the potential to revolutionize numerous industries. Generative AI models, trained on vast datasets, can produce novel and realistic content, including text, images, videos, and audio. Applications range from content creation and drug discovery to personalized recommendations and even generating creative text like poems or code.
Large Language Models (LLMs) are foundational to Generative AI. LLMs are trained on massive text datasets using self-supervised learning techniques, enabling them to understand and generate human-like text. Architectures like transformers, incorporating attention mechanisms, facilitate capturing long-range dependencies in text.
Fine-tuning and alignment are crucial steps in developing effective and responsible LLMs. Fine-tuning tailors pre-trained models to specific tasks, while alignment techniques, often involving human feedback, ensure model outputs adhere to human values and ethical standards.
Prompt engineering plays a vital role in harnessing the capabilities of LLMs. The quality of prompts directly impacts the relevance and accuracy of model outputs. Effective prompts provide clear instructions, context, and desired output format, guiding the model towards producing desired results.
Transfer learning is a powerful technique that leverages pre-trained models to expedite and enhance the development of new models. This approach applies knowledge gained from one task to another related task, reducing training time and improving performance. Transfer learning finds applications in both computer vision, such as image classification and object detection, and natural language processing, such as sentiment analysis and language translation.
The development of AI agents, powered by LLMs and augmented with external knowledge, holds significant potential for creating intelligent systems capable of autonomous task execution and human-like interaction.. These agents can access external databases, adapt to dynamic environments, and continuously learn, making them versatile for applications ranging from personal assistants to components of complex autonomous systems.

Suggested Reading

Highlights