Generative AI & LLMs
This session provides a comprehensive overview of artificial intelligence (AI), focusing on the evolution, concepts, and applications of Generative AI. The session explores topics such as the history of AI, different types of machine learning, neural networks, deep learning, large language models, and AI agents. It examines various generative models, including Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). The session emphasizes prompt engineering and transfer learning techniques for optimizing the performance of these models.
Slide deck posted on the iCollege class site.
| Speaker | Text |
|---|---|
| Alex | Welcome back to the Deep Dive today, uh, we’re diving deep into prompt engineering for LLMs, large language models, right? Exactly. You know, as data scientists with your machine running background, you already know how powerful these LLMs can be. Yeah. But you know, to really get them to do what we want for specific tasks. We need prompt engineering. Yeah, it’s like its own art, exactly. And that’s what we’ll unpack today, how to actually evaluate those prompts to get the absolute best performance out of LLMs for any given task. Yeah, and |
| Sam | it’s definitely not like evaluating traditional machine learning models. |
| Alex | Right, no, not at all. You know, those traditional models, you have clear metrics like accuracy or precision. With LLMs though, we’re dealing with these open-ended outputs, so evaluating them gets a little subjective. |
| Sam | Yeah, for sure. It’s not just right or wrong anymore. You’ve got to think about things like How well the LLM understands the context, |
| Alex | yeah, and how faithful its response is to what we gave it. |
| Sam | Exactly. Like is it making things up and then of course how relevant that answer is to our original prompt. |
| Alex | OK, so let’s break this down. Let’s say we’re working on text summarization. How would we evaluate different prompts to get good summaries from the LLM? |
| Sam | Well, first we’d have to define what good even means for our summaries. Like, are we looking for a really concise overview, a detailed analysis, or something in between? |
| Alex | Let’s say we want concise summaries that just get the key takeaways. |
| Sam | All right, so we could try a few different prompts to get those right. Like we could start simple with something like summarize this text or get a little more specific. Like, give me the key takeaways or even in 3 sentences, what are the most important points? |
| Alex | OK, interesting. So 3 different prompts, but all trying to get the same thing. How do we know which one actually works best? |
| Sam | That’s where those specific metrics for prompt evaluation come in. There’s been some interesting research on this. Yeah, like one system I’ve been looking at is Aries. Are, yeah, it stands for the Automated RAG evaluation System. Oh right, RAG is retrieval augmented generation. It’s basically combining information retrieval with LLMs, super relevant to prompt engineering because a lot of the time we want our LM to use external info to give a complete answer. |
| Alex | OK, so AES is designed to evaluate prompts for those arche systems. What’s different about how it evaluates things? |
| Sam | So AES really focuses on three key metrics. Context relevance, answer faithfulness and answer relevance. Got you. |
| Alex | So break those down for me. |
| Sam | Sure. So context relevance is all about whether the information pulled in is actually relevant to the question. Makes sense. Answer faithfulness is about whether the generated response, the answer, is grounded in that information, not just making stuff up. And then answer relevance is about if that answer is actually relevant to the prompt, even if it is faithful to the information it was given. OK, |
| Alex | so it’s looking at different angles of the LLN’s response. How does it actually calculate those metrics though? |
| Sam | This is the cool part. It uses LLM judges, LMM |
| Alex | judges, |
| Sam | yeah, basically smaller LLMs trained to do the evaluation. So they analyze the retrieved context, the generated answer, and the original prompt and score, each of those three metrics. Wow. |
| Alex | So we’re using LLMs to evaluate. Other LLMs |
| Sam | pretty meta, right? And it’s been shown to be really effective. |
| Alex | That’s wild. So walk me through how this works in practice. |
| Sam | Sure. So AOS has a three stage process. First, it creates a synthetic data set of question answer pairs from a bunch of documents. So |
| Alex | like a little practice world for the judges. |
| Sam | Yeah, exactly. Then in stage two, it trains those LLM judges using this data set. It’s got separate judges for each of the metrics and trains them using contrastive learning. |
| Alex | I’ve heard of that. Refresh my memory on contrastive learning though. |
| Sam | Of course it’s basically where the model gets pairs of examples, some similar and some different, and has to learn to tell them apart. Oh right, yeah, yeah. So in this case, the judges are learning what makes a good answer based on those metrics by comparing. Good and bad examples. |
| Alex | Interesting. So they’re learning from examples to then judge other examples. |
| Sam | Exactly. And then the last stage is scoring. And for that, Aries uses a technique called prediction powered inference or PPI. PPI, catchy. Yeah, well, it uses a little bit of human anatated data to give us some confidence intervals for the |
| Alex | score. So instead of one score, we get a range that tells us more about how the prompt is doing. Yep. |
| Sam | And that’s a big advantage of Aries. It not only automates prompt evaluation, but also gives us a more reliable estimate of how those prompts are working. |
| Alex | Aries sounds super powerful for evaluating prompts in these airy wedgie systems, but are there other frameworks out there specifically for different LLMs or tasks? |
| Sam | There are a few, yeah. One that comes to mind is LLM bar. It’s focused on evaluating. LLMs based on how well they follow instructions. |
| Alex | Instruction following. So basically how well they do what the prompt actually tells them to do. |
| Sam | Precisely. LLM bar is great for checking out different prompt engineering strategies like chain of thought prompting, chain of thought, yeah, where you’re basically guiding the LLM to think step by step. OK, |
| Alex | so it seems like choosing the right evaluation framework really depends on what kind of LOM you’re using and what you want it to do. |
| Sam | Totally. And beyond these frameworks, there’s another really important element for evaluating and benchmarking. Public data sets. Oh |
| Alex | right. Having these standardized data sets lets us compare different approaches and see what really works best overall. Exactly. |
| Sam | And there are some amazing ones out there like KItT and Super GLE. They’re incredibly diverse, covering all sorts of queries, documents, and answer types. |
| Alex | So it’s like having a comprehensive testing ground for your prompt engineering skills. You got it. Kilt and super GLE sound pretty intense. Any specific data sets within those that stand out to you? Oh |
| Sam | yeah, definitely. For you, given your machine learning background and all, I think data sets like natural questions, hot pot QA, and fever might be right up your alley. |
| Alex | Interesting. Let’s hear about those. |
| Sam | All right, so natural questions. That one’s all about question answering systems. It’s built with real Google search queries and their answers from Wikipedia. |
| Alex | Sounds like a real world challenge. |
| Sam | It is. And then there’s hotpot QA. That one’s focused on multi-hop question answering. |
| Alex | Multi-hop. So the LLM has to gather info from multiple sources to answer. |
| Sam | Exactly. Really puts the LLM’s reasoning and comprehension to the test. |
| Alex | Makes sense. What about fever? |
| Sam | Ah, fever, that stands for fact extraction and verification. Catchy. I know, right? It’s all about training LLMs to figure out if a claim is actually true. So it’s a great data set for testing both information retrieval and the LLM’s reasoning abilities. |
| Alex | So we’ve got kilt and super GLU as Big benchmarks and then more focused ones like natural questions, hot pot QA, and fever within them. |
| Sam | Exactly. And the great thing about these public data sets is that they give you a standardized environment to experiment, learn, and improve your prompt engineering. |
| Alex | You can compare your work with others and see what’s really working well in the field. Exactly. This has been a fantastic overview of how We can evaluate prompt engineering. We talked about the need for these specific metrics, the roles of frameworks like RREs and LLM bar, and of course the importance of public data sets. And we’re just scratching the surface here. Prompt engineering is constantly evolving with new techniques and best practices. Yeah, |
| Sam | it’s a really exciting time to be working in this area. |
| Alex | It really is. But for now we’re going to take a quick break. We’ll be right back to dive into some specific prompt engineering techniques that can really boost LLM performance. Sounds good. Welcome back to the deep dive. We’ve been talking about how important prompt engineering is for getting the most out of LLMs, and |
| Sam | we saw that evaluating these prompts, it’s not exactly a walk in the park. |
| Alex | Yeah, definitely not like traditional machine learning where you just look at accuracy, |
| Sam | right? It’s way more nuanced because LLM output is so open-ended. |
| Alex | Exactly. We have to consider the context, how faithful the response is to the info we gave it. And how relevant the answer is to the prompt. Yeah, |
| Sam | all those things matter, |
| Alex | and that’s where systems like Aries come in, right, with those LLM judges trained to evaluate those aspects. |
| Sam | Yeah, it’s pretty amazing how we can use LLMs to evaluate other LLMs. |
| Alex | Definitely mind bending. But you mentioned before that A is mainly for evaluating our gag systems. What about other LLMs and tasks? |
| Sam | Well, if you’re focused on how well the LLM follows instructions, there’s LLM |
| Alex | bar. Oh, OK, so LLM bar would be better for evaluating prompts for things like writing creative content or translating languages. |
| Sam | Exactly. It can tell us how well different prompting techniques work for following instructions and problem solving. Like chain of thought prompting. |
| Alex | We touched on that earlier. Why is chain of thought prompting so effective? |
| Sam | So with chain of thought prompting, you’re encouraging the LLM to think step by step. It basically makes the reasoning process more transparent and structured. We’re guiding the LLM to think more like a human would by breaking the problem down. |
| Alex | So it’s like giving the LLM a roadmap so it doesn’t go off track. |
| Sam | Yeah, exactly. And it’s been really successful for tasks that involve logic, problem solving, even common sense reasoning. |
| Alex | Wow. So it’s not just about the instructions themselves, but how we structure them to help the LLM |
| Sam | think. You got it. It’s about understanding the cognitive processes and mirroring them in our prime. |
| Alex | So prompt engineering needs a deep understanding of both the task and how LLMs work. |
| Sam | Absolutely. It’s this balance of human intuition and what the machine can do. |
| Alex | Aside from chain of thought prompting, are there other techniques data scientists should know? Oh yeah, |
| Sam | definitely. There’s few shot prompting, which is pretty widely used. |
| Alex | Remind me how that one works again. So with |
| Sam | few shot prompting, you give the LLM a few examples of the output you want before asking it to generate its own. Oh |
| Alex | right, so it’s like showing it a model answer. |
| Sam | Exactly. Even just a handful of examples can really improve the quality and relevance of what it generates. |
| Alex | It’s amazing that such a small amount of data can have such a big effect. |
| Sam | It is. And the cool thing about few shot prompting is that it’s so flexible. Oh yeah, yeah, you can experiment with different types and numbers of examples. You can even combine it with other techniques like chain of thought, |
| Alex | so it’s adaptable to different tasks and LLMs. |
| Sam | Exactly. And then there’s another technique we talked about briefly earlier, retrieval augmented |
| Alex | prompting, right? Giving the LLM extra info from external sources like a knowledge base. Yeah, |
| Sam | it’s all about giving the LLM the right context to give us more comprehensive and insightful answers, like |
| Alex | giving it a huge library to pull from. |
| Sam | Exactly. And with retrieval augmentation, we can even personalize prompts giving users info specific to them. Oh wow. |
| Alex | So like a chatbot that can access your browsing history to give you personalized recommendations. |
| Sam | Exactly. It’s really pushing the boundaries of what’s possible with LLMs. |
| Alex | The possibilities are pretty much endless. From personalized education to targeted ads, it seems like so we’ve got chain of thought for reasoning, few shot for quick learning, and retrieval augmentation for context and personalization. That’s a lot of tools. It is. How does a data scientist even know where to begin with all these options? Well, |
| Sam | the best approach really depends on the LLM you’re using, the task, and even the data you have. |
| Alex | So it comes down to experimenting and figuring out what works best for each situation. |
| Sam | Yeah, prompt engineering is all about trying things out, seeing the results, refining your prompts, and doing it all over again and |
| Alex | being creative, right? |
| Sam | Oh, absolutely. Creativity is super important in this field. |
| Alex | Sounds like prompt engineering is as much an art as it is a science. |
| Sam | It really is. And the better we understand how LLMs process information and generate language, the better our prompt engineering will get. |
| Alex | This has been a fascinating look into prompt engineering. We talked about the challenges of evaluation, picking the right framework, and All those powerful techniques and |
| Sam | we’re only just getting started. There’s still so much to explore with |
| Alex | LLMs. Well, I’ll have to save that for another deep dive, but now let’s shift gears and talk about public data sets and their role in evaluating and benchmarking all these different approaches. Welcome back. We’ve been exploring the world of prompt engineering, talking about frameworks, techniques, all that good stuff. |
| Sam | It’s a field that needs a good balance of being analytical and creative. |
| Alex | Totally agree. We’ve covered how to choose the right evaluation framework and those prompting techniques for your task, but there’s another important piece we need to talk about. Public data sets for benchmarking. |
| Sam | Oh yeah, those are essential. They let us compare different approaches to prompt engineering and see what works best across different tasks and LLMs. |
| Alex | It’s like a standardized test for your prompt engineering skills. Exactly. Earlier we talked about kilt and Super GLU, those big benchmarks with a wide range of data sets. Let’s dive into some specific data sets that might be interesting for our listeners. |
| Sam | OK, sure. We mentioned natural questions, hot pot QA, and fever. They’re all great for evaluating prompt engineering for specific types of tasks. |
| Alex | Let’s go over those again. Natural Questions focuses on question answering systems, right? Yep. |
| Sam | Uses real Google search queries and their answers from Wikipedia. Provides a nice realistic challenge for those systems. And Hotpot |
| Alex | QA is all about multi-hop question answering, where the LLM has to get info from multiple sources, |
| Sam | right? It tests how well the LLM can reason, understand, and put together info from different parts of a text. |
| Alex | Fever focuses on fact verification. Yep. |
| Sam | Fever challenges the LLM to figure out if a claim is true. It has to find info and then actually judge how accurate and reliable it is. |
| Alex | So these data sets are really useful for seeing how well our prompt engineering techniques are doing in specific areas. |
| Sam | Definitely. And it shows how important it is to choose the right data set for your task. Like if you’re building a chatbot for customer support, you’d probably want a data set that focuses on conversations and question answering. |
| Alex | That makes sense. So for data scientists getting into prompt engineering, what’s the key takeaway about using these public data sets |
| Sam | experiment. Try different data sets, different prompting techniques, different evaluation metrics. The more you experiment, the more you learn what works and what doesn’t. |
| Alex | And you might discover something totally new along the way. |
| Sam | Exactly. This field is all about innovating and discovering new things. |
| Alex | We’re just scratching the surface of what LLMs can do, and prompt engineering is leading the way for sure. This has been a great deep dive into the world of prompt engineering. We talked about evaluation frameworks, different prompting techniques, the importance of public data sets. It’s been a great overview, and it’s clear that prompt engineering isn’t just a technical skill, it’s an art that needs creativity, intuition, and a deep. Understanding of both language and machine learning. |
| Sam | Couldn’t have said it better |
| Alex | myself. To our listeners, keep exploring this amazing field, experiment, push the boundaries, see what you can do with LLMs. The future of this tech is in your hands. |
| Sam | Keep learning, keep innovating, and we’ll see you next time on the Deep Dive. |
Suggested Reading
- Textbook Hands-On Large Language Models by Jay Alammar and Maarten Grootendorst
- Textbook Fundamentals of Machine Learning for Predictive Data Analytics by John D. Kelleher, Brian Mac Namee and Aoife D’Arcy
Highlights
This session emphasizes several key insights about Generative AI:
Generative AI is a rapidly evolving field with the potential to revolutionize numerous industries. Generative AI models, trained on vast datasets, can produce novel and realistic content, including text, images, videos, and audio. Applications range from content creation and drug discovery to personalized recommendations and even generating creative text like poems or code.
Large Language Models (LLMs) are foundational to Generative AI. LLMs are trained on massive text datasets using self-supervised learning techniques, enabling them to understand and generate human-like text. Architectures like transformers, incorporating attention mechanisms, facilitate capturing long-range dependencies in text.
Fine-tuning and alignment are crucial steps in developing effective and responsible LLMs. Fine-tuning tailors pre-trained models to specific tasks, while alignment techniques, often involving human feedback, ensure model outputs adhere to human values and ethical standards.
Prompt engineering plays a vital role in harnessing the capabilities of LLMs. The quality of prompts directly impacts the relevance and accuracy of model outputs. Effective prompts provide clear instructions, context, and desired output format, guiding the model towards producing desired results.
Transfer learning is a powerful technique that leverages pre-trained models to expedite and enhance the development of new models. This approach applies knowledge gained from one task to another related task, reducing training time and improving performance. Transfer learning finds applications in both computer vision, such as image classification and object detection, and natural language processing, such as sentiment analysis and language translation.
The development of AI agents, powered by LLMs and augmented with external knowledge, holds significant potential for creating intelligent systems capable of autonomous task execution and human-like interaction.. These agents can access external databases, adapt to dynamic environments, and continuously learn, making them versatile for applications ranging from personal assistants to components of complex autonomous systems.