4.Prompt Engineering Fundamentals

Status

Not started

Introduction

This module covers essential concepts and techniques for creating effective prompts in generative AI models. The way your write your prompt to an LLM also matters. A carefully-crafted prompt can achieve a better quality of response. But what exactly do terms like prompt and prompt engineering mean? And how do I improve the prompt input that I send to the LLM? These are the questions we'll try to answer with in this chapter and the next.

Generative AI is capable of creating new content (e.g., text, images, audio, code etc.) in response to user requests. It achieves this using Large Language Models like OpenAI's GPT ("Generative Pre-trained Transformer") series that are trained for using natural language and code.

Users can now interact with these models using familiar paradigms like chat, without needing any technical expertise or training. The models are prompt-based - users send a text input (prompt) and get back the AI response (completion). They can then "chat with the AI" iteratively, in multi-turn conversations, refining their prompt until the response matches their expectations.

"Prompts" now become the primary programming interface for generative AI apps, telling the models what to do and influencing the quality of returned responses. "Prompt Engineering" is a fast-growing field of study that focuses on the design and optimization of prompts to deliver consistent and quality responses at scale.

Learning Goals

In this lesson, we learn what Prompt Engineering is, why it matters, and how we can craft more effective prompts for a given model and application objective. We'll understand core concepts and best practices for prompt engineering - and learn about an interactive Jupyter Notebooks "sandbox" environment where we can see these concepts applied to real examples.

By the end of this lesson we will be able to:

Explain what prompt engineering is and why it matters.
Describe the components of a prompt and how they are used.
Learn best practices and techniques for prompt engineering.
Apply learned techniques to real examples, using an OpenAI endpoint.

Key Terms

Prompt Engineering: The practice of designing and refining inputs to guide AI models toward producing desired outputs.
Tokenization: The process of converting text into smaller units, called tokens, that a model can understand and process.
Instruction-Tuned LLMs: Large Language Models (LLMs) that have been fine-tuned with specific instructions to improve their response accuracy and relevance.

Learning Sandbox

Prompt engineering is currently more art than science. The best way to improve our intuition for it is to practice more and adopt a trial-and-error approach that combines application domain expertise with recommended techniques and model-specific optimizations.

The Jupyter Notebook accompanying this lesson provides a sandbox environment where you can try out what you learn - as you go or as part of the code challenge at the end. To execute the exercises, you will need:

An Azure OpenAI API key - the service endpoint for a deployed LLM.
A Python Runtime - in which the Notebook can be executed.
Local Env Variables - complete the SETUP steps now to get ready.

The notebook comes with starter exercises - but you are encouraged to add your own Markdown (description) and Code (prompt requests) sections to try out more examples or ideas - and build your intuition for prompt design.

Our Startup

Now, let's talk about how this topic relates to our startup mission to bring AI innovation to education. We want to build AI-powered applications of personalized learning - so let's think about how different users of our application might "design" prompts:

Administrators might ask the AI to analyze curriculum data to identify gaps in coverage. The AI can summarize results or visualize them with code.
Educators might ask the AI to generate a lesson plan for a target audience and topic. The AI can build the personalized plan in a specified format.
Students might ask the AI to tutor them in a difficult subject. The AI can now guide students with lessons, hints & examples tailored to their level.

That's just the tip of the iceberg. Check out Prompts For Education - an open-source prompts library curated by education experts - to get a broader sense of the possibilities! Try running some of those prompts in the sandbox or using the OpenAI Playground to see what happens!

What is Prompt Engineering?

We started this lesson by defining Prompt Engineering as the process of designing and optimizing text inputs (prompts) to deliver consistent and quality responses (completions) for a given application objective and model. We can think of this as a 2-step process:

designing the initial prompt for a given model and objective
refining the prompt iteratively to improve the quality of the response

This is necessarily a trial-and-error process that requires user intuition and effort to get optimal results. So why is it important? To answer that question, we first need to understand three concepts:

Tokenization = how the model "sees" the prompt
Base LLMs = how the foundation model "processes" a prompt
Instruction-Tuned LLMs = how the model can now see "tasks"

Tokenization

An LLM sees prompts as a sequence of tokens where different models (or versions of a model) can tokenize the same prompt in different ways. Since LLMs are trained on tokens (and not on raw text), the way prompts get tokenized has a direct impact on the quality of the generated response.

To get an intuition for how tokenization works, try tools like the OpenAI Tokenizer shown below. Copy in your prompt - and see how that gets converted into tokens, paying attention to how whitespace characters and punctuation marks are handled. Note that this example shows an older LLM (GPT-3) - so trying this with a newer model may produce a different result.

Concept: Foundation Models

Once a prompt is tokenized, the primary function of the "Base LLM" (or Foundation model) is to predict the token in that sequence. Since LLMs are trained on massive text datasets, they have a good sense of the statistical relationships between tokens and can make that prediction with some confidence. Note that they don't understand the meaning of the words in the prompt or token; they just see a pattern they can "complete" with their next prediction. They can continue predicting the sequence till terminated by user intervention or some pre-established condition.

Want to see how prompt-based completion works? Enter the above prompt into the Azure OpenAI Studio Chat Playground with the default settings. The system is configured to treat prompts as requests for information - so you should see a completion that satisfies this context.

But what if the user wanted to see something specific that met some criteria or task objective? This is where instruction-tuned LLMs come into the picture.

Concept: Instruction Tuned LLMs

An Instruction Tuned LLM starts with the foundation model and fine-tunes it with examples or input/output pairs (e.g., multi-turn "messages") that can contain clear instructions - and the response from the AI attempt to follow that instruction.

This uses techniques like Reinforcement Learning with Human Feedback (RLHF) that can train the model to follow instructions and learn from feedback so that it produces responses that are better-suited to practical applications and more relevant to user objectives.

Types of LLMs

There are two primary types of LLMs:

Base LLMs
Instruction tuned LLMs

Related Read on Instruction Tuned LLMs: Top Open-Sourced Large Language Models Revolutionizing Conversational AI

Base LLMs

These models are trained on massive amounts of text data, often from the internet or other sources. Their primary function is to predict the next word in a given context. For example, when prompted with “What is the capital of France?” a base LLM might complete the sentence with “What is the capital of India?”. The GPT-3, Bloom etc. are few examples of such base large language models.

Instruction Tuned LLMs

These models are designed to follow instructions more accurately. They begin with a base LLM and are fine-tuned with input-output pairs that include instructions and attempts to follow those instructions. Reinforcement Learning from Human Feedback (RLHF) is often employed to refine the model further, making it better at being helpful, honest, and harmless. As a result, instruction tuned LLMs are less likely to generate problematic text and are more suitable for practical applications.

As per the previous example, for the prompt “What is the capital of France?” the response of Instruction tuned models would be “Paris” or “Paris is the capital of France”.

Examples of these breeds of LLMs include OpenAI’s ChatGPT and codex, Open Assistant etc. which have been extensively used in various applications, ranging from chatbots to content generation.

LLM Types for Dummies: A Simple Analogy

To better understand the concept of LLMs and their types, let’s use an analogy that compares them to a fresh college graduate. Base LLMs can be likened to a recent college graduate who has read extensively and accumulated a wealth of knowledge and insights on various topics. Just like this graduate, base LLMs have been trained on massive amounts of text data and can generate relevant responses based on the context they are given. However, they might not always be precise or focused on specific instructions.

Instruction Tuned LLMs, on the other hand, can be compared to the same college graduate who has now decided on a focused career objective, such as becoming a Python Developer at a company. This graduate has gone through additional training to hone their skills in their chosen field and has received guidance from senior professionals to become more proficient at their job. Similarly, instruction tuned LLMs are fine-tuned with input-output pairs that include specific instructions and attempts to follow those instructions. The model is further refined using Reinforcement Learning from Human Feedback (RLHF), making it better equipped to follow instructions accurately and generate more relevant and helpful outputs.

Why do we need Prompt Engineering?

Now that we know how prompts are processed by LLMs, let's talk about why we need prompt engineering. The answer lies in the fact that current LLMs pose a number of challenges that make reliable and consistent completions more challenging to achieve without putting effort into prompt construction and optimization. For instance:

Model responses are stochastic. The same prompt will likely produce different responses with different models or model versions. And it may even produce different results with the same model at different times. Prompt engineering techniques can help us minimize these variations by providing better guardrails.
Models can fabricate responses. Models are pre-trained with large but finite datasets, meaning they lack knowledge about concepts outside that training scope. As a result, they can produce completions that are inaccurate, imaginary, or directly contradictory to known facts. Prompt engineering techniques help users identify and mitigate such fabrications e.g., by asking AI for citations or reasoning.
Models capabilities will vary. Newer models or model generations will have richer capabilities but also bring unique quirks and tradeoffs in cost & complexity. Prompt engineering can help us develop best practices and workflows that abstract away differences and adapt to model-specific requirements in scalable, seamless ways.

Let's see this in action in the OpenAI or Azure OpenAI Playground:

Use the same prompt with different LLM deployments (e.g, OpenAI, Azure OpenAI, Hugging Face) - did you see the variations?
Use the same prompt repeatedly with the same LLM deployment (e.g., Azure OpenAI playground) - how did these variations differ?

Fabrications Example

In this course, we use the term "fabrication" to reference the phenomenon where LLMs sometimes generate factually incorrect information due to limitations in their training or other constraints. You may also have heard this referred to as "hallucinations" in popular articles or research papers. However, we strongly recommend using "fabrication" as the term so we don't accidentally anthropomorphize the behavior by attributing a human-like trait to a machine-driven outcome. This also reinforces Responsible AI guidelines from a terminology perspective, removing terms that may also be considered offensive or non-inclusive in some contexts.

Want to get a sense of how fabrications work? Think of a prompt that instructs the AI to generate content for a non-existent topic (to ensure it is not found in the training dataset). For example - I tried this prompt:

Prompt: generate a lesson plan on the Martian War of 2076.

A web search showed me that there were fictional accounts (e.g., television series or books) on Martian wars - but none in 2076. Commonsense also tells us that 2076 is in the future and thus, cannot be associated with a real event.

So what happens when we run this prompt with different LLM providers?

Response 1: OpenAI Playground (GPT-35)

Response 2: Azure OpenAI Playground (GPT-35)

Response 3: : Hugging Face Chat Playground (LLama-2)

As expected, each model (or model version) produces slightly different responses thanks to stochastic behavior and model capability variations. For instance, one model targets an 8th grade audience while the other assumes a high-school student. But all three models did generate responses that could convince an uninformed user that the event was real

Prompt engineering techniques like metaprompting and temperature configuration may reduce model fabrications to some extent. New prompt engineering architectures also incorporate new tools and techniques seamlessly into the prompt flow, to mitigate or reduce some of these effects.

Case Study: GitHub Copilot

Let's wrap this section by getting a sense of how prompt engineering is used in real-world solutions by looking at one Case Study: GitHub Copilot.

GitHub Copilot is your "AI Pair Programmer" - it converts text prompts into code completions and is integrated into your development environment (e.g., Visual Studio Code) for a seamless user experience. As documented in the series of blogs below, the earliest version was based on the OpenAI Codex model - with engineers quickly realizing the need to fine-tune the model and develop better prompt engineering techniques, to improve code quality. In July, they debuted an improved AI model that goes beyond Codex for even faster suggestions.

Read the posts in order, to follow their learning journey.

May 2023 | GitHub Copilot is Getting Better at Understanding Your Code
May 2023 | Inside GitHub: Working with the LLMs behind GitHub Copilot.
Jun 2023 | How to write better prompts for GitHub Copilot.
Jul 2023 | .. GitHub Copilot goes beyond Codex with improved AI model
Jul 2023 | A Developer's Guide to Prompt Engineering and LLMs
Sep 2023 | How to build an enterprise LLM app: Lessons from GitHub Copilot

You can also browse their Engineering blog for more posts like this one that shows how these models and techniques are applied for driving real-world applications.