The Prompt Report: A Systematic Survey of Prompting Techniques


1 University of Maryland     2 OpenAI     3 Stanford 4 Microsoft    5 Vanderbilt    6 Princeton    7 Texas State University    8 Icahn School of Medicine    9 ASST Brianza    10 Mount Sinai Beth Israel    11 Instituto de Telecomunicações    12 University of Massachusetts Amherst   
  *Equal Contribution
  sschulho@umd.edu   milie@umd.edu   resnik@umd.edu

In this paper, we conduct a systematic literature review of all Generative AI (GenAI) prompting techniques (prefix only). We combine human and machine efforts to process 4,797 records from arXiv, Semantic Scholar, and ACL, extracting 1,565 relevant papers through the PRISMA review process. From this dataset we present 58 text-based techniques, complemented by an extensive collection of multimodal and multilingual techniques. Our goal is to provide a robust directory of prompting techniques that can be easily understood and implemented. We also review agents as an extension of prompting, including methods for evaluating output and designing prompts that facilitate safety and security. Lastly, we apply prompting techniques in two case studies. teaser

Abstract

Generative Artificial Intelligence systems are being increasingly deployed across all parts of industry and research settings. Developers and end users interact with these systems through the use of prompting or prompt engineering. While prompting is a widespread and highly researched concept, there exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the area's nascency. This paper establishes a structured understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. We present a comprehensive vocabulary of 33 vocabulary terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities. We further present a meta-analysis of the entire literature on natural language prefix-prompting.

The PRISMA Review Process

During paper collection, we followed a systematic review process grounded in the PRISMA method. We first scraped arXiv, Semantic Scholar, and ACL through a keyword search. Our keyword list was comprised of 44 terms, with each being closely related to prompting and prompt engineering. We then deduplicated our datset based on paper titles, conducted extensive human and AI review for relevance, and automatically removed unrelated papers by checking paper bodies for the term "prompt".

PRISMA

The PRISMA review process. We accumulate 4,247 unique records from which we extract 1,565 relevant records.

A Taxonomy of Prompting Techniques

We present a comprehensive taxonomy of prompting techniques, methods for instructing Large Language Models (LLMs) to complete tasks. We divide prompting techniques into three categories: text-based, multilingual, and multimodal. Multilingual techniques are used to prompt LLMs in non-English settings. Multimodal techniques are used when working with non-textual modalities such as image and audio.

text

All text-based prompting techniques from our dataset.

multilingual

All multilingual prompting techniques.

multimodal

All multimodal prompting techniques.

Prompt Exploration and Advice

We discussed various prompting terms including prompt engineering, answer engineering, and few-shot prompting.

MMLU

The Prompt Engineering Process consists of three repeated steps 1) performing inference on a dataset 2) evaluating performance and 3) modifying the prompt template. Note that the extractor is used to extract a final response from the LLM output (e.g. "This phrase is positive" maps to"positive").

MMLU

An annotated output of a LLM output for a labeling task, which shows the three design decisions of answer engineering: the choice of answer shape, space, and extractor. Since this is an output from a classification task, the answer shape could be restricted to a single token and the answer space to one of two tokens ("positive" or "negative"), though they are unrestricted in this image.

MMLU

We highlight six main design decisions when crafting few-shot prompts. *Please note that recommendations here do not generalize to all tasks; in some cases, each of them could hurt performance.

Case Study: MMLU Benchmarking

In our first case study, we benchmark six distinct prompting techniques using the MMLU benchmark. We also explore the impact of formatting on results, finding variations between two different formats for each prompting technique.

MMLU

Accuracy values are shown for each prompting technique. Purple error bars illustrate the minimum and maximum for each technique, since they were each ran on different phrasings (except SC) and formats.

Case Study: Labelling for Suicide Crisis Syndrome (SCS)

In the second case study, we apply prompting techniques to the task of labelling reddit posts as indicative of suicide crisis syndrome (SCS). Through this case study, we aim to provide an example of the prompt engineering process in the context of a real world problem. We utilize the University of Maryland Reddit Suicidality Dataset and an expert prompt engineer, documenting the process in which they boost F1 score from 0 to 0.53.

Taxonomical Ontology of Prompt Hacking

Entrapment Scores of different prompting techniques.

Taxonomical Ontology of Prompt Hacking

Entrapment Scores of different prompting techniques, graphed as the prompt engineer develops them over time.

Taxonomical Ontology of Prompt Hacking

Automated techniques (DSPy) were able to defeat our human prompt engineer.

The Prompt Report Dataset

Our systematic review of all prompting techniques is based on the dataset of 1,565 relevant papers we collected. Below is a preview of the dataset. Specific columns, such as 'abstract', have been excluded. The full dataset is available on Huggingface, including the complete CSV file and all paper PDFs.



We conducted several analyses of the dataset which can be found within the paper, including an analysis of citation counts for different GenAI models, prompting techniques, and datasets.

Taxonomical Ontology of Prompt Hacking

How often different models were used (cited) by papers in our dataset.

Taxonomical Ontology of Prompt Hacking

How often different benchmarking datasets were used (cited) by papers in our dataset.

Taxonomical Ontology of Prompt Hacking

How often papers within our dataset were cited by other papers within our dataset.

BibTeX

@misc{
  schulhoff2024prompt,
      title={The Prompt Report: A Systematic Survey of Prompting Techniques}, 
      author={Sander Schulhoff and Michael Ilie and Nishant Balepur and Konstantine Kahadze and Amanda Liu and Chenglei Si and Yinheng Li and Aayush Gupta and HyoJung Han and Sevien Schulhoff and Pranav Sandeep Dulepet and Saurav Vidyadhara and Dayeon Ki and Sweta Agrawal and Chau Pham and Gerson Kroiz and Feileen Li and Hudson Tao and Ashay Srivastava and Hevander Da Costa and Saloni Gupta and Megan L. Rogers and Inna Goncearenco and Giuseppe Sarli and Igor Galynker and Denis Peskoff and Marine Carpuat and Jules White and Shyamal Anadkat and Alexander Hoyle and Philip Resnik},
      year={2024},
      eprint={2406.06608},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}