Prompting large language models (LLMs) has become a game-changer for adapting LLMs to new tasks by using human-designed instructions. These LLMs have an impressive ability to learn in context, allowing them to excel as few-shot learners. However, the predictions made by LLMs are overly sensitive and biased towards factors such as templates, label spaces, and demonstration examples. This leads to unexpected drops in performance and creates obstacles for robust LLM applications.
To tackle this issue, various calibration methods have been developed to reduce biases and restore LLM performance. While these methods provide solutions (like contextual and domain-context calibration), there’s a lack of unified analysis that clearly distinguishes each approach’s characteristics, strengths, and weaknesses. With this in mind, our new research paper, “Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering,” aims to conduct a systematic analysis of existing calibration methods, offering a holistic view and highlighting their failures.
Inspired by our analysis, we introduce Batch Calibration (BC), a straightforward yet intuitive method that addresses the limitations of previous methods. BC reduces bias by analyzing a batch of inputs, unifying different approaches, and accurately estimating contextual bias. The beauty of BC is that it requires no additional inputs, making it zero-shot and inference-only, with barely any additional computational costs. We tested the effectiveness of BC using PaLM 2 and CLIP models and achieved outstanding performance across more than 10 natural language understanding and image classification tasks, surpassing previous calibration baselines.
Our motivation for this research was to establish practical guidelines for in-context learning calibration. We delved into the limitations of current methods and identified contextual bias as a significant issue. Uncalibrated in-context learning tends to favor certain classes unfairly, like predicting the most frequent or the last label in a demonstration. While non-linear boundaries are theoretically flexible, they are prone to overfitting and instability in complex multi-class tasks. On the other hand, linear decision boundaries prove to be more robust and generalizable. Moreover, relying on content-free inputs to estimate bias isn’t always optimal and can introduce additional bias depending on the task.
Drawing from these insights, BC was designed to be a zero-shot, inference-only, and generalizable calibration technique with minimal computation costs. We believe accurately estimating contextual bias is crucial for effective calibration. Instead of using content-free inputs, BC estimates bias for each class in a content-based manner by averaging the output scores across the batch. This approach delivers calibrated probabilities by aligning the log-probability distribution to the estimated mean of each class. BC can estimate the contextual bias once all test samples are seen or dynamically process outputs on-the-fly using running estimates. It’s a modular and adaptable layer option that generates calibrated scores.
Our experiments cover diverse natural language and image classification tasks, including standard datasets like GLUE, SuperGLUE, SVHN, EuroSAT, and CLEVR. We evaluated PaLM 2 with different sizes (PaLM 2-S, PaLM 2-M, and PaLM 2-L) and CLIP ViT-B/16. The results speak for themselves, as BC consistently outperforms uncalibrated in-context learning, offering a significant performance boost of 8% and 6% on small and large PaLM 2 variants, respectively. BC surpasses the state-of-the-art prototypical calibration baseline by 6% on PaLM 2-S and outperforms the competitive contextual calibration baseline by an additional 3% on average for PaLM 2-L. BC proves to be a cost-effective technique, delivering stable performance improvements across all tasks, unlike previous baselines that exhibit varying degrees of success.
We also assessed BC’s performance by changing the number of in-context learning shots from 0 to 4, and BC consistently outperformed all baseline methods. As the number of shots increased, BC showed improved stability and overall performance. Additionally, when visualizing decision boundaries, BC consistently achieved positive results, whereas existing calibration methods had success and failure cases.
We didn’t stop there. We went on to analyze the robustness of BC regarding prompt engineering design choices that significantly impact LLM performance. We found that BC is more robust in handling different in-context examples and their orders. The performance of BC remains consistent when prompt templates are altered, and it even recovers performance when unconventional label spaces (like emoji pairs) are used. BC makes prompt engineering easier and more efficient, handling prompt variations with ease.
Furthermore, we explored the impact of batch size on BC’s performance. Unlike other methods that require hundreds of unlabeled samples to stabilize, BC proved to be remarkably sample-efficient, achieving strong performance with just around 10 unlabeled samples.
In conclusion, our research revisited previous calibration methods, addressing their failures and deficiencies. We introduced Batch Calibration (BC) as a zero-shot, inference-only technique that simplifies prompt engineering while providing state-of-the-art performance in both language and vision-language contexts. BC significantly improves the robustness and efficiency of LLM predictions, making them more reliable for various tasks. So, if you’re looking for a reliable calibration method, BC is your go-to solution.