Benchmarking Knowledge Boundary for Large Language Model: A Different Perspective on Model Evaluation

Xunjian Yin*, Xu Zhang*, Jie Ruan, Xiaojun Wan

ACL 2024 · Main Conference

TL;DR

We introduce the concept of knowledge boundary to evaluate LLMs beyond fixed prompts, using projected gradient descent with semantic constraints (PGDC) to find optimal prompts for each piece of knowledge -- revealing what models truly know versus what they can only access through lucky phrasing.

In recent years, substantial advancements have been made in the development of large language models, achieving remarkable performance across diverse tasks. However, evaluating LLMs with fixed questions is unreliable due to prompt sensitivity. We introduce the concept of knowledge boundary to encompass both prompt-agnostic knowledge (accessible regardless of phrasing) and prompt-sensitive knowledge (accessible only with specific wordings). We propose PGDC (projected gradient descent with semantic constraints) to identify optimal prompts for each knowledge piece. Our loss function combines answer generation probability, semantic constraint via L2 distance between hidden representations, and regularization penalizing embeddings far from discrete tokens. Experiments across multiple models and datasets demonstrate that knowledge boundary evaluation provides a more reliable and robust measure of what models truly know, independent of superficial prompt variations.

Knowledge Boundary: Three Classes of Knowledge
The three classes of knowledge under our framework: Prompt-Agnostic Knowledge (answerable regardless of phrasing), Prompt-Sensitive Knowledge (accessible only with specific wordings), and Unanswerable Knowledge. Knowledge boundary captures both the first two categories.

The Prompt Sensitivity Problem

Ask a language model a question, and it might get it right. Rephrase the same question, and suddenly it fails. This prompt sensitivity is one of the most frustrating aspects of working with LLMs -- a model's apparent knowledge depends not just on what you ask, but on how you phrase it.

This creates a fundamental problem for evaluation. When we test a model's knowledge with a fixed set of questions, are we measuring what the model knows, or just how well the prompts happen to match what it learned? The distinction matters enormously.

Introducing Knowledge Boundary

We propose a new concept: the knowledge boundary. Rather than asking whether a model can answer a specific prompt, we ask: what is the full range of prompts for which this knowledge is accessible? The boundary encompasses both prompt-agnostic knowledge (accessible regardless of phrasing) and prompt-sensitive knowledge (accessible only with specific wordings).

PGDC Framework Overview
The PGDC framework: starting from labeled prompts, the algorithm performs gradient descent toward target answers while constraining the search to remain within the semantic space of the original question.

A More Robust Evaluation

Traditional benchmarks evaluate models on fixed question sets. If the model knows the answer but the exact phrasing happens to trigger a failure, the benchmark marks it wrong. If the model gets lucky with a phrasing, it marks it right. Neither result reflects true knowledge.

Knowledge boundary evaluation searches for the optimal prompt for each piece of knowledge. This gives a more reliable measure of what the model actually knows, independent of superficial prompt variations.

Finding the Boundary with PGDC

How do you find the optimal prompt for a given piece of knowledge? We developed a projected gradient descent algorithm with semantic constraints. The algorithm searches the space of possible prompts, maximizing the model's ability to retrieve the correct answer while staying semantically equivalent to the original question.

1

Continuous Relaxation

The algorithm operates in continuous embedding space rather than discrete token space, allowing gradient-based optimization over prompt representations.

2

Gradient Descent with Constraints

The loss function combines answer generation probability, semantic constraint via L2 distance between hidden representations, and a regularization term penalizing embeddings far from discrete tokens.

3

Proximal Projection

A conditional threshold-based projection transforms optimized embeddings back to discrete text when the distance to the nearest token falls below threshold c, yielding a valid natural-language prompt.

Optimization Details

PGDC uses the Adam optimizer with a learning rate of 1e-2 and an exponential scheduler, running for a maximum of 25 iterations per prompt. The loss function is: Φ(X) = L(X,A) + λ1R(X,Q) + λ2δ(X), where L measures answer generation probability using a slicing window method, R enforces semantic similarity, and δ regularizes toward valid tokens.

Knowledge Boundaries Across Domains
Radar chart showing knowledge boundaries across six domains (natural sciences, medical, computer science, social sciences, humanities, and others) for five models on MMLU, revealing how different models have different knowledge profiles.

Knowledge Boundary Results (Accuracy %)

Dataset Model Original PGDC
PaRaRel LLaMA2 -- 71.36
KAssess LLaMA2 -- 69.84
CFACT LLaMA2 -- 3.41
AlCuna LLaMA2 -- 0.00

Human Evaluation: Semantic Preservation Rate

Model Preservation Rate
GPT-2 80.5%
GPT-J 85.1%
LLaMA2 83.3%
Vicuna 86.2%

What We Learned

Our experiments revealed that knowledge boundaries vary dramatically across models and domains. Some knowledge is robustly accessible; other knowledge hangs by a thread, retrievable only with very specific prompts. Understanding these boundaries helps us build more reliable systems and design better evaluation protocols.

Knowledge boundary offers a new lens on model capabilities -- one that sees past the noise of prompt sensitivity to measure genuine understanding.

Key Contributions

Citation

@inproceedings{yin-etal-2024-benchmarking, title = "Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation", author = "Yin, Xunjian and Zhang, Xu and Ruan, Jie and Wan, Xiaojun", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics", year = "2024", url = "https://aclanthology.org/2024.acl-long.124/", pages = "2270--2286" }