Benchmarking Knowledge Boundary for Large Language Model: A Different Perspective on Model Evaluation

Xunjian Yin*, Xu Zhang*, Jie Ruan, Xiaojun Wan

ACL 2024 · Main Conference

TL;DR

We introduce the concept of knowledge boundary to evaluate LLMs beyond fixed prompts, using projected gradient descent with semantic constraints (PGDC) to find optimal prompts for each piece of knowledge -- revealing what models truly know versus what they can only access through lucky phrasing.

In recent years, substantial advancements have been made in the development of large language models, achieving remarkable performance across diverse tasks. However, evaluating LLMs with fixed questions is unreliable due to prompt sensitivity. We introduce the concept of knowledge boundary to encompass both prompt-agnostic knowledge (accessible regardless of phrasing) and prompt-sensitive knowledge (accessible only with specific wordings). We propose PGDC (projected gradient descent with semantic constraints) to identify optimal prompts for each knowledge piece. Our loss function combines answer generation probability, semantic constraint via L2 distance between hidden representations, and regularization penalizing embeddings far from discrete tokens. Experiments across multiple models and datasets demonstrate that knowledge boundary evaluation provides a more reliable and robust measure of what models truly know, independent of superficial prompt variations.

Knowledge Boundary: Three Classes of Knowledge — The three classes of knowledge under our framework: Prompt-Agnostic Knowledge (answerable regardless of phrasing), Prompt-Sensitive Knowledge (accessible only with specific wordings), and Unanswerable Knowledge. Knowledge boundary captures both the first two categories.

The Prompt Sensitivity Problem

Ask a language model a question, and it might get it right. Rephrase the same question, and suddenly it fails. This prompt sensitivity is one of the most frustrating aspects of working with LLMs -- a model's apparent knowledge depends not just on what you ask, but on how you phrase it.

This creates a fundamental problem for evaluation. When we test a model's knowledge with a fixed set of questions, are we measuring what the model knows, or just how well the prompts happen to match what it learned? The distinction matters enormously.

Introducing Knowledge Boundary

We propose a new concept: the knowledge boundary. Rather than asking whether a model can answer a specific prompt, we ask: what is the full range of prompts for which this knowledge is accessible? The boundary encompasses both prompt-agnostic knowledge (accessible regardless of phrasing) and prompt-sensitive knowledge (accessible only with specific wordings).

PGDC Framework Overview — The PGDC framework: starting from labeled prompts, the algorithm performs gradient descent toward target answers while constraining the search to remain within the semantic space of the original question.

A More Robust Evaluation

Traditional benchmarks evaluate models on fixed question sets. If the model knows the answer but the exact phrasing happens to trigger a failure, the benchmark marks it wrong. If the model gets lucky with a phrasing, it marks it right. Neither result reflects true knowledge.

Knowledge boundary evaluation searches for the optimal prompt for each piece of knowledge. This gives a more reliable measure of what the model actually knows, independent of superficial prompt variations.

Finding the Boundary with PGDC

How do you find the optimal prompt for a given piece of knowledge? We developed a projected gradient descent algorithm with semantic constraints. The algorithm searches the space of possible prompts, maximizing the model's ability to retrieve the correct answer while staying semantically equivalent to the original question.

Continuous Relaxation

The algorithm operates in continuous embedding space rather than discrete token space, allowing gradient-based optimization over prompt representations.

Gradient Descent with Constraints

The loss function combines answer generation probability, semantic constraint via L2 distance between hidden representations, and a regularization term penalizing embeddings far from discrete tokens.

Proximal Projection

A conditional threshold-based projection transforms optimized embeddings back to discrete text when the distance to the nearest token falls below threshold c, yielding a valid natural-language prompt.

Optimization Details

PGDC uses the Adam optimizer with a learning rate of 1e-2 and an exponential scheduler, running for a maximum of 25 iterations per prompt. The loss function is: Φ(X) = L(X,A) + λ₁R(X,Q) + λ₂δ(X), where L measures answer generation probability using a slicing window method, R enforces semantic similarity, and δ regularizes toward valid tokens.

Knowledge Boundaries Across Domains — Radar chart showing knowledge boundaries across six domains (natural sciences, medical, computer science, social sciences, humanities, and others) for five models on MMLU, revealing how different models have different knowledge profiles.

Knowledge Boundary Results (Accuracy %)

Dataset	Model	Original	PGDC
PaRaRel	LLaMA2	--	71.36
KAssess	LLaMA2	--	69.84
CFACT	LLaMA2	--	3.41
AlCuna	LLaMA2	--	0.00

Human Evaluation: Semantic Preservation Rate

Model	Preservation Rate
GPT-2	80.5%
GPT-J	85.1%
LLaMA2	83.3%
Vicuna	86.2%

What We Learned

Our experiments revealed that knowledge boundaries vary dramatically across models and domains. Some knowledge is robustly accessible; other knowledge hangs by a thread, retrievable only with very specific prompts. Understanding these boundaries helps us build more reliable systems and design better evaluation protocols.

Knowledge boundary offers a new lens on model capabilities -- one that sees past the noise of prompt sensitivity to measure genuine understanding.

      Key Contributions
      New Framework: Knowledge boundary as a prompt-robust evaluation concept, distinguishing prompt-agnostic from prompt-sensitive knowledge
PGDC Algorithm: Projected gradient descent with semantic constraints to find optimal prompts in continuous embedding space
Better Evaluation: More reliable assessment of model knowledge that is independent of specific prompt phrasing
Cross-Domain Insights: Revealing how knowledge boundaries differ across domains and models, exposing the gap between apparent and actual model knowledge

    

Citation

@inproceedings{yin-etal-2024-benchmarking, title = "Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation", author = "Yin, Xunjian and Zhang, Xu and Ruan, Jie and Wan, Xiaojun", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics", year = "2024", url = "https://aclanthology.org/2024.acl-long.124/", pages = "2270--2286" }