Benchmarking Knowledge Boundary for Large Language Model: A Different Perspective on Model Evaluation
ACL 2024 · Main Conference
We introduce the concept of knowledge boundary to evaluate LLMs beyond fixed prompts, using projected gradient descent with semantic constraints (PGDC) to find optimal prompts for each piece of knowledge -- revealing what models truly know versus what they can only access through lucky phrasing.
In recent years, substantial advancements have been made in the development of large language models, achieving remarkable performance across diverse tasks. However, evaluating LLMs with fixed questions is unreliable due to prompt sensitivity. We introduce the concept of knowledge boundary to encompass both prompt-agnostic knowledge (accessible regardless of phrasing) and prompt-sensitive knowledge (accessible only with specific wordings). We propose PGDC (projected gradient descent with semantic constraints) to identify optimal prompts for each knowledge piece. Our loss function combines answer generation probability, semantic constraint via L2 distance between hidden representations, and regularization penalizing embeddings far from discrete tokens. Experiments across multiple models and datasets demonstrate that knowledge boundary evaluation provides a more reliable and robust measure of what models truly know, independent of superficial prompt variations.
The Prompt Sensitivity Problem
Ask a language model a question, and it might get it right. Rephrase the same question, and suddenly it fails. This prompt sensitivity is one of the most frustrating aspects of working with LLMs -- a model's apparent knowledge depends not just on what you ask, but on how you phrase it.
This creates a fundamental problem for evaluation. When we test a model's knowledge with a fixed set of questions, are we measuring what the model knows, or just how well the prompts happen to match what it learned? The distinction matters enormously.
Introducing Knowledge Boundary
We propose a new concept: the knowledge boundary. Rather than asking whether a model can answer a specific prompt, we ask: what is the full range of prompts for which this knowledge is accessible? The boundary encompasses both prompt-agnostic knowledge (accessible regardless of phrasing) and prompt-sensitive knowledge (accessible only with specific wordings).
A More Robust Evaluation
Traditional benchmarks evaluate models on fixed question sets. If the model knows the answer but the exact phrasing happens to trigger a failure, the benchmark marks it wrong. If the model gets lucky with a phrasing, it marks it right. Neither result reflects true knowledge.
Knowledge boundary evaluation searches for the optimal prompt for each piece of knowledge. This gives a more reliable measure of what the model actually knows, independent of superficial prompt variations.
Finding the Boundary with PGDC
How do you find the optimal prompt for a given piece of knowledge? We developed a projected gradient descent algorithm with semantic constraints. The algorithm searches the space of possible prompts, maximizing the model's ability to retrieve the correct answer while staying semantically equivalent to the original question.
Continuous Relaxation
The algorithm operates in continuous embedding space rather than discrete token space, allowing gradient-based optimization over prompt representations.
Gradient Descent with Constraints
The loss function combines answer generation probability, semantic constraint via L2 distance between hidden representations, and a regularization term penalizing embeddings far from discrete tokens.
Proximal Projection
A conditional threshold-based projection transforms optimized embeddings back to discrete text when the distance to the nearest token falls below threshold c, yielding a valid natural-language prompt.
Optimization Details
PGDC uses the Adam optimizer with a learning rate of 1e-2 and an exponential scheduler, running for a maximum of 25 iterations per prompt. The loss function is: Φ(X) = L(X,A) + λ1R(X,Q) + λ2δ(X), where L measures answer generation probability using a slicing window method, R enforces semantic similarity, and δ regularizes toward valid tokens.
Knowledge Boundary Results (Accuracy %)
| Dataset | Model | Original | PGDC |
|---|---|---|---|
| PaRaRel | LLaMA2 | -- | 71.36 |
| KAssess | LLaMA2 | -- | 69.84 |
| CFACT | LLaMA2 | -- | 3.41 |
| AlCuna | LLaMA2 | -- | 0.00 |
Human Evaluation: Semantic Preservation Rate
| Model | Preservation Rate |
|---|---|
| GPT-2 | 80.5% |
| GPT-J | 85.1% |
| LLaMA2 | 83.3% |
| Vicuna | 86.2% |
What We Learned
Our experiments revealed that knowledge boundaries vary dramatically across models and domains. Some knowledge is robustly accessible; other knowledge hangs by a thread, retrievable only with very specific prompts. Understanding these boundaries helps us build more reliable systems and design better evaluation protocols.
Knowledge boundary offers a new lens on model capabilities -- one that sees past the noise of prompt sensitivity to measure genuine understanding.
Key Contributions
- New Framework: Knowledge boundary as a prompt-robust evaluation concept, distinguishing prompt-agnostic from prompt-sensitive knowledge
- PGDC Algorithm: Projected gradient descent with semantic constraints to find optimal prompts in continuous embedding space
- Better Evaluation: More reliable assessment of model knowledge that is independent of specific prompt phrasing
- Cross-Domain Insights: Revealing how knowledge boundaries differ across domains and models, exposing the gap between apparent and actual model knowledge