ALCUNA: Large Language Models Meet New Knowledge

Xunjian Yin*, Baizhou Huang*, Xiaojun Wan

EMNLP 2023

TL;DR

We create artificial entities guaranteed absent from training data and build ALCUNA, a benchmark that cleanly tests whether LLMs can understand, differentiate, and reason about genuinely new knowledge -- revealing that models struggle significantly when they cannot fall back on memorization.

With the rapid development of NLP, large-scale language models (LLMs) excel in various benchmarks. However, the existing benchmarks may not adequately measure an LLM's ability to handle new knowledge. In this paper, we address the lack of benchmarks to evaluate LLMs' ability to handle new knowledge, an important issue in the age of information. We propose an approach called KnowGen that generates new knowledge by altering existing entity attributes and relationships, resulting in artificial entities that are absent from LLMs' training data. With KnowGen, we introduce a benchmark named ALCUNA to assess LLMs' abilities in knowledge understanding, differentiation, and association. We benchmark several LLMs and the results reveal that LLM performance in the face of new knowledge is not satisfactory, particularly in reasoning between new and internal knowledge. We also explore the impact of entity similarity and context on LLM understanding of new knowledge.

ALCUNA Framework Overview: KnowGen generates artificial entities with modified attributes and relationships to test LLM knowledge handling — Overview of ALCUNA. KnowGen creates artificial entities by modifying attributes and relationships of existing ones, producing knowledge guaranteed to be absent from any training corpus. The benchmark then probes three dimensions: understanding, differentiation, and association.

The Benchmark Contamination Problem

How do we know if a language model truly understands, or if it has just memorized the answers? This question becomes urgent when we realize that many benchmarks test knowledge the model has seen during training. High scores might reflect recall rather than reasoning.

The world also keeps generating new knowledge -- new people, new companies, new discoveries. Models trained on past data cannot have this information, yet real-world applications constantly encounter it. How do models handle knowledge they have never seen?

Creating Artificial Entities

We developed KnowGen, an approach that generates genuinely new knowledge by creating artificial entities. These are not real people or places -- they are fabricated with coherent attributes and relationships, but they are guaranteed to be absent from any training data.

By testing models on these artificial entities, we can cleanly separate understanding from memorization. A model that performs well on ALCUNA must be reasoning from the provided context, not pulling answers from its training data.

Entity Selection

Select an existing real-world entity from a knowledge graph as the seed, along with its attributes and relationships.

Attribute Modification

Systematically alter the entity's attributes (name, properties, connections) to create a new artificial entity that is coherent yet entirely novel.

Knowledge Generation

Generate contextual descriptions and relational knowledge about the artificial entity, forming a complete knowledge profile.

Benchmark Construction

Create evaluation tasks across three dimensions -- understanding, differentiation, and association -- using the artificial entities and their contexts.

The ALCUNA Benchmark

ALCUNA tests three core capabilities when models face new knowledge: knowledge understanding (can the model grasp the new information?), knowledge differentiation (can it distinguish new entities from similar known ones?), and knowledge association (can it reason about relationships between new and existing knowledge?).

Results

Our experiments revealed that LLMs struggle significantly with new knowledge. They conflate artificial entities with similar real ones, fail to reason about relationships involving new entities, and show brittle performance when contexts require integrating new and internal knowledge.

Key Findings Across Evaluation Dimensions

Capability	Challenge	Observation
Understanding	Absorbing new knowledge from context	Models struggle to properly integrate novel information
Differentiation	Distinguishing new from known entities	Entity similarity causes significant interference
Association	Reasoning between new and internal knowledge	Particularly weak; models are biased toward training data

The Troubling Pattern

The capabilities that shine on standard benchmarks do not transfer well to genuinely novel scenarios. Models show a strong bias toward their training knowledge, even when context clearly provides different information. Entity similarity is a major factor -- models confuse new entities with existing ones that share surface features.

Understanding the Failures

We dug deeper into what causes these failures. Entity similarity matters -- models confuse new entities with existing ones that share surface features. Context structure matters -- certain ways of presenting new information help models more than others. But fundamentally, models seem biased toward their training knowledge, even when context clearly provides different information.

      Key Findings
      Understanding Gap: Models struggle to properly absorb new knowledge from context
Entity Confusion: Similar entities cause interference with reasoning
Association Failure: Reasoning between new and internal knowledge is particularly weak
Practical Implications: Caution needed when deploying LLMs in novel domains

    

Impact of Entity Similarity

When artificial entities are more similar to real-world ones in surface form, model performance degrades further. This suggests that LLMs rely heavily on pattern matching against memorized entities rather than genuinely processing the provided context -- a fundamental limitation for handling truly novel information.

A Call for Caution

ALCUNA serves as both a benchmark and a warning. As we deploy language models in ever-expanding domains, they will constantly encounter knowledge beyond their training. Our results suggest that the impressive performance on standard benchmarks may not transfer to these real-world scenarios.

We hope this work motivates research into models that can better handle the genuinely new -- not just recall what they have already learned.

Citation

@inproceedings{yin-etal-2023-alcuna, title = "{ALCUNA}: Large Language Models Meet New Knowledge", author = "Yin, Xunjian and Huang, Baizhou and Wan, Xiaojun", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", year = "2023", url = "https://aclanthology.org/2023.emnlp-main.87/", pages = "1397--1414" }