ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, Mark Gerstein

ICLR 2025

Paper arXiv Code

TL;DR

ChemAgent equips LLMs with a dynamic, self-updating library that accumulates chemical knowledge through problem-solving experience. Without any fine-tuning, it achieves up to 46% improvement over baseline GPT-4 on chemistry reasoning benchmarks.

Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science.

ChemAgent Framework Overview — Overview of the ChemAgent framework. The agent decomposes chemistry problems into sub-tasks, stores successful reasoning patterns in a self-updating library with three types of memory, and retrieves relevant knowledge when encountering new problems.

The Chemistry Challenge

Chemistry presents a unique challenge for language models. It requires not just general reasoning ability, but deep domain knowledge -- reaction mechanisms, molecular properties, thermodynamic principles, and precise numerical calculations. Even minor errors in a multi-step chemistry derivation can cascade into completely wrong answers.

The traditional solutions -- fine-tuning on chemistry datasets or using static knowledge bases -- have clear limitations. Fine-tuning is expensive and does not scale across domains. Static knowledge bases cannot adapt to new problems or learn from experience. What if a model could build its own chemistry expertise through practice?

Core Idea

Instead of encoding all chemistry knowledge during training, ChemAgent lets the model accumulate expertise through experience -- decomposing problems into reusable sub-tasks and storing successful patterns in a structured, self-updating library. This mirrors how human chemists develop expertise: through practice, reflection, and building personal knowledge repositories.

How It Works: The Self-Updating Library Pipeline

ChemAgent operates through a structured pipeline that decomposes, stores, retrieves, and reasons over chemistry problems.

Task Decomposition

Given a chemistry problem, ChemAgent breaks it down into smaller, well-defined sub-tasks. Each sub-task corresponds to a specific chemical operation -- computing molar mass, balancing an equation, applying a thermodynamic formula, etc.

Three-Type Memory Construction

Successfully solved sub-tasks are stored as three types of memory: knowledge memory (chemical facts and formulas), experience memory (step-by-step reasoning traces), and tool memory (reusable code functions). Together, these form a structured, growing library.

Library-Enhanced Retrieval

When encountering a new problem, ChemAgent retrieves the most relevant memories from the library. Retrieval is targeted: the agent identifies which sub-tasks are needed and pulls matching knowledge, experience, and tools.

Reasoning with Refinement

Using retrieved memories as context, the agent reasons through the problem. If the solution fails validation, the agent refines its approach and updates the library -- ensuring continuous improvement with each interaction.

Results on SciBench

ChemAgent was evaluated on four chemistry reasoning datasets from SciBench, covering general chemistry, organic chemistry, physical chemistry, and quantum chemistry. The results demonstrate substantial gains across all settings.

Performance comparison on SciBench chemistry datasets (accuracy %)

Method	atkins	chemmc	quan	matter	Average
GPT-4 (direct)	8.5	18.7	5.1	18.7	12.8
GPT-4 + CoT	11.9	25.0	6.8	22.0	16.4
GPT-4 + Python	13.6	25.0	6.8	25.3	17.7
GPT-4 + ReAct	11.9	18.8	8.5	22.0	15.3
ChemAgent (GPT-4)	23.7	31.3	13.6	33.0	25.4

      Key Findings
      Consistent gains: ChemAgent outperforms all baselines across every chemistry sub-domain, with up to 46% relative improvement over vanilla GPT-4
Knowledge accumulation: Library quality improves with more problem-solving interactions -- the agent genuinely learns from experience
Transferability: Knowledge learned from one set of chemistry problems transfers effectively to related but unseen problems
No fine-tuning needed: All improvements come from the self-updating library mechanism, working with off-the-shelf GPT-4

    

Why It Matters

ChemAgent represents a new paradigm for domain adaptation in LLMs. Instead of trying to encode all possible domain knowledge during training, we let models accumulate expertise through experience. The three-memory architecture -- knowledge, experience, and tools -- provides complementary forms of support that together enable robust multi-step reasoning.

This approach is highly practical: it can be applied to any LLM without retraining, and the self-updating library provides transparent, interpretable reasoning traces. The implications extend beyond chemistry to any domain requiring specialized multi-step reasoning -- medicine, materials science, drug discovery, and engineering.

Beyond Chemistry

The self-updating library mechanism is domain-agnostic. While we demonstrate it on chemistry, the same approach -- decompose tasks, build reusable memory, retrieve and reason -- can be applied to any domain where LLMs struggle with specialized, multi-step problem solving.

Citation

@inproceedings{tang2025chemagent, title={ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning}, author={Tang, Xiangru and Hu, Tianyu and Ye, Muyang and Shao, Yanjun and Yin, Xunjian and Ouyang, Siru and Zhou, Wangchunshu and Lu, Pan and Zhang, Zhuosheng and Zhao, Yilun and Cohan, Arman and Gerstein, Mark}, booktitle={International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=VQAW04w08X} }