ContraSolver: Self-Alignment of Language Models by Resolving Internal Preference Contradictions

Xu Zhang*, Xunjian Yin*, Xiaojun Wan

ArXiv 2024

TL;DR

Language models harbor internal preference contradictions -- if A > B and B > C, they may still rank C > A. ContraSolver models preferences as a graph, detects contradictory cycles, and resolves them to achieve fully unsupervised self-alignment.

Recent alignment methods such as RLHF and DPO have significantly improved the quality of large language model outputs. However, these approaches depend on high-quality preference data, which is expensive to obtain. Self-alignment techniques alleviate this dependency but introduce the challenge of internal contradictions in model-generated preferences. We observe that language models exhibit preference contradictions -- logical inconsistencies where pairwise comparisons violate transitivity. ContraSolver addresses this by modeling preferences as a directed graph, identifying contradictory cycles, and resolving them via maximum spanning tree initialization that preserves high-confidence preferences. Experiments across four generation tasks demonstrate consistent improvements in both alignment quality and preference consistency, all without any human supervision.

ContraSolver Framework Overview — ContraSolver traverses a preference graph to identify and resolve contradictions -- edges that violate logical consistency in the model's preferences.

The Hidden Inconsistency

When we ask a language model to compare two responses, it expresses a preference. Ask it enough times with different pairs, and you'd expect a consistent ordering to emerge -- if A is better than B, and B is better than C, then A should be better than C. But what if the model says C is better than A?

This isn't a hypothetical. We discovered that language models harbor deep preference contradictions -- logical inconsistencies in how they rank different responses. These contradictions aren't just theoretical curiosities; they fundamentally limit how well a model can be aligned with human values.

A Graph-Theoretic Approach

We realized that preferences form a graph structure. Each response is a node; each comparison creates an edge. In a perfectly consistent model, this graph would have no cycles -- you could sort all responses from best to worst. But real models create cycles, and these cycles represent contradictions.

ContraSolver approaches alignment as a graph problem: find the contradictory edges and resolve them. But which edges are contradictory? A cycle might contain many edges, and only removing the right ones will lead to improvement.

Self-Annotation

The model generates pairwise preference annotations for response pairs, building a directed preference graph from its own judgments.

Maximum Spanning Tree

Initialize with a maximum spanning tree that preserves the highest-confidence preference edges, creating an acyclic backbone.

Contradiction Detection

Identify edges that, when added back, create cycles -- these are the contradictory preferences that violate transitivity.

Resolution & Alignment

Resolve contradictions by flipping low-confidence edges to restore consistency, then use the cleaned preferences for self-alignment training.

Data Construction Process — The data construction process: we build a preference graph through self-annotation, then identify edges that create contradictory cycles.

The Key Insight

We initialize the graph with a maximum spanning tree -- preserving high-confidence preferences -- then identify edges that create contradictions. By prioritizing the resolution of low-confidence preferences while preserving high-confidence ones, we can systematically improve alignment without external supervision.

Completely Unsupervised

What makes ContraSolver remarkable is that it requires no human labels. The model identifies its own contradictions and resolves them through self-annotation. This is true self-alignment: the model becomes more consistent with its own best judgments, without any external guidance.

We tested ContraSolver across four different generation tasks and found consistent improvements. More importantly, we could directly measure the reduction in contradictions -- the preference graphs became cleaner, more acyclic, more logically coherent.

Performance Across Generation Tasks

Method	Summarization	Dialogue	Translation	QA
Base Model	Baseline	Baseline	Baseline	Baseline
Self-Alignment (naive)	Improved	Improved	Improved	Improved
ContraSolver	Best	Best	Best	Best

      Results
      Consistent Improvement: Performance gains across four diverse generation tasks
Measurable Consistency: Quantifiable reduction in preference graph contradictions
Zero Human Supervision: Completely unsupervised self-alignment
Model Agnostic: Works across different LLM architectures

    

What This Means

ContraSolver reveals that alignment isn't just about matching human preferences -- it's about internal consistency. A model that contradicts itself cannot be reliably aligned, no matter how much human feedback you provide. By resolving these internal contradictions first, we create a more solid foundation for alignment.

This work suggests a new paradigm: before aligning models to external standards, help them align with themselves.

Citation

@misc{zhang2024contrasolverselfalignmentlanguagemodels, title={ContraSolver: Self-Alignment of Language Models by Resolving Internal Preference Contradictions}, author={Xu Zhang and Xunjian Yin and Xiaojun Wan}, year={2024}, eprint={2406.08842}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.08842} }