ContraSolver: Self-Alignment of Language Models by Resolving Internal Preference Contradictions
ArXiv 2024
Language models harbor internal preference contradictions -- if A > B and B > C, they may still rank C > A. ContraSolver models preferences as a graph, detects contradictory cycles, and resolves them to achieve fully unsupervised self-alignment.
Recent alignment methods such as RLHF and DPO have significantly improved the quality of large language model outputs. However, these approaches depend on high-quality preference data, which is expensive to obtain. Self-alignment techniques alleviate this dependency but introduce the challenge of internal contradictions in model-generated preferences. We observe that language models exhibit preference contradictions -- logical inconsistencies where pairwise comparisons violate transitivity. ContraSolver addresses this by modeling preferences as a directed graph, identifying contradictory cycles, and resolving them via maximum spanning tree initialization that preserves high-confidence preferences. Experiments across four generation tasks demonstrate consistent improvements in both alignment quality and preference consistency, all without any human supervision.
The Hidden Inconsistency
When we ask a language model to compare two responses, it expresses a preference. Ask it enough times with different pairs, and you'd expect a consistent ordering to emerge -- if A is better than B, and B is better than C, then A should be better than C. But what if the model says C is better than A?
This isn't a hypothetical. We discovered that language models harbor deep preference contradictions -- logical inconsistencies in how they rank different responses. These contradictions aren't just theoretical curiosities; they fundamentally limit how well a model can be aligned with human values.
A Graph-Theoretic Approach
We realized that preferences form a graph structure. Each response is a node; each comparison creates an edge. In a perfectly consistent model, this graph would have no cycles -- you could sort all responses from best to worst. But real models create cycles, and these cycles represent contradictions.
ContraSolver approaches alignment as a graph problem: find the contradictory edges and resolve them. But which edges are contradictory? A cycle might contain many edges, and only removing the right ones will lead to improvement.
Self-Annotation
The model generates pairwise preference annotations for response pairs, building a directed preference graph from its own judgments.
Maximum Spanning Tree
Initialize with a maximum spanning tree that preserves the highest-confidence preference edges, creating an acyclic backbone.
Contradiction Detection
Identify edges that, when added back, create cycles -- these are the contradictory preferences that violate transitivity.
Resolution & Alignment
Resolve contradictions by flipping low-confidence edges to restore consistency, then use the cleaned preferences for self-alignment training.
The Key Insight
We initialize the graph with a maximum spanning tree -- preserving high-confidence preferences -- then identify edges that create contradictions. By prioritizing the resolution of low-confidence preferences while preserving high-confidence ones, we can systematically improve alignment without external supervision.
Completely Unsupervised
What makes ContraSolver remarkable is that it requires no human labels. The model identifies its own contradictions and resolves them through self-annotation. This is true self-alignment: the model becomes more consistent with its own best judgments, without any external guidance.
We tested ContraSolver across four different generation tasks and found consistent improvements. More importantly, we could directly measure the reduction in contradictions -- the preference graphs became cleaner, more acyclic, more logically coherent.
Performance Across Generation Tasks
| Method | Summarization | Dialogue | Translation | QA |
|---|---|---|---|---|
| Base Model | Baseline | Baseline | Baseline | Baseline |
| Self-Alignment (naive) | Improved | Improved | Improved | Improved |
| ContraSolver | Best | Best | Best | Best |
Results
- Consistent Improvement: Performance gains across four diverse generation tasks
- Measurable Consistency: Quantifiable reduction in preference graph contradictions
- Zero Human Supervision: Completely unsupervised self-alignment
- Model Agnostic: Works across different LLM architectures
What This Means
ContraSolver reveals that alignment isn't just about matching human preferences -- it's about internal consistency. A model that contradicts itself cannot be reliably aligned, no matter how much human feedback you provide. By resolving these internal contradictions first, we create a more solid foundation for alignment.
This work suggests a new paradigm: before aligning models to external standards, help them align with themselves.