UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Zikang Ding, Junchi Yao, Junhao Li, Yi Zhang, Wenbo Jiang et al.|March 19, 2026arXiv

Key Takeaway

Biases in LLMs can be reduced by enforcing structural consistency in the model's internal computations (attention and hidden states) across counterfactual inputs, rather than just fixing outputs or training data.

Summary

This paper proposes UGID, a method to reduce social biases in large language models by treating the model as a computational graph and enforcing that its internal structure remains consistent across inputs that differ only in sensitive attributes like gender or race.

safety alignment training

Key Terms

internal-representations isomorphism-invariant counterfactual-explanation attention-mechanism bias-sensitive-regions