Biases in LLMs can be reduced by enforcing structural consistency in the model's internal computations (attention and hidden states) across counterfactual inputs, rather than just fixing outputs or training data.
This paper proposes UGID, a method to reduce social biases in large language models by treating the model as a computational graph and enforcing that its internal structure remains consistent across inputs that differ only in sensitive attributes like gender or race.