ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Siqi Sun, Ben Peng Wu, Mali Jin, Peizhen Bai, Hanpei Zhang et al.|March 13, 2026arXiv

Key Takeaway

Chain-of-thought reasoning substantially reduces hallucinations in LLMs analyzing long, complex documents—a critical capability for compliance and legal applications where accuracy is non-negotiable.

Summary

ESG-Bench is a benchmark dataset for testing how well AI models understand long corporate ESG (environmental, social, governance) reports and avoid making up false information. The dataset contains real ESG reports paired with human-verified question-answer pairs, letting researchers measure when models hallucinate versus when they accurately extract facts.

evaluation safety

Key Terms

hallucination chain-of-thought long-context-handling benchmark fine-tuning