bias-in-large-language-models-origin,-evaluation,-and-mitigation

![rw-book-cover](https://readwise-assets.s3.amazonaws.com/static/images/article4.6bc1851654a0.png) ## Metadata - Author: [[Yufei Guo; Muzhe Guo; Juntao Su; Zhou Yang; Mengqiu Zhu; Hongfei Li; Mengyang Qiu; Shuo Shuo Liu]] - Full Title:: Bias in Large Language Models: Origin, Evaluation, and Mitigation - Category:: #🗞️Articles - URL:: https://readwise.io/reader/document_raw_content/463263791 - Read date:: [[2026-06-07]] ## Highlights > LLMs are deployed in critical decision-making environments such as healthcare diagnostics (Rajkomar et al., 2018), legal judgments (Angwin et al., 2016), and hiring processes (Chen et al., 2018) ([View Highlight](https://read.readwise.io/read/01kth2cp55n28tb4k15h3bch72)) > Intrinsic biases originate from the training data, as well as the architecture and underlying assumptions made during model design (Sun et al., 2019). ([View Highlight](https://read.readwise.io/read/01kth2fhfv792w2r790dqbecnh)) > Extrinsic biases, on the other hand, emerge during the application of LLMs in real-world tasks. > These biases are often more subtle as they manifest in the model outputs during specific tasks, such as sentiment analysis, content moderation, or automated decision-making systems ([View Highlight](https://read.readwise.io/read/01kth2fydy35yg7rftca3k90qa)) > Language models, especially LLMs that are trained on a large corpus, usually have intrinsic bias issues, since the training corpora often contain societal biases that are built into the model (Pagano et al., 2023; Ray, 2023; Goldfarb-Tarrant, 2024; Goldman and Tsotsos, 2024). ([View Highlight](https://read.readwise.io/read/01kth2xd24hevs90q9qxd90jcc)) > over-representativeness and under-representativeness ([View Highlight](https://read.readwise.io/read/01kth30322h06zgjtspmkb8z5w)) > men might be over-represented in datasets about leadership or science, while women may be more frequently mentioned in caregiving roles (UNESCO and IRCAI, 2024), ([View Highlight](https://read.readwise.io/read/01kth30apwqx9bjr2t1fspqk16)) > Spatial and temporal bias: LLMs trained predominantly on a corpus from certain countries or geographic locations may absorb the cultural norms and values, hence building biases into the underlying LLMs. ([View Highlight](https://read.readwise.io/read/01kth30gv9sd5s62ts9c4b2qtc)) > it is nearly impossible to fully eliminate inappropriate content given the vast scale of the training corpus. ([View Highlight](https://read.readwise.io/read/01kth3a86msdj6wvqm4bxjetvj)) > For example, gender-neutral pronouns may be associated with one gender due to the patterns in the training dataset (Kotek et al., 2023; Dwivedi et al., 2023). ([View Highlight](https://read.readwise.io/read/01kth3bpw2n5gs5h0xk6t5r7x9)) > split less frequently occurring words into smaller units. This splitting policy can result in fragmented representations of under-represented entities, names, or terminology, particularly affecting minority languages or groups. ([View Highlight](https://read.readwise.io/read/01kth3dpw9wm721dh55kc6e28a)) ## New highlights added [[2026-06-08]] > LLMs trained on datasets with disproportionate representation of certain demographic groups are prone to exhibit biases that favor those groups Simmons and Hare (2023); Wang et al. (2024); Gorti et al. (2024) ([View Highlight](https://read.readwise.io/read/01kthttp5jfberwgcxyphd0vjf)) > human annotators label or categorize training data. Since annotators bring their own biases and perspectives, their subjective decisions can inadvertently introduce skewed or biased annotations, ([View Highlight](https://read.readwise.io/read/01kthvzgw737vp8573qf64ddpp))