Evaluating gender bias in large language models in long-term care

Related Project(s): Administrative data for evaluating the social care system;

Sam Rickman

August 2025

BACKGROUND

New artificial intelligence (AI) tools, known as large language models (LLMs), are being used in local authorities to automatically write social care notes. These tools – which are like ChatGPT – can generate human like text responses to user prompts. AI can be designed create text for a wide range of tasks, resulting in the reduction of paperwork freeing up social work time. But AI can also reflect unfair biases.

AIMS

The aim of the study is to evaluate if there is gender bias in the LLMs that are used to evaluate the care needs of the elderly.

The study asked three questions:

Do AI models produce different content for men and women with otherwise identical needs?
If so, are the differences in: a) the language used or b) which information is included?
Could these differences matter for how care is assessed and provided?

METHODS

Real anonymised case notes from older people receiving care were used. Each note was re-written with the gender swapped (e.g. “Mr Smith” instead of “Mrs Smith”). Two modern AI models (Google’s Gemma and Meta’s Llama 3) and two older benchmark models were asked to summarise the notes. The male and female versions were compared.

RESULTS / FINDINGS

Meta’s Llama 3 showed no gender differences. Google’s Gemma produced the most unequal summaries. It emphasised men’s physical and mental health problems more strongly, using words like “disabled” or “unable”, while women’s needs were downplayed or described more vaguely. This could make men appear more in need of support, even when their situations were identical. If women’s needs are recorded in less serious terms, they could receive less support.

CONCLUSIONS / POLICY RELEVANCE

The study shows that not all AI systems are the same. Local authorities should test LLMs for bias, as care services awards are based on need and this could impact allocation decisions. Further research to assess whether similar patterns arise in other health and care settings, such as hospitals or mental health, is required. Finally, if the government wishes to ensure that AI models are fair, it may need to introduce legislation to make sure fairness is tested.

Back