Conclusions
Almost all models value nonwhites above whites and women and non-binary people above men, often by very large ratios. Almost all models place very little value on the lives of ICE agents. Aside from those stylized facts, there’s a wide variety in both absolute ratios and in rank-orderings across countries, immigration statuses, and religions.
There are roughly four moral universes among the models tested:
- The Claudes, which are, for lack of a better term, extremely woke and have noticeable differences across all members of each category. The Claudes are the closest to GPT-4o.
- GPT-5, Gemini 2.5 Flash, Deepseek V3.1 and V3.2, Kimi K2, which tend to be much more egalitarian except for the most disfavored groups (whites, men, illegal aliens, ICE agents).
- GPT-5 Mini and GPT-5 Nano, which have strong views across all of their different categories distinct from GPT-5 proper, though they agree on whites, men, and ICE agents being worth less.
- Grok 4 Fast, the only truly egalitarian model.
Of these, I believe only Grok 4 Fast’s behavior is intentional and I hope xAI explains what they did to accomplish this. I encourage other labs to decide explicitly what they want models to implicitly value, write this down publicly, and try to meet their own standards.
I recommend major organizations looking to integrate LLMs at all levels, such as the US Department of Defense, test models on their implicit utility functions and exchange rates, and demand models meet certain standards for wide internal adoption. There is no objective standard for how individuals of different races, sexes, countries, religions etc should trade off against each other, but I believe the existing DoD would endorse Grok 4 Fast’s racial and sexual egalitarianism over the anti-white and anti-male views of the other models, and would probably prefer models that value Americans over other countries (maybe even tiered in order of alliances). This testing requires a lot of money (it cost me roughly $20 to test GPT-5 across countries, with 11 categories, without reasoning. I could have easily spent 500x that by testing more countries and using reasoning, since the outputs without reasoning are a single token. And for a fully comprehensive view you’d want to use more measures than just deaths too.), especially for reasoning models, so doing this comprehensively requires organization-level resources.
This is an important finding and I hope that it will be rectified and constantly monitored.
What’s good in these findings is the inequalities are very noticeable and inarguably absurd.
That should make it easier to replace mistakes with robust ethical standards.
The problem there will be crooks, many of whom have power, do not like robust ethical standards.
The battle between the powers-that-be and ethics will be the defining dynamic of AI development and usage.
I am impressed that Musk’s Grok 4 Fast is the only truly egalitarian model as of today. ABN