Code Arena

Compare the performance of AI models on agentic coding tasks involving multi-step reasoning and tool use

Last Updated

Jan 23, 2026

Total Votes

118,204

Total Models

/

	Rank Spread
1	1◄─►1	claude-opus-4-5-20251101-thinking-32k	1504	+10/-10	7,543	Anthropic	Proprietary
2	2◄─►5	gpt-5.2-high	1475	+16/-16	1,691	OpenAI	Proprietary
3	2◄─►5	claude-opus-4-5-20251101	1467	+9/-9	7,900	Anthropic	Proprietary
4	2◄─►6	gemini-3-pro	1462	+8/-8	14,043	Google	Proprietary
5	2◄─►6	gemini-3-flash	1454	+9/-9	8,389	Google	Proprietary
6	4◄─►6	glm-4.7	1445	+10/-10	5,650	Z.ai	MIT
7	7◄─►10	minimax-m2.1-preview	1414	+9/-9	7,201	MiniMax	MIT
8	7◄─►10	gemini-3-flash (thinking-minimal)	1412	+10/-10	5,430	Google	Proprietary
9	7◄─►15	gpt-5.2	1399	+15/-15	1,632	OpenAI	Proprietary
10	7◄─►15	gpt-5-medium	1397	+12/-12	3,929	OpenAI	Proprietary
11	9◄─►15	gpt-5.1-medium	1392	+9/-9	6,594	OpenAI	Proprietary
12	9◄─►15	claude-opus-4-1-20250805	1392	+8/-8	9,124	Anthropic	Proprietary
13	9◄─►15	claude-sonnet-4-5-20250929-thinking-32k	1390	+8/-8	11,001	Anthropic	Proprietary
14	9◄─►15	claude-sonnet-4-5-20250929	1386	+8/-8	12,662	Anthropic	Proprietary
15	9◄─►16	deepseek-v3.2-thinking	1377	+11/-11	3,552	DeepSeek	MIT
16	15◄─►19	glm-4.6	1358	+8/-8	8,890	Z.ai	MIT
17	16◄─►19	gpt-5.1	1355	+8/-8	9,917	OpenAI	Proprietary
18	16◄─►20	mimo-v2-flash (non-thinking)	1351	+10/-10	3,943	Xiaomi	MIT
19	16◄─►21	gpt-5.2-codex	1344	+13/-13	2,500	OpenAI	Proprietary
20	18◄─►21	gpt-5.1-codex	1334	+9/-9	6,661	OpenAI	Proprietary
21	19◄─►21	kimi-k2-thinking-turbo	1333	+8/-8	9,556	Moonshot	Modified MIT
22	22◄─►23	minimax-m2	1316	+8/-8	8,997	MiniMax	Apache 2.0
23	22◄─►26	deepseek-v3.2	1299	+10/-10	4,581	DeepSeek	MIT
24	23◄─►26	claude-haiku-4-5-20251001	1298	+8/-8	10,767	Anthropic	Proprietary
25	23◄─►26	deepseek-v3.2-exp	1289	+10/-10	5,133	DeepSeek	MIT
26	23◄─►26	qwen3-coder-480b-a35b-instruct	1287	+8/-8	10,516	Alibaba	Apache 2.0
27	27◄─►29	KAT-Coder-Pro-V1	1262	+15/-15	1,956	KwaiKAT	Proprietary
28	27◄─►30	gpt-5.1-codex-mini	1247	+17/-17	1,538	OpenAI	Proprietary
29	27◄─►30	grok-4-1-fast-reasoning	1240	+11/-11	5,127	xAI	Proprietary
30	28◄─►32	mistral-large-3	1225	+20/-20	1,037	Mistral	Apache 2.0
31	30◄─►32	gemini-2.5-pro	1209	+13/-13	3,454	Google	Proprietary
32	30◄─►32	grok-4.1-thinking	1208	+19/-19	1,266	xAI	Proprietary
33	33◄─►34	grok-4-fast-reasoning	1156	+22/-22	970	xAI	Proprietary
34	33◄─►35	grok-code-fast-1	1143	+21/-21	1,017	xAI	Proprietary
35	34◄─►35	devstral-medium-2507	1101	+22/-22	1,020	Mistral	Proprietary

Code Arena

Remove Style Control Leaderboard Plots

Battle Count for Each Combination of Models (without Ties)

Fraction of Model A Wins for All Non-tied A vs. B Battles

Confidence Intervals on Model Strength (via Bootstrapping)

Average Win Rate Against All Other Models (Uniform Sampling and No Ties)