AIME-TR Benchmark

This is a benchmark for evaluating LLMs' performance on Turkish translations of the AIME (American Invitational Mathematics Examination). The dataset consists of 60 questions: 30 from AIME 2024 and 30 from AIME 2025.

Thanks to ytu-ce-cosmos for providing the datasets: aime24-tr, aime25-tr

Format: The LLMs were prompted in this format:

SYSTEM_PROMPT = "Sen bir matematik olimpiyatı uzmanısın. Soruyu adım adım çöz. Cevabını en sonda \boxed{} içinde ver."
USER_PROMPT = {question}

Their answers were checked with a flexible regex. For example, if the correct answer is 4683, any answer containing that number as the last number was deemed correct. As such, models won't lose points due to formatting mistakes like "Answer is 4683" or "4683 bu sorunun doğru cevabıdır." Since all questions have definite numerical answers, we use binary true/false scoring for each question.

Rank	Model	Total (/60)	AIME24 (/30)	AIME25 (/30)
1	Qwen/Qwen3.5-27B	46	25	21
2	Qwen/Qwen3.5-35B-A3B	43	25	18
3	Qwen/Qwen3.5-9B	38	21	17
4	Qwen/Qwen3.5-4B	32	18	14
5	google/gemma-3-27b-it	10	5	5
6	google/gemma-3-12b-it	8	5	3
7	Qwen/Qwen3.5-2B	2	1	1
8	google/gemma-3-4b-it	2	2	0
9	google/gemma-3-1b-it	0	0	0