} ?>
(Yicai) June 20 -- Seven large language models, including OpenAI’s ChatGPT-4o, were made to ‘sit’ China’s notoriously difficult college entrance exam recently. They did relatively well in the English and Chinese language tests, but each one failed the math paper.
Chat GPT-4o as well as open source models developed by China’s Alibaba Group Holding, 01.AI, Zhipu AI, Shanghai Artificial Intelligence Laboratory, and France’s Mistral AI, were put to the test by OpenCompass, the Shanghai AI Lab’s evaluation system.
China’s tough college entrance exams are a good way of gauging LLM’s intelligence, the Shanghai AI Lab said. The tests were all marked manually and the examiners were not told that they were taken by machines. The exams contained both objective and subjective questions, it added.
Alibaba’s Qwen 2-72B was the smartest, scoring 303 points out of a total of 420 in the three subjects, according to the results published by OpenCompass yesterday. It was followed by US firm OpenAI’s Chat GPT-4o with 296 and the Shanghai AI Lab's InternLM 2.0 with 295.5. Mistral AI’s LLM came last with 185.
Each one failed the math test, however. InternLM 2.0 achieved the highest score of just 75 points out of 150. GPT-4o was second with 73.
The examiners found that the generative AI models’ answers to subjective math questions were illogical and confused. Sometimes the reasoning was wrong, but the answer was correct. The LLMs are able to memorize formulas well, but they have trouble in explaining how they solved the problems.
This shows that LLMs have much room to improve their math skills, Lin Dahua, a scientist at the Shanghai AI Lab, told Yicai. Math involves complex reasoning, which is a key ability if LLMs are to be used in finance and other vital areas.
The AI models performed well in terms of modern Chinese language, but there was a big gap in their knowledge of classical Chinese. Qwen scored highest with 124 out of 150 points, while GPT-4o excelled in English with 109 out of 120 points.
In English, most humans who take the test lose points for not writing enough, but the AI models tended to have points deducted for exceeding the word limit.
Editor: Kim Taylor