LLM leaderboard and evaluations

October 29, 2024

There are a variety of LLM leaderboards for reference. On the open side, this include HF's "Open LLM Leaderboard" (https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)) and LMSys's "Chatbot Arena LLM Leaderboard" (https://lmarena.ai/?leaderboard)..) On the closed side, Scale's "Leaderboard" is relatvely new and seems to be gaining traction (https://scale.com/leaderboard)..) You can filter by different benchmarks, but they all have various math components.

If you're choosing between frontier models, I'd just use the platform with the best UX for your use case. In general, frontier models all perform well in math (and show little sign of overfitting on the evaulation benchmarks - https://arxiv.org/html/2405.00332v3),,) including OpenAI's o1-preview, Claude's 3.5 Sonnet, Gemini's 1.5 Pro, and Llama's 3.1 405B Instruct.

Both OpenAI (https://openai.com/index/improvements-to-data-analysis-in-chatgpt/)) and Anthropic (more recently, https://www.anthropic.com/news/analysis-tool)) have got data analysis tools built-in, and Hugging Face (https://huggingface.co/chat/)) has the same for open models.

There are also open source tools where you can compare different frontier models at once, which gives you a sense check for the variety of responses - https://github.com/nat/openplayground?tab=readme-ov-file.

If you're choosing between OpenAI models, then o1-preview is leading with it's advanced reasoning. You can compare different OpenAI models and parameter configurations here - https://platform.openai.com/playground/chat?models=gpt-4o-mini-2024-07-18.

References: