1. Introduction: The Python Blind Spot

LiveCodeBench (LCB) has become the gold standard for evaluating LLM code generation. It is contamination-aware, continuously updated, and widely trusted by the research community. But it has one glaring limitation: it only tests Python. In the real world, software is written in dozens of languages. A model that ace Python might flounder in JavaScript, Rust, or Go. This blind spot is exactly what Multi-LCB addresses.

Multi-LCB extends LiveCodeBench to 12 programming languages, giving researchers and engineers a rigorous, multi-language evaluation suite. The findings are sobering: Python overfitting is real, and most models show dramatic performance drops outside their primary training language. If you are selecting a coding model for a polyglot codebase, Python-only numbers are not enough.

2. What Is Multi-LCB?

Multi-LCB is an open-source benchmark that extends the original LiveCodeBench to 12 programming languages: Python, C++, Java, Go, JavaScript, TypeScript, C#, Rust, Ruby, PHP, Kotlin, and Scala.

Key design decisions set it apart:

  • STDIN/STDOUT format — All problems are converted from LeetCode-style functional APIs to a unified STDIN/STDOUT interface. This eliminates language-specific boilerplate and makes evaluation reproducible across any language.
  • Auto-tracking — Multi-LCB is designed to automatically stay in sync with future LCB updates. When new problems are added to LCB, the pipeline converts them automatically.
  • Academic acceptance — The paper has been accepted at ICLR 2026.

The benchmark currently covers 1,401 problems from LCB (May 2024 – May 2025), transformed and validated across all 12 target languages.

3. How Does It Work?

The evaluation pipeline consists of five stages:

  1. Dataset loading — Fetches the latest LCB dataset from Hugging Face.
  2. Conversion — Transforms LeetCode functional test cases into STDIN/STDOUT format, handling scalar, 1D array, and 2D array input types.
  3. Zero-shot prompting — Each language gets a language-specific system message. No few-shot examples are provided, ensuring the model relies on its inherent coding ability.
  4. Execution — Code runs in isolated sandbox containers (6-second wall time, 4 GB RAM, no network access).
  5. Scoring — Pass@1 averaged over 10 independent runs for statistical reliability.

To run Multi-LCB locally:

conda activate multi_lcb_env
python -m lcb_runner.runner.main \
    --model "VLLMAsync" \
    --local_model_path Qwen/Qwen2.5-Coder-3B-Instruct \
    --temperature 0.2 --top_p 0.95 --n 10 \
    --plangs "all" \
    --cot_code_execution

You can replace --plangs "all" with a comma-separated subset like --plangs "python,javascript" to target specific languages.

Multi-LCB Poster

4. Models Evaluated

The paper evaluates 24 LLMs ranging from 7B to 685B parameters, covering a diverse set of architectures and training methodologies:

  • Qwen3 family — 8B, 32B, 30B-A3B (MoE), 235B-A22B, including thinking/instruction-tuned variants
  • DeepSeek-R1-0528 — The latest reasoning-focused release
  • GPT-OSS-120B — At both Medium and Low compute budgets
  • OpenReasoning-Nemotron-32B — A reasoning-augmented model
  • OlympicCoder and OpenCoder families — Open-source coding models
  • Reasoning-augmented variants — Compared directly against instruction-tuned counterparts to isolate the impact of chain-of-thought reasoning

Each model was evaluated 10 times per problem per language, yielding over 4 million execution runs in total.

5. Key Findings

5.1 Python Overfitting Is Real

The most striking finding is that models which excel at Python often collapse in other languages. Consider OpenReasoning-Nemotron-32B*: it achieves 64.4% Pass@1 on Python but drops to 10.8% on JavaScript and a mere 2.8% on Rust. This is not an isolated case. Across the board, Python performance consistently overstates cross-lingual competence.

The gap is especially pronounced for reasoning-augmented models, which appear to overfit to Python-specific patterns during chain-of-thought fine-tuning.

5.2 Performance Gradient Across Languages

The mean Pass@1 scores across all models reveal a clear hierarchy. Python leads by a wide margin, while Scala brings up the rear:

Language Mean Pass@1 Best Model Score
Python 48.2% 74.0% (Qwen3-235B-Thinking)
Java ~44% 73.9%
C++ ~44% 75.8%
C# ~38% 66.5%
Ruby ~38% 70.2%
PHP ~36% 69.0%
Go ~36% 69.9%
Rust ~36% 70.5%
Kotlin ~35% 71.0%
JavaScript ~34% 70.5%
TypeScript ~33% 70.3%
Scala <29% 62.3%

Only the mean Python score is highlighted. 8 of 12 languages have mean scores below 40%, indicating massive room for improvement in multi-language code generation.

5.3 Top-10 Models Overall

Ranked by average Pass@1 across all 12 languages:

  1. GPT-OSS-120B* (Medium): 67.8%
  2. Qwen3-235B-A22B-Thinking-2507*: 64.0%
  3. DeepSeek-R1-0528*: 63.1%
  4. GPT-OSS-20B* (Medium): 59.8%
  5. Qwen3-30B-A3B-Thinking-2507*: 53.2%
  6. GPT-OSS-120B* (Low): 53.1%
  7. Qwen3-235B-A22B*: 48.9%
  8. Qwen3-32B*: 48.6%
  9. Qwen3-30B-A3B*: 45.5%
  10. GPT-OSS-20B* (Low): 43.3%

GPT-OSS-120B at Medium compute leads the pack, followed closely by Qwen3-235B-Thinking and DeepSeek-R1. Notably, compute budget (Medium vs Low) matters as much as model size — GPT-OSS-120B drops from 67.8% to 53.1% when using Low compute.

5.4 Language-Specific Contamination

The paper includes a time-wise analysis that compares model performance on problems released before and after each model’s training cutoff date. The results confirm that residual contamination is present: scores are systematically higher on pre-cutoff problems. Many models exhibit sharp, step-like drops when evaluation crosses their training cutoff, indicating memorization of specific solutions rather than genuine language understanding.

This finding underscores the importance of contamination-aware benchmarks like Multi-LCB, which tracks problem release dates and enables rigorous time-based analysis.

6. How to Use Multi-LCB

Multi-LCB is fully open-source and accessible to anyone:

  • License: CC BY-NC 4.0 — free for academic and non-commercial use.
  • GitHub: github.com/Multi-LCB/Multi-LCB
  • Leaderboard: multi-lcb.github.io
  • Dataset: Available on Hugging Face via the official repository.
  • Inference: Requires conda with SGLang or vLLM for model serving. Supports both local and API-based models.
  • Language selection: Run on specific languages with --plangs "python,java,rust" or evaluate all 12 at once with --plangs "all".

The pipeline is designed for easy integration into existing evaluation workflows. If you already use LiveCodeBench, switching to Multi-LCB requires minimal changes.

7. Why This Matters

Multi-LCB matters for three reasons:

  1. Python-only scores are misleading. A model that ranks #1 on Python can rank #10 on a multi-language average. Relying solely on Python benchmarks leads to poor model selection decisions.
  2. Most languages are underserved. 8 of the 12 languages in Multi-LCB have mean Pass@1 below 40%. For languages like Scala, JavaScript, and TypeScript, even the best models struggle to crack 70%. There is enormous headroom for improvement.
  3. The benchmark grows with LCB. Because Multi-LCB auto-tracks LCB updates, it will remain relevant as new problems are added.  This means it can serve as a long-term standard for multi-language code evaluation.

For teams building or selecting coding AI, Multi-LCB provides the language-specific signal needed to make informed decisions.

8. Conclusion

Multi-LCB is a rigorous, contamination-aware, multi-language code benchmark that exposes a critical truth: strong Python performance does not guarantee cross-lingual competence. The benchmark extends LiveCodeBench to 12 languages, evaluates 24 models across thousands of problems, and reveals dramatic performance gaps that Python-only evaluations miss.

Whether you are a researcher developing the next generation of code LLMs or an engineer selecting a model for a polyglot codebase, Multi-LCB gives you the data you need. The findings are clear — the field has a long way to go before AI coding assistants can truly claim multi-language proficiency.


Paper: Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages (ICLR 2026)

GitHub: github.com/Multi-LCB/Multi-LCB

Leaderboard: multi-lcb.github.io

Sponsored Links

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply