Contents
1. Introduction
Looped Transformers scale test-time computation by reusing a shared block of layers to iteratively refine hidden representations without increasing parameter count. However, traditional looped architectures suffer from a linear scalability penalty: each additional loop proportionally increases latency and KV-cache memory, making large loop counts prohibitively expensive in practice.
Parallel Loop Transformers (PLT) solve this through Cross-Loop Position Offsets (CLP) and Shared-KV Gated Sliding-Window Attention (G-SWA). CLP shifts hidden state positions at each loop boundary to break sequential dependencies, enabling parallel execution. G-SWA maintains a constant KV-cache footprint regardless of loop count by reusing keys and values across iterations. Together, these mechanisms flatten the cost curve of looping—but this raises a key question: how many loops should we actually use?
This article presents LoopCoder-v2, a family of 7B-parameter PLT code generation models trained from scratch on 18 trillion tokens with R = {1, 2, 3, 4} loops. Across ten benchmarks spanning code generation, reasoning, agentic software engineering, and tool-use, we find a consistent result: optimal performance at R = 2, with additional loops actively harming results.
2. Background — Parallel Loop Transformers
Conventional looped Transformers process iterations sequentially, with each loop depending on the complete output of the previous one. This forces O(R) runtime and linearly growing KV-cache memory, which quickly becomes untenable for long-context tasks like repository-level code editing. PLT breaks this dependency chain via two innovations: CLP shifts hidden states right by one token position at each loop boundary, allowing all loops to compute in parallel (O(R) → O(1) wall-clock time); G-SWA reuses keys and values across loops with a gated sliding-window mechanism, keeping the cache footprint constant regardless of R. These mechanisms transform loop count from a cost constraint into an architectural design parameter.
3. Gain–Cost Framework
We formalize loop selection via a gain–cost framework. Each additional loop offers a refinement gain gain(r)—another opportunity to adjust hidden representations and improve predictions. However, each loop also incurs an intrinsic cost Ω(r) from CLP-induced positional mismatch, which remains approximately constant across loops. The net benefit is net(r) = gain(r) – Ω(r). When gain(r) exceeds Ω(r), the net effect is positive; when gain(r) diminishes below the fixed cost, performance regresses. This predicts a non-monotonic relationship with an optimal loop count where marginal gain no longer exceeds the offset cost.
4. Experimental Setup
LoopCoder-v2 is a family of 7B-parameter PLT coder models trained from scratch with the following configuration:
- Architecture: PLT with CLP and G-SWA, 7B parameters.
- Training data: 18 trillion tokens at 1:1 text-to-code ratio, 100+ programming languages.
- Variants: R = 1 (standard transformer baseline), R = 2, R = 3, and R = 4 loops, all with identical training data and protocols.
- Evaluation: 10 suites including SWE-bench Verified, Multi-SWE, LiveCodeBench, Terminal-Bench, BFCL, and five additional coding benchmarks.
5. Results — Loop-Count Sweep
Table 1 presents the main results across key benchmarks and the average over all 10 evaluation suites.
| Model (7B) | SWE-bench Verified | Multi-SWE | LiveCodeBench | Avg (10 benchmarks) |
|---|---|---|---|---|
| No-loop (R=1) | 43.0 | 14.0 | 27.4 | 38.0 |
| LoopCoder-v2 (R=2) | 64.4 | 31.0 | 35.4 | 46.5 |
| LoopCoder-v2 (R=3) | 27.6 | 11.0 | 28.6 | 36.9 |
| LoopCoder-v2 (R=4) | 22.4 | 9.3 | 24.5 | 34.3 |

Table 1: Benchmark results across loop-count variants. Bold values indicate the best performance in each column.
The results show a strongly non-monotonic pattern. R=2 delivers broad gains over the baseline: SWE-bench improves from 43.0 to 64.4, Multi-SWE from 14.0 to 31.0, LiveCodeBench from 27.4 to 35.4, and the 10-benchmark average from 38.0 to 46.5. Critically, all variants with R ≥ 3 regress below the R=1 baseline. At R=4, the average drops to 34.3—12.2 points below the R=2 peak.
To contextualize: at 64.4 on SWE-bench Verified, the 7B LoopCoder-v2 (R=2) surpasses Qwen3-235B (45.2) and approaches models orders of magnitude larger, including Qwen3-Coder-480B (67.0) and Kimi-K2 (69.2). This demonstrates that optimally-tuned test-time computation scaling via looping can be remarkably parameter-efficient.
6. Diagnostic Analysis — Why Only Two Loops?
Multiple diagnostic analyses converge on the same explanation. Hidden state dynamics show that loop 2 accounts for the majority of representational change; from loop 3 onward, updates become incremental rather than transformative. Attention pattern analysis confirms that the most significant rerouting occurs in loop 2, shifting from exploratory to exploitative patterns, after which attention stabilizes. KL divergence between output distributions shows the largest shift at loop 2, with loop 3 shifts an order of magnitude smaller. Effective rank of hidden representations—a proxy for representational diversity—peaks at loop 2 and narrows thereafter, indicating degeneracy in later loops.
The gain–cost framework explains this consistency: the CLP offset cost Ω(r) remains fixed per loop, but the refinement gain gain(r) drops sharply after loop 2. Beyond R=2, the diminishing gain cannot overcome the fixed positional mismatch, producing a negative net effect. This is why additional loops are not just wasteful but actively harmful.
7. Implications
For PLT-based models, R=2 is the empirically validated optimum—summarized as "only loop once" beyond the initial pass. Loop count should be treated as a hyperparameter requiring careful tuning rather than assuming monotonic improvement. The non-monotonic effect suggests static uniform looping is suboptimal; future work should explore instance-conditioned loop allocation, where the model dynamically determines loop count based on token-level or sequence-level difficulty, allocating more computation to hard examples while avoiding the over-refinement penalty observed with static R > 2 configurations.
8. Getting Started with LoopCoder-v2
LoopCoder-v2 is available on HuggingFace Hub at Multilingual-Multimodal-NLP/LoopCoder-V2. The R=2 variant is recommended for best performance.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Multilingual-Multimodal-NLP/LoopCoder-V2"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, device_map="auto"
)
prompt = "Write a Python function that returns the n-th Fibonacci number."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The model supports zero-shot code generation, instruction following for software engineering tasks, and fine-tuning for domain-specific applications. Load with trust_remote_code=True to enable PLT-specific architectural components.
9. Conclusion
LoopCoder-v2 establishes that Parallel Loop Transformers make test-time computation scaling practical, but more loops is not better. Through systematic evaluation of a 7B PLT family trained on 18 trillion tokens, we demonstrate that optimal performance occurs at exactly two loops, with additional loops causing consistent regression across all benchmarks. This is explained by the gain–cost framework: loop 2 delivers the principal refinement across hidden states, attention patterns, and output distributions, while fixed CLP offset costs dominate as later loops yield diminishing returns. The effective rank of representations peaks at loop 2 and narrows thereafter, confirming the loss of representational diversity with excessive looping. At 64.4 on SWE-bench Verified, LoopCoder-v2 (R=2) sets a new state of the art for 7B-scale code models, surpassing models 30× its size and approaching the largest open coding models. The central lesson: in PLT architectures, loop once, and only once, beyond the initial pass.