DMax: Aggressive Parallel Decoding for dLLMs

Source

Raw Markdown: paper_dmax-2026.md
PDF: paper_dmax-2026.pdf
Preprint: arXiv 2604.08302v3
Official code: czg1225/DMax
Official models: DMax-16B, DMax-Math-16B, DMax-Coder-16B
Official training data: math trajectories, code trajectories
Gonzo ML discussion: Telegram post 5420 (local extract stored at papers/dmax-2026/telegram-post-gonzo-ml-5420.md)
Gonzo-linked review: ArXivIQ review
Podcast pointer: Gonzo ML Podcasts 3727
Local official-artifact metadata: papers/dmax-2026/official_artifacts_metadata.json

Status And Credibility

arXiv lists DMax as cs.LG / cs.AI, first submitted on 2026-04-09 and revised as v3 on 2026-05-15. The paper is a current 2026 preprint from National University of Singapore authors Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. It is not a peer-reviewed venue result at ingest time.

Credibility is strong enough for an important ingest because the paper has an official Apache-2.0 GitHub repository, public Hugging Face model checkpoints, public synthetic math/code trajectory datasets, and concrete benchmark/throughput numbers. The Gonzo post listed the model as N/A, but public metadata retrieved on 2026-06-20 shows three official Hugging Face checkpoints: Zigeng/DMax-16B, Zigeng/DMax-Math-16B, and Zigeng/DMax-Coder-16B.

Core Claim

DMax argues that masked diffusion language models lose quality under aggressive parallel decoding because early mask-to-token predictions become hard, irreversible context. The paper’s fix is to make decoded positions revisable: train the model on its own predicted noisy sequences, then decode through soft embedding-space states instead of immediately committing every selected token.

The result is a diffusion language-model recipe that increases tokens-per-forward while keeping accuracy close to the LLaDA-2.0-mini baseline on math and code benchmarks.

Mechanism

DMax has two coupled pieces:

On-Policy Uniform Training (OPUT) fine-tunes a pretrained masked diffusion language model on both masked inputs and its own sampled predictions. For masked positions, the noisy predicted sequence is sampled from the model’s current predictive distribution:

x_{t}^{(p), i} = {x_{t}^{(m), i}, \overset{x}{^}^{i}, \overset{x}{^}^{i} \sim p_{θ} (\cdot ∣ x_{t}^{(m)}), x_{t}^{(m), i} \neq = [MASK], x_{t}^{(m), i} = [MASK] .

The model then learns to recover the clean sequence from both $x_{t}^{(m)}$ and $x_{t}^{(p)}$ , so inference-time self-generated mistakes become part of the training distribution.

Soft Parallel Decoding (SPD) keeps decoded positions as hybrid embeddings before final commitment. If a position’s previous top-1 token is $y_{j}$ with confidence $π_{j}$ , DMax feeds a normalized interpolation between the token embedding and the mask embedding:

\tilde{h}_{j} = π_{j} e (y_{j}) + (1 - π_{j}) e_{mask} .

This keeps uncertainty visible to later denoising steps instead of turning every early token choice into fixed context.

flowchart LR
  Masked[Masked block] --> Predict[Parallel token predictions]
  Predict --> OPUT[OPUT: train on model-sampled noisy states]
  OPUT --> SelfCorrect[Self-corrective dLLM]
  SelfCorrect --> SPD[SPD: token-mask hybrid embeddings]
  SPD --> Revise[Iterative self-revision]
  Revise --> Commit[Commit stable block]

Evidence

DMax is built on LLaDA-2.0-mini. The paper trains two specialized variants, DMax-Math and DMax-Coder, by full-parameter fine-tuning on 8 H200 GPUs. The training data is self-distilled: LLaDA-2.0-mini generates responses for public math and code prompts, yielding about 0.7M math samples and 1.0M code samples. Evaluation runs through dInFer on 2 H200 GPUs.

Key reported numbers:

Benchmark	Baseline LLaDA-2.0-mini	DMax variant	Accuracy change
GSM8K	2.04 TPF / 92.6%	5.48 TPF / 92.1%	-0.5 pp
MATH500	2.58 TPF / 75.8%	5.94 TPF / 75.4%	-0.4 pp
HumanEval-Instruct	4.38 TPF / 84.2%	7.36 TPF / 83.5%	-0.7 pp
MBPP-Instruct	2.71 TPF / 80.6%	5.86 TPF / 79.2%	-1.4 pp

The paper summarizes this as raising average TPF from about 2.8 to 6.2 while preserving accuracy, and reports an average of 1,338 TPS at batch size 1 on two H200 GPUs. Ablations support the coupling between OPUT and SPD: OPUT alone improves self-revision, SPD improves robustness under highly parallel decoding, and applying SPD to the original non-OPUT model collapses.

Relevance To This Wiki

DMax is upstream language-model evidence, not a time-series model. Its value for the wiki is the decoding and post-training pattern: when generation proceeds in blocks, intermediate predictions should remain revisable until uncertainty has collapsed enough to commit.

For time-series and world-model work, the closest analogy is horizon generation or candidate rollout under a serving budget. A foundation time-series model might need to generate many future steps, plausible scenarios, edits, or action-conditioned rollouts without paying fully serial decoding cost. DMax suggests that aggressive parallelism can be made safer when the model is explicitly trained to correct its own intermediate states and when the intermediate representation preserves confidence information.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute under a serving budget	adjacent	DMax reports a strong TPF/TPS versus accuracy tradeoff for diffusion language generation and exposes decoding thresholds/convergence criteria as inference knobs.	Needs numeric time-series, trajectory, or event-stream generation tests under matched wall-clock latency, memory, and horizon-quality budgets.
Multi-modal future distributions / generation	adjacent	Diffusion-style generation and self-revising denoising are relevant mechanisms for sample-path generation.	The paper evaluates text math/code outputs, not calibrated multiple numeric futures, edits, or action-conditioned rollouts.
Training objective for self-correction	adjacent	OPUT trains on model-sampled noisy states, directly targeting train-inference mismatch and self-generated error recovery.	A TSFM analogue must show that on-policy corrupted states preserve rare regimes, dense numeric values, exogenous variables, and interventions rather than only average trajectory plausibility.
Benchmark hygiene	warning	Throughput is reported on H200 hardware and benchmarks are GSM8K/MATH500/ASDIV/HumanEval/MBPP.	Do not translate language TPF gains into TSFM readiness without hardware-aware serving tests and domain-specific accuracy, calibration, and downstream utility metrics.
Native multivariate encoding and control	insufficient evidence	No multivariate numeric channels, action history, control inputs, interventions, or counterfactuals are modeled.	Needs action-conditioned multivariate time-series experiments.

Limitations

Evidence is confined to diffusion language models built from LLaDA-2.0-mini; it does not test time-series, robotics trajectories, telemetry, event streams, or action-conditioned world models.
The main recipe uses full-parameter fine-tuning on 8 H200 GPUs and evaluation on 2 H200 GPUs, so the serving claim should be read with hardware and implementation context.
The training data is self-distilled from the base model’s own generations. That is useful for self-correction, but it can also preserve base-model blind spots unless audited separately.
Soft embeddings depend on confidence values being meaningful enough to guide interpolation. Poor calibration or distribution shift could make the soft state misleading.
The public checkpoints require trust_remote_code=True in the model-card examples, so production use needs ordinary remote-code security review.

Links Into The Wiki

Open Questions

Can OPUT-style on-policy corrupted-state training become a general recipe for time-series generators that must revise tentative future trajectories before committing them?
What is the time-series analogue of TPF: generated samples per forward pass, channel-time cells per update, scenario horizon per denoising step, or downstream utility per wall-clock second?
Does soft embedding-space revision preserve rare numeric detail better than hard token/patch commitment in long-horizon generation?
Can confidence-based convergence criteria become calibrated uncertainty or abstention signals for operational time-series systems?
How should self-distilled trajectory data be audited so it improves self-correction without amplifying base-model shortcuts or erasing rare regimes?

Alex Open Research Wiki

Explorer

DMax: Aggressive Parallel Decoding for dLLMs

DMax: Aggressive Parallel Decoding for dLLMs

Source

Status And Credibility

Core Claim

Mechanism

Evidence

Relevance To This Wiki

Foundation TSFM Relevance

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks