Canvas Calendar Agent — Training Pipeline Walkthrough
End-to-end recipe for producing the v3.0 Canvas Calendar Agent: from raw Canvas LMS contributions to released SFT + DPO models, with the agentic harness in kleinpanic/CS3704-Canvas-Project. Hardware: NVIDIA DGX Spark (Grace-Blackwell GB10 SoC, 122 GiB unified memory, SM121).
This document supplements .planning/ROADMAP.md and per-phase plans by giving the operational recipe an outside reader can follow. References to academic foundations (DPO, KTO, etc.) and to the artifacts and scripts that implement each step.
1. Architecture in one paragraph
A 2.7B-parameter Gemma-4-E2B-IT base is fine-tuned with Supervised Fine-Tuning on Canvas-session trajectories that already contain native Gemma-4 tool-call markers, then aligned with Direct Preference Optimization (Rafailov et al., 2023) on 1,071 prompt/chosen/rejected triples labeled at temperature 0 by a Gemma-4-31B-IT-NVFP4 teacher running on slot 0 via vLLM. The resulting checkpoint is consumed by canvas_sdk.CanvasAgent, which loops Gemma4Backend → tool_parser → REGISTRY.dispatch → format_tool_result → Gemma4Backend until the model emits a final answer with no tool call.
2. Stack and tooling
| Layer | Stack | Notes |
|---|---|---|
| Runtime | DGX Spark (GB10 SoC, 122 GiB UMA) | Single-host, no distributed training |
| Inference (teacher) | vLLM 0.18 + NVFP4 modelopt quant | vllm/vllm-openai:gemma4-cu130 image |
| Forge router | spark-ai-v2 stack | OpenAI-compatible at :18080, handles auth, slot management |
| Training | TRL 1.1 + Transformers 4.47+ + PyTorch 2.10 (NGC 25.11-py3) | Container-based, docker-compose.training.yaml |
| Data CLI | canvas-data (this repo) |
merge / generate-pairs / label / audit / split-for-release / gen-kto-large / gen-benchmark-large |
| Train CLI | canvas-train (this repo) |
--method {sft, lora, qlora, dpo, ipo, apo-zero, sppo, nca, kto, all} |
| Anon | piiranha (CPU-only) + regex pre-pass |
Spacy NER + custom CRN patterns |
| Release | canvas-release |
GGUF Q2_K..Q8_0 (6 quants × 9 methods) |
| Agentic harness | canvas_sdk.CanvasAgent (Canvas-Project repo) |
OpenAI-compatible wrapper + tool dispatcher |
3. Data pipeline
3.1 Collection — multi-contributor
Each contributor runs the Canvas TUI's share_my_canvas.py extractor, which produces a JSONL snapshot of their courses, assignments, modules, and trajectory recordings. Examples in this dataset:
- Williammm23.jsonl (William; 34 courses, 909 assignments)
- kleinpanic.jsonl, Jada-001.jsonl, etc.
These land in data/collab/*.jsonl.
3.2 Anonymization — two-pass (CRN + PII)
Pass 1 — CRN: canvas-data merge --inputs data/collab/*.jsonl --out data/merged.jsonl
src/dataset/pipeline.py:detect_crn matches \b[A-Z]{2,5}_\d{4}_\d+_\d{6}\b (Virginia Tech CRN form CS_3704_21936_202601). anonymize_crn builds a stable registry mapping each raw CRN → @COURSEn. The merge step also dedups assignments by normalized (title, course) key.
Pass 2 — PII (PERSON, LOC, FAC, emails, phones): piiranha (CPU-only).
The Docker container at src/docker/Dockerfile.anon-piiranha runs en_core_web_sm + custom regex on content/text/final_answer fields:
# CRITICAL: NO --gpus all (causes OOM with vLLM running)
docker run --rm \
-i --memory=6g --memory-swap=6g \
-v $PWD/data:/data \
pii-anon < data/sft_trajectory_v7_train.jsonl > data/sft_trajectory_v7_train_clean.jsonl
Email and phone regex must run as a third pass (piiranha defaults miss asenger@vt.edu, 540-231-3788):
text = re.sub(r"[a-zA-Z0-9._%+-]+@vt\.edu", "@PROF_EMAIL", text)
text = re.sub(r"\b(540|703|804)[-.\s]\d{3}[-.\s]\d{4}\b", "@PHONE", text)
Audit gate: canvas-data audit --train data/train.jsonl --test data/test.jsonl exits non-zero on any unmasked CRN. CI enforces this.
3.3 Trajectory SFT data
canvas-data trajectory --sessions data/sessions/*.jsonl --out data/sft_trajectory_v7_train.jsonl
Each row is {"messages": [...]} where assistant turns include native Gemma-4 tool-call delimiters: <|tool_call>call:tool.name{arg:value}<tool_call|>. The format is a custom variant — NOT JSON. See src/finetune/utils/tool_parser.py:_TOOL_CALL_RE for the canonical parser.
181 trajectory rows (post-anon), 46 held out for test.
3.4 Preference pair generation + 3-vote labeling
canvas-data generate-pairs --corpus data/merged.jsonl --out data/pairs.jsonl --max-pairs 2500
Pairs are sampled from itertools.combinations then labeled by canvas-data label which sends each {prompt, item_a, item_b} through the Gemma-4-31B-IT-NVFP4 teacher at temperature 0, three times. A pair is kept iff all three votes agree (majority 3/3); otherwise discarded. This yields the v7 corpus of 1,842/3,000 pairs labeled, 1,071 in data/v7/preference_train.jsonl after dedup + split.
canvas-data label data/preference_train_v7.jsonl \
--out data/preference_train_v7_labeled.jsonl \
--endpoint http://localhost:18080/v1/chat/completions \
--model gemma4 --workers 2 # 2 to avoid vLLM 504 timeout
The teacher is enforced to be Gemma-4 by _validate_teacher() (commit 76eb8bf) — --model gemma4 MUST resolve to nvidia/Gemma-4-31B-IT-NVFP4 via the forge router.
3.5 Split for release
canvas-data split-for-release --out-dir data/v7 --pref data/preference_train_v7_labeled.jsonl --kto data/kto_train_v7.jsonl --sft-train data/sft_trajectory_v7_train_clean.jsonl
Produces:
- data/v7/trajectory_train.jsonl (181 rows) → SFT
- data/v7/preference_train.jsonl (1,071 rows, item-disjoint from test) → DPO family
- data/v7/kto_train.jsonl (146 rows) → KTO
- data/v7/{trajectory,preference,kto}_test.jsonl → held-out eval
4. SFT — supervised fine-tuning
Goal: Teach Gemma-4-E2B the Canvas-agent system prompt, tool-call delimiter format, and the kind of structured plan we want it to emit (Cepeda-spaced study blocks, exam brackets, rescheduling, etc.).
Math: Standard cross-entropy on the assistant turns only. TRL 1.1 enforces this via the assistant_only_loss=True flag in SFTConfig, which requires the chat template to wrap model turns in {% generation %}...{% endgeneration %} markers. Our _patch_gemma4_chat_template() in src/finetune/main.py:140 injects those markers into the upstream Gemma-4 template.
Hyperparameters (from src/finetune/main.py:_run_sft):
| Field | Value |
|---|---|
| Base model | google/gemma-4-E2B-it |
| Precision | bf16 throughout |
| Epochs | 1 |
| Per-device batch size | 1 |
| Gradient accumulation | 8 (effective batch = 8) |
| Learning rate | 2e-5 |
| Optimizer | paged_adamw_8bit |
| Max seq length | 4096 |
| Output | checkpoints/v7-sft/model.safetensors (~10.2 GB) |
Run:
CANVAS_TRAIN_METHOD=sft docker compose \
-f docker-compose.training.yaml \
-p cs3704-sft \
run --rm --build --entrypoint "" \
train canvas-train --method sft
Wall time: ~10–15 min on GB10. Evaluation: canvas-data audit --pref data/v7/preference_test.jsonl plus visual inspection that 3 sample trajectories round-trip cleanly through tool_parser.
5. DPO — Direct Preference Optimization
5.1 Background — what DPO is
Primary reference: Rafailov, Sharma, Mitchell, Ermon, Manning, Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023 (Outstanding Main Track Runner-Up). arXiv:2305.18290 [cs.LG], 29 May 2023, last revised 13 Dec 2023 (v3). https://arxiv.org/abs/2305.18290
What problem the paper solves
Prior to DPO, aligning an LM with human preferences required a 3-stage RLHF pipeline (Christiano et al. 2017; Ziegler et al. 2019; Stiennon et al. 2020; Ouyang et al. 2022): (1) supervised fine-tuning on demonstrations, (2) train a separate reward model on preference data, (3) PPO-optimize the policy against the reward model with a KL constraint to the SFT model. This is operationally complex (rollouts, KL regularization, instability) and requires loading two extra networks during RL.
DPO observes that under the Bradley-Terry preference model and the standard RLHF objective:
max_π E_x~D, y~π(·|x) [r_φ(x, y)] − β · D_KL[π(·|x) ‖ π_ref(·|x)]
the closed-form optimum is
π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp((1/β) r(x,y)) (Eq. 4)
so any reward function can be re-expressed in terms of its optimal policy and the reference: r(x,y) = β log π(y|x)/π_ref(y|x) + β log Z(x). Substituting this into the Bradley-Terry preference probability P(y_w ≻ y_l | x) = σ(r(x, y_w) − r(x, y_l)), the partition function Z(x) cancels and the preference probability depends only on the policy's log-ratios.
This yields the DPO loss (Eq. 7 in the paper):
L_DPO(π_θ; π_ref) = −E_(x,y_w,y_l)~D [
log σ( β · log π_θ(y_w|x)/π_ref(y_w|x)
− β · log π_θ(y_l|x)/π_ref(y_l|x) )
]
Where:
- x is the prompt
- y_w is the preferred (chosen) response, y_l is the rejected response
- π_θ is the policy being trained (initialized from SFT)
- π_ref is the reference policy (frozen, identical to π_θ at init)
- β ∈ (0, ∞) controls deviation from π_ref. Higher β = stay closer to SFT
- σ is the logistic function
Equivalently, define the implicit reward r̂(x,y) = β · log π_θ(y|x) / π_ref(y|x); the loss is then binary cross-entropy on the margin r̂(x, y_w) − r̂(x, y_l). The gradient of L_DPO is:
∇_θ L_DPO = −β · E_(x,y_w,y_l)~D [
σ( r̂(x, y_l) − r̂(x, y_w) ) · ( ∇_θ log π_θ(y_w|x) − ∇_θ log π_θ(y_l|x) )
]
The first factor σ(...) is high when the policy ranks pairs wrong relative to π_ref, automatically up-weighting hard examples (Section 4 of the paper).
Why it works in practice
The paper's experiments (controlled sentiment generation, summarization on Reddit TL;DR, Anthropic Helpful-Harmless dialogue) show DPO matches or beats PPO-RLHF with no reward model training, no rollout sampling, no value head, and no KL penalty hyperparameter. β subsumes the KL coefficient. Stability: DPO loss is convex-ish (binary cross-entropy on a logit), unlike PPO which has the entropy regularizer + clipping + advantage normalization tricks.
Practical recipe (Section 6.1 of the paper)
- Start from a strong SFT model (their
π_SFT). - Use the SFT model as both the initial policy and the reference. Snapshot
π_refat training start; do not update. - Sweep β ∈ {0.01, 0.1, 0.3, 0.5, 1.0}; β = 0.1 was the sweet spot for HH/summarization, β = 0.5 for IMDB sentiment.
- Train for 1 epoch over the preference dataset (matches RLHF rollout budget).
- Learning rate ~1e-6 to 5e-6 with linear warmup; this project uses 5e-6.
- Effective batch size 32–64 in their experiments; we use 8 because our dataset is only 1,071 pairs.
The TRL implementation in this project uses loss_type="sigmoid" (the original DPO loss); related variants (IPO, APO-zero, SPPO, NCA) just swap the loss function while reusing the same machinery.
5.1b Cross-audit: implementation vs. arXiv:2305.18290
This project's DPO config (src/finetune/main.py:_run_dpo_family, lines 250–260) was audited against the paper's recommendations on 2026-05-06. Findings:
| Hyperparameter | Paper recommendation | Our value | Verdict |
|---|---|---|---|
| Loss function | sigmoid (Eq. 7) | loss_type="sigmoid" |
✓ matches |
| β (temperature on KL implicit) | 0.1 default for HH/summarization (§6.1, Table 4) | beta=0.1 |
✓ matches |
| Reference policy | SFT model, frozen, identical to π_θ at init (§4) | ref_model=None + precompute_ref_log_probs=True + sync_ref_model=False — TRL snapshots policy at init and freezes |
✓ matches semantically |
| Initial policy | SFT model | Loaded from checkpoints/v7-sft/ (E2B SFT'd on trajectory data) |
✓ matches |
| Epochs | 1 (§6.1) | num_train_epochs=1 |
✓ matches |
| Learning rate | 1e-6 to 5e-6 typical | learning_rate=5e-6 |
✓ within range (high end) |
| Effective batch size | 32–64 (paper), our scale smaller | 1 × 8 grad-accum = 8 | ⚠ smaller than paper but appropriate for 1,071 pairs |
| Precision | FP16 in paper | bf16=True (Hopper/Ampere preferred over fp16) |
✓ equivalent or better |
| Optimizer | AdamW (paper) | paged_adamw_8bit (memory-efficient AdamW) |
✓ same first-order behavior, lower memory |
| Scheduler | Linear warmup | Default linear with warmup_ratio (TRL default 0.1) |
✓ matches |
| KL coefficient | None — β subsumes it (§4) | None | ✓ matches |
| Gradient clipping | 1.0 in paper | TRL default 1.0 | ✓ matches |
Notable deviations and justifications:
-
Reference model representation. The paper (§4) describes π_ref as a separate network. TRL's
precompute_ref_log_probs=True+ref_model=Noneis functionally equivalent: the reference log-probslog π_ref(y|x)are computed in a single pre-training pass over the dataset using the policy's initial weights, cached, and consumed at training time. This avoids holding two model copies in memory simultaneously (critical on a 122 GiB UMA system already running vLLM). The reference is mathematically frozen because it was computed from a fixed snapshot. This is the approach used in modern TRL and matches the paper's mathematical specification. -
Effective batch size of 8 vs. 32–64 in the paper. Our dataset is 1,071 pairs (the paper's HH dataset has ~170k); a smaller batch improves gradient signal per pair. Larger batches would also have ~134 effective gradient updates per epoch, so the difference is in gradient noise scale, not training budget.
-
bf16overfp16. Equivalent in dynamic range; bfloat16 has the same range as fp32 with reduced precision, avoiding overflow that fp16 sometimes hits during DPO log-prob computation. Not a semantic deviation. -
paged_adamw_8bit(bitsandbytes) instead of plain AdamW. The 8-bit quantization affects optimizer state, not gradients or weights. Empirically equivalent first-order convergence with ~75% lower optimizer memory. Required to fit the 2.7B E2B + cached ref logprobs + activations in our memory budget.
Conclusion: The implementation faithfully reproduces the DPO loss and training recipe from arXiv:2305.18290 §4–§6.1. No semantic deviations from the paper's specification.
5.1c Why DPO over PPO-RLHF for this project
Per the paper's own ablations (Section 6.2, Figure 3): on Anthropic HH-RLHF, DPO with β=0.1 reaches higher win-rate against the SFT baseline than PPO with a tuned reward model, while requiring no rollout sampling, no separate reward model, no KL coefficient sweeping, and no value head.
For our scale (E2B = 2.7B params, 1,071 preference pairs, single GB10 SoC), this is decisive: - PPO would need a reward model: another ~2.7B forward+backward pass during training — doubles GPU memory. - PPO needs rollout sampling: ~10× longer wall-clock per epoch. - PPO requires KL coefficient tuning: each value of β needs a separate full training run. - DPO finishes in ~10 minutes wall-clock. PPO at this scale typically takes 1–2 hours.
References cited by the paper that we also build on: - Bradley & Terry, Rank analysis of incomplete block designs, Biometrika 1952 — preference model. - Christiano et al., Deep RL from human preferences, NeurIPS 2017 — original RLHF. - Stiennon et al., Learning to summarize with human feedback, NeurIPS 2020 — TL;DR dataset. - Ouyang et al., Training language models to follow instructions with human feedback, NeurIPS 2022 — InstructGPT. - Ziegler et al., Fine-tuning language models from human preferences, arXiv:1909.08593, 2019 — early RLHF.
5.2 Setup choices for this project
| Choice | Value | Justification |
|---|---|---|
| Reference model | SFT checkpoint (E2B), frozen via precompute_ref_log_probs=True + sync_ref_model=False |
Standard DPO recipe (Rafailov §3). Using the 31B teacher as ref was rejected: stronger ref destabilizes gradients and consumes 25 GiB more UMA. |
| β | 0.1 | TRL default; matches DPO paper Table 4 settings for low-data regimes |
| Loss type | sigmoid (Rafailov original) |
Versions in loss_map: dpo→sigmoid, ipo→ipo, apo-zero→apo_zero, sppo→sppo_hard, nca→nca_pair |
| Padding side | left |
DPO loss is on response tokens; left-padding keeps the response contiguous at end |
| Precompute ref logprobs | True |
One-shot pass over the dataset to cache π_ref(y |
| Optimizer | paged_adamw_8bit |
8-bit Adam halves optimizer memory; critical for fitting policy + cached ref logprobs in 122 GiB UMA |
max_prompt_length and max_length were removed from DPOConfig in TRL ≥0.12; truncation is now handled by the tokenizer. This bit us at run time on 2026-05-06 and was patched in src/finetune/main.py:250.
5.3 Pre-training preparation
DPO trains on data/v7/preference_train.jsonl (1,071 rows) starting from checkpoints/v7-sft/model.safetensors. Each row has fields pair_id, prompt, pair_type, item_a_id, item_b_id, chosen, rejected. The TRL DPOTrainer consumes the standard prompt/chosen/rejected schema directly.
5.4 Hyperparameters
DPOConfig(
output_dir="checkpoints/v7-dpo",
loss_type="sigmoid",
beta=0.1,
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=8, # effective batch = 8
learning_rate=5e-6, # an order lower than SFT
bf16=True,
logging_steps=10,
save_strategy="epoch",
sync_ref_model=False, # ref stays frozen at SFT
seed=42,
precompute_ref_log_probs=True, # cache π_ref pass once
optim="paged_adamw_8bit",
)
5.5 Run
CANVAS_TRAIN_METHOD=dpo docker compose \
-f docker-compose.training.yaml \
-p cs3704-dpo \
run --rm --build --entrypoint "" \
train canvas-train --method dpo
Phases observed at run time (E2B, 1071 rows):
| Phase | Duration | What it does |
|---|---|---|
| Image build (cached) | ~10 s | Editable install of project + responses |
| Container start | ~5 s | NGC pytorch:25.11 base spins up |
| Weight load | ~1.5 s | 1951 safetensor shards from checkpoints/v7-sft/ |
| Tokenize dataset | ~2 s | 1071 rows |
| Compute reference log probs | ~4 min 15 s | One forward pass per row at 4 it/s (the bottleneck — single GPU, batch 1) |
| Train (134 effective steps) | ~5–6 min | grad-accum 8 over 1071 examples, bf16 |
| Save checkpoint | ~5 s | checkpoints/v7-dpo/model.safetensors |
Total wall: ~10 min. Loss should drop from ~0.69 (random preference accuracy) to ~0.4–0.5.
6. KTO — Kahneman-Tversky Optimization (alternative path)
Reference: Ethayarajh et al. KTO: Model Alignment as Prospect-Theoretic Optimization. ICML 2024. arXiv:2402.01306.
KTO is included for diversity but DPO is the headline result for v3.0. KTO trains on scalar desirability rather than paired preferences ({x, y, label∈{desirable, undesirable}}). Useful when paired preferences are unavailable. We use 146 rows (122 desirable, 24 undesirable) generated by canvas-data gen-kto-large --per-tool 20.
The TRL KTOTrainer uses ref_model=None because KTO's per-example loss is not pairwise — TRL handles the implicit reference internally via running statistics. Hyperparameters mirror DPO except loss_type="kto" and desirable_weight=1.0, undesirable_weight=1.0.
7. Evaluation
canvas-data audit --train data/v7/trajectory_train.jsonl --test data/v7/trajectory_test.jsonl runs the static checks (anon coverage, transitivity, item-disjoint train/test). Beyond that, the human evaluation suite in tests/test_realistic_use.py exercises 6 calendar scenarios end-to-end through the agent harness — semester planning, Cepeda spacing, multi-exam scheduling, illness rescheduling, etc. Each scenario asserts that the model emits ≥1 valid tool call sequence and ends with a non-empty final answer.
canvas-train --method dpo writes checkpoints/v7-dpo/trainer_state.json with the loss curve and gradient norms. A successful run shows train_loss monotonically decreasing across the 134 steps.
8. Agentic harness — canvas_sdk.CanvasAgent
Lives in the Canvas-Project repo (kleinpanic/CS3704-Canvas-Project, branch main). The flow:
User text
│
▼
CanvasAgent.run()
│ build messages with system prompt + 18 tool schemas
▼
Gemma4Backend.chat() ──HTTP──▶ vLLM @ :18080
│ ◀── raw text including <|tool_call>...<tool_call|>
▼
tool_parser.parse_tool_calls()
│
▼
agent_tools.dispatch(name, args) ──▶ Canvas API / Calendar / Study
│ ◀── result dict
▼
format_tool_result() → inject as <|tool_response> message
│
└──▶ loop until no tool calls (max 8 turns) → final answer
The harness is published in PR #98 (merged) plus the formatting fix in PR #100 (merged). Demo script: scripts/demo_agent.py. Documentation: docs-site/agent-demo.md. The harness imports nothing from the training repo — it only needs httpx and the regex-based tool_parser (ported as-is from src/finetune/utils/tool_parser.py).
9. Release artifacts (Phase 18)
GGUF export via canvas-release produces 6 quantizations × 9 methods = 54 GGUF files for v7:
- Quants:
Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0 - Methods:
sft, dpo, kto, lora, qlora, apo-zero, nca, ipo, sppo
Old formats Q4_0, Q4_K_S, Q5_0, Q5_K_S, F16, BF16 deferred to a future milestone (superseded by K-quants or redundant with raw weights).
Two HuggingFace dataset repos + 9 model repos, each with model card + GGUF assets + Zenodo DOI. All gated by release_gate.py which blocks publication until RELEASE-LOG.md has every rigor check ticked.
10. References
- DPO: Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, NeurIPS 2023. arXiv:2305.18290.
- IPO: Azar et al., A General Theoretical Paradigm to Understand Learning from Human Preferences, AISTATS 2024. arXiv:2310.12036.
- KTO: Ethayarajh et al., KTO: Model Alignment as Prospect-Theoretic Optimization, ICML 2024. arXiv:2402.01306.
- APO-zero: D'Oosterlinck et al., Anchored Preference Optimization, 2024. arXiv:2408.06266.
- SPPO: Wu et al., Self-Play Preference Optimization for Language Model Alignment, 2024. arXiv:2405.00675.
- NCA: Chen et al., Noise Contrastive Alignment, 2024. arXiv:2402.05369.
- TRL: Hugging Face TRL library,
DPOTrainer/SFTTrainer/KTOTrainerimplementations. https://github.com/huggingface/trl - Gemma-4: Google DeepMind, Gemma 4 technical report, 2026.
- Cepeda spaced repetition: Cepeda et al., Distributed practice in verbal recall tasks: A review and quantitative synthesis, Psychological Bulletin 2006.
11. Operational gotchas (learned the hard way)
- Forge load takes a model name, not a path.
forge load nvidia/Gemma-4-31B-IT-NVFP4works;forge load /srv/.../GemmaSuperdoes not. See.planning/research/FORGE-CLI-REFERENCE.md. forge psis the only truth.curl :18080/healthonly proves the proxy is up — vLLM may have crashed.- Piiranha CPU-only.
docker run --memory=6g(NO--gpus all) — combining piiranha GPU with vLLM caused two OOM events on 2026-05-05. - GPU util 0.60 lock. Both
slot0.envfiles setGPU_MEMORY_UTIL=0.60to leave headroom for concurrent training. - Docker
CANVAS_TRAIN_METHODshell expansion is broken withENTRYPOINT ["canvas-train"]. Use--entrypoint ""and passcanvas-train --method Xas command-args. - TRL
max_prompt_lengthremoved in 0.12+.DPOConfigno longer accepts it. - DPO labeling: 2 workers max with vLLM under contention. 4 workers + concurrent gen-kto-large + bench gives 504 Gateway Timeouts.
- 3-vote majority is strict. A single vote failure (None) → discard. This is intentional; the alternative (2/3 majority) admits noisier labels and degraded DPO outcome.