How the Model Is Being Trained
v7-DPO is a fine-tuned Gemma-4-E2B-IT targeting 18-tool function calling across Canvas LMS, calendar scheduling, and study planning. This page documents the training methodology in two states: what's shipped today (the v7-DPO checkpoint on HF Hub) and what's planned for v7-2 (the rebuild that addresses the v7-broken retrospective). Sections marked shipped describe deployed artifacts; sections marked planned describe the locked design that has not yet executed. [source: MILESTONE-v3.0-AUDIT.md, ROADMAP.md]
Current state vs planned
- • v7-DPO checkpoint on HF Hub (
kleinpanic93/canvas-calendar-agent-v7-dpo) — only v7 model artifact live - • Canvas SDK + Chrome extension (this repo)
- • HF Space demo (mock tools, no Canvas creds)
- • Phase 1 — SFT trajectory rebuild (per-tool quotas, item-disjoint, PII scrub) — NOT STARTED
- • Phase 2 — DPO 16-axis preference dataset (NO judge model) — NOT STARTED
- • Phase 3 — KTO binary-feedback dataset — NOT STARTED
- • Phase 4 — Benchmark expansion 1000+ — NOT STARTED
- • Phase 5 — Training matrix (sequential, β=0.3, eval_dataset) — NOT STARTED
- • Phase 6 — Release — NOT STARTED
- • 9-method matrix shipped infra, but model was empirically a no-op for tool calling: v7-dpo predictions were 23/25 byte-identical to v7-sft on bench.
- • Reference: MILESTONE-v3.0-AUDIT.md
End-to-end training pipeline
Raw Data
Curation
Guardrails
Fine-Tuning
16-axis
β = 0.3 (planned)
1000+ (planned)
HF Space
The Base Model
Google's MatFormer architecture — a nested transformer where the same weights serve multiple effective parameter counts. At the E2B (2.7B) extraction point the model is multimodal (text + vision tokens) and instruction-tuned. Fine-tuning inherits the instruction-following alignment without catastrophic forgetting of general capabilities.
🤗 google/gemma-4-e2b-it on HuggingFaceSupervised Fine-Tuning (SFT) phase 1 — not started
Phase 1 — teaches the model what a correct tool call looks like
SFT will train on multi-turn tool-call trajectories (post-anonymization, item-disjoint train/test splits). Each trajectory is a full user-turn → model-turn sequence where the model outputs one or more native Gemma-4 tool calls followed by a synthesized answer. The model learns correct syntax, correct tool selection, and correct argument structure from direct imitation. [source: ROADMAP.md L39]
Native Gemma-4 tool-call format
<|tool_call>call:canvas.get_assignments{course_id: "CS3704", days_ahead: 7}<tool_call|>
The <|tool_call> / <tool_call|> sentinel
tokens are part of Gemma-4's native vocabulary. The parser mirrors the Python SDK's
tool_parser.py exactly — same regex, same round-trip guarantee.
CV ≤ 0.3 prevents the model from over-indexing on high-frequency tools
(e.g. canvas.get_assignments) at the cost of rare ones
(e.g. reranker.priority_hint). Enforced by guardrail G4.
Note on the prior “181 trajectories” figure: that was v7-broken's post-shrinkage count (952 trajectory turns → 181 used after pipeline filtering) — exactly the bug phase 1 is fixing. New phase 1 enforces shrinkage budget via guardrail G6. [source: ROADMAP.md L39, PRE-EXECUTION-GUARDRAILS.md G6]
Direct Preference Optimization (DPO) phase 2 — not started
Phase 2 — teaches the model what a better tool call looks like
DPO (Rafailov et al. 2023, arXiv:2305.18290) replaces the RLHF reward model with a closed-form loss derived directly from the Bradley-Terry preference model. Given a preference pair (chosen, rejected), DPO increases the log-probability of the chosen completion relative to the rejected one, weighted by how much the current policy has already diverged from the reference.
Reference model role & β selection
The reference model is the SFT checkpoint — it acts as a KL-divergence
anchor, not a teacher. Phase 5 locks β = 0.3 (not the TRL default 0.1):
for small-N preference datasets (N ≈ 1000–4000 pairs) the higher β acts as a
stronger regularizer, preventing the policy from destroying its SFT capabilities to
aggressively optimize a small preference set. v7-broken used β = 0.1; the v7-2
rewrite raises it to 0.3 along with precompute_ref_log_probs=True.
[source: DPO-RESEARCH-SYNTHESIS.md L58, L90]
Preference pairs (16-axis programmatic)
8 binary axes × 2 (right/wrong) = 16 pair categories. Each pair shows the model what “good” looks like vs a SPECIFIC failure mode. Generation is programmatic (mutation-based) — NO judge model. Chosen / rejected are true by construction. [source: DATASET-LIFECYCLE.md L79-L127]
| # | Axis | Right (chosen) | Wrong (rejected) | Train |
|---|---|---|---|---|
| 1 | Tool selection | correct tool name from SDK registry | hallucinated tool name (e.g. canvas.list_assignments) | 200 |
| 2 | Argument keys | required args present, correct names | missing required args, or invented keys | 200 |
| 3 | Argument values | sane types/ranges (e.g. course_id: 4264) | wrong types or unparseable (e.g. course_id: "all") | 200 |
| 4 | Parsimony / cardinality | 1 minimal call resolves the user’s question | 4+ chained redundant calls (the v7-broken over-call pattern) | 600 |
| 5 | Timing / ordering | calls in correct dependency order (e.g. find_free_blocks before create_event) | wrong order (e.g. create_event before find_free_blocks) | 200 |
| 6 | Format / envelope | native Gemma-4 <|tool_call>...<tool_call|> format | natural-language description (“I will call canvas...”) or malformed envelope | 200 |
| 7 | Domain fit | tool matches user’s actual question (calendar query → calendar tool) | tool family mismatch (calendar query → canvas tool) | 200 |
| 8 | Termination | model stops after sufficient info gathered + composes final answer | model keeps calling tools indefinitely (the “random shit in random quick succession” failure) | 600 |
| Plus 800–1600 mixed-axis pairs (chosen/rejected differ on more than one axis — realistic of how the SFT model fails in practice) | ~1200 | |||
| Total (target) | 3200–4000 train / 400–500 test | |||
Why 16 axes (not 4)? v7-broken's harness reduced over-calling 16→2–3 via runtime caps, but the underlying model still didn't UNDERSTAND why fewer is better. With 16 axes, every failure mode the model exhibits in production has at least 200 training pairs explicitly teaching against it. No judge model — chosen and rejected are true by construction (mutation-based generation from SFT trajectories that pass G1/G4/G7/G8). [source: DATASET-LIFECYCLE.md#expanded-16-category-pair-taxonomy]
9-Method Training Matrix phase 5 — not started
A comparison across 9 training objectives — not an ablation: each method family eats a different dataset schema (see table below). Phase 5 runs them sequentially with per-method gates. [source: ROADMAP.md L47-L61, MILESTONE-v3.0-AUDIT.md L54-L66]
| Method family | Eats which dataset | Schema |
|---|---|---|
| SFT, LoRA, QLoRA | Phase 1 (SFT trajectory) | (messages) conversational |
| DPO, IPO, APO-Zero, SPPO, NCA | Phase 2 (preference pairs, 16-axis) | (prompt, chosen, rejected, tools) TRL conversational |
| KTO | Phase 3 (binary feedback) | (prompt, completion, label) unpaired |
| All 9 evaluated against | Phase 4 (benchmark) | (prompt, expected_tools[], expected_args[]) |
| Method | Dataset format | Notes |
|---|---|---|
| SFT baseline | Phase 1 trajectory data(messages) |
Direct imitation of correct tool-call sequences. Reference checkpoint for all DPO variants. |
| LoRA | Phase 1 trajectory data | PEFT wrapper over SFT. Low-rank adapter only; base weights frozen. Faster iteration. |
| QLoRA | Phase 1 trajectory data | LoRA + 4-bit NF4 quantization. Reduces VRAM by ~60%. Quality delta vs LoRA is the comparison question. |
| DPO | Phase 2 preference pairs(prompt, chosen, rejected, tools) |
Rafailov 2023. β = 0.3 (small-N regularizer; was 0.1 in v7-broken). Reference = SFT checkpoint. Primary production method. |
| IPO | Phase 2 preference pairs | Identity Preference Optimization (Azar et al.). Removes Bradley-Terry assumption; theoretically better on near-equal pairs. |
| APO-Zero | Phase 2 preference pairs | Anchored Preference Optimization with zero anchor. Stabilizes training when reference model is weak. |
| SPPO | Phase 2 preference pairs | Self-Play Preference Optimization. Iterative; each round re-ranks pairs using the current policy. |
| NCA | Phase 2 preference pairs | Noise-Contrastive Alignment. Treats rejected as noise; contrastive loss formulation. |
| KTO | Phase 3 unpaired binary(prompt, completion, label) |
Kahneman-Tversky Optimization (Ethayarajh et al.). Works without paired comparisons. Phase 3 enforces N ≥ 1000 (G12) to avoid v7-broken's grad_norm=290 instability at N=146. |
Pre-execution Guardrails (G1–G13)
Automated fail-stop CLI checks that run before every training job
Each guardrail encodes a SPECIFIC failure mode discovered in v7-broken or in cross-AI
audits. Every check is deterministic and exits non-zero on violation — no training
job starts if any guardrail fails. canvas-train --check
runs all 13 in sequence. Each card lists WHAT the
gate checks and WHY it exists (the v7-broken
motivation).
[source: PRE-EXECUTION-GUARDRAILS.md]
What: ≥ 80% of chosen AND rejected strings must contain <|tool_call>.
Why: v7-broken had 0/1071 rows containing tool calls — DPO trained on the wrong objective entirely (urgency-prioritization rationales instead of tool calls).
What: DPO/KTO/IPO/etc trainer must receive a non-empty eval_dataset, item-disjoint from train.
Why: v7-broken passed only train_dataset; reported “eval” metrics were train-set artifacts — silent overfit invisible from the loss curves.
What: Bench harness against base Gemma-4 must score ≥ 20% before testing fine-tunes.
Why: v7-broken's bench.py was missing the system prompt and used skip_special_tokens=True, which stripped the tool envelope. Base scored 0% — a harness artifact, not a model failure.
What: Every tool appears ≥ 30 times AND coefficient-of-variation across the 18 tools ≤ 0.3.
Why: v7-broken: calendar.create_event = 218 vs canvas.list_announcements = 8; CV = 1.27. The model couldn't learn rare tools.
What: 1-step dry run: grad_norm < 1.0, no NaN / Inf in activations or logits.
Why: v7-broken KTO: grad_norm = 290, logits at e+08 scale. Tightened from 5.0 → 1.0 in the gemini cross-AI audit.
What: Each pipeline stage retains ≥ 70% of its input rows, else require explicit --allow-shrinkage <reason>.
Why: v7-broken: 3000 raw → 1849 labeled (39% loss) → 1071 final (42% loss); 952 trajectory turns → 181 used. No audit, no warning.
What: Every tool name in the dataset must exist in the SDK registry.
Why: v7-broken: model emitted hallucinated canvas.list_assignments (real name: canvas.get_assignments) — the dataset itself contained the wrong name.
What: Every row round-trips through _tool_call_to_gemma4 → tool_parser.parse_gemma4_output with 100% fidelity.
Why: v7-broken serializer used Python str(v); parser expects JSON. 29 trajectory + 30 KTO rows failed round-trip silently.
What: Tests construct trainer, run a mini-train, assert reward margins move in the correct direction per pair type.
Why: v7-broken's test_training_config.py froze source-text strings — it would pass while DPO trained on the wrong objective forever.
What: docker build + docker run canvas-train --help exits 0 in CI.
Why: v7-broken had ENTRYPOINT ["canvas-train"] + command: sh -c "..." — the two collided and argparse rejected at runtime.
What: < 1% of rows exceed max_length = 1024 after tokenization.
Why: Gemini cross-AI audit WARN: TRL silently truncates — multi-call trajectories with tool results easily exceed 1024 tokens.
What: KTO ≥ 1000, DPO ≥ 1000, etc. Job aborts before training if N is below threshold.
Why: Gemini cross-AI audit WARN: KTO blew up partly because N = 146 was too small — numerical instability is near-certain at that scale.
What: No single tool appears more than 3× in any one trajectory's assistant turns.
Why: v7-broken AUDIT: ~25% of multi-call trajectories had redundant calls (canvas.get_todo 7× in one response). Excessive repetition trains the model to loop instead of terminating.
Bench Harness phase 4 — not started
Offline evaluation protocol used to select the production checkpoint
Design targets — phase 4 (BIER-style benchmark expansion) is NOT STARTED. The numbers below are targets, not measurements. [source: ROADMAP.md L45]
Metrics
- →Tool selection accuracy — did the model call the correct tool as its first action? Measured per tool, per method.
- →Argument accuracy — ToolBench-style partial match on required arguments (exact for IDs, fuzzy for natural-language fields).
- →Wilson 95% CI — each per-tool accuracy estimate comes with a Wilson score confidence interval. Small per-tool samples (< 40) are flagged as low-confidence.
- →Parsimony rate — fraction of correct responses that use the minimum-cardinality tool set. DPO should improve this vs SFT baseline.
Checkpoint selection
The production checkpoint is the epoch that maximizes macro-average tool selection accuracy on the eval split, subject to a constraint that no individual tool falls below 40% (tool-floor constraint). This prevents selecting a checkpoint that is strong on common tools but has regressed on rare ones.
PII Handling
How student-data PII is scrubbed before training
PII is scrubbed via Piiranha, a HuggingFace-hosted PII detection model run inside an isolated worker container before the train / test split. Anonymization is enforced as a phase-1 acceptance gate — a trajectory cannot enter the dataset if Piiranha flags any token in it as personal information.
Implementation
- • Model: iiiorg/piiranha-v1-detect-personal-information
- • Worker:
src/dataset/anon_worker_piiranha.py - • Container:
src/docker/Dockerfile.anon-piiranha - • Source repo: kleinpanic/CS3704-DPO-SSOT (training pipeline)
The HF Space demo never touches real student data — it ships with mocked Canvas tools and no Canvas credentials. PII handling is only relevant to the training pipeline. [source: anon_worker_piiranha.py, Dockerfile.anon-piiranha]