Deployed: v7-DPO • HF Model

How the Model Is Being Trained

v7-DPO is a fine-tuned Gemma-4-E2B-IT targeting 18-tool function calling across Canvas LMS, calendar scheduling, and study planning. This page documents the training methodology in two states: what's shipped today (the v7-DPO checkpoint on HF Hub) and what's planned for v7-2 (the rebuild that addresses the v7-broken retrospective). Sections marked shipped describe deployed artifacts; sections marked planned describe the locked design that has not yet executed. [source: MILESTONE-v3.0-AUDIT.md, ROADMAP.md]

Current state vs planned

shipped today

• v7-DPO checkpoint on HF Hub (kleinpanic93/canvas-calendar-agent-v7-dpo) — only v7 model artifact live
• Canvas SDK + Chrome extension (this repo)
• HF Space demo (mock tools, no Canvas creds)

in flight (v7-2)

• Phase 1 — SFT trajectory rebuild (per-tool quotas, item-disjoint, PII scrub) — NOT STARTED
• Phase 2 — DPO 16-axis preference dataset (NO judge model) — NOT STARTED
• Phase 3 — KTO binary-feedback dataset — NOT STARTED
• Phase 4 — Benchmark expansion 1000+ — NOT STARTED
• Phase 5 — Training matrix (sequential, β=0.3, eval_dataset) — NOT STARTED
• Phase 6 — Release — NOT STARTED

archived (v7-broken)

• 9-method matrix shipped infra, but model was empirically a no-op for tool calling: v7-dpo predictions were 23/25 byte-identical to v7-sft on bench.
• Reference: MILESTONE-v3.0-AUDIT.md

[source: ROADMAP.md L32-L45]

End-to-end training pipeline

Canvas LMS
Raw Data

Trajectory
Curation

G1–G13
Guardrails

SFT

Supervised
Fine-Tuning

Preference
16-axis

DPO

DPO Training
β = 0.3 (planned)

Bench Eval
1000+ (planned)

Deploy
HF Space

The Base Model

Gemma-4-E2B-IT base

Google's MatFormer architecture — a nested transformer where the same weights serve multiple effective parameter counts. At the E2B (2.7B) extraction point the model is multimodal (text + vision tokens) and instruction-tuned. Fine-tuning inherits the instruction-following alignment without catastrophic forgetting of general capabilities.

🤗 google/gemma-4-e2b-it on HuggingFace

2.7B

params

tools

MatFormer

architecture

multimodal

text + vision

Supervised Fine-Tuning (SFT) phase 1 — not started

Phase 1 — teaches the model what a correct tool call looks like

SFT will train on multi-turn tool-call trajectories (post-anonymization, item-disjoint train/test splits). Each trajectory is a full user-turn → model-turn sequence where the model outputs one or more native Gemma-4 tool calls followed by a synthesized answer. The model learns correct syntax, correct tool selection, and correct argument structure from direct imitation. [source: ROADMAP.md L39]

Native Gemma-4 tool-call format

<|tool_call>call:canvas.get_assignments{course_id: "CS3704", days_ahead: 7}<tool_call|>

The <|tool_call> / <tool_call|> sentinel tokens are part of Gemma-4's native vocabulary. The parser mirrors the Python SDK's tool_parser.py exactly — same regex, same round-trip guarantee.

Dataset shape (target)

Trajectories (train)400–500 (TBD post-phase-1)

Trajectories (test)80–100 (TBD post-phase-1)

Tools covered18 (all)

Min per tool≥ 30

Distribution CV≤ 0.3

Distribution constraint

CV ≤ 0.3 prevents the model from over-indexing on high-frequency tools (e.g. canvas.get_assignments) at the cost of rare ones (e.g. reranker.priority_hint). Enforced by guardrail G4.

Note on the prior “181 trajectories” figure: that was v7-broken's post-shrinkage count (952 trajectory turns → 181 used after pipeline filtering) — exactly the bug phase 1 is fixing. New phase 1 enforces shrinkage budget via guardrail G6. [source: ROADMAP.md L39, PRE-EXECUTION-GUARDRAILS.md G6]

Direct Preference Optimization (DPO) phase 2 — not started

Phase 2 — teaches the model what a better tool call looks like

DPO (Rafailov et al. 2023, arXiv:2305.18290) replaces the RLHF reward model with a closed-form loss derived directly from the Bradley-Terry preference model. Given a preference pair (chosen, rejected), DPO increases the log-probability of the chosen completion relative to the rejected one, weighted by how much the current policy has already diverged from the reference.

Reference model role & β selection

The reference model is the SFT checkpoint — it acts as a KL-divergence anchor, not a teacher. Phase 5 locks β = 0.3 (not the TRL default 0.1): for small-N preference datasets (N ≈ 1000–4000 pairs) the higher β acts as a stronger regularizer, preventing the policy from destroying its SFT capabilities to aggressively optimize a small preference set. v7-broken used β = 0.1; the v7-2 rewrite raises it to 0.3 along with precompute_ref_log_probs=True. [source: DPO-RESEARCH-SYNTHESIS.md L58, L90]

Preference pairs (16-axis programmatic)

8 binary axes × 2 (right/wrong) = 16 pair categories. Each pair shows the model what “good” looks like vs a SPECIFIC failure mode. Generation is programmatic (mutation-based) — NO judge model. Chosen / rejected are true by construction. [source: DATASET-LIFECYCLE.md L79-L127]

#	Axis	Right (chosen)	Wrong (rejected)	Train
1	Tool selection	correct tool name from SDK registry	hallucinated tool name (e.g. `canvas.list_assignments`)	200
2	Argument keys	required args present, correct names	missing required args, or invented keys	200
3	Argument values	sane types/ranges (e.g. `course_id: 4264`)	wrong types or unparseable (e.g. `course_id: "all"`)	200
4	Parsimony / cardinality	1 minimal call resolves the user’s question	4+ chained redundant calls (the v7-broken over-call pattern)	600
5	Timing / ordering	calls in correct dependency order (e.g. `find_free_blocks` before `create_event`)	wrong order (e.g. `create_event` before `find_free_blocks`)	200
6	Format / envelope	native Gemma-4 `<\|tool_call>...<tool_call\|>` format	natural-language description (“I will call canvas...”) or malformed envelope	200
7	Domain fit	tool matches user’s actual question (calendar query → calendar tool)	tool family mismatch (calendar query → canvas tool)	200
8	Termination	model stops after sufficient info gathered + composes final answer	model keeps calling tools indefinitely (the “random shit in random quick succession” failure)	600
Plus 800–1600 mixed-axis pairs (chosen/rejected differ on more than one axis — realistic of how the SFT model fails in practice)				~1200
Total (target)				3200–4000 train / 400–500 test

Why 16 axes (not 4)? v7-broken's harness reduced over-calling 16→2–3 via runtime caps, but the underlying model still didn't UNDERSTAND why fewer is better. With 16 axes, every failure mode the model exhibits in production has at least 200 training pairs explicitly teaching against it. No judge model — chosen and rejected are true by construction (mutation-based generation from SFT trajectories that pass G1/G4/G7/G8). [source: DATASET-LIFECYCLE.md#expanded-16-category-pair-taxonomy]

9-Method Training Matrix phase 5 — not started

A comparison across 9 training objectives — not an ablation: each method family eats a different dataset schema (see table below). Phase 5 runs them sequentially with per-method gates. [source: ROADMAP.md L47-L61, MILESTONE-v3.0-AUDIT.md L54-L66]

Method family → dataset schema (per-family, NOT shared)

Method family	Eats which dataset	Schema
SFT, LoRA, QLoRA	Phase 1 (SFT trajectory)	`(messages)` conversational
DPO, IPO, APO-Zero, SPPO, NCA	Phase 2 (preference pairs, 16-axis)	`(prompt, chosen, rejected, tools)` TRL conversational
KTO	Phase 3 (binary feedback)	`(prompt, completion, label)` unpaired
All 9 evaluated against	Phase 4 (benchmark)	`(prompt, expected_tools[], expected_args[])`

Method	Dataset format	Notes
SFT baseline	Phase 1 trajectory data `(messages)`	Direct imitation of correct tool-call sequences. Reference checkpoint for all DPO variants.
LoRA	Phase 1 trajectory data	PEFT wrapper over SFT. Low-rank adapter only; base weights frozen. Faster iteration.
QLoRA	Phase 1 trajectory data	LoRA + 4-bit NF4 quantization. Reduces VRAM by ~60%. Quality delta vs LoRA is the comparison question.
DPO	Phase 2 preference pairs `(prompt, chosen, rejected, tools)`	Rafailov 2023. β = 0.3 (small-N regularizer; was 0.1 in v7-broken). Reference = SFT checkpoint. Primary production method.
IPO	Phase 2 preference pairs	Identity Preference Optimization (Azar et al.). Removes Bradley-Terry assumption; theoretically better on near-equal pairs.
APO-Zero	Phase 2 preference pairs	Anchored Preference Optimization with zero anchor. Stabilizes training when reference model is weak.
SPPO	Phase 2 preference pairs	Self-Play Preference Optimization. Iterative; each round re-ranks pairs using the current policy.
NCA	Phase 2 preference pairs	Noise-Contrastive Alignment. Treats rejected as noise; contrastive loss formulation.
KTO	Phase 3 unpaired binary `(prompt, completion, label)`	Kahneman-Tversky Optimization (Ethayarajh et al.). Works without paired comparisons. Phase 3 enforces N ≥ 1000 (G12) to avoid v7-broken's grad_norm=290 instability at N=146.

Phase 5 is sequential per-method execution with per-method gates (G1–G13 + behavioral reward-margin checks). Methods that no-op on tool calling (the v7-broken DPO failure mode) are rejected by gate G9 before they reach the bench. [source: ROADMAP.md L47-L48]

Pre-execution Guardrails (G1–G13)

Automated fail-stop CLI checks that run before every training job

Each guardrail encodes a SPECIFIC failure mode discovered in v7-broken or in cross-AI audits. Every check is deterministic and exits non-zero on violation — no training job starts if any guardrail fails. canvas-train --check runs all 13 in sequence. Each card lists WHAT the gate checks and WHY it exists (the v7-broken motivation). [source: PRE-EXECUTION-GUARDRAILS.md]

G1 Tool-call token coverage

What: ≥ 80% of chosen AND rejected strings must contain <|tool_call>.

Why: v7-broken had 0/1071 rows containing tool calls — DPO trained on the wrong objective entirely (urgency-prioritization rationales instead of tool calls).

G2 Eval dataset required & disjoint

What: DPO/KTO/IPO/etc trainer must receive a non-empty eval_dataset, item-disjoint from train.

Why: v7-broken passed only train_dataset; reported “eval” metrics were train-set artifacts — silent overfit invisible from the loss curves.

G3 Base model bench sanity ≥ 20%

What: Bench harness against base Gemma-4 must score ≥ 20% before testing fine-tunes.

Why: v7-broken's bench.py was missing the system prompt and used skip_special_tokens=True, which stripped the tool envelope. Base scored 0% — a harness artifact, not a model failure.

G4 Tool distribution ≥ 30, CV ≤ 0.3

What: Every tool appears ≥ 30 times AND coefficient-of-variation across the 18 tools ≤ 0.3.

Why: v7-broken: calendar.create_event = 218 vs canvas.list_announcements = 8; CV = 1.27. The model couldn't learn rare tools.

G5 Numerical stability: grad_norm < 1.0

What: 1-step dry run: grad_norm < 1.0, no NaN / Inf in activations or logits.

Why: v7-broken KTO: grad_norm = 290, logits at e+08 scale. Tightened from 5.0 → 1.0 in the gemini cross-AI audit.

G6 Pipeline shrinkage < 30% per stage

What: Each pipeline stage retains ≥ 70% of its input rows, else require explicit --allow-shrinkage <reason>.

Why: v7-broken: 3000 raw → 1849 labeled (39% loss) → 1071 final (42% loss); 952 trajectory turns → 181 used. No audit, no warning.

G7 Tool registry check (no hallucinated names)

What: Every tool name in the dataset must exist in the SDK registry.

Why: v7-broken: model emitted hallucinated canvas.list_assignments (real name: canvas.get_assignments) — the dataset itself contained the wrong name.

G8 Serializer ↔ parser round-trip

What: Every row round-trips through _tool_call_to_gemma4 → tool_parser.parse_gemma4_output with 100% fidelity.

Why: v7-broken serializer used Python str(v); parser expects JSON. 29 trajectory + 30 KTO rows failed round-trip silently.

G9 Behavioral test (real reward margins)

What: Tests construct trainer, run a mini-train, assert reward margins move in the correct direction per pair type.

Why: v7-broken's test_training_config.py froze source-text strings — it would pass while DPO trained on the wrong objective forever.

G10 Docker container smoke test

What: docker build + docker run canvas-train --help exits 0 in CI.

Why: v7-broken had ENTRYPOINT ["canvas-train"] + command: sh -c "..." — the two collided and argparse rejected at runtime.

G11 Sequence-length budget

What: < 1% of rows exceed max_length = 1024 after tokenization.

Why: Gemini cross-AI audit WARN: TRL silently truncates — multi-call trajectories with tool results easily exceed 1024 tokens.

G12 Hard minimum N per method

What: KTO ≥ 1000, DPO ≥ 1000, etc. Job aborts before training if N is below threshold.

Why: Gemini cross-AI audit WARN: KTO blew up partly because N = 146 was too small — numerical instability is near-certain at that scale.

G13 Trajectory tool repetition ≤ 3×

What: No single tool appears more than 3× in any one trajectory's assistant turns.

Why: v7-broken AUDIT: ~25% of multi-call trajectories had redundant calls (canvas.get_todo 7× in one response). Excessive repetition trains the model to loop instead of terminating.

Bench Harness phase 4 — not started

Offline evaluation protocol used to select the production checkpoint

Design targets — phase 4 (BIER-style benchmark expansion) is NOT STARTED. The numbers below are targets, not measurements. [source: ROADMAP.md L45]

1000+

evaluation prompts (target)

TBD post-phase-4

tools covered

each tool as expected_tools[0]

Wilson CI per result

Metrics

→Tool selection accuracy — did the model call the correct tool as its first action? Measured per tool, per method.
→Argument accuracy — ToolBench-style partial match on required arguments (exact for IDs, fuzzy for natural-language fields).
→Wilson 95% CI — each per-tool accuracy estimate comes with a Wilson score confidence interval. Small per-tool samples (< 40) are flagged as low-confidence.
→Parsimony rate — fraction of correct responses that use the minimum-cardinality tool set. DPO should improve this vs SFT baseline.

Checkpoint selection

The production checkpoint is the epoch that maximizes macro-average tool selection accuracy on the eval split, subject to a constraint that no individual tool falls below 40% (tool-floor constraint). This prevents selecting a checkpoint that is strong on common tools but has regressed on rare ones.

PII Handling

How student-data PII is scrubbed before training

PII is scrubbed via Piiranha, a HuggingFace-hosted PII detection model run inside an isolated worker container before the train / test split. Anonymization is enforced as a phase-1 acceptance gate — a trajectory cannot enter the dataset if Piiranha flags any token in it as personal information.

Implementation

• Model: iiiorg/piiranha-v1-detect-personal-information
• Worker: src/dataset/anon_worker_piiranha.py
• Container: src/docker/Dockerfile.anon-piiranha
• Source repo: kleinpanic/CS3704-DPO-SSOT (training pipeline)

The HF Space demo never touches real student data — it ships with mocked Canvas tools and no Canvas credentials. PII handling is only relevant to the training pipeline. [source: anon_worker_piiranha.py, Dockerfile.anon-piiranha]