Background
Cloud LLMs are impractical for field service in hospitals and clinics — no access to closed networks, regulatory constraints on data transmission, per-token costs. We needed AI-powered maintenance decisions that run entirely on-device.
Goal
Build a fine-tuned Small Language Model that takes a medical ventilator subsystem reading and produces a structured JSON maintenance plan — offline, on a phone.
Approach
Fine-tune Microsoft's Phi-3-mini (3.8B) with LoRA adapters on a domain knowledge base of 53 ventilator facts. Train on a Mac mini. Deploy via llama.cpp on iOS.
Outcome
54% → 89% decision accuracy over 8 iterations. 2.2GB model, ~12 tokens/sec on iPhone. Total training cost: €0.55 in electricity.
Domain Knowledge Base
53 facts
4 subsystems
6 fact types
3C3D4B
Dataset Generation
1,512 scenario combinations
600 train
100 eval / 100 test
3C3D4B
LoRA Fine-Tuning
Phi-3 3.8B
rank 16, 0.5% of params
~4h / iteration
3C3D4B
↓ iterate on training data ↓
Evaluation
Decision accuracy
Fact grounding
Schema validity
Safety ordering
Export Pipeline
MLX → GGUF
Q4_K_M quantization
2.2GB final model
Mobile App
iPhone / Metal GPU
~12 tokens/sec
Plan mode
3C3D4B
Constraint
Cloud LLM
On-Device SLM
Connectivity
Required
Not needed
Regulatory
Data transmitted to third-party servers
Nothing leaves the device
Response time
Network round-trip
< 1 second
Per-query cost
Per-token billing
Zero after deployment
Vendor dependency
Locked to API provider
Manufacturer owns the model
Subsystem
Example of facts
Electronics Controller
Signal integrity thresholds, firmware checksum deviation, JTAG programming requirements
Pneumatics Assembly
Data transmitted to third-party servers
Sensor Array
Network round-trip
Power Supply Unit
Per-token billing
Five possible decisions
Emergency shutdown
Critical failure, immediate stop
Escalate
Needs specialist or patient transfer
Full replacement
Subsystem swap, specific tool kits
Temporary fix
Stabilize within policy limits (2–8h)
Monitor
Within tolerance, schedule follow-up
Each combination of subsystem, measurement, environment, and constraints maps to exactly one correct decision. No ambiguity. That’s what makes this domain work for fine-tuning — you can programmatically verify whether the model got it right.
Step 3 — Foundation Decisions
Three decisions shaped everything that followed. Each one closed off alternatives and committed us to a path.
What is a Small Language Model?
SLMs use similar transformer architecture as frontier LLM models — but at 1–4 billion parameters instead of 100B+. The trade-off: less general knowledge, but enough capacity for focused, domain-specific tasks. Small enough to run on a phone.
Base Model: Phi-3 Mini (3.8B)
We evaluated four candidates:
Model
Size
Strength
Limitation
Phi-3-mini
3.8B
Best reasoning per param, strong JSON
Slightly larger
Llama 3.2
3B
Official mobile support
Weaker at structured output
Gemma 2
2B
Smallest viable
Struggles with branching logic
Ministral 3.B
3.B
Good general performance
License restrictions for commercial use
Phi-3 is trained on synthetic “textbook-quality” reasoning data. In practice, this meant it needed fewer training examples to learn our domain’s branching logic — a significant advantage when your entire dataset is 600 samples.
Fine-Tuning over RAG
Two standard approaches for domain-specific LLMs. We chose fine-tuning.
RAG retrieves context at runtime — the model stays general and you maintain a vector database of documents. Good when knowledge changes often. Requires a retrieval system on the device.
Fine-tuning teaches the model domain-specific reasoning patterns and output structure. Facts are still provided at inference time — what changes is how the model processes them. No runtime infrastructure beyond the model itself.
For our use case, we needed deterministic JSON output, not document retrieval. An important nuance: fine-tuning here doesn’t bake the facts into the model — the knowledge base is still passed in at inference time as context. What the model learns is how to reason over facts: which thresholds matter, how to chain constraints, when to escalate vs. replace. That means the fact KB can be updated to a degree without retraining — add a new threshold, adjust a policy limit — as long as the reasoning patterns the model learned still apply.
Fine-tuning was the obvious choice for this case. RAG would be the better fit when you don’t need a custom output structure and the reasoning patterns are simple enough for a base model to handle without specialization — think Q&A over a document corpus, not multi-step constraint evaluation with structured JSON output.
LoRA: Modify 0.5% of the Model
LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices — about 0.5% of total parameters. The resulting adapter is ~30MB, versus 2.2GB for the base model. Adapters can be hot-swapped at runtime in ~50ms without reloading the base model, which opens up possibilities beyond single-domain deployment: different specialized adapters per task, A/B testing between adapter versions with a feedback mechanism, or shipping adapter updates over the air while the base model stays cached on-device.
Parameter
Value
Why
Rank
16
Started at 8, but the model couldn’t learn branching logic (“if measurement < 70% AND tools = full → replace”). Rank 16 gave enough capacity for multi-step conditionals.
Dropout
0.05
Prevents memorization. With only 600 training samples, overfitting is a real risk.
Learning rate
Cosine 1e-5 → 1e-6
Start with bigger steps, refine gradually. Warmup of 50 steps to avoid destabilizing pretrained weights.
Target modules
q_proj, k_proj, v_proj, o_proj
Standard attention projection matrices — where reasoning patterns live.
Result: a ~30MB adapter on top of the 2.2GB base model.
What the hyperparameters actually control
The table above is the “what”. Here’s the “why it matters” for each knob:
Rank controls adapter capacity — how complex the patterns it can learn.
Rank 4–8: learns vocabulary and formatting (surface patterns)
Rank 16: learns branching logic — threshold comparisons, multi-step conditionals. Our sweet spot.
Rank 64+: approaches full fine-tuning capacity, but overfits on small datasets
The rank 8 → 16 jump was part of the v3 → v5 accuracy leap. At rank 8, the model could learn what words to output but couldn’t learn when to output them — it failed on “if measurement < 70% AND tools = full → full_replacement” because that multi-step conditional needed more adapter capacity than rank 8 provided.
Dropout is regularization — randomly disabling 5% of the adapter’s connections on each training step forces the model to not rely on any single pathway.
0.0: overfits — memorizes training examples, fails on new scenarios
0.2+: underfits — can’t learn enough
0.05: conservative, appropriate for a well-structured task with only 600 samples
Cosine decay (1e-5 → 1e-6) controls how big each weight update is. The sculpting analogy: start with a chisel, finish with sandpaper.
Start at 1e-5: large enough steps to make meaningful progress early
Cosine curve: smooth gradual reduction (not an abrupt drop)
End at 1e-6: very small refinements — fine-tuning the fine-tuning
Warmup (50 steps from 1e-7): pick up the chisel carefully — don’t destabilize the pretrained weights with aggressive early updates
Step 4 — Dataset: Programmatic, Not Hand-Written
We didn’t write training examples by hand. We generated them from the full scenario space.
4 subsystems × 12 failure modes × 3 environments × 3 tool levels × 2 power states = 1,512 possible combinations
From that space: 600 training, 100 evaluation, 100 test — with strict separation. No scenario appears in more than one split.
The distractor strategy
Each training sample includes the relevant facts plus 7 distractor facts — real facts from the knowledge base that are irrelevant to the current scenario. The model has to learn which facts matter and which to ignore.
This is critical. In production, the model receives all facts for a subsystem (11–13 facts), not a curated subset. If it only trained on relevant facts, it would never learn to filter.
How we measured success
Four metrics, each catching a different failure mode:
Metric
What it catches
Schema validity
Is the JSON well-formed and complete?
Decision accuracy
Data transmitted to third-party servers
Fact grounding
Did it reference only provided facts — no hallucinated thresholds?
Safety ordering
Are safety checks (patient isolation, depressurization) listed before repair actions?
Schema validity hit 99–100% from the earliest iterations — Phi-3’s strength at structured output paid off immediately. Decision accuracy was the hard problem.
Step 5 — Training: Eight Iterations from 54% to 89%
The progression tells a story. Each version taught us something about how SLMs learn — and how they fail.
v2
—
Fixed measurement format, LoRA setup
v3
54%
Baseline — model guesses when uncertain
v5
83%
Chain-of-thought + rank 8→16 (+29)
v6
77%
Pink Elephant Effect (−6)
v7
85%
Removed negation, positive anchors
v8
89%
Priority rules + constraint verification
v5: Chain-of-thought — the biggest single jump
Before v5, the training data went straight from scenario to decision. The model saw inputs and outputs but not the reasoning path between them.
At v5, we added explicit reasoning steps to every training sample: check the measurement zone, verify constraints, evaluate available actions, then decide. We also bumped LoRA rank from 8 to 16 because the model needed more adapter capacity for multi-step logic.
The result — 54% to 83% — was the largest improvement in the entire project. Chain-of-thought isn’t just a prompting trick for production use. It fundamentally changes what a fine-tuned model can learn.
v6 → v7: The Pink Elephant Effect
This was the most instructive mistake.
At 83%, we analyzed the errors. The model sometimes confused similar decisions — calling something “escalate” when it should be “full replacement”, or vice versa. Logical response: add disambiguation to the training data. Make it explicit.
We added text like:
“This is NOT an escalation case. This is NOT a shutdown or replacement case.”
Accuracy dropped from 83% to 77%. Ten new errors appeared.
The mechanism: transformer attention activates on tokens regardless of negation context. When the training data says “not escalate”, the attention mechanism lights up on “escalate”. The model is now thinking about escalation — weighing it, considering it — even though the text says not to. Autoregressive models generate token by token. They can’t “un-think” a concept once it’s been activated.
The fix at v7: remove every “not X” phrase. Replace with positive-only anchors:
“Confirmed: this is a temporary fix case. Flow output is at 72%, above the 70% threshold for temporary fixes. All required tools (valve kit VK-300, torque wrench TW-8) are available. Power is stable.”
Result: 77% → 85%.
The principle applies beyond fine-tuning. If you’re writing system prompts, few-shot examples, or RAG instructions — describe what something is. Describing what it isn’t can produce the opposite of what you intend. The Pink Elephant Effect is weaker in large models (they have more capacity to process negation), but it never fully goes away.
v8: Data quality is the only lever that matters
v8 reached 89% with the same hyperparameters as v5. No changes to rank, dropout, learning rate, or training duration. The only change: better training text — explicit priority rules, constraint verification before action selection.
From v5 to v8, every accuracy improvement came from rewriting training data. The hyperparameters were “good enough” at v5 and stayed good enough. If your fine-tuned model has plateaued, look at your data before touching your config.
The remaining hard boundary
Even at 89%, there’s a clear pattern in the errors. The hardest confusion: escalate vs. full replacement. Both triggered by the same measurement thresholds, differentiated by a single variable — whether the right tools are available.
6 of 11 remaining errors are this exact confusion. The next lever: contrastive training pairs where only tools availability differs, forcing the model to learn that one variable is the deciding factor.
Step 6 — From Training Weights to a Phone
Two separate challenges: converting the model to a mobile-friendly format, and building the iOS integration.
Export: the silent failure
The model trains in MLX format (Apple’s ML framework). iPhones run llama.cpp with GGUF format. The conversion pipeline:
MLX LoRA
Fused 4-bit
Dequant bf16
GGUF f16
GGUF Q4_K_M
2.2GB
The trap: MLX uses its own 4-bit quantization format. llama.cpp’s converter doesn’t understand it. If you skip the dequantization step, you get a GGUF file that loads without errors, runs inference, and produces garbage. No warning. No crash. Just a model that confidently outputs nonsense.
The fix is one flag (--dequantize ). Finding that the flag was needed cost hours. The lesson: always validate exports with your evaluation pipeline, never by visual inspection.
iOS: four libraries tried, one wrapper written
The llama.cpp Swift ecosystem is not mature. We tried four wrappers:
Library
What happened
SwiftLlama
Outdated tokenizer, runtime crashes
LocalLLMClient
C++ headers broke iOS builds
llama-cpp-swift
Missing Package.swift for the dependency
mattt/llama.swift
Compiled — but raw C API only
The last one worked, but it only exposes low-level C functions: llama_model_load_from_file(), llama_decode(), llama_sampler_sample(). No Swift-native interface.
We wrote a 738-line wrapper on top that handles:
Model lifecycle (load, unload, memory management)
Token streaming via Swift’s
AsyncThrowingStreamKV cache management for multi-turn conversations
Memory pressure handling — auto-unloads model on low-memory warnings
Thermal monitoring — blocks inference if the device overheats
Swift 6 strict concurrency (
Sendablecompliance)Metal GPU acceleration on physical devices
The app’s core feature is Plan mode: technician selects a subsystem, inputs a measurement, and gets a structured JSON maintenance plan rendered on-device. A basic chat interface exists for testing model inference directly, but free-form conversational interaction is a future extension, not a shipping feature.
At this ecosystem maturity level, the right strategy is thin bindings and your own wrapper. Opinionated libraries break on every llama.cpp update.
The Numbers
89%
Decision accuracy
98%
Fact grounding
99%+
Schema validity
2.2GB
Model size (Q4_K_M)
~12
tokens/sec on iPhone
~50ms
Adapter swap time
Cost
Phase
Hours
Training (8 iterations)
~26h
Evaluation inference
~9h
Export (4 GGUF exports)
~1.3h
Data generation & misc
~1h
Total compute
~37h
Hardware: Mac mini M2 Pro, ~40W sustained. Total energy: ~1.8 kWh. Total cost: €0.55.
An A100 GPU would have been ~10x faster — full project in 3–4 hours instead of 37. But the bottleneck was never waiting for training to finish. It was analyzing evaluation results, understanding why the model confused escalation with replacement, and rewriting training data. The Mac mini fit naturally into a iteration cycle: start training, analyze the previous run’s results, write better data for the next one.
What We Learned
Seven things we’d tell ourselves at the start:
1
Training data quality is the main lever that moves.
From v5 to v8, hyperparameters were frozen. Every accuracy gain came from rewriting training text.
2
Don’t use negation in training data — or anywhere a model reads.
The Pink Elephant Effect: saying “not X” makes the model think about X. Describe what things are. Avoid describing what they aren’t. This applies to fine-tuning, system prompts, and RAG alike.
3
Chain-of-thought transforms fine-tuning, not just prompting.
Externalizing reasoning in training data (+29% accuracy) was a bigger lever than any hyperparameter change. The model doesn’t just learn to mimic the output — it learns the reasoning path.
4
SLMs are production-viable for structured, domain-specific tasks.
A 3.8B model on a phone makes medical equipment decisions at 89% accuracy with 98% fact grounding. You don’t need a 100B+ model for everything.
5
Budget time for mobile integration — the ecosystem is immature.
Four out of four Swift llama.cpp wrappers failed. Plan to write your own integration layer. This will change as the ecosystem matures, but as of early 2026, it’s still build-your-own territory.
6
Validate model exports with your eval pipeline, never by eye.
Silent format conversion failures produce models that load, run, and output confident garbage. The only reliable test is your evaluation suite.
7
You don’t need a GPU cluster.
Two weeks on a Mac mini, €0.55 in electricity, 89% accuracy. The bottleneck is thinking about your data, not waiting for compute.

