Home

Apps

Services

Technologies

Industries

References

About us

Blog

Contact

On-Device Intelligence: Fine-Tuning Small Language Models for Field-Critical Operations

A 3.8B-parameter model, trained on a Mac mini, running on an iPhone. We validated this in medical equipment maintenance, and the same on-device pattern applies across multiple regulated and low-connectivity sectors. No cloud, no API, no internet required.

At a Glance

Background

Cloud LLMs are impractical for field service in hospitals and clinics — no access to closed networks, regulatory constraints on data transmission, per-token costs. We needed AI-powered maintenance decisions that run entirely on-device.

Goal

Build a fine-tuned Small Language Model that takes a medical ventilator subsystem reading and produces a structured JSON maintenance plan — offline, on a phone.

Approach

Fine-tune Microsoft's Phi-3-mini (3.8B) with LoRA adapters on a domain knowledge base of 53 ventilator facts. Train on a Mac mini. Deploy via llama.cpp on iOS.

Outcome

54% → 89% decision accuracy over 8 iterations. 2.2GB model, ~12 tokens/sec on iPhone. Total training cost: €0.55 in electricity.

The Pipeline

Domain Knowledge Base

53 facts

4 subsystems

6 fact types

3C3D4B

Dataset Generation

1,512 scenario combinations

600 train

100 eval / 100 test

3C3D4B

LoRA Fine-Tuning

Phi-3 3.8B

rank 16, 0.5% of params

~4h / iteration

3C3D4B

↓ iterate on training data ↓

Evaluation

Decision accuracy

Fact grounding

Schema validity

Safety ordering

Export Pipeline

MLX → GGUF

Q4_K_M quantization

2.2GB final model

Mobile App

iPhone / Metal GPU

~12 tokens/sec

Plan mode

3C3D4B

Step 1 — The Problem: Why On-Device

The general case for running inference on the device rather than calling an API:

Offline-capable — no network dependency, works in the field, in factories, in remote locations
Privacy — data never leaves the device
Latency — no round-trip, instant response
Cost — no per-token API billing
Independence — not locked to OpenAI, Google, or Anthropic — you own the model

In our case, the domain makes this concrete. Medical equipment like ventilators is serviced by OEM field service engineers — manufacturer-employed specialists dispatched to hospitals for module replacements, firmware updates, and non-routine repairs. They travel between facilities, don’t have network credentials at each site, and sending operational data to external cloud APIs means navigating regulatory constraints around data transmission in medical environments. On-device inference sidesteps all of that: no data leaves the device, so compliance is a non-issue by architecture.

The constraints, side by side:

The general case for running inference on the device rather than calling an API:

Offline-capable — no network dependency, works in the field, in factories, in remote locations
Privacy — data never leaves the device
Latency — no round-trip, instant response
Cost — no per-token API billing
Independence — not locked to OpenAI, Google, or Anthropic — you own the model

The constraints, side by side:

The general case for running inference on the device rather than calling an API:

Offline-capable — no network dependency, works in the field, in factories, in remote locations
Privacy — data never leaves the device
Latency — no round-trip, instant response
Cost — no per-token API billing
Independence — not locked to OpenAI, Google, or Anthropic — you own the model

The constraints, side by side:

Constraint

Cloud LLM

On-Device SLM

Connectivity

Required

Not needed

Regulatory

Data transmitted to third-party servers

Nothing leaves the device

Response time

Network round-trip

< 1 second

Per-query cost

Per-token billing

Zero after deployment

Vendor dependency

Locked to API provider

Manufacturer owns the model

The last point matters more than it seems. When your field service workflow depends on a third-party API, you’re one pricing change or deprecation away from retraining your entire field team. An on-device model is a static artifact that ships with the service app — it works the same today as it will in five years, regardless of what happens to OpenAI’s pricing page.

Step 2 — The Domain: MVCU (Modular Ventilator Control Unit)

We designed a training scenario modeled on real-world medtech field service: a modular ventilator control unit we code-named MVCU. The device is synthetic, but the domain structure — subsystem hierarchies, safety constraints, regulatory policies, decision thresholds. Four subsystems, each with its own failure modes, safety procedures, and decision thresholds:

Subsystem

Example of facts

Electronics Controller

Signal integrity thresholds, firmware checksum deviation, JTAG programming requirements

Pneumatics Assembly

Data transmitted to third-party servers

Sensor Array

Network round-trip

Power Supply Unit

Per-token billing

The knowledge base contains 53 facts in six categories: thresholds, safety procedures, tool requirements, compatibility rules, policies, and diagnostics.

The model’s job: given a subsystem, a measurement, and the current environment (tools available, power stability, patient status), produce a structured JSON maintenance plan — safety checks first, then diagnostics, reasoning, and a final decision.

The knowledge base contains 53 facts in six categories: thresholds, safety procedures, tool requirements, compatibility rules, policies, and diagnostics.

Five possible decisions

Emergency shutdown

Critical failure, immediate stop

Escalate

Needs specialist or patient transfer

Full replacement

Subsystem swap, specific tool kits

Temporary fix

Stabilize within policy limits (2–8h)

Monitor

Within tolerance, schedule follow-up

Each combination of subsystem, measurement, environment, and constraints maps to exactly one correct decision. No ambiguity. That’s what makes this domain work for fine-tuning — you can programmatically verify whether the model got it right.

Step 3 — Foundation Decisions

Three decisions shaped everything that followed. Each one closed off alternatives and committed us to a path.

What is a Small Language Model?

SLMs use similar transformer architecture as frontier LLM models — but at 1–4 billion parameters instead of 100B+. The trade-off: less general knowledge, but enough capacity for focused, domain-specific tasks. Small enough to run on a phone.

Base Model: Phi-3 Mini (3.8B)

We evaluated four candidates:

Model

Size

Strength

Limitation

Phi-3-mini

3.8B

Best reasoning per param, strong JSON

Slightly larger

Llama 3.2

3B

Official mobile support

Weaker at structured output

Gemma 2

2B

Smallest viable

Struggles with branching logic

Ministral 3.B

3.B

Good general performance

License restrictions for commercial use

Phi-3 is trained on synthetic “textbook-quality” reasoning data. In practice, this meant it needed fewer training examples to learn our domain’s branching logic — a significant advantage when your entire dataset is 600 samples.

Fine-Tuning over RAG

Two standard approaches for domain-specific LLMs. We chose fine-tuning.

RAG retrieves context at runtime — the model stays general and you maintain a vector database of documents. Good when knowledge changes often. Requires a retrieval system on the device.

Fine-tuning teaches the model domain-specific reasoning patterns and output structure. Facts are still provided at inference time — what changes is how the model processes them. No runtime infrastructure beyond the model itself.

For our use case, we needed deterministic JSON output, not document retrieval. An important nuance: fine-tuning here doesn’t bake the facts into the model — the knowledge base is still passed in at inference time as context. What the model learns is how to reason over facts: which thresholds matter, how to chain constraints, when to escalate vs. replace. That means the fact KB can be updated to a degree without retraining — add a new threshold, adjust a policy limit — as long as the reasoning patterns the model learned still apply.

Fine-tuning was the obvious choice for this case. RAG would be the better fit when you don’t need a custom output structure and the reasoning patterns are simple enough for a base model to handle without specialization — think Q&A over a document corpus, not multi-step constraint evaluation with structured JSON output.

LoRA: Modify 0.5% of the Model

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices — about 0.5% of total parameters. The resulting adapter is ~30MB, versus 2.2GB for the base model. Adapters can be hot-swapped at runtime in ~50ms without reloading the base model, which opens up possibilities beyond single-domain deployment: different specialized adapters per task, A/B testing between adapter versions with a feedback mechanism, or shipping adapter updates over the air while the base model stays cached on-device.

Parameter

Value

Why

Rank

16 Started at 8, but the model couldn’t learn branching logic (“if measurement < 70% AND tools = full → replace”). Rank 16 gave enough capacity for multi-step conditionals.

Dropout

0.05 Prevents memorization. With only 600 training samples, overfitting is a real risk.

Learning rate

Cosine 1e-5 → 1e-6

Start with bigger steps, refine gradually. Warmup of 50 steps to avoid destabilizing pretrained weights.

Target modules

q_proj, k_proj, v_proj, o_proj

Standard attention projection matrices — where reasoning patterns live.

Result: a ~30MB adapter on top of the 2.2GB base model.

What the hyperparameters actually control

The table above is the “what”. Here’s the “why it matters” for each knob:

Rank controls adapter capacity — how complex the patterns it can learn.

Rank 4–8: learns vocabulary and formatting (surface patterns)
Rank 16: learns branching logic — threshold comparisons, multi-step conditionals. Our sweet spot.
Rank 64+: approaches full fine-tuning capacity, but overfits on small datasets

The rank 8 → 16 jump was part of the v3 → v5 accuracy leap. At rank 8, the model could learn what words to output but couldn’t learn when to output them — it failed on “if measurement < 70% AND tools = full → full_replacement” because that multi-step conditional needed more adapter capacity than rank 8 provided.

Dropout is regularization — randomly disabling 5% of the adapter’s connections on each training step forces the model to not rely on any single pathway.

0.0: overfits — memorizes training examples, fails on new scenarios
0.2+: underfits — can’t learn enough
0.05: conservative, appropriate for a well-structured task with only 600 samples

Cosine decay (1e-5 → 1e-6) controls how big each weight update is. The sculpting analogy: start with a chisel, finish with sandpaper.

Start at 1e-5: large enough steps to make meaningful progress early
Cosine curve: smooth gradual reduction (not an abrupt drop)
End at 1e-6: very small refinements — fine-tuning the fine-tuning
Warmup (50 steps from 1e-7): pick up the chisel carefully — don’t destabilize the pretrained weights with aggressive early updates

Step 4 — Dataset: Programmatic, Not Hand-Written

We didn’t write training examples by hand. We generated them from the full scenario space.

4 subsystems × 12 failure modes × 3 environments × 3 tool levels × 2 power states = 1,512 possible combinations

From that space: 600 training, 100 evaluation, 100 test — with strict separation. No scenario appears in more than one split.

The distractor strategy

Each training sample includes the relevant facts plus 7 distractor facts — real facts from the knowledge base that are irrelevant to the current scenario. The model has to learn which facts matter and which to ignore.

This is critical. In production, the model receives all facts for a subsystem (11–13 facts), not a curated subset. If it only trained on relevant facts, it would never learn to filter.

How we measured success

Four metrics, each catching a different failure mode:

Metric

What it catches

Schema validity

Is the JSON well-formed and complete?

Decision accuracy

Data transmitted to third-party servers

Fact grounding

Did it reference only provided facts — no hallucinated thresholds?

Safety ordering

Are safety checks (patient isolation, depressurization) listed before repair actions?

Schema validity hit 99–100% from the earliest iterations — Phi-3’s strength at structured output paid off immediately. Decision accuracy was the hard problem.

Step 5 — Training: Eight Iterations from 54% to 89%

The progression tells a story. Each version taught us something about how SLMs learn — and how they fail.

—

Fixed measurement format, LoRA setup

54%

Baseline — model guesses when uncertain

83%

Chain-of-thought + rank 8→16 (+29)

77%

Pink Elephant Effect (−6)

85%

Removed negation, positive anchors

89%

Priority rules + constraint verification

v5: Chain-of-thought — the biggest single jump

Before v5, the training data went straight from scenario to decision. The model saw inputs and outputs but not the reasoning path between them.

At v5, we added explicit reasoning steps to every training sample: check the measurement zone, verify constraints, evaluate available actions, then decide. We also bumped LoRA rank from 8 to 16 because the model needed more adapter capacity for multi-step logic.

The result — 54% to 83% — was the largest improvement in the entire project. Chain-of-thought isn’t just a prompting trick for production use. It fundamentally changes what a fine-tuned model can learn.

v6 → v7: The Pink Elephant Effect

This was the most instructive mistake.

At 83%, we analyzed the errors. The model sometimes confused similar decisions — calling something “escalate” when it should be “full replacement”, or vice versa. Logical response: add disambiguation to the training data. Make it explicit.

We added text like:

“This is NOT an escalation case. This is NOT a shutdown or replacement case.”

Accuracy dropped from 83% to 77%. Ten new errors appeared.

The mechanism: transformer attention activates on tokens regardless of negation context. When the training data says “not escalate”, the attention mechanism lights up on “escalate”. The model is now thinking about escalation — weighing it, considering it — even though the text says not to. Autoregressive models generate token by token. They can’t “un-think” a concept once it’s been activated.

The fix at v7: remove every “not X” phrase. Replace with positive-only anchors:

“Confirmed: this is a temporary fix case. Flow output is at 72%, above the 70% threshold for temporary fixes. All required tools (valve kit VK-300, torque wrench TW-8) are available. Power is stable.”

Result: 77% → 85%.

The principle applies beyond fine-tuning. If you’re writing system prompts, few-shot examples, or RAG instructions — describe what something is. Describing what it isn’t can produce the opposite of what you intend. The Pink Elephant Effect is weaker in large models (they have more capacity to process negation), but it never fully goes away.

v8: Data quality is the only lever that matters

v8 reached 89% with the same hyperparameters as v5. No changes to rank, dropout, learning rate, or training duration. The only change: better training text — explicit priority rules, constraint verification before action selection.

From v5 to v8, every accuracy improvement came from rewriting training data. The hyperparameters were “good enough” at v5 and stayed good enough. If your fine-tuned model has plateaued, look at your data before touching your config.

The remaining hard boundary

Even at 89%, there’s a clear pattern in the errors. The hardest confusion: escalate vs. full replacement. Both triggered by the same measurement thresholds, differentiated by a single variable — whether the right tools are available.

6 of 11 remaining errors are this exact confusion. The next lever: contrastive training pairs where only tools availability differs, forcing the model to learn that one variable is the deciding factor.

Step 6 — From Training Weights to a Phone

Two separate challenges: converting the model to a mobile-friendly format, and building the iOS integration.

Export: the silent failure

The model trains in MLX format (Apple’s ML framework). iPhones run llama.cpp with GGUF format. The conversion pipeline:

MLX LoRA

Fused 4-bit

Dequant bf16

GGUF f16

GGUF Q4_K_M

2.2GB

The trap: MLX uses its own 4-bit quantization format. llama.cpp’s converter doesn’t understand it. If you skip the dequantization step, you get a GGUF file that loads without errors, runs inference, and produces garbage. No warning. No crash. Just a model that confidently outputs nonsense.

The fix is one flag (--dequantize ). Finding that the flag was needed cost hours. The lesson: always validate exports with your evaluation pipeline, never by visual inspection.

iOS: four libraries tried, one wrapper written

The llama.cpp Swift ecosystem is not mature. We tried four wrappers:

Library

What happened

SwiftLlama

Outdated tokenizer, runtime crashes

LocalLLMClient

C++ headers broke iOS builds

llama-cpp-swift

Missing Package.swift for the dependency

mattt/llama.swift

Compiled — but raw C API only

The last one worked, but it only exposes low-level C functions: llama_model_load_from_file(), llama_decode(), llama_sampler_sample(). No Swift-native interface.

We wrote a 738-line wrapper on top that handles:

Model lifecycle (load, unload, memory management)
Token streaming via Swift’s AsyncThrowingStream
KV cache management for multi-turn conversations
Memory pressure handling — auto-unloads model on low-memory warnings
Thermal monitoring — blocks inference if the device overheats
Swift 6 strict concurrency (Sendable compliance)
Metal GPU acceleration on physical devices

The app’s core feature is Plan mode: technician selects a subsystem, inputs a measurement, and gets a structured JSON maintenance plan rendered on-device. A basic chat interface exists for testing model inference directly, but free-form conversational interaction is a future extension, not a shipping feature.

At this ecosystem maturity level, the right strategy is thin bindings and your own wrapper. Opinionated libraries break on every llama.cpp update.

The Numbers

89%

Decision accuracy

98%

Fact grounding

99%+

Schema validity

2.2GB

Model size (Q4_K_M)

~12

tokens/sec on iPhone

~50ms

Adapter swap time

Cost

Phase

Hours

Training (8 iterations)

~26h

Evaluation inference

~9h

Export (4 GGUF exports)

~1.3h

Data generation & misc

~1h

Total compute

~37h

Hardware: Mac mini M2 Pro, ~40W sustained. Total energy: ~1.8 kWh. Total cost: €0.55.

An A100 GPU would have been ~10x faster — full project in 3–4 hours instead of 37. But the bottleneck was never waiting for training to finish. It was analyzing evaluation results, understanding why the model confused escalation with replacement, and rewriting training data. The Mac mini fit naturally into a iteration cycle: start training, analyze the previous run’s results, write better data for the next one.

What We Learned

Seven things we’d tell ourselves at the start:

Training data quality is the main lever that moves.

From v5 to v8, hyperparameters were frozen. Every accuracy gain came from rewriting training text.

Don’t use negation in training data — or anywhere a model reads.

The Pink Elephant Effect: saying “not X” makes the model think about X. Describe what things are. Avoid describing what they aren’t. This applies to fine-tuning, system prompts, and RAG alike.

Chain-of-thought transforms fine-tuning, not just prompting.

Externalizing reasoning in training data (+29% accuracy) was a bigger lever than any hyperparameter change. The model doesn’t just learn to mimic the output — it learns the reasoning path.

SLMs are production-viable for structured, domain-specific tasks.

A 3.8B model on a phone makes medical equipment decisions at 89% accuracy with 98% fact grounding. You don’t need a 100B+ model for everything.

Budget time for mobile integration — the ecosystem is immature.

Four out of four Swift llama.cpp wrappers failed. Plan to write your own integration layer. This will change as the ecosystem matures, but as of early 2026, it’s still build-your-own territory.

Validate model exports with your eval pipeline, never by eye.

Silent format conversion failures produce models that load, run, and output confident garbage. The only reliable test is your evaluation suite.

You don’t need a GPU cluster.

Two weeks on a Mac mini, €0.55 in electricity, 89% accuracy. The bottleneck is thinking about your data, not waiting for compute.