Shipping a CLI on FoundationModels

I recently built scg — a Swift CLI tool that generates commit messages using Apple’s on-device generative model. No API keys, no cloud calls, no cost per token. Just your Mac, your diff, and a local model. The experience taught me a lot about what it’s like to build real tools on top of Apple’s FoundationModels framework, and where on-device inference stands today.

Why On-Device?

Commit messages are one of those things that sit in an awkward spot. Writing them well takes real attention — you’re summarizing intent, not just describing changes. But the cost of getting them wrong is low enough that reaching for a cloud API feels like overkill. You don’t want to pay per-token for commit messages. You don’t want to manage API keys. And for many developers and organizations, sending your diffs to an external service is a non-starter for privacy reasons.

Apple’s FoundationModels framework, shipping with macOS Tahoe (26), occupies an interesting niche: it gives you a generative model that runs entirely on-device, with no network calls. For a tool like scg, this is ideal. The model is always available (no cold starts from a remote server), it’s free, and your code never leaves your machine.

Working with FoundationModels

The API surface is straightforward. You create a LanguageModelSession, give it system instructions, and ask it to respond. Here’s roughly what the core generation looks like in scg:

import FoundationModels

let session = LanguageModelSession(
    model: SystemLanguageModel.default,
    instructions: systemPrompt
)

let response = try await session.respond(
    generating: CommitDraft.self,
    options: generationOptions
) {
    userPrompt
}

One of the nicest features is structured output. You can define a Swift type conforming to @Generable and the framework will coerce the model’s output into that shape:

@Generable(description: "A git commit message with subject and optional body.")
struct CommitDraft: Hashable, Codable, Sendable {
    @Guide(description: "Subject line (max 50 chars). Imperative mood.")
    var subject: String

    @Guide(description: "Optional body explaining WHY the change was made.")
    var body: String?
}

The @Guide annotations serve as inline instructions to the model — they shape the output without you needing to repeat formatting rules in the prompt itself. This is a genuinely pleasant developer experience. You define a Swift struct with annotations, and the framework handles the structured extraction. No regex parsing, no JSON schema negotiation, no hoping the model remembers your formatting instructions.

There’s one practical hurdle worth noting: Terminal apps using FoundationModels require Full Disk Access to be enabled for your terminal emulator. The first time you run scg, you may also get an Apple Intelligence access prompt. These are one-time setup steps, but they’re the kind of thing that can trip up first-time users.

4,096 Tokens Is All You Get

Here’s the first thing that hits you. Cloud models like Claude give you 200k tokens of context. Apple’s on-device model gives you roughly 4,096. That’s the entire prompt: system instructions, metadata, and diff content. All of it.

For a small commit touching two or three files, this is fine. For a refactor touching twenty files? You’re looking at maybe 6 lines of diff per file — if you’re lucky. For a large feature branch? Forget about fitting the diff at all. You’re in triage mode, deciding which files the model even gets to see.

This isn’t a theoretical constraint. It’s the central engineering challenge of building on FoundationModels. Every design decision in scg flows from this limitation.

You Don’t Have a Tokenizer

It gets worse. Apple doesn’t expose a tokenizer for the on-device model. You can’t count tokens before sending a prompt. You’re estimating. In scg, the heuristic is crude: 4 characters ≈ 1 token. That’s it. No BPE, no vocabulary lookup, just character division.

static func tokenEstimate(forCharacterCount count: Int) -> Int {
    count / 4
}

This means you’re budgeting blind. The PromptBatchPlanner reserves 15% headroom on top of the 4,096 ceiling to absorb estimation error. If you underestimate and exceed the context window, the framework throws exceededContextWindowSize at runtime — and you only find out after you’ve already built and sent the prompt. The only way to validate your estimates against reality is to run Apple’s Instruments profiler and manually compare.

The Model Copies When Confused

One behavior I didn’t expect: when given both a high-level summary and detailed diff content, the model tends to copy the summary verbatim instead of analyzing the diffs. I discovered this when implementing a two-pass analysis for large changesets. The first pass generates an overview from file metadata. The second pass processes batches of diffs. My initial approach passed the overview summary to each batch for context. The result: every batch produced nearly identical output, parroting the overview instead of reading the diff.

The fix was counterintuitive — strip the summary text from batch prompts entirely and only pass minimal metadata (the category label and key file paths). The model does better when it has less context, as long as that context is the right context. With a cloud model, more context generally helps. With this model, more context can actively mislead.

Quality Is Inconsistent

The model sometimes nails it. A clean, concise commit message that captures the intent of the change. Other times, the output is generic (“Update files”), slightly wrong, or weirdly formatted. The variance is high enough that scg includes retry logic with exponential backoff — requests can time out (the model sometimes stalls), and retries use progressively longer timeouts (30s → 60s → 120s).

var maxAttempts: Int = 3
var requestTimeout: TimeInterval = 30

// Each retry doubles the timeout
let attemptTimeout = configuration.requestTimeout * pow(2.0, Double(attempt - 1))

This variance is why scg is explicitly human-in-the-loop. The model generates a draft, and then you review it. You can accept, edit, or regenerate. Editing a 70% correct draft is still faster than writing from scratch. But you’d never want to ship this output unreviewed into an automated pipeline — at least not today.

Fighting for Every Token

The constraints above forced me to build an entire subsystem just for fitting useful content into 4,096 tokens. It’s easily the most complex part of the codebase, and the part I’m least proud of needing.

Aggressive compaction. The prompt builder starts with up to 50 lines of diff per file across 12 files. If that blows the budget, it iteratively reduces — fewer lines per file, then fewer files — until the estimate fits. The loop has hard minimums (3 files, 6 lines each) below which it just ships the prompt and hopes.

Smart snippet extraction. Instead of taking the first N lines of a diff (which is mostly unchanged context), scg prioritizes actual change lines (+/-) and hunk headers (@@). Runs of unchanged context in the middle get collapsed to ... markers. Every line that isn’t a change is a line wasted.

File importance scoring. When allocating scarce token budget across files, not all files are equal. Type definitions score +30. Protocol conformances score +25. Test files score -15. Generated files score -50. Lock files score -40. The scorer determines which files get full snippets and which get metadata-only summaries.

Semantic hints to compensate for lost context. Since the model can’t see full diffs, scg programmatically detects what kind of changes each file contains — imports, type definition, protocol conformance — and surfaces those as structured labels. With a cloud model, you’d trust it to notice. With a smaller model, you have to tell it explicitly.

- Sources/Core/LLMClient.swift [modified; staged; +45/-12]
  changes: imports, type definition

Batch processing. When a changeset is too large for a single prompt, the PromptBatchPlanner splits files into batches. Each batch generates a partial draft in its own LanguageModelSession, and a combination prompt merges the results. Related files are grouped semantically — test files stay with their source counterparts — so the model gets coherent context even across batches.

Prompt Engineering for a Smaller Model

Prompting an on-device model is a different discipline. With Claude, I can write a conversational prompt, include lots of context, and trust the model to figure it out. With a smaller model, every token of context needs to earn its place.

The system prompt in scg is deliberately terse:

Write a git commit message for the code changes shown.
Subject: max 50 chars, imperative mood ("Add X" not "Added X"), describe WHAT changed.
Body: optional, explain WHY if helpful.
Read the diff carefully. Lines with + are additions, - are deletions.

No preamble, no role-playing, no “you are an expert.” Smaller models benefit from directness. The @Guide annotations on the @Generable struct pull their weight here too — they shape the output at the schema level without burning prompt tokens.

Two flags that turned out to be surprisingly important:

Function context (--function-context, on by default) includes entire functions containing changes rather than just the changed lines with minimal surrounding context. It costs more tokens but gives the model enough semantic context to describe what a change does rather than just where it occurs. The tradeoff is worth it for all but the largest diffs.

Rename detection (--detect-renames, on by default) prevents file moves from showing up as a deletion and an addition. Without it, the model sees two unrelated file changes and can’t describe the rename. With it, the model gets renamed from old/path.swift and produces accurate output. A small detail, but it’s the kind of thing that separates a useful tool from a frustrating one.

What This Means for Swift Developers

Apple shipping a generative model as a system framework is a meaningful shift. FoundationModels isn’t trying to compete with frontier cloud models. It’s filling a different niche: tasks where privacy matters, where latency matters, where you don’t want to manage infrastructure, and where “good enough” output is genuinely good enough.

Commit message generation is one use case. But the same pattern — structured local inference with a tight feedback loop — applies to plenty of developer tools: code review suggestions, documentation drafts, refactoring summaries, changelog generation. These are all tasks where the cost of a cloud API call outweighs the benefit, but where AI assistance is still valuable.

The constraints are real. You won’t be summarizing entire codebases or doing deep multi-file reasoning with 4k tokens. But the constraints also force better engineering: smarter context selection, structured prompts, and information density over verbosity. Those are good habits regardless of what model you’re targeting.

If you want to try scg, it’s available on GitHub and installable via Homebrew:

brew install Iron-Ham/swift-commit-gen/scg

You’ll need macOS Tahoe (26) with Apple Intelligence enabled, and Xcode 26. Give it a spin on your next commit — and let me know what you think.