How Our Agents Test Their Own iOS Changes

The Verification Gap

AI coding agents can write iOS code, run the build, and check if it compiles. But compiling isn’t the same as working. A view might compile perfectly while rendering a blank screen. A button might exist in the view hierarchy but be invisible because its frame is zero.

You close this gap by running the app and looking at it. Agents can’t do that, or at least they couldn’t until recently. They’d write code, confirm it builds, maybe run unit tests, and hand it back to you with “looks good” confidence that was entirely unearned.

We built a skill that closes this gap. The agent builds the app, launches it on a simulator, observes the result, and produces a structured pass/fail report. It doesn’t tap buttons or navigate flows; it looks at what’s on screen and evaluates whether it matches the intent.

The MCP Bridge

This is built on XcodeBuildMCP, an MCP server that exposes Xcode build and simulator operations as tool calls. It gives the agent access to operations it couldn’t perform through shell commands alone:

build_run_sim: Build the app, install it on a simulator, and launch it
screenshot: Capture the current simulator screen as an image the agent can see
snapshot_ui: Return a structured representation of the view hierarchy, including every view’s type, frame, accessibility label, and traits
record_sim_video: Start and stop MP4 recording on the simulator
start_sim_log_cap / stop_sim_log_cap: Capture app logs during a time window

The agent calls these the same way it calls Read or Bash.

The Workflow

The skill is a Claude Code skill file with five steps. The agent follows it whenever you say “verify this change” or “does this work on simulator.”

Step 1: Define Criteria

Before building anything, the agent reads the diff and proposes what “working” means for this specific change. If you modified a compose view, the criteria might be: “compose screen renders with To/CC/BCC fields visible, keyboard appears when tapping the body field, attachment button is present.” If you fixed a crash on empty inbox, the criteria might be: “app launches without crashing, empty state view is displayed.”

The agent presents these criteria to you for confirmation. It shouldn’t be the one deciding what counts as success for a change it just made. You provide the ground truth.

Step 2: Build and Launch

The agent builds using the Makefile (not XcodeBuildMCP) because the Makefile has the correct flags:

make build

Then it configures XcodeBuildMCP. This is where the worktree-aware simulator setup matters: make simulator-setup returns the right simulator for the current environment, and the agent passes it along with the workspace path, scheme, and bundle ID:

mcp__XcodeBuildMCP__session_set_defaults
  workspacePath: /path/to/MyApp.xcworkspace
  scheme: "MyApp-Prod"
  simulatorName: "MyApp-a1b2c3d4"  # worktree-specific clone
  bundleId: "com.example.myapp"

Then build_run_sim boots the simulator, installs the app, and launches it.

Step 3: Observe

The agent starts video recording, then works through each criterion in a loop:

Screenshot the current screen
Snapshot the view hierarchy to verify elements exist and are positioned correctly
Capture logs to check for errors that aren’t visible in the UI

The view hierarchy snapshot is the most useful tool here. The agent can verify that a button has the right label, that a list has the expected number of rows, or that a view exists at all, without needing to interact with the app.

mcp__XcodeBuildMCP__snapshot_ui

→ Returns:
  UIWindow
    NavigationStack
      InboxView
        List (24 rows)
          ThreadCell "Meeting tomorrow" — accessibilityLabel: "Meeting tomorrow, from Jane, 2 minutes ago"
          ThreadCell "Build failed" — accessibilityLabel: "Build failed, from CI Bot, 15 minutes ago"
          ...
        FloatingActionButton — accessibilityLabel: "Compose"

The agent reads this like it reads code. It can check whether the compose button exists, whether the inbox has rows, whether the labels are meaningful. All of it gets evaluated against the criteria from Step 1.

For criteria that need a different screen, the agent notes what it can observe and flags that manual navigation is still needed.

Step 4: Accessibility Audit

This happens as a side effect of Step 3, not as a separate pass. While inspecting the view hierarchy for each criterion, the agent checks for interactive elements (buttons, toggles, links, text fields) that are missing accessibilityLabel. A tappable button with no label means VoiceOver reads the raw image name or nothing at all.

Missing accessibility labels on interactive elements fail the verification. This is intentional: accessibility is a first-class requirement, not a nice-to-have. An agent that verifies a feature works but misses that half the buttons are inaccessible hasn’t done its job.

If the accessibility issues are unrelated to the current change (pre-existing gaps), they’re reported separately as “file an issue” recommendations rather than failing the verification.

Step 5: Report

The agent stops video recording and produces a structured report:

## Verification Report

**Change:** Add CC/BCC toggle to compose view
**Verdict:** PASS

### Test Criteria Results
| # | Criterion | Result | Notes |
|---|-----------|--------|-------|
| 1 | Compose screen renders with To field | PASS | Screenshot 1 |
| 2 | CC/BCC fields appear on toggle | PASS | Screenshot 2 |
| 3 | Keyboard appears on body tap | PASS | Screenshot 3 |

### Video Recording
Saved to: `/tmp/ios-verify-compose-ccbcc-2026-03-19.mp4`

### Accessibility Issues
- FloatingActionButton: missing accessibilityLabel (pre-existing)

### Recommendations
- File issue for FloatingActionButton a11y label

The video is ephemeral (/tmp/) but useful for attaching to a PR. The screenshots serve as quick-reference evidence.

What This Doesn’t Do

This is not a replacement for UI automation. Full end-to-end flows (tap this, navigate there, assert that) require XCUITest or Maestro. This skill only observes static screen state, which is simpler and more reliable.
This is not a replacement for unit tests. The skill verifies that the UI renders correctly. It doesn’t test business logic, state transitions, or error handling.
This requires a buildable project. If make build fails, the whole workflow fails. The CLI tooling foundation needs to be solid first.

Why It Matters

Without this skill, the agent writes code, confirms it compiles, and hands it to you. You’re the quality gate. With it, the agent goes further: it launches the app, looks at the result, checks accessibility, and hands you a report with screenshots and video. You’re still the final reviewer, but you’re reviewing evidence rather than running the app yourself for the first time.

The accessibility audit is the part I didn’t expect to be so valuable. Before this skill, accessibility was something we checked when we remembered to. Now every verification run checks it automatically. We catch far more a11y issues early, and none of it required adding a separate step to anyone’s workflow.