How Our Agents Test Their Own iOS Changes
The Verification Gap
AI coding agents can write iOS code, run the build, and check if it compiles. But compiling isn’t the same as working. A view might compile perfectly while rendering a blank screen. A button might exist in the view hierarchy but be invisible because its frame is zero.
You close this gap by running the app and looking at it. Agents can’t do that, or at least they couldn’t until recently. They’d write code, confirm it builds, maybe run unit tests, and hand it back to you with “looks good” confidence that was entirely unearned.
We built a skill that closes this gap. The agent builds the app, launches it on a simulator, observes the result, and produces a structured pass/fail report. It doesn’t tap buttons or navigate flows; it looks at what’s on screen and evaluates whether it matches the intent.
The MCP Bridge
This is built on XcodeBuildMCP, an MCP server that exposes Xcode build and simulator operations as tool calls. It gives the agent access to operations it couldn’t perform through shell commands alone:
build_run_sim: Build the app, install it on a simulator, and launch itscreenshot: Capture the current simulator screen as an image the agent can seesnapshot_ui: Return a structured representation of the view hierarchy, including every view’s type, frame, accessibility label, and traitsrecord_sim_video: Start and stop MP4 recording on the simulatorstart_sim_log_cap/stop_sim_log_cap: Capture app logs during a time window
The agent calls these the same way it calls Read or Bash.
The Workflow
The skill is a Claude Code skill file with five steps. The agent follows it whenever you say “verify this change” or “does this work on simulator.”
Step 1: Define Criteria
Before building anything, the agent reads the diff and proposes what “working” means for this specific change. If you modified a compose view, the criteria might be: “compose screen renders with To/CC/BCC fields visible, keyboard appears when tapping the body field, attachment button is present.” If you fixed a crash on empty inbox, the criteria might be: “app launches without crashing, empty state view is displayed.”
The agent presents these criteria to you for confirmation. It shouldn’t be the one deciding what counts as success for a change it just made. You provide the ground truth.
Step 2: Build and Launch
The agent builds using the Makefile (not XcodeBuildMCP) because the Makefile has the correct flags:
make build
Then it configures XcodeBuildMCP. This is where the worktree-aware simulator setup matters: make simulator-setup returns the right simulator for the current environment, and the agent passes it along with the workspace path, scheme, and bundle ID:
mcp__XcodeBuildMCP__session_set_defaults
workspacePath: /path/to/MyApp.xcworkspace
scheme: "MyApp-Prod"
simulatorName: "MyApp-a1b2c3d4" # worktree-specific clone
bundleId: "com.example.myapp"
Then build_run_sim boots the simulator, installs the app, and launches it.
Step 3: Observe
The agent starts video recording, then works through each criterion in a loop:
- Screenshot the current screen
- Snapshot the view hierarchy to verify elements exist and are positioned correctly
- Capture logs to check for errors that aren’t visible in the UI
The view hierarchy snapshot is the most useful tool here. The agent can verify that a button has the right label, that a list has the expected number of rows, or that a view exists at all, without needing to interact with the app.
mcp__XcodeBuildMCP__snapshot_ui
→ Returns:
UIWindow
NavigationStack
InboxView
List (24 rows)
ThreadCell "Meeting tomorrow" — accessibilityLabel: "Meeting tomorrow, from Jane, 2 minutes ago"
ThreadCell "Build failed" — accessibilityLabel: "Build failed, from CI Bot, 15 minutes ago"
...
FloatingActionButton — accessibilityLabel: "Compose"
The agent reads this like it reads code. It can check whether the compose button exists, whether the inbox has rows, whether the labels are meaningful. All of it gets evaluated against the criteria from Step 1.
For criteria that need a different screen, the agent notes what it can observe and flags that manual navigation is still needed.
Step 4: Accessibility Audit
This happens as a side effect of Step 3, not as a separate pass. While inspecting the view hierarchy for each criterion, the agent checks for interactive elements (buttons, toggles, links, text fields) that are missing accessibilityLabel. A tappable button with no label means VoiceOver reads the raw image name or nothing at all.
Missing accessibility labels on interactive elements fail the verification. This is intentional: accessibility is a first-class requirement, not a nice-to-have. An agent that verifies a feature works but misses that half the buttons are inaccessible hasn’t done its job.
If the accessibility issues are unrelated to the current change (pre-existing gaps), they’re reported separately as “file an issue” recommendations rather than failing the verification.
Step 5: Report
The agent stops video recording and produces a structured report:
## Verification Report
**Change:** Add CC/BCC toggle to compose view
**Verdict:** PASS
### Test Criteria Results
| # | Criterion | Result | Notes |
|---|-----------|--------|-------|
| 1 | Compose screen renders with To field | PASS | Screenshot 1 |
| 2 | CC/BCC fields appear on toggle | PASS | Screenshot 2 |
| 3 | Keyboard appears on body tap | PASS | Screenshot 3 |
### Video Recording
Saved to: `/tmp/ios-verify-compose-ccbcc-2026-03-19.mp4`
### Accessibility Issues
- FloatingActionButton: missing accessibilityLabel (pre-existing)
### Recommendations
- File issue for FloatingActionButton a11y label
The video is ephemeral (/tmp/) but useful for attaching to a PR. The screenshots serve as quick-reference evidence.
What This Doesn’t Do
- This is not a replacement for UI automation. Full end-to-end flows (tap this, navigate there, assert that) require XCUITest or Maestro. This skill only observes static screen state, which is simpler and more reliable.
- This is not a replacement for unit tests. The skill verifies that the UI renders correctly. It doesn’t test business logic, state transitions, or error handling.
- This requires a buildable project. If
make buildfails, the whole workflow fails. The CLI tooling foundation needs to be solid first.
Why It Matters
Without this skill, the agent writes code, confirms it compiles, and hands it to you. You’re the quality gate. With it, the agent goes further: it launches the app, looks at the result, checks accessibility, and hands you a report with screenshots and video. You’re still the final reviewer, but you’re reviewing evidence rather than running the app yourself for the first time.
The accessibility audit is the part I didn’t expect to be so valuable. Before this skill, accessibility was something we checked when we remembered to. Now every verification run checks it automatically. We catch far more a11y issues early, and none of it required adding a separate step to anyone’s workflow.