We tested 6 AI models on mobile tapping. One scored 94%. One scored 21%

Spatial accuracy is the most common failure point for mobile AI agents. UI-TapBench measures exactly this
— across 570 annotated screenshots from 20 real production apps.

94.51%

Drizz accuracy

21.72%

GPT-5.1 accuracy

97.18

Drizz F1 score

570

Annotated screenshots

Read the Benchmark

Download PDF

Benchmark results

Six models. One task. A clear hierarchy

All models evaluated zero-shot on identical prompts. A tap is correct if predicted coordinates fall within the ground-truth bounding box. Sorted by accuracy.

94.51%

Drizz tap accuracy — highest across all six models evaluated

Drizz · Accuracy

21.72%

GPT-5.1 accuracy on the same dataset — a model expected to perform far higher

GPT 5.1 · Accuracy

4×

Gap between the best and worst F1 scores —spatial precision is the differentiator

97.18 vs 35.68 · F1

Model	Accuracy	Precision	Recall	F1 Score	Latency p90
Drizz #1	94.51%	96.22%	98.16%	97.18	4.81s
Qwen 3.5-27b	92.98%	94.98%	97.61%	96.28	5.92s
Gemini Pro	89.84%	91.28%	98.28%	94.65	12.27s
Gemini Flash	81.44%	83.78%	96.67%	89.77	8.19s
GPT-5.2 Low	44.83%	45.71%	95.88%	61.91	5.0s
GPT-5.1 Low	21.72%	23.35%	75.61%	35.68	5.55s

p90 = 90th percentile inference time. Zero-shot. Apache 2.0 dataset.

Write. Run. Ship.

The formatted PDF. Forward Ready

Charts, tables, production math, and citation — in a single file built for sharing with your VP or engineering lead. The full report is freely available on this page. The PDF is the shareable version.

Oops! Something went wrong while submitting the form.

Production impact

Why 1.5% accuracy is not a small number

Drizz leads Qwen by 1.53 percentage points. In a CI/CD pipeline running 50 test cases with 20 taps each, that gap creates 300 extra failures every day.

Criteria

Drizz · 94.51% accuracy

Qwen 3.5-27b · 92.98% accuracy

Errors per 1,000 taps

Test cases affected / run

~5–6

Daily failures (20 CI runs)

1,100

1,400

15-step flow success rate

42.8%

33.0%

Latency p90

4.81s

5.92s

Errors per 1,000 taps

Drizz · 94.51% accuracy

Qwen 3.5-27b · 92.98% accuracy

Test cases affected / run

Drizz · 94.51% accuracy

~5–6

Qwen 3.5-27b · 92.98% accuracy

Daily failures (20 CI runs)

Drizz · 94.51% accuracy

1,100

Qwen 3.5-27b · 92.98% accuracy

1,400

15-step flow success rate

Drizz · 94.51% accuracy

42.8%

Qwen 3.5-27b · 92.98% accuracy

33.0%

Latency p90

Drizz · 94.51% accuracy

4.81s

Qwen 3.5-27b · 92.98% accuracy

5.92s

Cascading failure math

Mobile test automation is sequential. A wrong tap on step 3 of a 15-step flow doesn't just fail step 3 — it corrupts app state through steps 4–15. At 94.51% per-tap accuracy, Drizz completes a 15-step flow without error 42.8% of the time. Qwen at 92.98% drops to 33.0%. A 1.53-point accuracy gap translates to a 30% higher probability of end-to-end success.

Analysis

Why GPT-5.1 scores 21% on a task it should pass

This is not a language failure. GPT-5.1 understands "tap the second option in the list" perfectly. It fails to map that instruction to the correct pixel on a dense mobile screen.

Dense layouts require precise counting, not comprehension

Mobile UIs pack dozens of tappable elements into small screens. "The third item" requires iterating over visually similar rows and landing on the exact bounding box — not recognising a region.Connect your app. Write tests in plain English. Start testing in minutes.

General training doesn't calibrate for pixel-level spatial tasks

LMMs trained on diverse image understanding tasks lack fine-grained spatial calibration for mobile UI interaction. Domain-specific fine-tuning on production app layouts is required.Tests adapt when UI changes. Broken steps fix themselves. Zero maintenance.

High recall, catastrophic precision

Mobile UIs pack dozens of tappable elements into small screens. "The third item" requires GPT-5.1 scores 75.61% recall but only 23.35% precision. It taps something — just the wrong thing. A false positive in mobile automation is worse than a miss: it corrupts state and cascades failures forward.iterating over visually similar rows and landing on the exact bounding box — not recognising a region.Connect your app. Write tests in plain English. Start testing in minutes.

The fastest models failed hardest

GPT-5.2 runs at 5.0s p90. GPT-5.1 at 5.55s — both faster than Drizz's 4.81s. A fast wrong tap fails the test, moves on, and leaves corrupt app state poisoning every downstream step.

METHODOLOGY

How we ran it. Reproduce it yourself.

We are both benchmark creator and top scorer. Full reproducibility — open dataset, disclosed constants, identical conditions for every model — is our only credibility.

Evaluation conditions

Each model received: one mobile screenshot + one natural language instruction. Prompted to return (x, y) coordinates. A tap is marked correct if predicted coordinates fall within the ground-truth bounding box. Zero-shot. No few-shot examples, no chain-of-thought, no model-specific prompt engineering. Identical prompts across all six models.

Scope

Tap-level spatial precision only. Not end-to-end task completion, multi-step reasoning, or stateful interaction.

Sample Size

570 samples. Sufficient to differentiate production-viable from non-viable. Dataset expansion planned for v2.

Platform

Android screenshots only. iOS, iPadOS, and web mobile layouts are not included in this release.

Language

English-language interfaces only. RTL layouts and CJK interfaces not evaluated.

Self-Evaluation

Drizz is benchmark creator and top scorer. Mitigated by open dataset + full annotation release under Apache 2.0.

Prompting

Zero-shot only. Performance may differ with few-shot examples or model-specific optimisations.

Open dataset

570 screenshots. 20 apps.
Apache 2.0.

Run your own models. Challenge the scores. Publish your results. That's why we open-sourced it.

Food & Travel

Uber Eats, DoorDash, Instacart, Uber, Airbnb, Booking.com

6 Apps

Social & Entertainment

WhatsApp, Telegram, Reddit, Spotify, YouTube, Netflix

6 Apps

Finance, Health & Work

PayPal, Robinhood, Notion, Todoist, Duolingo, Coursera, MyFitnessPal, Strava

6 Apps

test your app on drizz

download app

We tested 6 AI models on mobile tapping. One scored 94%. One scored 21%

Six models. One task. A clear hierarchy

94.51%

Drizz · Accuracy

GPT 5.1 · Accuracy

97.18 vs 35.68 · F1

The formatted PDF. Forward Ready

Why 1.5% accuracy is not a small number

Why GPT-5.1 scores 21% on a task it should pass

Dense layouts require precise counting, not comprehension

General training doesn't calibrate for pixel-level spatial tasks

High recall, catastrophic precision

The fastest models failed hardest

How we ran it. Reproduce it yourself.

Scope

Sample Size

Platform

Language

Self-Evaluation

Prompting

570 screenshots. 20 apps. Apache 2.0.

Food & Travel

6 Apps

6 Apps

6 Apps

570 screenshots. 20 apps.
Apache 2.0.