β€’
Drizz raises $2.7M in seed funding β€’
β€’
Featured on Forbes
β€’
Drizz raises $2.7M in seed funding β€’
β€’
Featured on Forbes
Logo
Schedule a demo

We tested 6 AI models on mobile tapping. One scored 94%. One scored 21%

Spatial accuracy is the most common failure point for mobile AI agents. UI-TapBench measures exactly this
β€” across 570 annotated screenshots from 20 real production apps.

94.51%
Drizz accuracy
21.72%
GPT-5.1 accuracy
97.18
Drizz F1 score
570
Annotated screenshots

Six models. One task. A clear hierarchy

All models evaluated zero-shot on identical prompts. A tap is correct if predicted coordinates fall within the ground-truth bounding box. Sorted by accuracy.

94.51%

Drizz tap accuracy β€” highest across all six models evaluated

Drizz Β· Accuracy

21.72%

GPT-5.1 accuracy on the same dataset β€” a model expected to perform far higher

GPT 5.1 Β· Accuracy

4Γ—

Gap between the best and worst F1 scores β€”spatial precision is the differentiator

97.18 vs 35.68 Β· F1

Model Accuracy Precision Recall F1 Score Latency p90
Drizz #1
94.51%
96.22% 98.16% 97.18 4.81s
Qwen 3.5-27b
92.98%
94.98% 97.61% 96.28 5.92s
Gemini Pro
89.84%
91.28% 98.28% 94.65 12.27s
Gemini Flash
81.44%
83.78% 96.67% 89.77 8.19s
GPT-5.2 Low
44.83%
45.71% 95.88% 61.91 5.0s
GPT-5.1 Low
21.72%
23.35% 75.61% 35.68 5.55s
p90 = 90th percentile inference time. Zero-shot. Apache 2.0 dataset.

The formatted PDF. Forward Ready

Charts, tables, production math, and citation β€” in a single file built for sharing with your VP or engineering lead. The full report is freely available on this page. The PDF is the shareable version.

No spam. One email with the PDF attached.
OR
Higher Intent
Get a custom analysis for your team's test count and release velocity.
Book a demo
Oops! Something went wrong while submitting the form.

Why 1.5% accuracy is not a small number

Drizz leads Qwen by 1.53 percentage points. In a CI/CD pipeline running 50 test cases with 20 taps each, that gap creates 300 extra failures every day.

Criteria
Drizz Β· 94.51% accuracy
Qwen 3.5-27b Β· 92.98% accuracy
Errors per 1,000 taps
55
70
Test cases affected / run
~5–6
~7
Daily failures (20 CI runs)
1,100
1,400
15-step flow success rate
42.8%
33.0%
Latency p90
4.81s
5.92s
Errors per 1,000 taps
Drizz Β· 94.51% accuracy
55
Qwen 3.5-27b Β· 92.98% accuracy
70
Test cases affected / run
Drizz Β· 94.51% accuracy
~5–6
Qwen 3.5-27b Β· 92.98% accuracy
~7
Daily failures (20 CI runs)
Drizz Β· 94.51% accuracy
1,100
Qwen 3.5-27b Β· 92.98% accuracy
1,400
15-step flow success rate
Drizz Β· 94.51% accuracy
42.8%
Qwen 3.5-27b Β· 92.98% accuracy
33.0%
Latency p90
Drizz Β· 94.51% accuracy
4.81s
Qwen 3.5-27b Β· 92.98% accuracy
5.92s
Cascading failure math
Mobile test automation is sequential. A wrong tap on step 3 of a 15-step flow doesn't just fail step 3 β€” it corrupts app state through steps 4–15. At 94.51% per-tap accuracy, Drizz completes a 15-step flow without error 42.8% of the time. Qwen at 92.98% drops to 33.0%. A 1.53-point accuracy gap translates to a 30% higher probability of end-to-end success.

Why GPT-5.1 scores 21% on a task it should pass

This is not a language failure. GPT-5.1 understands "tap the second option in the list" perfectly. It fails to map that instruction to the correct pixel on a dense mobile screen.

Dense layouts require precise counting, not comprehension

Mobile UIs pack dozens of tappable elements into small screens. "The third item" requires iterating over visually similar rows and landing on the exact bounding box β€” not recognising a region.Connect your app. Write tests in plain English. Start testing in minutes.

General training doesn't calibrate for pixel-level spatial tasks

LMMs trained on diverse image understanding tasks lack fine-grained spatial calibration for mobile UI interaction. Domain-specific fine-tuning on production app layouts is required.Tests adapt when UI changes. Broken steps fix themselves. Zero maintenance.

High recall, catastrophic precision

Mobile UIs pack dozens of tappable elements into small screens. "The third item" requires GPT-5.1 scores 75.61% recall but only 23.35% precision. It taps something β€” just the wrong thing. A false positive in mobile automation is worse than a miss: it corrupts state and cascades failures forward.iterating over visually similar rows and landing on the exact bounding box β€” not recognising a region.Connect your app. Write tests in plain English. Start testing in minutes.

The fastest models failed hardest

GPT-5.2 runs at 5.0s p90. GPT-5.1 at 5.55s β€” both faster than Drizz's 4.81s. A fast wrong tap fails the test, moves on, and leaves corrupt app state poisoning every downstream step.

How we ran it. Reproduce it yourself.

We are both benchmark creator and top scorer. Full reproducibility β€” open dataset, disclosed constants, identical conditions for every model β€” is our only credibility.

Evaluation conditions
Each model received: one mobile screenshot + one natural language instruction. Prompted to return (x, y) coordinates. A tap is marked correct if predicted coordinates fall within the ground-truth bounding box. Zero-shot. No few-shot examples, no chain-of-thought, no model-specific prompt engineering. Identical prompts across all six models.

Scope

Tap-level spatial precision only. Not end-to-end task completion, multi-step reasoning, or stateful interaction.

Sample Size

570 samples. Sufficient to differentiate production-viable from non-viable. Dataset expansion planned for v2.

Platform

Android screenshots only. iOS, iPadOS, and web mobile layouts are not included in this release.

Language

English-language interfaces only. RTL layouts and CJK interfaces not evaluated.

Self-Evaluation

Drizz is benchmark creator and top scorer. Mitigated by open dataset + full annotation release under Apache 2.0.

Prompting

Zero-shot only. Performance may differ with few-shot examples or model-specific optimisations.

570 screenshots. 20 apps.
Apache 2.0.

Run your own models. Challenge the scores. Publish your results. That's why we open-sourced it.

Food & Travel

Uber Eats, DoorDash, Instacart, Uber, Airbnb, Booking.com

6 Apps

Social & Entertainment

WhatsApp, Telegram, Reddit, Spotify, YouTube, Netflix

6 Apps

Finance, Health & Work

PayPal, Robinhood, Notion, Todoist, Duolingo, Coursera, MyFitnessPal, Strava

6 Apps

Download the benchmark PDF
This is some text inside of a div block.
Get the pdf
book a demo instead
Schedule a demo