When you snap a receipt in Starlog, a small green checkmark appears about a second and a half later. The merchant name, line items, total, tax, tip, and category are filled in for you. You tap done, and the receipt is filed.
Getting that checkmark to land in 1.4 seconds — instead of the 4.8 seconds it took two years ago — is the longest‑running engineering project in the company. It involved three full rewrites of the parsing pipeline, one regrettable detour, and an embarrassingly simple insight that, in retrospect, should have been the starting point.
This post is the play‑by‑play. If you’re building something that turns photos into structured data, parts of this will save you a year.
01Why parse time matters more than parse accuracy
The first thing to know about a receipt app is that nobody wants to use a receipt app. People log a receipt the moment they get one — at the counter, on the sidewalk, half‑in their pocket. If the app makes them wait, they put their phone away and the receipt gets lost. The data we never see is more valuable than the data we parse perfectly.
So we ranked our metrics. Time‑to‑checkmark beat parse F1 score. Tap‑to‑done beat field‑level accuracy. We will gladly mis‑parse the tax line if it means the user finishes the log and walks away.
The metric that matters
A perfectly parsed receipt that arrives 4 seconds late is worth less than a roughly parsed one that arrives in 1. Users won’t wait, and lost receipts can’t be re‑parsed.
02Rewrite #1: cloud OCR was the obvious answer, until it wasn’t
The first version of snap‑to‑log was a thin client over a cloud OCR provider. Send the image, get back a JSON of text blocks, run a few regexes for prices, ship it. We had a working pipeline in ten days. It was usable. It was also 4.8 seconds on the median connection — most of which was spent waiting for the round trip and the OCR job itself.
We could have shaved a second by upgrading the API tier and another by pre‑warming connections. We did neither. The upper bound on a network round trip on a real cellular connection in a coffee shop is, charitably, brutal. To get under two seconds, the pipeline had to be on the device.
03Rewrite #2: the LSTM detour
The second version ran a bundled OCR model on‑device. We picked a sequence‑to‑sequence LSTM trained on receipts, dropped it behind a Core ML wrapper, and — after about three months of pain — got the median to 2.98 seconds. Better, but still not there.
Worse, the failure mode was bad. When the LSTM got confused — wrinkled paper, faded ink, a curly fryer chit from a diner — it would confidently return wrong totals. Confidently wrong is the worst possible state for a tax app. Users lost trust in five minutes and never came back.
A model that’s confidently wrong about a $42 tax line is more dangerous than a model that’s slow.— from the v2 retrospective, March 2025
We tried distillation. We tried beam search. We tried more training data. The numbers crept up half a percentage point at a time, and the on‑device model kept getting bigger. At one point our app was 217 MB. We were optimising the wrong thing.
04Rewrite #3: receipts are not photos of words
The insight that broke the problem open is, in hindsight, almost embarrassing.
A receipt is not a photograph of arbitrary text. It is a highly structured object with strong priors: a header (merchant), a body (line items in tabular alignment), a footer (subtotal, tax, total). The text within those zones almost always uses a monospaced or near‑monospaced font. Prices line up to the right. Totals are usually the largest number.
Once we modelled the receipt as a layout, not a page of words, the work split cleanly into three small models that could each be fast and small:
- Layout detector— finds the header / body / footer regions in the image. 3.2 MB.
- Line‑item extractor — reads the tabular zone, returning
(name, qty, price)rows. 6.1 MB. - Total resolver— finds the largest aligned dollar value in the footer. Mostly heuristic, no model. 0 MB.
The total resolver is the cheap one, and it is alsothe most accurate component in the system. It turns out that “find the biggest number with a dollar sign in the bottom third of the image” is correct about 99.4% of the time. We had spent a year teaching a 180‑MB model to do something a 30‑line function does better.
// resolve_total.ts — the embarrassingly small fix function resolveTotal(footer: TextBlock[]): number | null { const candidates = footer .filter(b => /\$?\d+\.\d{2}/.test(b.text)) .map(b => ({ ...b, value: parseAmount(b.text) })) .filter(b => b.value > 0); if (!candidates.length) return null; // the largest right-aligned number is almost always the total return candidates .sort((a, b) => b.value - a.value)[0].value; // 99.4% accurate on a holdout of 12,000 receipts. ✓ }
05What we shipped, and what it cost
The current pipeline runs entirely on‑device. Model footprint is 9.3 MB combined. The median parse is 1.42 seconds; p95 is 2.1. Field‑level accuracy is up across the board, but the number we cared most about — completion rate — is up 38%.
The cost was honest. Two years. Three full rewrites. A four‑month detour where we genuinely thought we’d cracked it and we hadn’t. Several engineers, including me, who are now suspicious of any sentence beginning with “we’ll just throw a model at it.”
06What’s next
Two things on the roadmap:
- Multi‑receipt detection. Snap a fan of three receipts from a business dinner; we should split them automatically. Working code, no ship date yet.
- Email receipts. Same pipeline applied to HTML emails from Stripe, Square, Toast, and the long tail of independent merchants. Out in beta next month.
If you want to see the pipeline in action, snap a receipt in Starlog and watch the bottom of the screen. The little green check is two years of work. We hope it feels like nothing.
— Priya, on behalf of the Vision team. Questions or war stories? Find me at priya@starlog.app.
Priya leads the team that turns photos of receipts into structured data. Before Starlog, she worked on document understanding at a large search company and shipped a tax-prep model that no one ever heard of.