How we built the snap-to-log camera that reads a receipt in 1.4 seconds.

When you snap a receipt in Starlog, a small green checkmark appears about a second and a half later. The merchant name, line items, total, tax, tip, and category are filled in for you. You tap done, and the receipt is filed.

Getting that checkmark to land in 1.4 seconds — instead of the 4.8 seconds it took two years ago — is the longest‑running engineering project in the company. It involved three full rewrites of the parsing pipeline, one regrettable detour, and an embarrassingly simple insight that, in retrospect, should have been the starting point.

This post is the play‑by‑play. If you’re building something that turns photos into structured data, parts of this will save you a year.

01Why parse time matters more than parse accuracy

The first thing to know about a receipt app is that nobody wants to use a receipt app. People log a receipt the moment they get one — at the counter, on the sidewalk, half‑in their pocket. If the app makes them wait, they put their phone away and the receipt gets lost. The data we never see is more valuable than the data we parse perfectly.

So we ranked our metrics. Time‑to‑checkmark beat parse F1 score. Tap‑to‑done beat field‑level accuracy. We will gladly mis‑parse the tax line if it means the user finishes the log and walks away.

The metric that matters

A perfectly parsed receipt that arrives 4 seconds late is worth less than a roughly parsed one that arrives in 1. Users won’t wait, and lost receipts can’t be re‑parsed.

02Rewrite #1: cloud OCR was the obvious answer, until it wasn’t

The first version of snap‑to‑log was a thin client over a cloud OCR provider. Send the image, get back a JSON of text blocks, run a few regexes for prices, ship it. We had a working pipeline in ten days. It was usable. It was also 4.8 seconds on the median connection — most of which was spent waiting for the round trip and the OCR job itself.

v1 · Cloud OCR

4.81s

v2 · Local LSTM

2.98s

v3 · Receipt parser

1.42s

Median parse time, p50 — measured on iPhone 13 over a typical 4G connection, 1,000 receipts per version.

We could have shaved a second by upgrading the API tier and another by pre‑warming connections. We did neither. The upper bound on a network round trip on a real cellular connection in a coffee shop is, charitably, brutal. To get under two seconds, the pipeline had to be on the device.

03Rewrite #2: the LSTM detour

The second version ran a bundled OCR model on‑device. We picked a sequence‑to‑sequence LSTM trained on receipts, dropped it behind a Core ML wrapper, and — after about three months of pain — got the median to 2.98 seconds. Better, but still not there.

Worse, the failure mode was bad. When the LSTM got confused — wrinkled paper, faded ink, a curly fryer chit from a diner — it would confidently return wrong totals. Confidently wrong is the worst possible state for a tax app. Users lost trust in five minutes and never came back.

A model that’s confidently wrong about a $42 tax line is more dangerous than a model that’s slow.— from the v2 retrospective, March 2025

We tried distillation. We tried beam search. We tried more training data. The numbers crept up half a percentage point at a time, and the on‑device model kept getting bigger. At one point our app was 217 MB. We were optimising the wrong thing.

04Rewrite #3: receipts are not photos of words

The insight that broke the problem open is, in hindsight, almost embarrassing.

A receipt is not a photograph of arbitrary text. It is a highly structured object with strong priors: a header (merchant), a body (line items in tabular alignment), a footer (subtotal, tax, total). The text within those zones almost always uses a monospaced or near‑monospaced font. Prices line up to the right. Totals are usually the largest number.

Once we modelled the receipt as a layout, not a page of words, the work split cleanly into three small models that could each be fast and small:

Layout detector— finds the header / body / footer regions in the image. 3.2 MB.
Line‑item extractor — reads the tabular zone, returning (name, qty, price)rows. 6.1 MB.
Total resolver— finds the largest aligned dollar value in the footer. Mostly heuristic, no model. 0 MB.

The total resolver is the cheap one, and it is alsothe most accurate component in the system. It turns out that “find the biggest number with a dollar sign in the bottom third of the image” is correct about 99.4% of the time. We had spent a year teaching a 180‑MB model to do something a 30‑line function does better.

// resolve_total.ts — the embarrassingly small fix
function resolveTotal(footer: TextBlock[]): number | null {
  const candidates = footer
    .filter(b => /\$?\d+\.\d{2}/.test(b.text))
    .map(b => ({ ...b, value: parseAmount(b.text) }))
    .filter(b => b.value > 0);

  if (!candidates.length) return null;

  // the largest right-aligned number is almost always the total
  return candidates
    .sort((a, b) => b.value - a.value)[0].value;
  // 99.4% accurate on a holdout of 12,000 receipts. ✓
}

05What we shipped, and what it cost

The current pipeline runs entirely on‑device. Model footprint is 9.3 MB combined. The median parse is 1.42 seconds; p95 is 2.1. Field‑level accuracy is up across the board, but the number we cared most about — completion rate — is up 38%.

Median parse

1.42s

Down from 4.81s, on‑device, no network.

−70%

Completion rate

94%

Receipts snapped that finish the log within 30s.

+38%

Model footprint

9.3MB

Three small models replacing one 180 MB seq2seq.

−95%

The cost was honest. Two years. Three full rewrites. A four‑month detour where we genuinely thought we’d cracked it and we hadn’t. Several engineers, including me, who are now suspicious of any sentence beginning with “we’ll just throw a model at it.”

06What’s next

Two things on the roadmap:

Multi‑receipt detection. Snap a fan of three receipts from a business dinner; we should split them automatically. Working code, no ship date yet.
Email receipts. Same pipeline applied to HTML emails from Stripe, Square, Toast, and the long tail of independent merchants. Out in beta next month.

If you want to see the pipeline in action, snap a receipt in Starlog and watch the bottom of the screen. The little green check is two years of work. We hope it feels like nothing.

— Priya, on behalf of the Vision team. Questions or war stories? Find me at priya@starlog.app.

EngineeringOCROn-device MLPerformancePostmortem

Priya Raman

Staff engineer · Vision team

Priya leads the team that turns photos of receipts into structured data. Before Starlog, she worked on document understanding at a large search company and shipped a tax-prep model that no one ever heard of.

01Why parse time matters more than parse accuracy

02Rewrite #1: cloud OCR was the obvious answer, until it wasn’t

03Rewrite #2: the LSTM detour

04Rewrite #3: receipts are not photos of words

05What we shipped, and what it cost

06What’s next

Read next

Try the snap.Forget about the rest.

Try the snap.
Forget about the rest.