Engineering

How receipt OCR actually works, from photo to data.

Photograph a receipt and the numbers type themselves — most people never ask how. The plain-English version: the two jobs hiding behind one tap, and where each one stumbles.

Vivek Reddy
founder
Jun 11, 2026 6 min read
$ scan receipt.jpg
detect text → 142 chars
parse → store · amount
1tap to data
Engineering

Search "receipt OCR" and you'll find a hundred apps promising the same small magic: photograph a receipt, and the numbers type themselves. It works often enough that most people never ask how — right up until the day it reads a total as 5.40 on a slip that plainly says 54.00, and they suddenly want to know what just happened in there. This is the plain-English version of what goes on between the photo and the data: the two jobs hiding behind one tap, why the hard part isn't the part you'd guess, and what all of it means for how you should actually use the thing.

OCR is older and narrower than you think

OCR — optical character recognition — is the decades-old technology that turns a picture of text into actual, machine-readable characters. Point it at a scanned page and it hands back the words as text you could copy and paste. That's it. That's the whole job of classic OCR: pixels in, characters out.

Notice what it doesn't do. It reads the text on a receipt — the header, the line items, the tax, the total, the "thank you, come again" — as one undifferentiated stream of characters. It has no idea which number is the total, which line is the merchant, or which of the three dates on the slip is the one that matters. It can read; it can't understand. And on a receipt, the reading was never the hard part.

"Receipt OCR" is really two jobs

So when an app says "receipt OCR," it's almost always describing two different jobs stacked on top of each other:

  1. Recognition — the classic OCR step. Find the text in the image and convert it to characters. On a clean photo, this is the mature, largely-solved part.
  2. Extraction — sometimes called parsing or "document understanding." Take that wall of recognized text and work out which piece is the merchant, which is the total, and which is the date. This is the part that's genuinely hard, because a receipt has no fixed layout — every point-of-sale system in the world prints them differently.

The second job is where the real work, and the real errors, live. Recognizing the characters "5", "4", ".", "0", "0" is easy. Deciding that that number is the total — and not the subtotal, the tax, the cash tendered, or the change — is the part that takes judgment.

The pipeline, step by step

Put together, a receipt moving through OCR passes through roughly four stages:

1. Image cleanup. Before anything reads the text, the image usually gets straightened, cropped to the receipt, and adjusted for contrast — a crooked, shadowed photo normalized into something closer to a flat scan. (This is also why the photo you take matters so much: garbage in, garbage out, which is the whole reason there's a short guide to giving this step something to work with.)

2. Text detection and recognition. The system finds the regions of the image that hold text, then recognizes the characters in each. The output is the full text of the receipt, usually with rough position information — this string of characters sat here on the page.

3. Field extraction. Now the understanding step. Using the text and where it sat — a large string near the top is probably the merchant; a number beside the word "total" near the bottom is probably the amount — the system pulls out the structured fields. Modern extractors lean on a mix of patterns, layout cues, and increasingly machine-learning models trained on a lot of receipts.

4. Structured data out. What you finally get is no longer a picture or a wall of text. It's data: merchant, amount, sometimes a date. That's the thing an app can file, total, and categorize — the difference between a folder of images and an actual expense record.

Why the hard part is the hard part

It's worth sitting with stage 3, because it explains every receipt-OCR frustration you've ever had. A receipt is a genuinely awful document to parse:

  • There's no standard layout. The total might be bottom-right, mid-page, or labeled "amount due," "balance," "to pay" — or nothing at all.
  • There are many numbers that look like the total — subtotal, tax, each line item, cash given, change. Picking the right one is a judgment call, not a lookup.
  • Dates are ambiguous. Is 03/04 the 4th of March or the 3rd of April? Is the date on the slip the transaction date or the print date?
  • The paper is faded, crumpled, thermal, or handwritten — so stage 2 often hands stage 3 imperfect text to reason over in the first place.

This is why two apps running on the same photo can give different answers, and why "the OCR was wrong" almost always means the extraction was wrong, not that it misread a character. (I went deep on exactly where this breaks in its own post.)

How Starlog does it

Worth being concrete about our own implementation, because it follows directly from everything above. Starlog runs OCR on your device, using Google's ML Kit, and it's deliberately modest about what it claims: it reads the store name and the amount off the receipt and pre-fills those two fields. The date, the category, and anything else are yours to set — and every field, including the two it filled, stays editable.

That's a design choice, not a limitation we're papering over. Extraction is the step that makes confident mistakes, so we pre-fill the fields OCR is most reliable on, keep a human in the loop for the rest, and make correcting any of it a single tap. And the original receipt image is always kept — filed into your own Google Drive — so the data is a convenience and the image is the record. If a number is ever in doubt, the source is right there to settle it.

What this means for you

Three practical takeaways fall out of understanding how this works:

  • Help stages 1 and 2. A clean, flat, well-lit photo is the single biggest thing in your control. Give recognition good text and extraction has a fighting chance. The seven-habit version is here.
  • Always glance at stage 3. Because extraction is where the confident errors hide, a two-second check of the pulled amount against the slip catches the one that would otherwise surface months later, when the numbers won't reconcile.
  • Keep the original. OCR is a time-saver, not a system of record. The data feeds your totals and your export to the accountant; the image is what holds up if anyone ever asks.

The short version

Receipt OCR is two jobs wearing one name: reading the text, which is largely solved, and understanding which text means what, which isn't. The reading rarely fails on a decent photo; the understanding is where the work and the mistakes live. Knowing that is most of the trick — it tells you to take a good photo, glance at the result, and never throw away the original. Do that, and OCR does what it's genuinely good at: making the typing disappear.

A small app for keeping your receipts straight.
We’re early. Come along.

Get Starlog