Multilingual menus: where machine translation gets it wrong, and how we catch it

Machine translation is good enough for prose and dangerous for menus. Dishes are full of proper nouns, regional terms and cultural context that NMT systems mis-handle in specific, repeatable ways. Here is what we have seen go wrong, why, and the review gate ORDR builds in.

By Francesco Boffa

Co-founder

Wednesday, 8 May 2024 5 min read

An editorial illustration of a printed menu lying on a wooden bistro table with a single dish illustration above it, glyphs from multiple writing systems flowing softly across the composition

You can run a paragraph of English prose through Google Translate into French, German, Italian, Spanish, Japanese — and the output is broadly competent. You can run a restaurant menu through the same engine, and you can get into trouble in ways that prose translation does not warn you about.

This is the long version of why menus are hard for machine translation and what we build to catch the failures before they reach a customer.

The problem, in three categories

Proper nouns that look like ordinary words. “Carbonara”, “Caesar”, “Wellington”, “Pavlova”, “Cumberland”, “Mornay”, “Florentine”, “Bourguignon”. To a neural machine translation system these read as words; they translate to other words. We have seen “Caesar salad” rendered into Spanish as a salad attributed to the Roman emperor, and “Wellington steak” rendered into German as something pertaining to rubber boots. These are real outputs, not jokes.

The underlying problem is one that named-entity recognition research has been working on for two decades: NMT models have to decide whether a proper noun should be preserved verbatim, transliterated, or translated, and the decision is fragile for food terms because the same word is sometimes a name and sometimes a common noun. “Florentine” is a kind of biscuit AND a kind of egg dish AND a kind of person. Without the wider context of a paragraph of prose, the model often picks the wrong sense.

Regional and dialectal food terms. “Bap”, “barm”, “stottie”, “muffin” (the British one), “biscuit” (the British one vs the American one), “chips” (the British one vs the American one), “lemonade” (the British one vs the American one). All of these have specific, regionally bounded referents that NMT systems flatten. A British “lemonade” rendered into Spanish as limonada is fine; rendered as the cloudy still American kind would be wrong. The model does not know which kind of lemonade your menu means without surrounding context that menus do not provide.

Culturally-loaded terms that have no equivalent. “Sunday roast”, “afternoon tea”, “ploughman’s”, “high tea”, “elevenses”. You can translate the words; you cannot translate the institution. The right answer is usually to leave the term in the source language and use a footnote, but NMT will dutifully render almuerzo dominical asado and call it a day.

Two papers worth reading on the underlying NMT limitations: Koehn and Knowles, “Six Challenges for Neural Machine Translation” (2017) is dated but still on point about domain shift and rare words. Macháček et al., “Findings of the 2023 WMT Shared Task on Machine Translation” is the most recent state-of-the-art benchmark and shows where the systems still struggle.

What we built in front of the translation API

ORDR’s translation pipeline has three stages.

Stage one: machine translation as the draft. When a venue adds a new menu in English and wants Spanish, we call Google Cloud Translation and produce a draft in Spanish for every item. The draft is never published automatically. It sits in a Pending review state.

Stage two: a glossary of preserved terms. Before the machine translation runs, the menu’s items are pre-processed against a venue-managed glossary. Items in the glossary are wrapped in non-translatable tags — Google’s API supports this directly via the notranslate mechanism and DeepL has an equivalent feature for glossaries. So “Pavlova” stays “Pavlova” in every target language; “Carbonara” stays “Carbonara”; the venue’s own dish names (“Soho Smashed Burger”, “Roy’s Steak Tartare”) stay as themselves. The glossary is auto-populated from a small list of well-known dish proper-nouns and is editable per-venue.

Stage three: a human review gate. A native speaker — usually the venue manager, sometimes a server who speaks the target language — sees the draft, fixes the awkward bits, and approves. Only then does the translated menu go live to customers. The gate is the single most important piece. It costs a venue ten or fifteen minutes per new language, once.

Specific things we have caught

Spanish translations rendering “still water” as agua quieta (quiet water) rather than agua sin gas (still water in the bottled sense). The native speaker fixed it.
Italian translations rendering “side of chips” as un lato di patatine (a side, in the geometric sense). The native speaker fixed it.
A German translation referring to a “Wellington” as a boot. Glossary now preserves “Beef Wellington” verbatim.
A Japanese translation that helpfully attempted to phonetically transliterate the venue’s name into katakana, despite the venue’s name being a deliberate stylisation that needed to stay in Latin script.

None of these are catastrophes individually. A pattern of them, untreated, is the difference between a translation a customer trusts and one they laugh at.

Why we do not rely on the bleeding edge

Newer large-language-model approaches — Anthropic, OpenAI, Google’s Gemini, Meta’s Llama — are demonstrably better at preserving named entities and cultural context. Comparisons published by the Stanford CRFM show meaningful improvements on the well-known mistranslation benchmarks. We have piloted a Gemini-backed translation path internally and the quality is better than NMT for the harder cases.

But the cost and latency change. NMT via Google’s translation API is sub-100ms and roughly $20 per million characters. A frontier LLM call is 1–3 seconds and roughly twenty times the cost. For a 200-item menu that is the difference between a one-shot translation that takes 30 seconds and one that takes 10 minutes, at meaningful cost.

For now, NMT plus a glossary plus a human review gate gives us the right cost / quality / speed envelope. We will revisit when frontier models converge on NMT pricing — likely sooner than we expect.

What ORDR does about this

ORDR ships machine translation for menus with a glossary system for preserved terms and a mandatory human-review gate before any translated menu goes live. The full feature documentation is in the help centre. If you run a venue in a market with non-English-speaking customers, the gate is what makes the difference between competent translation and embarrassing translation — and we will not let you publish the bad version.

The problem, in three categories

What we built in front of the translation API

Specific things we have caught

Why we do not rely on the bleeding edge

What ORDR does about this

What ORDR does about this.