Nine Languages in One Prompt: Multilingual Product Discovery in SEA
Southeast Asia runs in nine working languages at once. Indexing product data so a user's agent can search in any of them turns out to need a specific shape — and a few non-obvious decisions.
Singapore is a good microcosm of Southeast Asia's linguistic reality. A single user population — 5.9 million people — speaks English, Mandarin, Malay, and Tamil natively, in daily code-switching blends. A Singaporean user asking their agent for product recommendations might type in English today, Mandarin tomorrow, and a Singlish mix the day after. A shopping infrastructure that works for that user needs to understand all of those as the same intent.
Scale up to the region: 650 million people, nine working languages, dozens of local dialects. Add Japanese and Korean, which are not SEA languages but which are disproportionately represented in SEA's consumer and creator populations (Japan and Korea are the top two inbound tourist sources for most SEA cities; K-beauty and J-beauty categories are dominant in each country's beauty market). A commerce layer that serves the region in 2026 has to work in all of these languages or it is leaving significant volume on the table.
This post walks through the design choices that actually matter when building that indexing layer — drawn from running the real thing in production.
1. The nine languages that matter
| Language | Script | Native speakers (approx, SEA context) |
|---|---|---|
| English | Latin | Universal L2; L1 for Singapore, Philippines. |
| Simplified Chinese (中文) | CJK | Large diaspora populations across SEA; official in Singapore, dominant in mainland China cross-border. |
| Traditional Chinese (繁體) | CJK | Heritage script used in Hong Kong, Taiwan, older diaspora communities. |
| Malay (Bahasa Melayu) | Latin | Official in Malaysia, Brunei, Singapore; 290M speakers with Indonesian. |
| Indonesian (Bahasa Indonesia) | Latin | Official in Indonesia (~270M); highly mutually intelligible with Malay. |
| Vietnamese (tiếng Việt) | Latin (diacritic-heavy) | Official in Vietnam (~95M). |
| Thai (ภาษาไทย) | Thai | Official in Thailand (~69M); no whitespace word boundaries. |
| Japanese (日本語) | CJK + hiragana + katakana | Used by SEA residents and inbound shoppers from Japan; katakana for foreign brand names. |
| Korean (한국어) | Hangul | Used by SEA residents and inbound shoppers from Korea. |
2. The hard parts
Three characteristics of these languages make classic English-centric search indexing fail in important ways.
a. CJK and Thai have no whitespace word boundaries
Chinese, Japanese, and Thai don't separate words with spaces. "新加坡美妆品牌" is one run of characters; the semantic units are "新加坡" (Singapore) + "美妆" (beauty/cosmetics) + "品牌" (brand). A search index that tokenizes on whitespace will index the whole string as a single token, which matches nothing. You need either a dedicated tokenizer (mecab, jieba, icu) or a simpler approach that works for commerce-sized vocabulary: n-gram windows.
In practice we generate 2-character n-grams over CJK and Thai text. "新加坡美妆" becomes the set {新加, 加坡, 坡美, 美妆}. A query of "新加坡" generates {新加, 加坡}; we check overlap with the indexed set. It's a bag-of-bigrams approach, works well for short queries, ages well as vocabulary grows. It's not as sharp as a proper tokenizer, but the operational simplicity is significant and the match quality is much better than whitespace tokenization.
b. Foreign brand names are localized differently in each script
Xiaomi in Chinese is 小米 (xiǎo mǐ, literally "little rice"). In Korean it's 샤오미 (phonetic). In Japanese it's シャオミ (katakana phonetic). These are not translations of each other; they are parallel localizations. A user in Seoul typing 샤오미 is looking for the same company as a user in Shanghai typing 小米, and an index that doesn't know this will fail both.
The pragmatic move is to maintain a brand alias table: for each canonical brand, a small list of common localizations in each script. We hand-curate this for the 11 brands we currently index — it's less than 10 lines of data per brand. For brands without a heritage localization (airpaz, WPS, FusionHome), the alias list is empty and we fall back to Latin-script match.
c. Category vocabulary varies by language and by locale
The English category "beauty" has multiple reasonable translations in each language — and the same-language translations themselves vary by market. In Bahasa Malaysia, "kecantikan" is formal; in Bahasa Indonesia, "kecantikan" is also used but "kosmetik" is common in everyday speech. In Japanese, 化粧品 is a noun (cosmetics) while スキンケア is a sub-category (skincare), both often used where an English speaker would just say "beauty."
We build a category synonym table for each of our seven categories across all nine languages — roughly 30-40 synonym tokens per category. It's not that hard to maintain: categories change slowly, we update when we notice a miss, and the table compresses nicely in JSON.
3. The indexing architecture we ended up with
The operational model is:
- Each brand record carries a
keywordsarray — the union of brand aliases, category synonyms, and region aliases across all nine language variants. A typical brand has 40-80 keywords. - At search time, we tokenize the query. For Latin-script queries: split on whitespace, drop tokens shorter than 2 characters. For CJK/Thai queries (detected by Unicode range): generate 2-character n-grams across the query.
- Score each candidate brand by: exact keyword match (high weight), substring keyword match (medium), token-in-keyword match (low). Add classic bonuses for brand-name match, headline match, category match.
- Return top-k sorted by score, filtered by region and category if supplied.
No ML ranker. No embedding store. No fuzzy phoneme matching. Just typed data + a few hundred lines of ranking code. It fits entirely inside a single MCP tool handler running on an edge Worker, with sub-100ms latency from any SEA region.
4. What we got right and what we got wrong
Got right: the n-gram-on-CJK approach. For a vocabulary our size (~70 keywords per brand across 11 brands), bigrams are absolutely sufficient, and they avoid the complexity and bundle-size of a real tokenizer in an edge runtime.
Got right: keeping brand aliases and category synonyms as hand-curated data, not LLM-generated. LLMs generate plausible-looking but wrong localizations; we got burned twice in internal testing. Hand-curating 10 lines per brand is an order of magnitude cheaper than cleaning up LLM mistakes.
Got wrong initially: treating English as the default. Our first version did English tokenization first, then checked language, then re-tokenized. We now detect script first (it's a one-character-class regex) and dispatch to the right tokenizer. Faster and simpler.
Got wrong initially: not weighting brand-alias matches high enough. A user typing 小米 gets a much stronger signal than a user typing 电子 (electronics) — the former names a specific brand, the latter names a category of 50 brands. Our scoring now treats an exact brand-alias match as near-parity with an exact brand-name match.
5. Validation, end-to-end
We ran 20 test queries across 8 languages as part of the initial build-out. Each query expresses the same underlying consumer intent ("beauty in Singapore", "Xiaomi phone", "sneakers Singapore", etc.) in a different language. All 20 resolved to the expected top-1 brand. That's the bar we shipped at — the multilingual layer has to work across all nine variants or the region-first commerce thesis doesn't land.
You can watch this end-to-end in the live demo: it runs four real MCP calls in four different languages — Chinese, Thai, Japanese, English — against the production endpoint. What the demo renders is what any OpenUI-based chat client with our MCP installed would render for a user's query.
6. What's next
Three things we are adding in the next version:
- Language-specific CTAs. A Korean-speaking user should see Korean text on the click-through card, not "Visit Sephora." Lightweight i18n layer — we already have the translations ready; just needs wiring into the tool output.
- Romanized Cantonese and Hokkien. SEA Chinese diaspora doesn't always type in Mandarin; some localization varies. Low-priority but noticing the gap.
- Arabic and Vietnamese tokenization tweaks. Arabic isn't in scope today but likely joins when we cover Middle East. Vietnamese has diacritic ambiguity that our current tokenizer doesn't fully handle (tiếng vs tieng).
None of this is theoretical. The multilingual index is already the core differentiator we've shipped — the surface where agents asking in Thai, Indonesian, or Japanese get answers that actually match the user's intent, without anyone having to translate the query first.