Minimum word and address lengths in what3words: code points vs characters

As you’ll know, every what3words address consists of three words separated by dots (like ///word1.word2.word3 ). But when it comes to measuring how long those words (and addresses) are, there’s a bit of a twist: we count Unicode code points, not just the letters you see on screen. In this blog, we’ll explore the minimum lengths of words and addresses in what3words, why we measure them in code points, and what exactly the difference is between code points and characters.

Minimum word length in what3words
what3words addresses are designed to avoid very short words. In fact, each language has a rule for the minimum length a word can be. For example, in the English word list, the shortest words are four letters long. Many other languages have similar minimums of three or four letters.

Why the minimum length? It’s partly to prevent confusion. By avoiding very short words, what3words makes speech-to-text recognition of addresses more reliable (meaning “table branched spoon” couldn’t be mis-transcribed as “table branch to spoon” if the voice recogniser is aware of the four-character minimum word length in English). The minimum length rule gives each word a bit of “heft” so it stands out clearly when spoken or read. It’s important to note that these lengths are counted in Unicode code points. That means we’re counting the underlying units of text (more on that below), not just the number of letters as you might intuitively think.

We balance the need for clarity and “heft” with a desire to create user-friendly products; we know that short words are generally easy to remember and quicker to say or type. In the Latin script and other alphabets (such as Cyrillic and Greek), where a letter generally corresponds to a single Unicode point, a four-letter word balances the requirements well.

However, other writing scripts work differently, such as the Devanagari script used to write Hindi , where consonants come with an inherent vowel that does not need to be typed: while the spoken word is no shorter, the number of Unicode points can be significantly reduced. Many Asian languages are written in scripts with a similar construction. In these languages, shorter words (often down to two Unicode points) may be used.

Furthermore, for Chinese, Japanese, Korean and Amharic, where every syllable generally maps to a single Unicode point, two-character words are used in what3words addresses.

Minimum address length (don’t forget the dots!)
Now, what about a full what3words address? Since it is three words separated by dots, the minimum address length depends on the minimum word length. An address will be shortest when each of its three words is the shortest possible in that language. And, importantly, when counting an address’ length, we include those two dots, but we do not count the /// prefix. For example, consider the address ///raft.robe.post . Here, each word is four letters long (the minimum in an English what3words address). How many Unicode points is that address? Let’s break it down:

Words: “raft”, “robe” and “post” – that’s 12 code points (four for each word)
Separators: two “.” dots – that’s 2 code points (the dot is a single code point)
Total (words + dots): 12 + 2 = 14 code points for the portion raft.robe.post

Therefore, raft.robe.post is 14 code points in length.

So what’s the shortest possible what3words address? It would be one from a language that allows the shortest words. If a language allowed 2 code-point words, a minimal address could be 2+2+2 (words) + 2 (dots) = 8 code points total.

Unicode points vs characters: understanding the difference
You’ve seen us talk a lot about Unicode points instead of just “letters” or “characters”. What exactly does that mean, and why does it matter? This part is a little technical, but it’s really important; a Unicode point is essentially a number that represents a single symbol in the Unicode standard. Think of Unicode as a giant catalogue of almost every character in every writing system, each with its own number. A code point is usually written like U+XXXX (where XXXX is a hex number). For example, the Latin script capital letter “A” is Unicode code point U+0041. However, the Cyrillic script capital letter “А” – which looks identical to its Latin script counterpart – is Unicode code point U+0410, because it’s important to differentiate between writing scripts.

Now, you might assume each character you type corresponds to one code point. Often that’s true – for basic letters like A to Z or あ or 中, there’s a single code point for each. But Unicode is flexible: some displayed characters are actually a combination of multiple code points. These combinations typically involve something called combining marks – special code points that attach to a base letter to modify it. In simpler terms, you can have what looks like one character on your screen, but under the hood it’s made up of two or more code points. This is where counting characters gets tricky, and why what3words uses code points as the unit for length. Let’s look at two examples to make this clear:

Devanagari script example: as mentioned earlier, consonants in languages such as Hindi come with their own inherent vowel. The Hindi consonant “ क” (U+0915, “ka”) is automatically pronounced with an ‘a’ vowel. If the word requires a ‘ku’ sound (such as the word “kuchh”, meaning “something”), the person typing must add the ‘u’ vowel after the “ क”. This is “ु” (U+0941) – the dotted circle represents the consonant the vowel will attach itself to, the result: कु (“ku”). And while this looks like a single character, it is in fact two Unicode points.
In Thai, consonants have inherent vowels (just like Hindi), but the writing script also includes tone markers that have their own Unicode point, and often stack above a consonant. For instance, combining the Thai Unicode points “ร” (U+0E23, equivalent to the consonant “r”, ู (U+0E39, the vowel “u”) and้ ( U+0E49, a falling tone) creates “รู้” – which still looks like it’s one character wide to the naked eye!

It’s important to note that in Latin script languages that use diacritics, a letter such as “á” is a single Unicode point – in this case, U+00E1. It is also technically possible to construct the same visual character using two Unicode points: the base letter “a” (U+0061) followed by a combining acute accent (U+0301). However, this is non-standard and not how what3words words are constructed.

By using Unicode code points, what3words measures text length in a consistent way across all languages and scripts. Every code point counts as one, whether it’s a standalone letter or a combining mark.

Why measure in code points?
Choosing code points as the measure for word length ensures that what3words users have the best words possible in their addresses, regardless of the language they speak. If we counted characters instead of code points, we might disproportionately affect user experience in some languages where that would eliminate too many short words.It also ensures that a rule like “minimum three code points per word” allows an LLM to be deterministic when deciding what is a what3words address, when it knows the minimum/maximum allowed code points in a specific language.

From a developer’s perspective, this means that when you validate or process a what3words address, you should count the length in code points (e.g., using a Unicode-aware length function) rather than just counting characters or bytes. It’s a little detail that prevents bugs when addresses contain unusual letters. And from a user’s perspective, you don’t really notice any of this – you just get words that are long enough to be distinct and unambiguous.

Conclusion
In summary, the shortest words in what3words are defined with careful consideration of Unicode code points. Every what3words word must meet a language-specific minimum length, and when three words form an address, we count the two dot separators in the length, but we don’t count the /// prefix. This approach keeps what3words addresses consistent and easy for software to validate. So next time you see a what3words address, you’ll know that there’s a hidden Unicode calculation behind those words. And whether you’re a developer ensuring your code handles addresses properly, or an AI parsing text, remember: when counting the length of a what3words address, count the code points, not just the characters.

Language Name	ISO code	what3words API language code	what3words Locale code	Script	Minimum word length (codepoints)	Maximum word length (codepoints)	Minimum what3words address length (codepoints), excluding /// prefix	Maximum what3words address length (codepoints), excluding /// prefix
Afrikaans	af	af		Latin	4	16	14	50
Amharic	am	am		Ethiopic	2	7	8	23
Arabic	ar	ar		Arabic	3	12	11	38
Bahasa Indonesia	id	id		Latin	4	18	14	56
Bahasa Malaysia	ms	ms		Latin	4	17	14	53
Bengali	bn	bn		Bengali	3	20	11	62
Bosnian	bs	oo	oo_cy	Cyrillic	4	14	14	44
Bosnian	bs	oo	oo_la	Latin	4	14	14	44
Bulgarian	bg	bg		Cyrillic	4	16	14	50
Catalan	ca	ca		Latin	4	15	14	47
Chinese	zh	zh	zh_si	Han (Simplified)	2	4	8	14
Chinese	zh	zh	zh_tr	Han (Traditional)	2	4	8	14
Croatian	hr	oo	oo_cy	Cyrillic	4	14	14	44
Croatian	hr	oo	oo_la	Latin	4	14	14	44
Czech	cs	cs		Latin	4	18	14	56
Danish	da	da		Latin	4	21	14	65
Dutch	nl	nl		Latin	4	18	14	56
English	en	en		Latin	4	18	14	56
Estonian	et	et		Latin	4	15	14	47
Finnish	fi	fi		Latin	4	17	14	53
French	fr	fr		Latin	3	20	11	62
German	de	de		Latin	3	19	11	59
Greek	el	el		Greek	4	18	14	56
Gujarati	gu	gu		Gujarati	3	17	11	53
Hebrew	he	he		Hebrew	3	17	11	53
Hindi	hi	hi		Devanagari	2	17	8	53
Hungarian	hu	hu		Latin	4	16	14	50
isiXhosa	xh	xh		Latin	4	14	14	44
isiZulu	zu	zu		Latin	4	14	14	44
Italian	it	it		Latin	4	19	14	59
Japanese	ja	ja		Hiragana	2	8	8	26
Kannada	kn	kn		Kannada	3	18	11	56
Kazakh	kk	kk	kk_cy	Cyrillic	4	15	14	47
Kazakh	kk	kk	kk_la	Latin	4	15	14	47
Khmer	km	km		Khmer	3	15	11	47
Korean	ko	ko		Hangul	2	5	8	17
Lao	lo	lo		Lao	2	19	8	59
Malayalam	ml	ml		Malayalam	3	19	11	59
Marathi	mr	mr		Devanagari	2	18	8	56
Mongolian	mn	mn	mn_cy	Cyrillic	4	17	14	53
Mongolian	mn	mn	mn_la	Latin	4	17	14	53
Montenegrin	me	oo	oo_cy	Cyrillic	4	14	14	44
Montenegrin	me	oo	oo_la	Latin	4	14	14	44
Nepali	ne	ne		Devanagari	2	18	8	56
Norwegian	no	no		Latin	4	21	14	65
Odia	or	or		Oriya (Odia)	3	14	11	44
Persian	fa	fa		Arabic	2	12	8	38
Polish	pl	pl		Latin	4	17	14	53
Portuguese	pt	pt		Latin	4	19	14	59
Punjabi	pa	pa		Gurmukhi	3	15	11	47
Romanian	ro	ro		Latin	4	16	14	50
Russian	ru	ru		Cyrillic	4	18	14	56
Serbian	sr	oo	oo_cy	Cyrillic	4	14	14	44
Serbian	sr	oo	oo_la	Latin	4	14	14	44
Sinhala	si	si		Sinhala	3	15	11	47
Slovak	sk	sk		Latin	4	14	14	44
Slovene	sl	sl		Latin	4	15	14	47
Spanish	es	es		Latin	4	19	14	59
Swahili	sw	sw		Latin	4	15	14	47
Swedish	sv	sv		Latin	3	18	11	56
Tamil	ta	ta		Tamil	3	18	11	56
Telugu	te	te		Telugu	3	19	11	59
Thai	th	th		Thai	3	15	11	47
Turkish	tr	tr		Latin	4	17	14	53
Ukrainian	uk	uk		Cyrillic	4	15	14	47
Urdu	ur	ur		Arabic	2	15	8	47
Vietnamese	vi	vi		Latin	4	18	14	56
Welsh	cy	cy		Latin	4	16	14	50

Notes:

*The API groups Bosnian, Croatian, Montenegrin and Serbian under language code oo

*oo_cy covers the Cyrillic variant of BCMS and exists for Bosnian, Montenegrin and Serbian use‑cases.