04/08/2025

As you’ll know, every what3words address consists of three words separated by dots (like ///word1.word2.word3 ). But when it comes to measuring how long those words (and addresses) are, there’s a bit of a twist: we count Unicode code points, not just the letters you see on screen. In this blog, we’ll explore the minimum lengths of words and addresses in what3words, why we measure them in code points, and what exactly the difference is between code points and characters.

Minimum word length in what3words
what3words addresses are designed to avoid very short words. In fact, each language has a rule for the minimum length a word can be. For example, in the English word list, the shortest words are four letters long. Many other languages have similar minimums of three or four letters.

Why the minimum length? It’s partly to prevent confusion. By avoiding very short words, what3words makes speech-to-text recognition of addresses more reliable (meaning “table branched spoon” couldn’t be mis-transcribed as “table branch to spoon” if the voice recogniser is aware of the four-character minimum word length in English). The minimum length rule gives each word a bit of “heft” so it stands out clearly when spoken or read. It’s important to note that these lengths are counted in Unicode code points. That means we’re counting the underlying units of text (more on that below), not just the number of letters as you might intuitively think.

We balance the need for clarity and “heft” with a desire to create user-friendly products; we know that short words are generally easy to remember and quicker to say or type. In the Latin script and other alphabets (such as Cyrillic and Greek), where a letter generally corresponds to a single Unicode point, a four-letter word balances the requirements well.

However, other writing scripts work differently, such as the Devanagari script used to write Hindi , where consonants come with an inherent vowel that does not need to be typed: while the spoken word is no shorter, the number of Unicode points can be significantly reduced. Many Asian languages are written in scripts with a similar construction. In these languages, shorter words (often down to two Unicode points) may be used.

Furthermore, for Chinese, Japanese, Korean and Amharic, where every syllable generally maps to a single Unicode point, two-character words are used in what3words addresses.

Minimum address length (don’t forget the dots!)
Now, what about a full what3words address? Since it is three words separated by dots, the minimum address length depends on the minimum word length. An address will be shortest when each of its three words is the shortest possible in that language. And, importantly, when counting an address’ length, we include those two dots, but we do not count the /// prefix. For example, consider the address ///raft.robe.post . Here, each word is four letters long (the minimum in an English what3words address). How many Unicode points is that address? Let’s break it down:

  • Words: “raft”, “robe” and “post” – that’s 12 code points (four for each word)
  • Separators: two “.” dots – that’s 2 code points (the dot is a single code point)
  • Total (words + dots): 12 + 2 = 14 code points for the portion raft.robe.post

Therefore, raft.robe.post is 14 code points in length.

So what’s the shortest possible what3words address? It would be one from a language that allows the shortest words. If a language allowed 2 code-point words, a minimal address could be 2+2+2 (words) + 2 (dots) = 8 code points total.

Unicode points vs characters: understanding the difference
You’ve seen us talk a lot about Unicode points instead of just “letters” or “characters”. What exactly does that mean, and why does it matter? This part is a little technical, but it’s really important; a Unicode point is essentially a number that represents a single symbol in the Unicode standard. Think of Unicode as a giant catalogue of almost every character in every writing system, each with its own number. A code point is usually written like U+XXXX (where XXXX is a hex number). For example, the Latin script capital letter “A” is Unicode code point U+0041. However, the Cyrillic script capital letter “А” – which looks identical to its Latin script counterpart – is Unicode code point U+0410, because it’s important to differentiate between writing scripts.

Now, you might assume each character you type corresponds to one code point. Often that’s true – for basic letters like A to Z or あ or 中, there’s a single code point for each. But Unicode is flexible: some displayed characters are actually a combination of multiple code points. These combinations typically involve something called combining marks – special code points that attach to a base letter to modify it. In simpler terms, you can have what looks like one character on your screen, but under the hood it’s made up of two or more code points. This is where counting characters gets tricky, and why what3words uses code points as the unit for length. Let’s look at two examples to make this clear:

  • Devanagari script example: as mentioned earlier, consonants in languages such as Hindi come with their own inherent vowel. The Hindi consonant “ क” (U+0915, “ka”) is automatically pronounced with an ‘a’ vowel. If the word requires a ‘ku’ sound (such as the word “kuchh”, meaning “something”), the person typing must add the ‘u’ vowel after the “ क”. This is “ु” (U+0941) – the dotted circle represents the consonant the vowel will attach itself to, the result: कु (“ku”). And while this looks like a single character, it is in fact two Unicode points.
  • In Thai, consonants have inherent vowels (just like Hindi), but the writing script also includes tone markers that have their own Unicode point, and often stack above a consonant. For instance, combining the Thai Unicode points “ร” (U+0E23, equivalent to the consonant “r”,   (U+0E39, the vowel “u”) and ( U+0E49, a falling tone) creates “รู้” – which still looks like it’s one character wide to the naked eye!

It’s important to note that in Latin script languages that use diacritics, a letter such as “á” is a single Unicode point – in this case, U+00E1. It is also technically possible to construct the same visual character using two Unicode points: the base letter “a” (U+0061) followed by a combining acute accent (U+0301). However, this is non-standard and not how what3words words are constructed.

By using Unicode code points, what3words measures text length in a consistent way across all languages and scripts. Every code point counts as one, whether it’s a standalone letter or a combining mark.

Why measure in code points?
Choosing code points as the measure for word length ensures that what3words users have the best words possible in their addresses, regardless of the language they speak. If we counted characters instead of code points, we might disproportionately affect user experience in some languages where that would eliminate too many short words.It also ensures that a rule like “minimum three code points per word” allows an LLM to be deterministic when deciding what is a what3words address, when it knows the minimum/maximum allowed code points in a specific language.

From a developer’s perspective, this means that when you validate or process a what3words address, you should count the length in code points (e.g., using a Unicode-aware length function) rather than just counting characters or bytes. It’s a little detail that prevents bugs when addresses contain unusual letters. And from a user’s perspective, you don’t really notice any of this – you just get words that are long enough to be distinct and unambiguous.

Conclusion
In summary, the shortest words in what3words are defined with careful consideration of Unicode code points. Every what3words word must meet a language-specific minimum length, and when three words form an address, we count the two dot separators in the length, but we don’t count the /// prefix. This approach keeps what3words addresses consistent and easy for software to validate. So next time you see a what3words address, you’ll know that there’s a hidden Unicode calculation behind those words. And whether you’re a developer ensuring your code handles addresses properly, or an AI parsing text, remember: when counting the length of a what3words address, count the code points, not just the characters.

Language Name ISO code what3words API language code what3words Locale code Script Minimum word length (codepoints) Maximum word length (codepoints) Minimum what3words address length (codepoints), excluding /// prefix Maximum what3words address length (codepoints), excluding /// prefix
Afrikaans af af Latin 4 16 14 50
Amharic am am Ethiopic 2 7 8 23
Arabic ar ar Arabic 3 12 11 38
Bahasa Indonesia id id Latin 4 18 14 56
Bahasa Malaysia ms ms Latin 4 17 14 53
Bengali bn bn Bengali 3 20 11 62
Bosnian bs oo oo_cy Cyrillic 4 14 14 44
Bosnian bs oo oo_la Latin 4 14 14 44
Bulgarian bg bg Cyrillic 4 16 14 50
Catalan ca ca Latin 4 15 14 47
Chinese zh zh zh_si Han (Simplified) 2 4 8 14
Chinese zh zh zh_tr Han (Traditional) 2 4 8 14
Croatian hr oo oo_cy Cyrillic 4 14 14 44
Croatian hr oo oo_la Latin 4 14 14 44
Czech cs cs Latin 4 18 14 56
Danish da da Latin 4 21 14 65
Dutch nl nl Latin 4 18 14 56
English en en Latin 4 18 14 56
Estonian et et Latin 4 15 14 47
Finnish fi fi Latin 4 17 14 53
French fr fr Latin 3 20 11 62
German de de Latin 3 19 11 59
Greek el el Greek 4 18 14 56
Gujarati gu gu Gujarati 3 17 11 53
Hebrew he he Hebrew 3 17 11 53
Hindi hi hi Devanagari 2 17 8 53
Hungarian hu hu Latin 4 16 14 50
isiXhosa xh xh Latin 4 14 14 44
isiZulu zu zu Latin 4 14 14 44
Italian it it Latin 4 19 14 59
Japanese ja ja Hiragana 2 8 8 26
Kannada kn kn Kannada 3 18 11 56
Kazakh kk kk kk_cy Cyrillic 4 15 14 47
Kazakh kk kk kk_la Latin 4 15 14 47
Khmer km km Khmer 3 15 11 47
Korean ko ko Hangul 2 5 8 17
Lao lo lo Lao 2 19 8 59
Malayalam ml ml Malayalam 3 19 11 59
Marathi mr mr Devanagari 2 18 8 56
Mongolian mn mn mn_cy Cyrillic 4 17 14 53
Mongolian mn mn mn_la Latin 4 17 14 53
Montenegrin me oo oo_cy Cyrillic 4 14 14 44
Montenegrin me oo oo_la Latin 4 14 14 44
Nepali ne ne Devanagari 2 18 8 56
Norwegian no no Latin 4 21 14 65
Odia or or Oriya (Odia) 3 14 11 44
Persian fa fa Arabic 2 12 8 38
Polish pl pl Latin 4 17 14 53
Portuguese pt pt Latin 4 19 14 59
Punjabi pa pa Gurmukhi 3 15 11 47
Romanian ro ro Latin 4 16 14 50
Russian ru ru Cyrillic 4 18 14 56
Serbian sr oo oo_cy Cyrillic 4 14 14 44
Serbian sr oo oo_la Latin 4 14 14 44
Sinhala si si Sinhala 3 15 11 47
Slovak sk sk Latin 4 14 14 44
Slovene sl sl Latin 4 15 14 47
Spanish es es Latin 4 19 14 59
Swahili sw sw Latin 4 15 14 47
Swedish sv sv Latin 3 18 11 56
Tamil ta ta Tamil 3 18 11 56
Telugu te te Telugu 3 19 11 59
Thai th th Thai 3 15 11 47
Turkish tr tr Latin 4 17 14 53
Ukrainian uk uk Cyrillic 4 15 14 47
Urdu ur ur Arabic 2 15 8 47
Vietnamese vi vi Latin 4 18 14 56
Welsh cy cy Latin 4 16 14 50

Notes:

*The API groups Bosnian, Croatian, Montenegrin and Serbian under language code oo

*oo_cy covers the Cyrillic variant of BCMS and exists for Bosnian, Montenegrin and Serbian use‑cases.