The broken “b”
I downloaded two hundred thousand words from twenty-one languages. I wanted to know which letter pairs people actually read most often — not which pairs designers kern, but which pairs show up in real text. Simple question. One Python script, public data from Leipzig, zero dependencies. I put it on GitHub as open-source.
Then I looked deeper and something was off. The German dictionary had its letter “b” systematically eaten. Not randomly — systematically. “Überdeckt” stored as “Üerdeckt”. “Arbeitskräfte” as “Areitskräfte”. “Selbstverständlich” as “Selstverständlich”. The prefixes be-, über-, ab- — the backbone of German morphology — chopped clean off. Not my fault. That’s how the original from Leipzig Corpora looked. Someone somewhere exported a corrupted file and nobody caught it.
I downloaded a different edition of the corpus, a clean one, reran everything. Along the way a few things came up that I didn’t expect. 252 letter pairs cover over 82% of text. Nearly 70% of all unique pairs contain diacriticals — but they carry just 12% of frequency. English “wins” only 6 bigrams out of 2,528, while Croatian dominates the top 100. Not because it matters more, but because its frequency distribution across the corpus is shaped differently.
Data has its quirks. You need to know them to trust it.
Repository: github.com/zeroszescszesc/066-kerning-pairs