Hermiona (part one)
Existing autokerning tools measure the white space between letters, compare it to a reference, equalize. They all run on a single formula. The problem isn’t that they can’t see shapes — they can. The problem is they apply one rule to every pair. They don’t learn from examples. They don’t know that the same space needs a different correction depending on the font, the context, what the designer is trying to achieve. The formula is rigid. Kerning decisions are not.
I wanted to try something different. Instead of writing rules — train a model on hundreds of fonts. Render every pair. Extract from the image what can be measured — edge shape, gap size, ink density, letter openness. And let the network find patterns on its own. No metadata. No knowledge of whether the font is serif or sans. Pure vision.
My wife came up with the name. Hermiona. Because she was the biggest brain.
The network is tiny. Two layers of 256 neurons, one of 64, one output. 97 thousand parameters. For comparison — GPT-4 has over a trillion. It’s not the same league, but it’s not the same problem either. GPT learns language from billions of sentences. Hermiona learns one thing: she looks at two letters next to each other and says how much to move them. She doesn’t need a billion parameters for that. She needs good eyes.
I had one requirement: it must train on a Mac Mini 2011 with eight gigabytes of RAM. So that anyone could train their own model on whatever they have. Maybe that’s stubbornness. Hard to tell.
I started with fifteen horizontal rays per glyph. Top to bottom, each one measures where the ink ends. That gives you a shape profile — the contour of the edge facing the neighbor. On top of that, gap statistics: how much white at the narrowest point, at the widest, on average. Weights tuned so the middle of the letter counts more than the top and bottom — because that’s where the eye looks. 45 features total. Then openness — how much white space on the edge of the letter. C has a lot of open space on the right side. O is closed. Centers of gravity, width-to-height ratios. 57 features. Every one extracted from pixels.
First tests on a dozen fonts — average error below fourteen units. I thought I had gold.
Added more fonts. Error went up. More fonts. Even worse. Instead of improving the model was getting worse. This is the moment when most people make the network bigger. More layers, more neurons, heavier architectures. I couldn’t — I had the Mac Mini 2011 as a hard limit.
I started digging into the data instead of the architecture. And I found a ceiling. The same pair — l and apostrophe — had kerning ranging from minus one hundred ninety-one to plus two hundred twenty across different fonts. Nearly identical shapes. Median deviation per pair: thirty units. Every designer kerns their own way. The model can’t jump over that.
The solution turned out to be simple. Instead of teaching the model absolute values — minus eighty in one font is not the same as minus eighty in another — I teach it deviations from the average of a given font. “This pair should be tighter than your average by this much”. One change. The biggest jump.
Then small things. A mechanism that randomly switches off some neurons during training — so the network doesn’t memorize data by heart but learns to generalize. Gradually slowing the learning pace over a thousand epochs — so at the end the model refines details instead of jumping past them. Half a unit here and there.
Final result: average model error is 16.7 units on a 1000 UPM grid — the standard grid fonts are designed on. To put that in perspective: at fourteen-point text, 16.7 units is 0.23 millimeters on paper. Thinner than a line drawn with a pencil. The eye doesn’t catch it during normal reading.
But most kerning in running text is subtle — corrections in the plus-minus twenty unit range. Big shifts and heavy tightening are for headlines and display. In that everyday zone the model is more accurate: error drops to twelve units.
456 fonts. 70 thousand samples. A model that fits on a thumb drive and trains on a laptop.
That’s the architecture. What it means in practice — next time.