Hermiona (part two)
Sixteen point seven. That was the error after the first round. Not bad. But I knew it wasn’t the end — the model had trained on seventy thousand samples from four hundred fifty-nine fonts. Not enough. I started adding fonts. Got up to seven hundred sixteen. And that’s when it turned out I’d been throwing away eighty-two percent of the data the whole time.
The bug sat in one function. When the model analyzed a pair of letters, it measured the edge profile — where the ink ends at each height. For letters with straight sides — H, I, M, N — the profile didn’t extend beyond the bounding box. All distances came out zero or less. The function saw nothing but zeros and rejected the pair as defective. Eighty-two percent of pairs disappeared silently. The model was learning from scraps.
The fix was trivial. Instead of rejecting — write zeros. The pair is tight, that’s information, not an error. Three lines of code. Suddenly instead of seventy thousand samples I had two hundred seventy-eight thousand. Four times more.
While I was at it, I filtered out diacritical duplicates. Fonts contain kerning separately for T+a, T+à, T+á, T+â — but 066.KERN uses kerning classes, so class A covers all accented variants. Those pairs are noise. Eleven percent of data in the bin, but the right eleven percent.
First sweep on the new data — four hundred thirty-two configurations, seventy hours on the M4. I tested everything at once: three network sizes, three dropout levels, three regularization weights, two batch sizes, with and without layer normalization, four loss functions. Full grid. Brute force.
The loss function is how the model measures its own mistakes during training. Until then I’d been using the simplest one — mean squared error. Every mistake squared, big errors hurt disproportionately. I tried an alternative: Huber loss. Up to a certain threshold it works like a square, beyond that it goes linear — doesn’t panic at large deviations. Handles noise in the data better. Gave about one unit down.
Layer normalization helped too. It evens out the signal inside the network, stabilizes training. Consistent improvement across all configurations. Result after the first sweep: 15.5. Progress, but not a revolution. The revolution came from a direction I didn’t expect. Bigger networks.
In part one I wrote that increasing the network from 56 to 97 thousand parameters made no difference. But that model was learning from seventy thousand samples. Now I had two hundred seventy-eight thousand. The proportions changed. A small network didn’t have the capacity to absorb four times more data.
I started scaling. Two layers of 512 instead of 256 — error dropped to 12.7. Two of 768 — 11.97. Two of 1024 — 11.59. With 1536 neurons in the first layer — 11.39. Each jump required more training time. A thousand epochs at the start, two thousand, three, five. An epoch is one pass through all the data — the model sees every sample, calculates the error, adjusts its weights. At five thousand epochs with a large network, a single configuration ran for six hours. One got stuck at thirty-one.
The trend was clear and unsettling at the same time. Wider network gave better results. The curve wasn’t flattening. But the cost was rising — the bigger the network, the more it memorized training data instead of learning to generalize. Test error kept dropping, but the gap between training and test kept growing. That’s called overfitting. The model becomes an expert on data it’s seen and an ignorant on anything new. Dropout helped — the same mechanism as in part one, randomly switching off neurons so the network can’t rely on a single path. But with large networks it’s a balancing act. Too much dropout — the model doesn’t learn. Too little — it memorizes.
Over two hundred hours of training across all sweeps. Six rounds of optimization. From 16.7 to 11.39. Five point three units down. Thirty-two percent error reduction. What mattered most? Data. Four times more samples was the biggest jump. Then network size — small to large, about four units total. The rest — loss function, normalization, dropout tuning — a bit here and there.
At fourteen-point text, 11.4 units is 0.16 millimeters. Thinner than a sheet of paper. The model trains on a Mac Mini M4, fits on a thumb drive, and kerns with precision that’s hard to tell apart from manual work. That’s the numbers. What’s next — conversion to Core ML and integration with 066.KERN. Hermiona is about to start working.