The Compounding Founder

The Compounding Founder

The Taste Bottleneck

AI can generate faster than you can judge. That is the actual problem.

Eduardo's avatar
Eduardo
Mar 02, 2026
∙ Paid

Last Tuesday at 2am, I opened my app and scrolled through the screens I had shipped that month. Thirty-six of them. AI wrote most of the code. Every screen worked. Nothing crashed. The buttons went where buttons go.

I felt nothing.

Not bad. Not good. Just — flat. The spacing was fine. The typography was fine. The colors were fine. Fine is the word you use when you cannot explain what is wrong but you know something is.

I closed the laptop and went to bed. In the morning I opened it again. The screens were still fine.


The conventional argument about AI and quality goes like this: AI is a tool, and like any tool, the output reflects the operator. If the design is mediocre, that is a skill gap. Learn design. Get better at prompting. Ship with more intention.

The problem is that this argument assumes the bottleneck is generation. It is not.

I can generate a screen in four minutes. A good prompt, a decent component library, and Claude will produce something functional before my coffee cools. That part works. That part has never been easier.

The bottleneck is judgment. Specifically, my judgment, applied thirty-six times, at a quality that does not degrade when I am tired, distracted, or moving fast.

I am one person. The app does not care.


Here is what I mean by judgment.

When I look at a screen and something feels off, I am actually running five evaluations simultaneously. I had never separated them before. It took sitting down and forcing myself to articulate what my eyes were doing.

Spatial rigor. Is every margin, every padding, every gap a multiple of four? The 4px grid is not a preference. It is what makes things feel aligned without the user knowing why. I ran a grep across all thirty-six screens. Eleven had spacing values like 15, 18, or 22 — values that are not on the grid. Eleven of thirty-six. Almost a third were subtly misaligned, and I had not caught it because each screen, in isolation, looked close enough.

Typographic hierarchy. One screen, one clear entry point. You squint at the screen and you should immediately know where to look first. Fifteen of the twenty-six style learning screens had no display-weight typography at all. Every heading was the same weight as the body text. The visual hierarchy was not wrong — it was absent.

Restraint. For every border, every badge, every icon — does removing it break the task? I started counting visual elements. Most screens had three to five things that existed for decoration, not function. Removing them did not hurt comprehension. It helped.

Motion. This one was the worst. Nineteen of twenty-six screens had no entrance animation at all. Static content appearing from nothing, like a light switch instead of a door opening. The screens that did have motion used hardcoded durations and cubic curves instead of spring physics. Nothing felt alive.

Information ergonomics. Can I reach the primary action with my thumb? Are there fewer than seven interactive elements in the initial viewport? Basic stuff. The kind of thing you check once and then forget to check on screen number twenty-three.

Five evaluations. Each measurable. Each something I could check by reading the code, not by looking at a screenshot. Each something I had been doing unconsciously — and therefore inconsistently.


So I wrote the rubric down.

Not as guidelines. Not as principles. As a scoring system. Each dimension, one to five. Weighted: spatial and typographic at 25% each, restraint at 20%, motion and ergonomics at 15% each. Calibrated against my own gut.

The calibration was the hard part.

Left alone, AI will score its own work 4.5 out of 5 every time. “The spacing is clean and the typography is well-structured.” It says this the way a student who did not read the book still tries to sound confident in the essay. Technically defensible. Actually meaningless.

I defined what a 3 looks like: it works, nothing is broken, it feels generic. That is the expected starting point for AI-generated UI. Not a failure. The baseline. Most of what ships in this industry is a 3.

I defined what a 5 looks like: this could be a screenshot in Apple’s Human Interface Guidelines. I have not seen my code produce a 5 yet. That is fine. It means the scale is honest.

And I added one rule: when in doubt between two scores, pick the lower one. It is easier to discover a screen is better than expected than to ship something that needed more work.


Then I pointed the system at my own app.

The wardrobe feature — ten screens, the part of the app I had spent the most time on — averaged 3.4 out of 5. Two screens were ready to ship. Seven needed polish. One was at the threshold.

The style learning feature — twenty-six screens, the part I thought was in good shape — also averaged 3.4. But the distribution was worse. Three screens shipping. Seventeen needing polish. Six needing complete reworks. Six.

The weakest dimension across both features: motion. Average score of 2.0 to 2.5. Almost every screen was missing entrance animations, haptic feedback, and stagger patterns. The code was architecturally sound and visually dead.

I sat with that for a while.


The conventional response here would be to frame this as a success story. I found the problems. I built the system. Now I can fix them efficiently. Quality will compound.

That is partly true. The system does compound. Run the audit after fixing six screens, and the new ones score higher because the patterns are in place. Run it again a month later, and regressions get caught immediately. The bar does not lower. It ratchets.

But I want to be honest about what this actually reveals.

I shipped thirty-six screens. I was proud of them. An automated system I built in a weekend told me that six of them need to be rewritten, seventeen need meaningful polish, and the best ones — the ones I had spent the most time caring about — are still only 3.8 out of 5. Not bad. Not great. Close enough to good that I had convinced myself they were good.

The system did not tell me anything I did not already know. It told me things I had been choosing not to see.


The pattern underneath this is not about design.

Any judgment you make repeatedly can be decomposed into measurable dimensions. You can score those dimensions. You can calibrate the scores against your own taste. You can teach a system to apply them. And then the system will tell you the truth even when you would rather not hear it.

The uncomfortable question is whether the rubric captures taste or merely approximates it. Whether a system that scores 3.4 out of 5 is telling me the truth or telling me a useful lie.

I think the answer is both. The rubric does not replace judgment. It makes judgment auditable. It turns “this feels off” into “the spatial score on screen fourteen is 2.5 because of three off-grid padding values on lines 47, 82, and 114.” That is not taste. That is measurement.

But measurement is what scales. Taste does not.


Design was the first domain. API quality is next. Content quality after that. Each one follows the same five-step process, and each one starts with the same hard question: what am I actually evaluating when I say “this is good”?

The process works because it is a forcing function, not a scoring matrix. It forces you to articulate what you have been doing unconsciously. The articulation changes how you see.

The Philosophy Behind the Skills

Apple does not publish a rubric. But if you study their apps long enough — not the guidelines document, the actual apps — you notice they encode five things into every screen.

User's avatar

Continue reading this post for free, courtesy of Eduardo.

Or purchase a paid subscription.
© 2026 Eduardo · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture