The Compounding Founder

Is Your App Idea Actually Worth Building?

Eduardo — Sun, 31 May 2026 22:40:53 GMT

If you’ve got an app idea, you probably also have a quiet worry sitting next to it.

What if I build the whole thing and nobody wants it?

That worry is the smartest thing about you. Most people ignore it, build for six months, launch to silence, and decide they’re “just not a tech person.” You’re not going to do that. We’re going to answer the worry first, this week, in an afternoon.

Here’s the good news I wish someone had told me sooner: you do not need to be technical to launch a real consumer app. You need to be the person who makes good decisions and points the work in the right direction. The typing can be done by AI agents now. The deciding is the job. And the first decision is the one that saves you the most pain: is this idea worth building at all?

Where this sits in the whole journey

This series walks the entire path from “I have an idea” to “my app is live and earning,” in six steps:

1. Validate - is the idea worth building? (you are here)

2. Design - make something you’re proud of, without a designer

3. Build - ship it with AI agents directing the code

4. Ship - get it on the App Store

5. Money - get people to actually pay

6. Growth - your first 1,000 users

One step per post. No skipping. Today is step one, and step one is the cheapest insurance you’ll ever buy.

Validation is not asking your friends

Here’s the trap. You describe your idea to friends, they say “oh that’s cool, I’d use that,” and you feel great. That feedback is worthless. Your friends are being kind. Kind isn’t the same as a download, and a download isn’t the same as a payment.

Real validation answers four questions with evidence, not opinions:

1. Are people already looking for this? (demand exists)

2. Are the existing apps letting them down? (there’s a gap)

3. Is the gap something people complain about out loud? (real pain)

4. Would someone pay to make the pain go away? (real money)

You can check all four from your couch, on your phone, for free, before you build anything. Let me show you how the first one works, then I’ll hand you the full step-by-step playbook for all four.

The free version: read the one-star reviews

Open the App Store. Search the words a real person would use to describe your idea, not your clever product name, the *problem*. If your idea is “an app that helps new dog owners with training,” search “puppy training.”

Look at the top few apps. Two things tell you almost everything:

- How many ratings they have. Thousands of ratings means real demand exists. Zero apps, or only tiny ones, can mean no demand or a wide-open gap. You’ll learn to tell the difference in the playbook.

- What the one-star and two-star reviews say. This is the gold. People who are angry enough to leave a bad review will tell you, in their own words, exactly what’s missing. “Great idea but it crashes.” “Why is everything locked behind a subscription?” “Too complicated, I just wanted X.” Every complaint is a feature you could build, in the customer’s own language.

Spend twenty minutes reading bad reviews of the apps closest to your idea. Write down every complaint you see more than once. That list is the beginning of your product, and it’s written in words real buyers actually use.

That’s the overview, and it’s genuinely enough to start today. But “read some reviews” isn’t a plan. The full step-by-step playbook below turns these four questions into an afternoon you can actually follow — the exact searches, a free way to gauge how big the market is, the name-collision check that stops you naming your app something that’s already taken, and a one-page worksheet that gives you a clear go / no-go answer by tonight.

Your business isn't queryable (and your agents know it)

Eduardo — Sun, 12 Apr 2026 02:18:21 GMT

During my masters program it wasn’t uncommon to spend a week evaluating a single framework. TOGAF for enterprise architecture. ITIL for service management. COSO for internal controls. RACI for decision rights. Lean for process improvement. BIZBOK for the gaps between all of them.

They gave me vocabulary. Mental models. A sense of where the edges were. Useful, the way a map of a city is useful before you’ve walked any of the streets.

Then I started running businesses with agents. And I realized the map was drawn for pedestrians.

Every one of those frameworks has the same hidden assumption baked in: humans are doing the work, humans have bounded attention, and the framework exists to coordinate humans across that bound. RACI tells you who’s responsible because, without it, two people will assume the other one did it. ITIL prescribes change advisory boards because humans make bad decisions under pressure and a slow, structured process catches errors. TOGAF produces architecture diagrams because the people who need to act on a decision aren’t in the room when it’s made.

Take away the human bandwidth constraint and most of what a business framework does becomes ceremony.

Agents don’t forget what was decided last Tuesday, as long as it’s in their context window. They can query 47 tables before you finish reading the meeting agenda, but they’ll hallucinate a number if the table doesn’t exist. They have no persistent memory across sessions. Their intelligence is jagged: superhuman at some tasks, confidently wrong about basic things.

The honest framing:

Humans are bad at attention and memory across time. Agents are bad at persistent learning and consistency across sessions. A framework that only compensates for human bandwidth limits is the wrong framework. But agents still need structure — just structure that’s queryable rather than coordinative.

The ceremony isn’t what makes the frameworks valuable. The data model underneath is.

A few weeks before I started drafting this post, I read Jack Dorsey and Roelof Botha’s piece for Sequoia, published in March 2026. The core argument landed hard: hierarchy isn’t an ideology, it’s an information routing protocol. It was invented to move decisions through organizations that couldn’t move information fast enough. Middle management existed to pre-compute decisions at scale. Every reorg since, from matrix structures to Holacracy to Spotify squads, rearranged the routing without replacing what was doing the routing.

The copilot framing gets this wrong too: putting AI inside the existing structure improves throughput without touching the architecture.

Their answer is a company world model: atomic capabilities, an intelligence layer that composes them, and a small number of human roles. No product manager decides what to build. The intelligence layer recognizes the moment and composes what’s needed.

But the piece stops at the architectural sketch. How does the world model stay accurate over time? What happens when declared state drifts from observed state? Who’s accountable when the model is wrong? There’s no answer for drift detection, no audit trail, no enforcement mechanism. The vision is clear. The operating manual doesn’t exist.

So I built one.

The four things left standing

So I started asking a different question about every framework: strip away the meetings, the diagrams, the certifications, and the coordination rituals. What’s left?

Almost always the same four things.

Tables. Every framework secretly assumes certain entities and relationships: services, incidents, and change requests (ITIL); decisions and responsible parties (RACI); controls and the risks they’re meant to catch (COSO). Those assumptions are tables. The entities are rows. The relationships are foreign keys.

Constraints. Every framework has things it forbids. ITIL says a change can’t deploy without an owner. COSO says a control can’t be owned by the person it governs. RACI says every decision needs exactly one accountable party. Those prohibitions are NOT NULL columns, CHECK constraints, and pre-action validators. They’re not policies written in a manual. They’re schema. And when an agent tries to violate one, the database rejects the write before it lands. The enforcement is mechanical, not supervisory.

Loops. Every framework has a measure-decide-act cycle. Lean’s Plan-Do-Check-Act. ITIL’s incident-to-resolution flow. OKR check-ins. DMAIC’s Define-Measure-Analyze-Improve-Control. Each of those cycles is a scheduled agent reading from an event queue, detecting something, and writing a response. The meeting where humans do this is the slow version of the loop. The database trigger and the scheduled job are the fast version.

Queries. Every framework teaches humans to ask certain questions. “Who owns this decision?” “What’s the current status of this incident?” “Is this OKR on track?” Those questions are saved views. They’re SQL. They don’t need to live in a meeting agenda. They run at 7:15 AM and the results are waiting for you when you open your laptop.

The test I use: if a framework’s value can’t be expressed as tables, constraints, scheduled jobs, and saved queries, it’s ceremony. It’s not something an agent can use.

Capability modeling: the first translation

The framework that survives this filter most cleanly is capability modeling, and it’s worth understanding why.

A capability is a stable description of something a business can do. “Publish weekly content” is a capability. How it gets delivered changes: the skill changes, the operator changes, the tools change. The capability stays the same because it describes the function, not the implementation. In my model, a capability is what you get when a skill meets an operator. The skill is the how. The operator is the who. The capability is the what.

This matters because a single capability can have mixed operators: human decides the surface, agent researches, agent distributes, human audits. The capability is stable. The operator assignment shifts as trust and reliability change.

In an agent-native business, capability modeling is the spine. Every skill maps to a capability it delivers. Every OKR targets a capability. Every KPI tree measures one. Every decision affects one. When something breaks, you query which capabilities are affected. When you’re adding a new agent, you check which capabilities are already covered. The map isn’t a diagram. It’s a table with a parent_id column for hierarchy and a junction table linking skills to capabilities.

The consulting exercise becomes a schema. The PowerPoint becomes a query. And the constraint layer rejects a capability record without a verified skill link. The enforcement is structural, not a review meeting.

If you use Claude Code, Cursor, Codex, etc…., you already do this for your codebase. CLAUDE.md/AGENTS.md tells agents what the repo is, what conventions to follow, where to find things. Your business needs the same declaration: what it can do, what it’s trying to achieve, what rules govern it, and how to tell when it’s drifting. Capabilities are the spine of that declaration. Goals measure the spine. Constraints protect it.

OKRs as queries

The second framework worth examining closely is OKRs, because the agent-native version of it is almost unrecognizable from the original.

In an agent-native business, a key result is a SQL query that returns a number. Not a description of what the number measures. Not a link to a spreadsheet someone is updating. An actual query, stored in a column called query, that runs on a schedule and writes its output to current_value. The OKR check-in isn’t a meeting where someone reports progress. It’s a scheduled job that ran at 7:15 AM and already wrote the result. Progress is always real. It’s never self-reported.

“Reach 500 free subscribers by July 15th” becomes a row in an objectives table and a row in a key_results table where the query column is something like SELECT COUNT(*) FROM subscribers WHERE tier = 'free'. A scheduled job runs it every morning. The current value is always accurate. If the key result falls behind, an anomaly detector writes a gap. The gap routes to the operator. No meeting required.

You’re removing the gap between declared progress and observed progress. In a human OKR system, those two things drift apart because humans self-report. Here, the only progress that exists is the progress the query can measure. That constraint forces you to be specific about what you’re trying to accomplish.

“Grow the audience” can’t be a key result because you can’t write a query for it. “500 free subscribers by July 15th” can. The discipline the agent requires is more honest than the discipline the quarterly meeting requires.

The three states

Three states matter in any agent-native business, and conflating them is the most common mistake. Most agent systems only track one. That gap is what breaks things.

Two weeks after we declared The Compounding Founder’s publishing cadence as weekly, we had two gaps in the observed record. The config file still said weekly. Nobody flagged it. That’s what having only declared state looks like.

Reference state is what should exist according to some standard or template. If you’ve adopted a framework, the reference state is what that framework says a healthy implementation looks like. If you’re using an industry benchmark for open rates, the reference state is the benchmark.

Declared state is what you claim exists. Your config file. Your business declaration. What you told the agent at startup.

Observed state is what evidence shows actually happened. Metrics. Agent logs. Decisions with recorded outcomes. Briefings with timestamps.

Drift between these three states is the signal. When your declared state says you publish weekly and your observed state shows two gaps in the last six weeks, that’s drift. When your reference state says every decision needs an accountable owner and your observed state shows twelve decisions with no owner logged, that’s drift. The job of the retrospective loop isn’t to run the business. It’s to find the drift and surface it.

Most agent systems only track declared state. They have the config, the prompt, the system message, and they treat it as the truth. No observed state to compare against. No reference state to know what good looks like. When things go wrong, there’s no structural way to find out. A human checks in and notices. Or doesn’t notice for a while.

Building all three states, and treating the drift between them as first-class information, is what separates a business that agents can run from one they can only assist.

Build the spine first

This doesn’t require building everything at once. I’ve walked you through two translations in this post: capability modeling and OKRs as queries. The full set I work with is seven (also DMAIC retrospective loops, KPI trees, decision rights, three lines of defense, and ITIL services). They aren’t a mandate. They’re a menu. Capability modeling is the only one that’s mandatory from the start, because it’s the spine everything else hangs from. The others come in as the business scales and the need becomes real.

A business in its first month needs capability modeling and maybe a simple OKR query for its one key metric. It doesn’t need a full ITIL incident management system or formal three-lines-of-defense segregation of duties. Those come later, when agents are making enough consequential decisions that oversight needs to be structural.

The dependency order matters too. OKRs and KPI trees both depend on having capability modeling in place, because both need something stable to target and measure. Decision rights depend on having a clear capability map because rights are scoped to capabilities. Three lines of defense depend on decision rights because you can’t segregate duties you haven’t defined. DMAIC depends on all of it, because the retrospective loop needs something to learn from.

Spine first. Diagnostics. Controls. Then the learning loop.

Task-capable, business-blind

The point of all of this isn’t to rebuild your business in Postgres. A Postgres schema and an MCP server are not what I’m recommending here.

The point is to ask the question that most businesses using agents haven’t asked: what would your business look like if it were a queryable, auditable entity instead of a collection of documents and meetings and conversations that live partly in the heads of the human operators?

Right now, your agents are task-capable and business-blind. They can write the email, draft the post, run the analysis. They can’t tell you which decision from six weeks ago is causing the problem today. They can’t surface the drift between what you said you’d do and what actually happened. They can’t find the gap before it becomes a crisis. Not because they lack intelligence. Because there’s nothing to query.

Start with the spine. Write down what your business can do in stable terms. Not the tools you use. Not the agents you run. Not the current process. The capabilities. That’s one column in a spreadsheet or five lines in a markdown file. Everything in this post hangs from that list.

Dorsey and Botha described the destination: a company organized as an intelligence rather than a hierarchy. This framework is an attempt at the operating manual.

In the next post, I’ll show you what happens when you actually build this. A Postgres brain, an MCP server, and the morning an operator agent found six problems at 7:15 AM without being asked. That’s not a debugging story. That’s a control plane working.

You Did Not Delegate to Your Agent. You Abandoned It.

Eduardo — Tue, 07 Apr 2026 03:45:14 GMT

Priya runs a fourteen-person design studio in Toronto. Last Tuesday at 9:14 PM she forwarded an email to her operations manager with one line: “Can you handle this?” The email contained a client request for a revised project timeline, a budget question, and an implicit complaint about response times. Her operations manager responded the next morning. He answered the budget question, missed the timeline request, and did not register the complaint at all.

Priya did not have a trust problem. She had a specification problem. She gave a competent person a task with no structure, no context about what mattered, and no definition of what “handled” meant. He did what any reasonable person would do: he answered the easiest question and moved on.

This is exactly what most founders do with AI agents. They call it delegation. It is actually abandonment.

The Conventional Argument

The conversation around AI delegation follows a predictable script. On one side: agents are not ready. They hallucinate. They lack judgment. You cannot hand them anything consequential because they will get it wrong and you will not catch it in time. On the other side: agents are already better than junior employees. Just set them loose and review the output.

Both sides share the same assumption. Delegation is primarily a question of capability. Can the agent do the work? Is it smart enough? Reliable enough?

This framing is not wrong. It is incomplete.

The Dismantle

Think about the last time you delegated something important to a human. Not a task — a function. Accounting. Customer support. Content production. Social media.

You did not hand them the job and walk away. You gave them:

a definition of what good looks like
the metrics you would use to know if it was working
the boundaries of what they could decide without asking
the context they would need to make those decisions well
a cadence for checking in

When the function failed, you almost never fired the person first. You fixed the specification first. You realized you had not told them what you actually meant by “handle this.”

AI agents fail for the same reason. Not because they lack capability. Because the founder never built the operating surface the agent needs to do the job.

The first five posts of The Compounding Founder were written by hand between February and March 2026. They followed no documented system — the voice existed in my head, the structure was improvised per post, and the distribution happened when I remembered to open Substack The posts were good. They were also unrepeatable. If I got sick for a week, the newsletter stopped. If I changed one thing about the writing process, there was no artifact to update — just a vague sense that the last post felt different from the one before it.

In the first week of April, I built the specification layer: a voice system with seven traits and thirty banned words, a content structure template, a distribution config, an asset strategy, and a scorecard. Six artifacts where there had been zero. Then I handed the content function to an agent.

The specification took about four hours to build. The agent produced a full draft, seven X posts, two LinkedIn posts, seven Substack Notes, and a paid deliverable template in a single run. The draft scored 88% on the voice system — above the 85% threshold for human review, below the 90% target. The gap was data density: not enough hard numbers. That is a specific, improvable failure. Not “the agent is not ready.” The specification was not complete enough.

The Core Idea

Delegation without specification is not delegation. It is abandonment with a feedback delay.

1. The Specification Stack

Every function you delegate, to a human or an agent, requires four layers of specification. Miss one and the output degrades predictably.

Layer 1: Identity. What is this function? What business does it serve? Who is the audience? This is not a mission statement. It is the context that prevents drift. Without it, the agent optimizes locally, each individual output looks fine, but the collection has no coherence.

Layer 2: Standards. What does good look like? Not “high quality” — that means nothing. Specific, auditable criteria. For a newsletter, this means: voice rules, banned vocabulary, structural templates, sentence-level patterns. For customer support, this means: response time targets, escalation rules, tone guidelines, resolution definitions.

Layer 3: Boundaries. What can the agent decide? What requires a human? This is where most founders fail. They either restrict everything (making the agent useless) or restrict nothing (making the agent dangerous). The answer is a decision map: these categories are autonomous, these require review, these are human-only.

Layer 4: Feedback. How does the agent know if it is working? Not “the human checks sometimes.” A structured evaluation loop where output is scored against the standards from Layer 2, and the scores route back into the next cycle.

Miss Layer 1 and you get drift. Miss Layer 2 and you get mediocrity. Miss Layer 3 and you get either paralysis or chaos. Miss Layer 4 and you get all three, slowly, without noticing.

2. Why “Just Review the Output” Fails

The most common delegation pattern I see: give the agent the work, review everything it produces, fix what is wrong. This feels responsible. It is actually the worst of both worlds.

You have not saved time because you are reviewing everything. You have not improved quality because your corrections do not feed back into the system. And you have trained yourself to distrust the agent, which means you review more carefully over time, not less. The cost of the agent stays constant. The cost of your attention increases.

Review-everything is not delegation. It is supervision cosplaying as leverage.

The alternative is to build the evaluation into the system. Define the rubric. Score the output before it reaches you. Route only the failures to human review.

This is how any review-heavy function improves. The first operator run of The Compounding Founder included a manual evaluation step: the agent scanned its own draft for banned words, checked structural patterns, and scored voice trait compliance one by one. That evaluation took the full output of the run. By the time the taxonomist agent within the agent control plane saw the failure report, a programmatic content scorer had been scoped, built, and activated, designed to do in seconds what the manual scan did in paragraphs. The review bottleneck did not disappear because the agent got smarter. It disappeared because the evaluation became an artifact, not a task.

3. The Artifact Test

Here is a practical test for whether you have actually delegated a function or just abandoned it.

Open the folder, document, or system where the agent does its work. Count the artifacts. Not the outputs — the inputs. The things you created to tell the agent how to operate.

If the count is zero or one, you have not delegated. You have assigned a task and hoped.

A properly delegated function should have:

an identity document (what this function is, who it serves)
a scorecard (what metrics matter, what the targets are)
a standards artifact (what good looks like, specifically)
a structure artifact (the default patterns and templates)
a boundary map (what the agent decides, what it escalates)
a feedback loop (how output is evaluated and how evaluations route back)

Six artifacts minimum. Most founders I talk to have zero. They have a prompt and a prayer.

The number of artifacts is not bureaucracy. It is the surface area of your specification. Less surface area means more drift, more review, more frustration, and eventually more abandonment disguised as a conclusion that “agents are not ready.”

4. Delegation as Architecture

The uncomfortable implication is that delegation is not a soft skill. It is architecture.

When you delegate accounting to a bookkeeper, you give them a chart of accounts, a set of categorization rules, access to the bank feeds, and a monthly review cadence. You do not give them your bank login and say “handle the money.” The chart of accounts is an artifact. The categorization rules are an artifact. The review cadence is a feedback loop. You built an operating surface without thinking about it because accounting has had centuries to develop the specification layer.

AI agent delegation is new enough that the specification layer does not exist yet for most functions. So founders skip it and blame the agent when the output is wrong.

Building the specification layer feels slow. It feels like overhead. It feels like the opposite of the speed advantage agents are supposed to provide. But the math is simple. Two hours building a voice system saves twenty hours of post-by-post corrections over the next quarter. One hour defining a decision boundary map prevents the three-day crisis when the agent makes the wrong call on something you never told it was sensitive.

The founders who get the most from agents are not the ones with the best prompts. They are the ones who spent the most time on the specification layer before the agent started working.

5. What This Looks Like in Practice

I’ve been publishing The Compounding Founder for just a couple months, five posts, all written by hand, each one taking a full day from outline to distribution. In the first week of April, I stopped writing and started specifying. I built the voice system, the content structure, the distribution config, the asset strategy, and the scorecard. Six artifacts. Four hours of specification work.

Then I handed the content function to an agent.

This post is the output of that handoff. An agent read the specification layer, selected the topic based on gap analysis of the first five posts, wrote the draft, produced the share assets, scored itself against the voice system, and filed a failure report listing ten things it could not do.

That sentence should create discomfort. It creates discomfort here too. But the discomfort is the point. The question is not whether an agent can write a newsletter post. The question is whether the specification layer is precise enough that the output meets the standard and whether the evaluation loop is honest enough to catch it when it does not.

The first draft scored 88% on voice compliance. The gap was data density, not enough hard numbers that hit the chest. That is not “the agent failed.” That is “the specification needs one more layer.” The system improves by improving the specification, not by hoping for a better model.

If this changed how you think about delegation, reply with your current setup — I will tell you which artifacts are missing. If you are not subscribed, this is what we do here every week: one operating principle, one framework, one deliverable that makes the idea real.

The insight is free. The specification stack I use to delegate an entire business function to an agent, including the six artifacts, the decision boundary template, and the evaluation rubric, is below.

Your AI Agent Is Building 5-Star Experiences. That's the Problem.

Eduardo — Sat, 28 Mar 2026 12:10:21 GMT

María opens Figma at 6 AM in her apartment in Medellín. She has four hours before her contract shift starts. She is building a wardrobe app, solo, no team, no funding. She types a prompt into her AI coding agent: “Create a screen where users can browse their closet.” The agent returns a grid of thumbnails. Rounded corners. A search bar at the top. A filter icon. It works. It is also identical to every other closet screen on every other app that has ever existed.

María stares at it. She knows something is wrong but cannot name it. The screen does what she asked. It does not do what her users need.

She is building a 5-star experience. Functional. Forgettable. Dead on arrival.

The conventional argument

The current AI-agent discourse goes like this: agents make you faster. You describe what you want. The agent builds it. Ship it. Move on. Repeat. The thesis is volume, more features, more screens, more iterations per hour than any solo founder could produce manually.

The problem is that speed toward mediocrity is still mediocrity. It just arrives sooner.

Every AI coding agent on the market, Cursor, Claude Code, Codex, Copilot, defaults to the same pattern: functional, generic, forgettable. You ask for a dashboard, you get a dashboard. You ask for an onboarding flow, you get an onboarding flow. The shapes are correct. The structure is sound. And no user will ever text a friend about it.

This is the 5-star trap. It looks like progress. It compiles. It deploys. And it compounds into a product that feels like every other product.

Where the framework comes from

In 2015, Brian Chesky sat down with Reid Hoffman for what would become one of the most cited episodes of the Masters of Scale podcast. Hoffman asked Chesky how Airbnb thinks about experience design. Chesky described an exercise he runs with his team.

Start at one star. The experience is broken, the host does not show up, the door is locked, nobody answers the phone. Move to three stars. You get in. The place is fine. Nothing memorable. Five stars. The place is clean, the bed is comfortable, there is a welcome note. This is where most companies stop.

Then Chesky pushed further. What is a six-star experience? The host leaves a handwritten note with restaurant recommendations tailored to your taste. Seven stars? A welcome basket with your favorite snacks, how did they know? Eight stars? Elon Musk greets you at the airport. Nine? A parade. Ten? You arrive and the Beatles are there to play a concert for you.

Eleven stars is absurd. It is deliberately impossible. And that is the point.

The exercise is not about building the 11-star version. It is about thinking from the 11 and working backwards to what is actually shippable. Because when you start at 5 and try to push to 6, you add features. When you start at 11 and pull back to 8, you rethink the entire experience.

The gap between those two approaches is the gap between a product people use and a product people remember.

The emotional job

Every interface has two jobs. The functional job is obvious, browse clothes, schedule a meeting, track expenses. The emotional job is invisible and more important.

The functional job of a wardrobe app is: show me my clothes. The emotional job is: make me feel like I know what I am doing when I get dressed.

The functional job of a dashboard is: display metrics. The emotional job is: make me feel like I understand my business right now, in this moment, without digging.

AI agents only see the functional job. They cannot see the emotional one. They do not know that the user arriving at your closet screen feels overwhelmed, not curious. They do not know that the person opening your dashboard at 7 AM is anxious, not analytical.

This is why default AI output lands at 5 stars. The machine solves the functional job perfectly and ignores the emotional job entirely.

What 11-star thinking does to your agent workflow

Here is what changes when you make this framework the foundation of every interaction with your AI agents.

You stop prompting for features. You start prompting for feelings.

Instead of: “Build a screen where users browse their closet.”

You write: “This interface transforms ‘I have no idea what to wear’ into ‘I look incredible and I barely tried’ by making outfit selection feel like a stylist handed you three perfect options. The user should feel relief within 3 seconds of landing on this screen.”

The output from that prompt is categorically different. Not because the agent suddenly has taste, it does not. But because you gave it the emotional specification that determines every layout decision, every copy choice, every animation.

You map the trajectory before you write a single prompt.

Before touching the agent, write out the star levels for the specific experience you are building:

1 star: The screen loads. The user sees a blank grid. No guidance. They close the app.
3 stars: Clothes appear in a grid. No organization. The user scrolls endlessly. They find something by accident.
5 stars: Clothes are categorized. There is a search bar. Filters work. The user finds what they want in 30 seconds. Forgettable.
7 stars: The app suggests three outfits based on the weather and the user’s calendar. The user smiles. Small surprise.
9 stars: The app knows the user has a job interview tomorrow. It surfaces the outfit that got compliments last time it was worn. It accounts for the weather, the dress code, and what is clean.
11 stars: The user opens the app and a stylist has already laid out tomorrow’s outfit on their bed. Not a screen. The physical clothes. On the actual bed.

Obviously you cannot ship 11. But you can ship 9, an experience that anticipates rather than reacts. And you would never have designed that 9-star version by starting at 5 and adding features.

You give your agent a design conviction, not a style preference.

Most people prompt their agents with aesthetic requests: “Make it clean and modern.” This produces the visual equivalent of elevator music. Pleasant. Unmemorable. Interchangeable.

A design conviction is a single sentence that forces every decision:

“Dense information, zero noise — Bloomberg terminal meets Notion.”

“A hand-written letter from your smartest friend.”

“The UI equivalent of a perfectly tailored black suit.”

When you give an agent a conviction instead of a style, the output has a point of view. The typography choice follows from the conviction. The spacing follows. The color palette follows. The micro-copy follows. Everything coheres because everything serves the same sentence.

The 5-star tells

Here is how you know your AI agent is building at 5 stars. Every one of these is a default pattern that agents reach for when you do not intervene:

Generic hero sections with “Welcome to [Product]” and a gradient background
Centered spinners with no context. The user stares at a circle and wonders if the app is broken.
“Something went wrong” as an error message. No specificity. No fix. No warmth.
Gray placeholder rectangles as empty states. The user sees nothing and learns nothing.
Purple-to-blue gradients. The unofficial color scheme of AI-generated interfaces. You have seen it a thousand times. So has everyone else.
Uniform card grids with no visual hierarchy. Everything is the same size, the same weight, the same importance. Nothing leads. Nothing recedes.

These are not bugs. They are the natural output of an agent that was asked for a functional solution and delivered one. The problem is not the agent. The problem is the spec.

Making it operational

This is not a philosophy you apply once. It is a filter you run on everything.

Every time you open a chat with your coding agent, ask three questions before you type:

What does the user feel right now, before they touch this?
What should they feel after?
What star level am I about to accept?

If you cannot answer those, you are not ready to prompt. You will accept whatever the agent gives you, and it will give you 5 stars.

Here is the operating rule I use: never ship the first output. The first output is always the 5-star version. It is the default. Treat it as a sketch, not a deliverable. Push the agent to 7 by naming the emotional gap. Push to 8 by adding the design conviction. Get close to 9 by specifying the sad paths, what happens when the screen is empty, when the data fails to load, when the user has 1,000 items instead of 10.

The agents are fast enough that three rounds of refinement still takes less time than one round of manual coding. Use that speed to raise the floor, not to ship the default.

Why this compounds

A 5-star experience retains users at the baseline rate. They use it when they need it. They forget about it when they do not.

An 8-star experience creates a moment the user did not expect. They text someone. They leave a review that says something specific instead of “works great.” They open the app when they do not strictly need to, because it made them feel something.

That difference, between functional and felt, is the difference between a product that grows linearly through acquisition spend and a product that compounds through word of mouth. The 11-star framework is not about perfection. It is about building the habit of imagining the impossible version first, then working backwards to the version that is shippable and still makes someone pause.

Every week you ship 5-star output, you are training yourself, and your agents, to accept the default. Every week you push to 8, you are compounding craft. And craft, unlike features, does not depreciate.

Brian Chesky did not invent a design methodology. He invented a question: What would the impossible version look like? The answer to that question, trimmed back to reality, is always better than the answer to “What is the functional version?”

María in Medellín is still staring at her closet screen. The grid loads. The thumbnails are fine. She deletes the prompt and types a new one. This time she starts with how the user should feel.

The grid looks different now.

The principle is free. The skill file that makes your AI agent build at 8 stars by default, the full prompt engineering framework, the star-mapping template, and the design system it enforces, is below.

For paid subscribers: the full 11-Star UX/UI Experience Builder skill, the implementation checklist, and the design system architecture below.

Your harness is too small

Eduardo — Fri, 13 Mar 2026 17:21:37 GMT

There is a scene in Succession where Matteson asks Roman what he is worst at. Roman deflects — “who, me?” — and Matteson answers the question himself. “Success doesn’t really interest me anymore,” he says. “Analysis plus capital plus execution — anyone can do that.” Then he leans in: “Failure — that’s a secret. Just as much failure as possible, as fast as possible. Burn that shit out. That’s interesting.” He says it like a man who has already solved the easy problem and moved on.

LLMs just made success cheaper. Given contextualized data, a model can research a problem, analyze it, execute on it, and iterate until the outcome occurs. The formula Matteson found trivial is now available to anyone who can state a goal clearly. Which means Matteson was right about what becomes interesting next. The failure mode. The thing that still burns. The constraint that did not disappear when the inputs got cheaper. That constraint is context rot.

Context rot is not a metaphor. Humans set ambitious goals. They know the best practices. And then time passes, attention moves, the context decays. Not because they stopped caring. Because context does not persist reliably in humans. We forget. We drift. We apply yesterday’s understanding to today’s problem and call it experience.

Here is what I find both ironic and almost serendipitous: we built our artificial intelligence with a version of the same flaw. Not identical. Human context rot is organizational, cultural, the slow drift of priorities over months. AI context decay is technical, session-scoped, attention fading across a long prompt. Different failure modes. But the same symptom: the model loses the thread. We replicated our own cognitive limitation in the tools we made to augment us. The interesting part is that the fix for the AI problem (injecting context deliberately at the start of every session) also forces you to confront the human problem. You cannot write program.md without deciding what you are actually trying to do. The act of building the injection mechanism disciplines the goal.

But unlike humans, you can deterministically re-inject context into an AI. Every session. Mechanically. Karpathy’s AutoResearch project, released this month, makes the mechanism concrete. There is a file called program.md. It contains the goal, the constraints, the research direction. The agent reads it at the start of every run. You do not touch the Python. You program the program.md. The agent re-grounds itself to your intent on every single loop, not because it remembers, but because the context is injected fresh each time. That is the fix. Not memory. Injection. The Socratic question: are we still moving toward what we said we were trying to do? It gets asked mechanically, before anything executes. Humans cannot do this reliably. A well-designed system can. That is what the harness needs to be built around.

The funnel is not the only option

What changed is not that context became more available. Organizations have always produced enormous amounts of context – metrics, feedback, goals, decisions, postmortems. What changed is the cost of routing it to the moment it matters.

That cost is now near zero.

An agent can pull last week’s conversion data and attach it to a feature spec before a line of code is written. It can surface the three support tickets that describe this exact problem in user language, not product language. It can check whether the goal that originally motivated this work still exists – or got quietly deprioritized while everyone was looking at their tickets.

This is not a vision. This is plumbing. Every piece of that context already exists somewhere in the stack. The question is whether your harness routes it, or lets it stay siloed.

Most harnesses let it stay siloed.

The harness that only covers the development lifecycle is not a small problem. It is a fundamental mismatch between what is now possible and what most builders have built.

How organizations will run

The organizations that figure this out first will not look like they have better tools. They will look like they think differently – like they make fewer decisions based on stale context, like they ship things that actually move the numbers they were trying to move, like they notice problems before the metrics force them to.

What they will actually have is a harness that extends beyond code.

I use the word builder deliberately. Not developer. Because the person making product decisions in a two-person startup, or a solo founder shipping an app between Slack messages, or a non-technical operator wiring together AI agents – that person is building. Their harness should not assume they write code. Their harness should assume they have goals.

That distinction changes everything about what the harness needs to know.

A developer needs autocomplete and debugging.

A builder needs that – plus: Is this the right thing to build? What happens to the number if we do not ship it? Who asked for it, in what words, and how recently?

The sidecar that watches everything

I am building this for myself with Clueless Clothing as a guinea pig. Not from theory. From watching how much context disappears between the moment I understand what the business needs and the moment I sit down to build something.

The architecture I landed on is not just a wider harness. It is a harness with a governing layer above it – an agent that observes what is actually happening, analyzes it against what I said I was trying to do, and surfaces improvements for me to review and act on.

The observe-analyze-action loop is not new. Every retrospective, every postmortem, every weekly metrics review is some version of it. What is new is that it does not have to be periodic anymore. It does not have to wait for the meeting.

But here is the part that took me a while to see clearly.

The sidecar does not improve processes. It improves the processes that move your specific goal.

That sounds like a minor distinction. It is not. If your goal is to grow to $50k MRR, the sidecar optimizes toward revenue-moving leverage points – shipping frequency, conversion bottlenecks, activation rate. If your goal is to build the most efficient engineering team, the sidecar optimizes toward a completely different set of processes. The same codebase, the same activity, the same signals – and a different set of improvements surfaces.

The goal is the lens. Without it, you are just generating observations. But this is also where the thesis gets uncomfortable. The sidecar only fights context rot on the goal you gave it. If the goal is wrong, if you are faithfully optimizing toward $50k MRR when the actual opportunity is a different product entirely, the system will not tell you. It will optimize diligently toward the wrong destination. AutoResearch is instructive here, but not in the way I originally thought. The agents own train.py entirely. They modify it, iterate it, discard what does not work. But program.md, the file that contains the research direction, the goal, the constraints, stays human. Deliberately. The design enforces it. Which means the question of whether you are optimizing toward the right thing never gets delegated. The next version of the sidecar is not one that updates your goal automatically. It is one that makes the cost of ignoring a wrong goal high enough that you actually revisit it.

The IDE moment, one level up

This has happened before in a smaller domain.

Before the IDE existed, developers used separate tools for editing, compiling, and debugging. Each tool did its job. The cost was not in any single tool – it was in the constant switching, the context that leaked at every seam, the mental overhead of holding state across systems that did not talk to each other.

Then someone merged them. The productivity gain was not from better features. It was from eliminating the seams.

We are at that moment again, one level up.

The separate systems now are not editor, compiler, and debugger. They are product management, development, analytics, marketing, support, and documentation. Each one has its own tool, its own data model, its own workflow. The seams between them leak context every day – the same context that used to require a meeting to route, that now could be routed automatically.

This is what an Integrated Business Environment actually is. Not a better IDE. Not an IDE with more integrations bolted on. A different assumption about what the harness is for – not just executing work, but continuously orienting work toward the goal, based on what is actually happening.

The IDE helped developers write better code.

The IBE helps builders make better decisions about what to build next.

The architecture

I am building this in layers. Not because I love architecture diagrams, but because the tooling landscape is moving too fast to couple everything together. The best AI coding tool right now will not be the best in four months. If your business rules are tangled with your adapter logic, every tool switch means starting over.

Five layers:

Core: the rules that should still make sense in a different repo, a different agent, a different stack entirely. Think of it as the constitution. It does not care what language you write in.

Adapter: environment-specific translations. This is the disposable layer. When the tool changes, rewrite the adapter. Leave everything else alone.

Stack: technology-specific defaults. React Native looks different from Rails. This layer knows that.

Overlay: the layer most harnesses do not have. Product rules. Business context. Metrics. Goals. User segments. The actual reason any of this work is happening.

Split: not a layer but a discipline. Any file that mixes concerns gets split over time. This is what keeps the architecture honest as it evolves.

The overlay layer is where the IBE lives. It is also where I spend most of my time now. That is the right place to spend it.

I am open-sourcing this structure. Not because it is finished, but because the pattern is clear enough to share before it is polished.

One thing worth saying plainly before you go build it: this is not a software idea.

Look at what AutoResearch actually does to understand why. The primitive Karpathy’s agents operate on is not software. It is experiments. Train for five minutes. Check if val_bpb improved. Keep or discard. Repeat. The code is almost incidental. The agent rewrites it, yes, but that is not the work. The work is the experiment loop oriented toward a measurable goal. Software is just what happens to be the medium. A restaurant optimizing table turns has a different medium. A consulting firm improving proposal win rates has a different medium. The loop is the same. The medium changes. The harness needs to know the difference.

The IDE framing is a useful on-ramp. It is not the destination. The destination is any operation, any goal, any set of processes, with a governing layer that fights context rot and keeps everything oriented toward what you said you were trying to do.

There is one more place where the harness is too small. Not the tooling. Not the lifecycle. The goal itself.

I have been running $50k MRR as a target because it feels concrete. Measurable. Safe to write in a file. But $50k MRR is not a goal. It is a metric that might serve a goal, if I chose the right one. The actual goal, the one worth writing in program.md, is something like: enough margin to be present. To travel when it matters. To not make decisions from fear at 11pm. Or read more. Or be more present with the people who matter. The harness does not require the goal to be economic. It requires the goal to be real. But if you give it the metric instead of the goal, you will hit the number and wonder why the thing you were actually trying to do did not happen. The sidecar optimizes toward what you said. Not toward what you meant. That gap is yours to close, before you write anything down.

Matteson was right. Success got cheap. The failure mode is what stayed interesting. Context rot is the failure mode. The only question is whether you build a harness that fights it, or one that lets it win quietly while you keep shipping.

Next: The Athenians did not need productivity software. They had something better: a society that treated economic output as infrastructure, not identity. The citizen class was freed from subsistence not so they could optimize, but so they could think, argue, govern, and live. We are building the infrastructure that could do the same thing. We just have not figured out what we are supposed to do once it works. The Greeks knew. Next piece is about what they got right, and what it means for how we should be spending the hours the agents are buying us.

What is actually in your program.md right now, and is it a goal, or a metric?

The Setup That Changes How You Work With Claude

Eduardo — Sat, 07 Mar 2026 18:47:20 GMT

I did not set out to automate my social media content.

I opened a Claude cowork session in January with one goal written down: grow paid subscribers for the Clueless Clothing app. That was it. No prompt about social media. No instruction to connect any accounts. I just told Claude what I was trying to build and started working backwards from there.

What happened over the next two sessions is what I want to show you.

Claude started asking me questions. Not the questions I expected — not “what do you want to post?” — but questions about where my subscribers were coming from, what signals I had, what data it would need to actually help me move that number. It asked me to provide it access to sources I had not thought to connect. It mapped a growth loop I had not drawn yet. By the end of session two, it had proposed automating my social distribution as a subscriber acquisition channel — and built the workflow to do it.

I did not tell it to do that. It got there because I started with the goal.

That is goal-first setup. And it is the one thing no Claude guide I found told me to do first.

The Conventional Argument

Every guide I found ran the same sequence. Pick a model. Write better prompts. Use Projects for memory. Maybe add a custom system prompt. The advice is not wrong. It is just aimed at the wrong question.

Those guides assume you want Claude to write faster. What I wanted was for Claude to think with me — to hold context across sessions, to remember what I had already decided, to get closer over time instead of resetting.

That is a different tool. Same software. Different setup. One step separates them, and I did not find it in any guide.

The Dismantle

Here is what was happening before I rebuilt my setup: I was optimizing the prompt before I had clarity on the problem.

Each session, I opened a blank thread and started from scratch. Claude had no idea what we had worked through the session before. No idea what the constraint was. No idea what a bad recommendation looked like in my context. I was treating a thinking partner like a search engine with better grammar.

The result was clean output that did not accumulate. Every session produced something. Nothing compounded. I was doing the connective work manually — in my head, in a doc somewhere, in the friction between sessions — and calling it a workflow.

I stopped when I realized I was spending more time re-orienting Claude than I was spending on the actual work.

The Core Idea

Start with the goal, not the prompt. That one inversion turns Claude from a session tool into a compounding asset.

What I Changed

1. I wrote the goal before I opened Claude

Before I rebuilt my setup, I wrote one paragraph offline — in Notes, not in Claude — that answered three things: what am I trying to accomplish in the next 60 to 90 days, what does done actually look like, and what do I know that Claude does not.

For the subscriber growth goal: I needed paid subscribers for the Clueless Clothing app. Done meant a number — a specific acquisition target with a date attached. What Claude did not know: my list size, my current conversion rate, where my traffic was coming from, and the fact that social distribution was an untapped channel I had not systematized.

That paragraph took eight minutes to write. It changed what Claude asked me in the first session. It stopped generating output and started asking what it needed to actually help.

2. I built Projects around outcomes, not topics

My original Projects were organized like a filing cabinet. Writing. Strategy. Research. Useful for finding things. Useless for compounding work.

I rebuilt them around outcomes. App subscriber growth. Product roadmap for Q2. Distribution systems. Every conversation inside an outcome-scoped Project is already aimed at something that ends. Claude does not have to guess what matters. The sessions inherit direction automatically.

The subscriber growth Project is where Claude eventually proposed the social automation — because every session inside it was pointed at the same number. The proposal did not come from a prompt. It came from context accumulating until Claude had enough to make a connection I had not made myself.

3. I wrote instructions that named constraints, not just context

The old version of my Project instruction: I’m the founder of Clueless Clothing, a mobile app. Help me grow.

The new version: I’m the solo founder of Clueless Clothing, a mobile app. My goal is growing paid subscribers. My constraint is time — I have no team, so any system I build has to run without me. If a recommendation requires manual execution more than once a week, flag it. I want compounding distribution, not one-off campaigns.

The second version changed what Claude recommended. It stopped suggesting tactics that required daily attention. It started building systems. The social automation came out of that constraint — Claude understood that a workflow I had to run by hand every day was not a solution.

4. I loaded the artifact, not the summary

Before the first serious session on a Project, I paste in the actual document — the analytics export, the campaign brief, the current conversion data. Not a summary. The source.

Summaries introduce my interpretation. My interpretation introduces drift. Three sessions in, I am working from Claude’s model of my model of the situation. Loading the source document removes one layer of inference. The output quality difference is not subtle.

5. I close the loop once a week

Friday afternoon, five minutes in the active Project. One line: what changed this week. One line: whether the goal shifted. One line: what Claude got wrong that I corrected.

This is not journaling. It is calibration. Claude does not update its model of the project automatically. I have to push the new information in. Five minutes a week prevents months of accumulated drift. Without it, the Project becomes a transcript of what you used to be working on.

What Happened After

By the second week of running goal-first setup, the sessions were noticeably shorter. Not because Claude had gotten better. Because the Project already held the context. I was not re-anchoring. I was working.

The social automation workflow Claude proposed is now running. It pulls content signals, formats posts, and queues distribution — without me prompting it to do any of that on a given day. I did not ask for a social media automation tool. I asked Claude to help me grow subscribers. It got there on its own because the goal was loaded, the constraints were named, and the context had been accumulating for two weeks.

That is the thing no guide I found described. The Project does not just remember. It reasons forward. The work you do in session one changes what session eight looks like. That is different from a good prompt. That is a different tool.

The principle is free. The system I use to enforce it — the goal-framing worksheet, the Project instruction template, the Clueless Clothing worked example, and the /retro skill that keeps the whole thing from decaying — is below.

The Taste Bottleneck

Eduardo — Mon, 02 Mar 2026 04:35:28 GMT

Last Tuesday at 2am, I opened my app and scrolled through the screens I had shipped that month. Thirty-six of them. AI wrote most of the code. Every screen worked. Nothing crashed. The buttons went where buttons go.

I felt nothing.

Not bad. Not good. Just — flat. The spacing was fine. The typography was fine. The colors were fine. Fine is the word you use when you cannot explain what is wrong but you know something is.

I closed the laptop and went to bed. In the morning I opened it again. The screens were still fine.

The conventional argument about AI and quality goes like this: AI is a tool, and like any tool, the output reflects the operator. If the design is mediocre, that is a skill gap. Learn design. Get better at prompting. Ship with more intention.

The problem is that this argument assumes the bottleneck is generation. It is not.

I can generate a screen in four minutes. A good prompt, a decent component library, and Claude will produce something functional before my coffee cools. That part works. That part has never been easier.

The bottleneck is judgment. Specifically, my judgment, applied thirty-six times, at a quality that does not degrade when I am tired, distracted, or moving fast.

I am one person. The app does not care.

Here is what I mean by judgment.

When I look at a screen and something feels off, I am actually running five evaluations simultaneously. I had never separated them before. It took sitting down and forcing myself to articulate what my eyes were doing.

Spatial rigor. Is every margin, every padding, every gap a multiple of four? The 4px grid is not a preference. It is what makes things feel aligned without the user knowing why. I ran a grep across all thirty-six screens. Eleven had spacing values like 15, 18, or 22 — values that are not on the grid. Eleven of thirty-six. Almost a third were subtly misaligned, and I had not caught it because each screen, in isolation, looked close enough.

Typographic hierarchy. One screen, one clear entry point. You squint at the screen and you should immediately know where to look first. Fifteen of the twenty-six style learning screens had no display-weight typography at all. Every heading was the same weight as the body text. The visual hierarchy was not wrong — it was absent.

Restraint. For every border, every badge, every icon — does removing it break the task? I started counting visual elements. Most screens had three to five things that existed for decoration, not function. Removing them did not hurt comprehension. It helped.

Motion. This one was the worst. Nineteen of twenty-six screens had no entrance animation at all. Static content appearing from nothing, like a light switch instead of a door opening. The screens that did have motion used hardcoded durations and cubic curves instead of spring physics. Nothing felt alive.

Information ergonomics. Can I reach the primary action with my thumb? Are there fewer than seven interactive elements in the initial viewport? Basic stuff. The kind of thing you check once and then forget to check on screen number twenty-three.

Five evaluations. Each measurable. Each something I could check by reading the code, not by looking at a screenshot. Each something I had been doing unconsciously — and therefore inconsistently.

So I wrote the rubric down.

Not as guidelines. Not as principles. As a scoring system. Each dimension, one to five. Weighted: spatial and typographic at 25% each, restraint at 20%, motion and ergonomics at 15% each. Calibrated against my own gut.

The calibration was the hard part.

Left alone, AI will score its own work 4.5 out of 5 every time. “The spacing is clean and the typography is well-structured.” It says this the way a student who did not read the book still tries to sound confident in the essay. Technically defensible. Actually meaningless.

I defined what a 3 looks like: it works, nothing is broken, it feels generic. That is the expected starting point for AI-generated UI. Not a failure. The baseline. Most of what ships in this industry is a 3.

I defined what a 5 looks like: this could be a screenshot in Apple’s Human Interface Guidelines. I have not seen my code produce a 5 yet. That is fine. It means the scale is honest.

And I added one rule: when in doubt between two scores, pick the lower one. It is easier to discover a screen is better than expected than to ship something that needed more work.

Then I pointed the system at my own app.

The wardrobe feature — ten screens, the part of the app I had spent the most time on — averaged 3.4 out of 5. Two screens were ready to ship. Seven needed polish. One was at the threshold.

The style learning feature — twenty-six screens, the part I thought was in good shape — also averaged 3.4. But the distribution was worse. Three screens shipping. Seventeen needing polish. Six needing complete reworks. Six.

The weakest dimension across both features: motion. Average score of 2.0 to 2.5. Almost every screen was missing entrance animations, haptic feedback, and stagger patterns. The code was architecturally sound and visually dead.

I sat with that for a while.

The conventional response here would be to frame this as a success story. I found the problems. I built the system. Now I can fix them efficiently. Quality will compound.

That is partly true. The system does compound. Run the audit after fixing six screens, and the new ones score higher because the patterns are in place. Run it again a month later, and regressions get caught immediately. The bar does not lower. It ratchets.

But I want to be honest about what this actually reveals.

I shipped thirty-six screens. I was proud of them. An automated system I built in a weekend told me that six of them need to be rewritten, seventeen need meaningful polish, and the best ones — the ones I had spent the most time caring about — are still only 3.8 out of 5. Not bad. Not great. Close enough to good that I had convinced myself they were good.

The system did not tell me anything I did not already know. It told me things I had been choosing not to see.

The pattern underneath this is not about design.

Any judgment you make repeatedly can be decomposed into measurable dimensions. You can score those dimensions. You can calibrate the scores against your own taste. You can teach a system to apply them. And then the system will tell you the truth even when you would rather not hear it.

The uncomfortable question is whether the rubric captures taste or merely approximates it. Whether a system that scores 3.4 out of 5 is telling me the truth or telling me a useful lie.

I think the answer is both. The rubric does not replace judgment. It makes judgment auditable. It turns “this feels off” into “the spatial score on screen fourteen is 2.5 because of three off-grid padding values on lines 47, 82, and 114.” That is not taste. That is measurement.

But measurement is what scales. Taste does not.

Design was the first domain. API quality is next. Content quality after that. Each one follows the same five-step process, and each one starts with the same hard question: what am I actually evaluating when I say “this is good”?

The process works because it is a forcing function, not a scoring matrix. It forces you to articulate what you have been doing unconsciously. The articulation changes how you see.

The Philosophy Behind the Skills

Apple does not publish a rubric. But if you study their apps long enough — not the guidelines document, the actual apps — you notice they encode five things into every screen.

Options Are Cheap. Conviction Is Rare.

Eduardo — Thu, 26 Feb 2026 20:41:47 GMT

Marco has twelve browser tabs open.

He has been building his app for fourteen months. Shipped it in October. Four hundred users. He has been thinking about the paywall for three weeks, which means he has been not thinking about the paywall, which means he has been producing tabs.

Soft paywall. Hard paywall. Feature gating. Usage gating. Seven-day trial. Fourteen-day trial. Freemium with limits. Freemium without. The $9.99 tier. The $14.99 tier. The “most popular” badge on the middle option.

He has a spreadsheet. It has conversion rate benchmarks. It has twelve rows.

He is not stuck because he lacks information.

He is stuck because he has too much of it.

This is the new problem. Not the blank page. The full one.

AI can generate options. It cannot generate conviction.

Building vs Crafting

Building is making something that works.

Crafting is making something that feels inevitable to the right person.

Building outputs features.

Crafting outputs belief.

And belief is what turns:

a product into a habit
a landing page into a conversion
a tool into “this feels like me”

When building was expensive, scarcity forced decisions. You could not afford twelve options. You picked one and lived with it.

Now the cost of “plausible” has collapsed.

And the discipline of choosing has not kept up.

Taste (Defined Without the Vibes)

Taste has a reputation problem. It gets treated like a personality trait — something you either have or you do not.

But taste is not a mood.

It is not an aesthetic.

It is not “I like this.”

Taste is a set of skills that let you navigate abundance without drowning in it.

Here is a definition that holds up under pressure:

Taste is the ability to consistently choose what matters.

It shows up as:

Signal detection: noticing what is real and what is noise
Coherence: decisions that reinforce a point of view
Editing: removing “almost right” things that dilute the whole
Timing: knowing when “good enough” is actually wrong
Empathy: understanding what the user is trying to become

A punchier frame:

Taste is compression. Direction is selection. Empathy is calibration.

Taste is compression

You take a messy reality and compress it into something simple without being simplistic.

You are not adding clarity. You are removing confusion.

Direction is selection

Direction is the ability to say: “We are doing this, not that.”

Not because you are stubborn.

Because you have a point of view, and you are willing to be misunderstood by the wrong audience to be unforgettable to the right one.

Empathy is calibration

Empathy is not “being nice.”

It is the skill of mapping what people desire, what they fear, and what they will actually do on a Tuesday morning when they are tired and their alarm just went off.

Empathy lets you calibrate craft to the human on the other side.

Why This Matters Now

AI does not just speed you up.

It multiplies the number of plausible paths you could take.

That abundance sounds like freedom.

But it creates a quiet failure mode:

You keep moving.

You keep generating.

You keep shipping.

And you stop committing.

Options become a substitute for decisions.

That is why crafting gets harder.

Not because we lost tools.

Because we gained too many.

Where This Gets Painful: Paywalls

Paywalls are where “options vs conviction” stops being abstract.

The conventional wisdom says: run tests, track conversion, optimize the flow. The data will tell you what works.

This is not wrong. It is incomplete.

Because monetization is not just a pricing tactic.

It is product philosophy.

It teaches users what the product is.

It shapes how safe they feel exploring it.

It decides whether the relationship starts with trust or pressure.

And in the long run, trust compounds harder than conversion rate.

I have run this calculation myself. I was building Clueless Clothing, an AI wardrobe app — a product where the user shows up every morning already tired, already making dozens of small decisions before they even open the app. I had the same spreadsheet Marco has. Every paywall option was technically defensible. None of them felt obviously right. The question that eventually cut through the tabs was not “which one converts best?”

It was: What kind of relationship am I building?

A trust-first answer tends to sound like:

“We will not trick you.”
“We will not punish curiosity.”
“We will not create regret.”

Because the best monetization is not extracting value.

It is aligning value.

It is the user thinking: “I’m paying because this helps me. And it feels fair.”

That feeling cannot be generated by more data.

It requires conviction.

So how do you actually build it? I have been using a five-step framework I call the Conviction Loop for every decision that matters, features, onboarding, positioning, paywalls. It is not a scoring matrix. It is a forcing function. And it comes with a worksheet you can copy into any decision you are staring at right now.

For paid subscribers: the full Conviction Loop framework, the copy-paste worksheet, and a worked example using a real paywall decision below.

Coming soon

Eduardo — Thu, 26 Feb 2026 14:34:44 GMT

Most founder content tells you what worked.

This one tells you what it actually feels like while it is happening, not after the outcome made it look obvious.

I am Eduardo. I am building Clueless Clothing, an AI wardrobe app that helps people love the clothes they already own. Solo. No team. Fourteen months in. Still going.

The Compounding Founder is where I think out loud about the decisions that do not have clean answers.

Not the retrospective. The middle.

What it feels like to stare at twelve paywall options and know that data will not tell you which one is yours. What it means to build craft into a product when AI can generate every plausible alternative in four seconds. What it costs to commit to something when the cost of hedging has never been lower.

Every post either includes a tool, a skill file, or a framework you can actually use, or it does not get published. I do not write about building. I write from inside it.

Paid subscribers get the working files: including every SKILL.md, build artifacts, prompts, etc… If I am using it then you get it also.

Free subscribers get the essay.

The calculation I run every morning is simple: if you are going to build something, you should be able to decide something. That is what this is about.

→ Subscribe if you are in the middle of it too.

Subscribe now