A few months ago I started leading a team to explore AI and how it can improve the experience and productivity of developers at my company. I held weekly meetings, tried experiments with my team, and after a few weeks I had a problem: I couldn't say for sure if or how the AI was making us more productive.
The engineers were giving conflicting signals. Some were excited. Others said the AI felt unpredictable: It would hit something obvious, then fail on anything that touched our codebase conventions, and they would spend more time fixing it than writing the code themselves. We wanted to show management that the investment was worth it, but we didn't have a good metric.
What everyone else is measuring
You've probably seen headlines like "At Company X, all our code is written by AI" or "AI made our engineers 50% more productive." They rarely explain what that means or how they measured it.
Tool vendors measure lines of code generated and acceptance rate. An engineer can take all the suggestions and discard all of them, so these numbers say almost nothing about the actual quality of the production. Productivity percentage sounds like it, but it is difficult to measure and varies between developers and even for the same person on similar tasks. It is not reproducible and does not tell you where to invest to improve.
a better question
Instead of measuring speed, I started asking something more concrete: Over the lifecycle of a feature, in how many areas can AI reliably get the job done?
Building a feature involves many different areas: understanding requirements, system design, front-end components, back-end logic, testing, and implementation. For everyone, the question is simple: is this covered?
- Covered: A workflow exists, validated against actual patterns in your code base, that has been used successfully by at least two people on the team. The developer reviews the result, makes minor adjustments and submits it.
- Partial: AI helps but cannot perform the task alone. Something is documented but the workflow is not complete or only one person has validated it. Useful, but closer to pair programming than delegation.
- Discovered: AI operates blindly and relies on general training knowledge with no basis in its specific codebase; You may or may not read your code thoroughly before starting. Most of the time, the developer ends up constantly directing, such as explaining a task to an intern instead of just doing it themselves.
With this framework, new areas start as Partials: a person documents and tests a workflow. When a second person can use it without making modifications, it becomes covered. Criteria can be adjusted and, over time, observability can be added to make promotions more data-driven.
Coverage is infrastructure, not intuition.
The reason this metric holds up where productivity percentages don't is that coverage describes your team's system, not a feeling about the overall capability of the AI.
When I say forms are covered on my team, I mean there is a specific workflow that clarifies requirements, delegates to a subagent following the exact patterns of our codebase, and then runs a second subagent that validates the result against a checklist and fixes errors. Two developers with no particular experience in prompts can create a form and get consistently good results, not because AI is magically good with forms, but because the workflow is documented and the standards are codified.
If an area is partial or discovered today, the path forward is clear: document the patterns, create the workflow, get teammates to validate it through actual use. There is no need to wait for smarter models.
Three areas of our code base
Forms (covered). A dedicated skill, a subagent trained in our patterns and a review subagent who detects and fixes problems. Several people have used it in production forms. The developer reviews the result.