Why Telemetry Can’t Reveal AI’s True Impact on Developer Productivity

Abstract

Companies are adopting AI tools to assist software development — but evaluating their impact is challenging.
The default approach is telemetry: tracking IDE activity, code churn, AI suggestions, PRs, etc.
But telemetry, even when richly instrumented, captures physical interactions, not the cognitive effort that defines developer productivity.
A landmark 2025 study by METR revealed a deep disconnect: developers felt more productive with AI, but actual time-to-delivery increased — and telemetry couldn’t explain why.
To measure what matters, we need a Knowledge-Centric Perspective — focused not on physical behavior, but on invisible knowledge discovery.

1. Introduction

As AI tools increasingly co-author software, companies attempt to measure the impact of AI on developer productivity through telemetry (e.g., IDE interactions, keystrokes, commits). However, telemetry captures physical actions, not cognitive effort, leading to misleading conclusions. Example: A developer might spend hours thinking through a design or resolving ambiguity — no telemetry event can capture this crucial value-adding work. Thus, even perfect telemetry would mostly watch hands (IDE clicks, keystrokes, diffs), while software development is fundamentally knowledge work — the cognitive effort required to close the gap between what a team knows and what it must know to deliver value. If you want leading indicators of predictable delivery, you must measure knowledge flow, not keyboard flow. That’s exactly what KEDE (Knowledge Discovery Efficiency) and the broader Knowledge-Centric Perspective are designed to do.

2. Why telemetry — including PRs and AI usage logs — fails to capture AI’s true impact

Most companies can’t afford — financially, legally, ethically, or culturally — to run telemetry on every developer machine.

2.1 Cost & practicality

Massive data volume & low signal-to-noise: IDE events, cursor moves, tab switches, prompts, and diffs create terabytes of mostly meaningless noise you will then pay to store, model, and govern.
Tooling heterogeneity: Multiple IDEs, terminals, editors, AI tools, shells, and custom scripts make full coverage infeasible.
Shadow channels: A lot of real knowledge work happens in docs, whiteboards, meetings, Slack, PR discussions, and the developer’s head—outside IDE telemetry’s reach.

2.2 Legal, privacy & compliance

GDPR / works councils / unionization risk (esp. in EU): pervasive behavioral monitoring is often unacceptable, requires negotiations, DPIAs, opt-ins, minimization, transparent processing purposes, retention limits, etc.
Litigation discovery risk: Keeping granular behavioral logs can backfire—these logs become discoverable evidence.
Security posture: Keylogging-like telemetry can violate internal security policies and trigger employee pushback.

2.3 Organizational & cultural damage

Erodes trust & autonomy: Creates defensive behavior, metric gaming, and learned helplessness.
Optimizes for what’s easy to count: Lines of code, PR counts, time-in-IDE—all manual labor proxies, not knowledge creation.
False precision: High-resolution but low-validity data seduces leadership into confident but wrong conclusions.

2.4 Conceptual mismatch

Telemetry tracks physical activity, not knowledge transformation and cognitive effort.
Telemetry cannot tell why a developer is idle or how much cognitive work has been done during design or debugging.
Two developers with identical telemetry may produce vastly different outcomes depending on prior knowledge and problem complexity.
The bottleneck in modern software delivery is cognitive control and knowledge gaps, not typing speed or tab count.
Developers use primarily their cognitive abilities, which no telemetry can gauge.

3. Case Study: The METR Early-2025 AI Developer Study

A landmark randomized controlled trial of AI in real open-source work revealed the paradox of telemetry vs. experience. This is the clearest demonstration that telemetry cannot reveal AI’s productivity impact.

Key findings from the study:

Developers expected AI to save them 24% time.
After the task, they felt it had saved 20%.
Telemetry showed the opposite: AI slowed them down by 19%.

This mismatch highlights two important lessons:

Self-reports and perceptions are unreliable when assessing AI productivity.
Telemetry—while useful—still can’t explain why developers slowed down.

3.1 What METR measured

16 experienced OSS developers performed 246 real tasks across familiar repos.
Tasks were randomly assigned to AI-allowed or AI-disallowed conditions.
Before starting, developers forecasted AI would speed them up by 24%.
After finishing, they still believed AI sped them up by 20%.
In reality: using AI increased time-to-completion by 19%.

3.2 How they measured it

METR used developer self-reported total time, including:

Pre-PR implementation
Post-PR review iterations

They did collect detailed telemetry, including:

Screen recordings
Cursor IDE analytics
AI suggestion acceptance rates
Number and duration of PR reviews

Yet even with all that, the telemetry alone couldn’t explain the productivity reversal — only that more time was spent prompting, waiting for, and fixing AI output.

3.3 Why it matters

Even a high-integrity, high-resolution study couldn’t rely on telemetry alone.
The telemetry observed physical interactions, but missed the mental model shifts, false assumptions, and effort to recover from misleading suggestions.

3.4. Why AI confounds traditional metrics

AI inflates superficial metrics (more code suggestions, more lines touched). Developers spend more time:

Prompting AI
Reviewing AI output
Cleaning up incorrect or subpar completions

The cost is invisible unless you track knowledge transformations, not keystroke counts.

3.5. The real bottleneck: knowledge work, not manual work

While telemetry may include PR creation, review time, suggestion acceptance, and detailed interaction logs, these artifacts only show how a developer interacted. They don’t reveal how much new knowledge they had to acquire—or whether AI helped them bridge that gap faster or slower.

Developers are not assembly-line workers producing lines of code. They are cognitive agents discovering, validating, and applying knowledge under uncertainty.

Knowledge work is the act of bridging the gap between:

What the developer/team currently knows
What they must know to deliver value

That gap cannot be captured by PRs, AI prompt logs, or cursor tracking.

4. The Knowledge-Centric Perspective (KCP): Measuring what actually matters

Core premise: Treat knowledge as the fuel that powers software delivery. Measure how much knowledge is missing, how fast it’s discovered, how much is lost to rework, and how well team capability matches work complexity.

If software development is fundamentally about reducing uncertainty and discovering knowledge, then our metrics must reflect that. By applying the Knowledge-Centric perspective, we derive universally applicable metrics:

Predictability: Ability to meet expected knowledge discovery rates within a timeframe.
Knowledge Discovery Efficiency (KEDE): Balance between individual capability and work complexity.
Rework (Information Loss Rate): The ratio of lost information (errors, discarded work) to total perceived missing knowledge.
Collaboration: How efficiently teams close knowledge gaps as group size changes.
Cognitive Load: Number of alternative solutions considered before settling on one.
Happiness (Flow State): Developer engagement when capability and challenge are balanced.
Productivity: Ratio of delivered value to the knowledge discovered—high productivity = minimal new knowledge needed for high-value outcomes.

These metrics serve as leading indicators, tracking the quality of the software development process by focusing on its core nature: the acquisition and application of knowledge.

5. How to use KEDE to measure AI’s true value

Identify knowledge-intensive tasks (exploratory dev, unfamiliar domains, debugging).

Apply KEDE to track:

How much knowledge is newly discovered vs reused
How rework changes with or without AI
Whether AI reduces or increases cognitive load

Unlike telemetry, KEDE lets you compare:

Different teams using AI in different ways
Same team across AI and non-AI periods
Tasks with different levels of ambiguity or novelty

6. How KEDE complements (not replaces) existing frameworks

In contrast to the Knowledge-Centric perspective, traditional Flow metrics focus on outputs and lagging indicators such as feature velocity, throughput, and lead times.

This perspective treats software development as a Flow of tangible entities—number of commits, story points, User Stories, Features, Defects, Pull Requests, Incidents — from input stages through output to the final user outcome.

They offer a measurable, logistics-like view of software development, highlighting its efficiency and productivity as if it were a manufacturing system.

SPACE/DORA: Great lagging indicators of delivery performance; KEDE provides leading indicators rooted in why work is (un)predictable.
Value Stream Mapping: KEDE quantifies knowledge bottlenecks inside the stream.
Cognitive Load (Team Topologies): KEDE lets you see the bit demand vs bit capacity mismatch numerically.

We believe that when combined with Flow Metrics, our Knowledge-Centric Metrics provide a comprehensive, holistic view of the software development process, thereby augmenting and enhancing the insights provided by Flow Metrics.

The beauty of this approach is that Knowledge-Centric and Flow Metrics don't collide; they complement each other. They each offer a unique standpoint, and together, they provide a fuller picture of the software development process.

7. Anticipating objections

“Bits sound theoretical — can we trust the numbers?”
You don’t need perfect bit counts; you need consistent estimators to compare within your org over time and between teams/domains. Think of it as relative entropy tracking, not Platonic truth.

“Telemetry is still useful, right?”
Yes—for tooling DX, CI latency, flakiness, flaky tests, build times, local environment pain, etc. Just don’t mistake hand telemetry for mind telemetry.

“Can KEDE be gamed?”
Much less so than output or activity metrics, because it’s anchored to closing uncertainty, not producing volume.

8. Conclusion

If software delivery is powered by knowledge, then knowledge — not physical motion — must be what we measure. If AI transforms how knowledge is discovered and applied in development, then evaluating its productivity impact requires metrics that reflect that transformation.

Telemetry can tell you how your developers move their hands. Telemetry fails because it focuses on how fast developers type, not how effectively they think. KEDE tells you how your organization learns. KEDE and related metrics offer a scientifically grounded, universal way to track the real engine of software development: knowledge acquisition. Use it to manage predictability, rework, collaboration, cognitive load, and flow—before delivery failures show up on your dashboards.