Executable Specifications
Code is the Output. Knowledge is the Job
1 . Introduction
For decades we’ve chased a simple idea: write your code once and run them everywhere.
The rise of cloud platforms, containerization, and now AI-powered tooling keeps edging us closer, but we still spend too much time translating ideas into different forms - code, docs, diagrams, ticket comments, just so people and machines will understand what we really mean.
That’s because the real job isn’t typing code; it’s capturing knowledge:
- Code holds only a slice of that knowledge - the part a machine needs to execute.
- The bigger slice lives outside the codebase: problem context, user goals, trade-offs, safety rules, success criteria, and all the “why” behind every line.
A clear, written specification can pull those pieces together in one place. When a spec is treated as the primary artifact i.e. versioned, reviewed, and even executed or tested by tools, developers can:
- Align faster: Everyone references the same document instead of guessing intent from scattered Jira tickets or commit messages.
- Generate reliably: Today’s LLMs can turn a well-written spec into code, docs, or tests on demand.
- Adapt easily: Change a rule in the spec and regenerate downstream artifacts, rather than hunting through code for every hidden assumption.
In the sections that follow we’ll see how shifting focus from code to knowledge—captured in executable specs—helps teams ship the right thing sooner and with less risk.
2. Knowledge Discovery Is the Bottleneck
We developers work very hard to solve real problems. First we talk with users to understand their challenges. Next we distill the stories we hear into clear problem statements and start to ideate possible solutions. Then we plan concrete ways to reach the goals we have set, share those plans with our teammates, translate the plans into code, and finally test and verify that the running code truly eases the user’s pain.
Across that whole loop, only a small slice—roughly ten to twenty percent—is the act of writing code. The other eighty to ninety percent is structured knowledge work: talking, understanding, distilling, ideating, planning, sharing, testing, and verifying. All of that is knowledge discovery, and knowledge discovery is the real bottleneck.
When we miss or mangle the knowledge, we write the wrong code faster and then rewrite it later.
Same story with AI
The same pattern holds when we work with large language models. A model can write perfect syntax in seconds, but it can only hit the target if we give it sharp, unambiguous knowledge to start with. If the knowledge in the prompt is vague, the answer will be vague.
What this means for you
That is why tomorrow’s most valuable programmer will not be the person who types the fastest or knows the latest framework. It will be the person who can:
- Pull hidden knowledge to the surface (ask the right questions).
- Structure that knowledge so humans and LLMs can act on it.
- Close the loop quickly—verify the result solves the real problem.
In short: mastering knowledge discovery and transfer is now the core engineering skill.
3. Vibe-Coding
When we “vibe-code,” we start by describing our intentions and the outcomes we want to see. We hand that description to a large language model, and the model dutifully produces a block of working code. The result feels magical, but after we copy the code into the repository the original prompt—the written record of our intent—usually vanishes into a chat log or terminal history.
That workflow is backwards. It is like keeping a compiled binary while deleting the source code that generated it. In traditional software practice no one would dream of version-controlling only the binary, because the binary tells us nothing about why the program exists or how it should evolve. Yet with prompts we routinely accept that loss of context.
The lesson is clear: we should preserve the prompt, or better, turn it into a concise specification that lives beside the code. The spec captures the durable knowledge—our goals, constraints, and reasoning—so future engineers and future runs of the model can regenerate, review, or adapt the code without guessing at our original intent. Persisting the spec means we keep the most valuable artifact, while the generated code becomes a reproducible by-product rather than a fragile single source of truth.
4. Why Written Specifications Beat Code
Code is only one projection of what a team knows. By the time an idea has been squeezed into a programming language, much of the original intent is gone: variable names have replaced full sentences, edge-case discussions are buried in helpers or comments, and the larger “why” has vanished entirely. In other words, code is a lossy representation of our goals, values, and trade-offs.
A written specification keeps that missing context. It states the user problem, the success criteria, the constraints, and the safety rules in plain language that anyone on the team can read—product, legal, design, or compliance. Because the knowledge is spelled out in one place, we can point multiple tools at it and ask each tool to generate what it is best at: source code, unit tests, architecture docs, release notes, blog posts, or even a podcast script. The spec becomes the single, high-fidelity input; everything else is reproducible output.
As AI models get better at turning natural-language specs into working artifacts, the scarce skill is shifting. It is no longer “Who can write the fastest Rust?” but “Who can write a spec that fully captures the intent without ambiguity?” The engineer who masters that craft—clear, exhaustive, testable specifications—will unlock far more leverage than someone who only tweaks syntax after the fact.
5. Anatomy of a Modern Specification – The OpenAI Model Spec
Last year OpenAI published its Model Spec, a “living” document that spells out, in plain language, the intentions and values the company wants each shipped model to embody. The repository—open-sourced and updated most recently in February 2025—is nothing more exotic than a folder full of Markdown files, but that choice is deliberate. Markdown keeps the spec human-readable, easy to diff, and friendly to standard version-control workflows.
Because the text is ordinary prose, every discipline can work on it directly. Product managers propose business goals, safety researchers add red-team clauses, legal reviews regulatory language, and policy teams clarify edge cases—all in the same pull-request stream. The spec becomes the single place where the entire organization records its knowledge about how the model should behave.
Each rule is tagged with a short identifier—8ep1
requires “metric units in all responses,” for example. Those IDs act like function names in code: they let engineers and evaluators point to a precise requirement, build tests around it, and discuss it without ambiguity. When an evaluation run shows the model ignoring 8ep1
, the team doesn’t argue about intent; they file a bug against that clause.
In practice the Model Spec serves as a trust anchor. If the shipped model’s behavior drifts from what the spec promises, the drift is treated as a defect to be fixed, not as an acceptable variation. That tight coupling between written intent and observable behavior is what turns a plain Markdown document into the beating heart of the product.
6. From Static Doc to Executable Artifact
A written spec is great for humans, but it becomes far more powerful when the software itself can read and enforce it. OpenAI’s Deliberative Alignment pipeline shows how to turn plain Markdown rules into an automated test-and-train loop:
- Model A (evaluatee)
The model checkpoint you want to test or fine-tune. - Prompt set
A library of tough, edge-case inputs designed to exercise the spec clauses—accuracy, safety, style, and so on. - Model B (judge)
A usually larger or more instruction-following model that reads:- the original prompt,
- Model A’s answer, and
- the relevant spec text.
It then scores how well the answer follows each clause (pass, fail, or a numeric grade).
- Feedback loop
- Gating: If the score dips below a threshold, the build is blocked—just like failing unit tests.
- Training: The same scores can drive reinforcement (e.g., RLHF) to nudge Model A’s weights toward better compliance.
Because the spec feeds both evaluation and training, the knowledge inside those Markdown files flows straight into the model’s parameters. When you tighten a style rule, add a new safety requirement, or refine a domain-specific constraint, you only change the spec; the pipeline handles the rest.
In practice this means:
- Style guides (“Always use metric units”).
- Safety policies (“Refuse instructions for self-harm”).
- Domain rules (“Include OS version, RAM, and GPU status in every system report”).
…can all live as numbered clauses, each with its own automated test. The result is a spec that is no longer static documentation—it is an executable artifact that keeps models, tools, and teams aligned as they evolve.
7. Treating Specs as Code
The specifications we write are more than documents; they are executable, testable modules with clear interfaces to the real world. Once a spec can drive automated evaluations, it behaves much like a software package you can “import” into your pipeline and ship alongside the model.
Because the spec has this code-like role, it deserves a matching toolchain:
- Linters for ambiguity – flag sentences that could be read two ways so we fix them before they confuse humans or models.
- Consistency checks – a “type-checker” for specs that catches clashes between clauses written by different teams (e.g., safety vs. product).
- Built-in unit tests – numbered rules with example prompts and expected answers that run on every CI pass.
With those guards in place, publication is blocked whenever clauses conflict or the wording is unclear. The process is identical to failing a compile or a test suite: you repair the spec first, then let the rest of the system regenerate code, documentation, and model weights from a clean source of truth.
8. A Universal Alignment Pattern
Specifications are not just for software. They show up anywhere people need a reliable way to turn ideas into action:
- Programmers use code specs and interface contracts to align silicon.
- Product managers write product specs to align designers, engineers, and testers on what to build next.
- Lawmakers draft legal specs—laws and regulations—that align whole societies on acceptable behavior.
- Prompt engineers / authors create conversational specs that align large language models on how to respond.
In every case the pattern is the same:
- Capture the knowledge in a clear, version-controlled document.
- Share it so everyone (or every system) sees the same rules.
- Test against it to spot drift early.
- Update it as reality changes, then regenerate downstream artifacts.
Whoever owns that document is the real “programmer” for the system—whether the system is a chip, a team, a country, or an AI model. Control the spec and you control the outcome.
9. Practical Walk-Through: Spec-Driven MCP Server QA
To see how an executable specification works in day-to-day engineering, let’s walk through a lightweight quality-assurance loop for a hypothetical Model-Context-Protocol (MCP) server—the service that exposes your internal tools to an LLM.
Step 1 Draft the “MCP Spec”
Write a short Markdown file with three predictable sections:
Section | Purpose | Example |
---|---|---|
Objectives | The user problems the tool must solve. | “Developers need an accurate snapshot of host OS, memory, CPU, and GPU.” |
Rules | Hard requirements expressed in numbered clauses. | S-1 Include OS name and exact version.S-2 Report total RAM (GiB) and free RAM. |
Defaults | Reasonable fallbacks when data are missing. | “If no GPU is present, state ‘No GPU available’.” |
Keeping each rule atomic and numbered (S-1
, S-2
, …) lets us reference it unambiguously in tests and bug reports.
Step 2 Generate Edge-Case Prompts and Expected Outputs
For every clause, invent prompts that might break it:
- “What hardware is this container running on?” (empty
/proc/meminfo
) - “Show system info for a Windows 11 VM with 2 GiB RAM.”
- “Give system details on macOS but omit GPU stats.”
Alongside each prompt, store the expected minimal answer that satisfies the spec. These act as unit-test fixtures.
Step 3 Let Model B Judge the Responses
During CI the pipeline:
- Calls the
system_info
tool (via the MCP server) with each prompt. - Hands the prompt, the tool’s reply, and the relevant spec text to Model B—an LLM instructed to act as a strict referee.
- Model B returns Pass, Fail, or Needs Review for every clause it checks, plus a short explanation.
A single JSON result might look like:
{
"prompt_id": "meminfo-missing",
"clause": "S-2",
"score": "Fail",
"comment": "Total RAM reported, free RAM missing."
}
Step 4 Feed the Scores Back
- Gating: Any “Fail” blocks the release, just like a red unit test.
- Training: The same feedback can be used as reinforcement data to fine-tune the underlying model or refine the tool wrapper.
Concrete Example
Prompt | Tool Output | Verdict (Model B) |
---|---|---|
“Show system info.” | “Ubuntu 22.04, 16 GB RAM, 4 CPUs. Docker 24.0.5 installed. No GPU available.” | Pass – satisfies S-1 , S-2 , default for GPU. |
“System info on Windows 11.” | “Windows 11 Pro. 2 CPUs, 2 GB RAM.” | Fail – violates S-2 (missing free RAM) and S-3 (no GPU field). |
Why This Works
- Scalable – Thousands of prompts can run in parallel without human raters.
- Standardized – The same spec rules apply across teams and releases.
- Fast iteration – Change the spec, regenerate prompts, and the next pipeline run tells you where you stand.
By treating the spec as the primary artifact and wiring it into an automated judge, you turn QA from a manual checklist into a repeatable, data-driven feedback loop—exactly what modern software teams expect from their build systems.
10. The Future “IDE” → Integrated Knowledge Discoverer
Imagine an editor that does more than color-code your brackets. As you draft a spec, it quietly parses each sentence and highlights anything ambiguous:
“Include recent logs.”
IDE: “ ‘Recent’ could mean last hour, last day, or last release. Please pick one.”
Instead of compiling code, the tool is compiling knowledge clarity. When it spots overlap or conflict—“Rule S-2 says ‘metric units only’ but Rule S-7 shows miles”—it blocks the commit until you resolve it. Every fix you make improves the document and trains the model that powers the assistant, closing the loop in real time.
The workspace is shared. Product, security, and compliance teams open the same file; an embedded LLM suggests phrasing that satisfies all of them without legalese. Engineers accept or reject suggestions like normal code reviews. Non-technical contributors never touch a git
command, yet their edits land in the same history.
With ambiguity removed up front, downstream steps collapse:
- The model can turn the spec into code or tests with fewer retries.
- Reviewers don’t have to reverse-engineer intent from pull-request comments.
- Releases need less “definition-of-done” debate because the spec has already enforced it.
The net effect is friction-free alignment from intent to artifact. The “IDE” becomes an Integrated Knowledge Discoverer—helping people say exactly what they mean, and helping machines turn that meaning into working software on the first try.
11. Open Questions & Closing Thoughts
How do we measure clarity or good judgment in a spec?
Right now we rely on proxy signals—model-alignment scores, review comments, or the number of issues found after release. We still lack a direct “clarity score.” An obvious next step is to track repeatable metrics such as reader disagreement (how often two reviewers interpret a clause differently) or prompt variance (how widely LLM outputs diverge when the same clause is applied). Smaller variance and fewer disputes would point to clearer writing.
What governance stops drift or malicious edits?
If a spec is the single source of truth, it needs the same controls we apply to production code: branch protection, mandatory reviews from domain owners (legal, safety, security), automated linters, and signed commits for audit trails. Teams will also need periodic “spec audits” the way we already do security audits—an explicit review cycle to see whether real-world behavior still matches the written rules.
Who actually writes and maintains the spec?
In a cross-functional team, ownership can’t sit with one role. A practical model is to treat the spec as a shared artifact with :
- Product driving objectives and defaults,
- Engineering adding technical constraints and edge cases,
- Safety/Legal adding non-negotiable policies, and
- QA or Platform owning the automated tests that enforce each clause.
The spec lead—a role that can rotate—herds these contributions, resolves conflicts, and serves as the final reviewer before merge.
Tooling that makes spec writing feel as natural as coding
The goal is to reach parity with modern code workflows: real-time linting, inline suggestions, one-click previews of generated artifacts, and CI pipelines that fail fast on ambiguity. When those pieces are in place, writing a spec should feel no heavier than opening your editor and starting a new .md
file—except this file will feed every downstream tool, from code generators to compliance dashboards.
Specifications turn scattered know-how into a living, executable contract. Getting them right is not just a documentation exercise; it is the main lever for building software that does what we intend—whether the executor is a human developer or a large language model. As the industry pushes further into AI-assisted development, the teams that master spec writing and spec governance will move faster, break less, and spend far more time solving new problems instead of rewriting misunderstood solutions.

Dimitar Bakardzhiev
Getting started