Promises Your Agents Can Keep

Robert von Massow

Jun 11, 2026 ·AI & ML · 9 min read

When the compiler can’t save you

Coding agents hallucinate APIs. They invent methods, guess at field names, and confidently call endpoints that were never there. Inside your own codebase this is mostly a non-event — the compiler complains, the linter lights up, and you fix it before it ever runs. There’s a safety net, and it’s automatic.

The moment your code calls another service, that net disappears. No compiler knows the shape of someone else’s API. So when an agent writes a client against a remote service and invents a field that doesn’t exist, nothing stops it. The mock the agent tests against is built from its own assumptions about the remote service — essentially the same guess, written down twice. Tests pass. The build is green. The thing that was never true is now validated by a test that was also never true.

And increasingly, the agent writing that broken client isn’t yours. It belongs to another team. When they build against your service, it’s their agents doing the work — and nothing you own tells those agents what your API actually looks like. The contract only ever existed in a prompt, and prompts don’t get persisted, versioned, or checked against later. (When the spec is missing, the agent just fills the gap — confidently, and not always correctly.)

You find out in production. If the service gets hammered, you find out fast — errors in your metrics, an alert, an incident. If it’s called once a week, you find out the hard way: a dormant break detonates in the middle of a nightly processing chain, pages someone at 3am, and drops them into a failure they have no context for. They don’t know why the change was made, what it was supposed to do, or how to fix it. Getting someone out of bed is expensive enough on its own.

This is the seam where agentic engineering quietly breaks: not inside a service, but between services, where no compiler is watching. (I’ve written about the broader version of this problem — how agentic engineering pushes specifications out of artifacts and into scattered conversations — in From Developers to Operators.) Which leaves a producer with two questions. When another team’s agents build against my service, how do I hand them my API automatically, so they’re building against the truth instead of a guess? And how do I evolve that API without silently breaking everyone who depends on it?

A promise you can actually keep: the API contract

Both questions have the same root: the answer the other team’s agent reads has to be the right kind of thing in the first place.

Documentation describes what already exists. A spec describes what should be true — before you build it. (My colleague Jan has written about why that distinction matters, tracing it back through IDLs, Protocol Buffers, and OpenAPI.)

That distinction is easy to wave away until it bites you. Documentation is written after the fact and drifts the moment the code changes; it’s a description, not a constraint. A spec is the thing you build against and check against. It’s the difference between “here’s roughly what this service does” and “here’s what this service promises, and here’s how you’ll know if that promise breaks.” When another team’s agent builds against your service, a description is exactly what you don’t want it reading. It needs a constraint.

When you build a service other services call, you make a promise about its interface. Usually that promise is implicit — it lives in your head, in the documentation, in a prompt, in whatever the calling team assumed last Tuesday. pinky-promise makes it a real promise: explicit, versioned, and enforced.

The idea is simple. When you design a service, you declare its API surface as part of the brainstorming — not as a documentation chore afterward, but as part of deciding what the service is. That contract gets published to a shared registry. Consumers pin to a version and have their code checked against it.

The registry is just a git repository you control. No external service, nothing to sign up for, no SaaS sitting between your teams. But “just a git repo” undersells what it becomes: it’s the shared spec registry your services publish to and your consumers build against, the place the contract finally lives instead of evaporating in a chat session. Producers push their API specs; consumers always work against the latest published version. Two teams’ agents coordinate through the registry without ever talking to each other directly — asynchronous communication, with the contract as the medium.

That’s the whole shift. The promise stops being something you remember and starts being something the system can check.

What it does

Here’s what that checking actually looks like — and it splits cleanly by which side you’re on. You adopt the half that fits you; the registry is the only thing you share.

If you own a service, you’re a producer:

Declare your API surface during brainstorming, as part of designing the service — not as a documentation chore afterward.
Publish a versioned spec to the registry automatically when the branch is complete.
Get breaking changes classified as major, minor, or patch — and caught before they’re planned, not after they’re deployed.

If you call a service, you’re a consumer:

Pin to a specific version and have every call validated against the published spec — at planning, implementation, and code review.
Import a remote or third-party spec (OpenAPI, gRPC, GraphQL) with /api-spec-import, so even an API you don’t own — Stripe, Twilio, an internal platform service — can’t be hallucinated against.

And one feature both sides share — the bridge between them:

Generate pact tests from the contract. The consumer proves its calls match what it pinned; the producer proves its service still honors what it published. Both check against the same contract, from opposite ends.

Deterministic checks for AI agents: API contract testing with pact

Start with those pact tests, because they’re the part you can lean on hardest. They’re deterministic: same input, same result, no model in the loop — the same check, run the same way, every time.

Be honest about what that buys you, though. Pact tests don’t verify in the strict sense — they don’t explore the whole input space, so they’re not a proof. They’re a deterministic check, not a guarantee. Which raises the obvious question: if no single check is conclusive, what makes the whole thing trustworthy?

Borrow the Swiss cheese model from safety engineering — our Swiss army knife for thinking about safety in probabilistic environments. Every layer of defense has holes. No single layer catches everything. You stay safe by stacking layers so that no hole lines up with the next — a failure has to pass through all of them at once to reach production. Leave it to the Swiss to hand us every tool we need.

Pact is one slice. The agent-driven checks — validating your calls at planning, implementation, and review — are another. And it’s worth being honest about that slice too: it’s gathering evidence, not proof. The more checks, the more evidence, the more trust — the same reason building with superpowers earns more of my confidence than vibe-coding against a plain assistant. But an agent can report a false negative and wave through a real break, or a false positive and flag something fine. It’s a strong early signal. It’s still a signal, with holes of its own.

The point is that the two slices are made of different material. The mistake would be to stack a second model-driven check on top of the first — another slice with its holes in the same places, letting the same failures through. Pact’s holes are somewhere else entirely, because it fails for entirely different reasons: it doesn’t depend on how the model behaves on any given day. You’re not collecting more of the same evidence. You’re widening the error surface you actually mitigate, by checking the contract two independent ways.

I’ve argued before that with agents, tests stop being a safety net and become the control system. Pact is where that gets concrete: a deterministic floor under a probabilistic process, holding the contract to account no matter how the model behaves.

You still own the outcome

But checking is not the same as owning.

A contract doesn’t write your code, and it doesn’t take responsibility for it. You can delegate execution to an agent — you can’t delegate the decision to trust the result. That decision is still yours, and it’s still where accountability lives.

What a contract changes is whether that decision is reviewable. When the promise between two services is explicit, versioned, and checked from both ends, the seam between teams stops being a place where assumptions quietly drift and becomes something you can actually inspect, reason about, and own. That’s what oversight needs: not a guarantee that nothing breaks, but visibility into the places it could. pinky-promise doesn’t remove the operator from the loop. It gives the operator something solid to hold on to.

Where it’s heading next:

Generating MCP servers for your service on the fly, in the service’s own language.
Deeper authentication and authorization support — inferring auth flows from the binding and automating credential provisioning, so consumers wire up access with as little manual configuration as possible.
Asynchronous and event-driven APIs alongside schema evolution, in the spirit of a schema registry like Confluent’s — so contracts cover evolving message schemas on a stream, not just request/response. We’ve applied the same contract-layer idea to data warehouses, where schema changes routinely break downstream consumers.
Richer breaking-change analysis and migration guidance when a consumer’s pinned version falls behind.

These are active directions, not dated commitments; what ships and when will follow what proves useful in practice, your feedback, and your contributions. So if one of these matters to you, that’s where open source earns its name — open an issue or jump in, and help decide what lands sooner.

pinky-promise is built on top of superpowers and lives here: github.com/superluminar-io/pinky-promise. If your services talk to each other — and increasingly, if your agents do — give it a try, and tell me where it breaks.

About the author

Robert von Massow

With over a decade of hands-on AWS experience and certifications spanning Developer to Security Specialty, Robert works as a Cloud Consultant at superluminar. Here he shares stories and insights from his work — from serious AWS challenges to playful experiments and everything in between.