About us

Categories

Tags

Follow us on: X, LinkedIn

Initiated and Officially Supported by Tensormesh

OpenAI API Is the New IPv4

By

Junchen Jiang

A new system stack is quietly taking shape around LLM serving. What makes it interesting is not just how quickly it is evolving, but how familiar the shape of that evolution looks if you’ve spent time studying large-scale systems like the internet or web infrastructure. These systems were never cleanly designed from first principles; they emerged under pressure—from developers building applications, from infrastructure teams chasing efficiency, and from companies trying to interoperate in fragmented ecosystems. Over time, structure appeared.

The same thing is now happening in LLM systems. What started as a simple inference call has turned into a layered architecture, and within that architecture, a familiar pattern is emerging: a narrow waist.

Diagram illustrating the concept of 'The New Narrow Waist of LLM Systems' showing a flow from traditional network protocols to modern applications, including an OpenAI-compatible API.

Left: Internet protocol hourglass, adapted from Steve Deering’s 2001 IETF presentation Watching the Waist of the Protocol Hourglass.

The claim here is straightforward: the OpenAI-style API is becoming that narrow waist. This comes with real advantages—but also structural limitations that are already shaping what is and isn’t possible in the ecosystem.

Note: “OpenAI API” and “OpenAI-compatible API” are used interchangeably here to refer to the broader OpenAI-compatible interface used across providers and open-source inference engines.

New Layering

Just a few years ago, LLM inference barely resembled a system. You would send a prompt to something like a HuggingFace Transformer and receive a response. There was no meaningful notion of layering, coordination, or separation of concerns. Everything collapsed into a single step.

A comparison of LLM applications in 2023 and 2026, showcasing the transition from HuggingFace Transformer to a more complex application stack including an agent, OpenAI API, inference orchestrator, AI-native data store, and an inference engine.

That simplicity is gone. Today, what you see depends entirely on where you stand. Infrastructure engineers talk about GPU kernels, KV cache layouts, batching strategies, and memory scheduling. Application developers describe agents, RAG pipelines, tool execution, and long-term memory. Both perspectives are correct, but neither captures the full picture.

What has emerged instead is a real stack with distinct layers and boundaries. The transition is captured clearly in the figure, which contrasts a single transformer call in 2023 with a structured LLM stack by 2026.

At a high level, the stack now looks like this:

  • Applications at the top assemble agents and model calls into user-facing systems.
  • Beneath them sits an API layer—most commonly the OpenAI API—which standardizes interaction.
  • Below that, an inference orchestrator handles routing, batching, and scheduling.
  • Underneath, an AI-native data layer manages tokens, KV cache, and intermediate state.
  • At the bottom, the inference engine executes models directly on hardware.

What used to be a single function call has become an ecosystem. Once you step back, the resemblance to earlier systems becomes hard to ignore.

The same figure also aligns LLM inference with IP networking and web/video stacks. The pattern repeats almost exactly: an expressive application layer at the top, a standardized interface in the middle, coordination and caching layers beneath, and a hardware-bound execution layer at the bottom.

This is not accidental convergence. It is what happens when systems scale.

Diagram comparing LLM stack and Internet/Web structure, highlighting the 'narrow waist' concept with OpenAI API in the center.

Good news: OpenAI API is becoming the New IPv4

In this emerging stack, the OpenAI API has effectively become the standard interface between applications and inference backends. This is not just about OpenAI as a provider; the entire ecosystem has converged on API compatibility as a coordination mechanism. GPT, Claude, Gemini, Together, Fireworks, vLLM, SGLang—nearly every provider and open source inference engine  implements some version of the same interface.

Diagram illustrating the relationship between Applications, the OpenAI API, and Inference providers, highlighting the API's role as a narrow waist enabling independent evolution yet limiting cross-boundary optimizations.

That convergence creates a powerful network effect. Applications are written once and can run across providers. Providers can plug into an existing ecosystem of applications. The result is portability on one side and adoption on the other.

This is exactly what a narrow waist is supposed to do.

Once the interface stabilizes, it decouples the layers above and below it. Application developers can build increasingly complex systems—agents, multi-step pipelines, retrieval workflows—without needing to care about how inference is implemented underneath. That abstraction is not just convenient; it dramatically lowers the cost of experimentation. Switching providers or combining them no longer requires rethinking the entire system.

At the same time, infrastructure teams can optimize aggressively below the API boundary. Improvements in batching, scheduling, KV cache management, or hardware utilization can be deployed without breaking applications. This separation of concerns allows both layers to move quickly and independently, even though they are developed by completely different actors.

The analogy to IPv4 is precise. A simple, stable interface enabled the internet to scale globally, while network operators competed and innovated underneath. The interface itself remained minimal, but that minimalism is exactly what allowed everything around it to grow. 

The OpenAI API is playing the same role: it standardizes interaction, creates portability, and accelerates ecosystem growth.

Bad news: OpenAI API is becoming the New IPv4!

The same properties that make a narrow waist powerful also make it rigid.

Over time, the interface becomes increasingly difficult to change—even in small ways. The problem is not just inertia; it is structural. Applications depend on the interface for portability, and providers depend on it for compatibility. Any change risks breaking that equilibrium.

More importantly, the API exposes almost nothing about application intent. From the backend’s perspective, everything reduces to input tokens and output tokens. It cannot tell which parts of the input are reusable, which outputs are transient, or how the application trades off latency against cost.

This is exactly what defines a narrow waist: it hides complexity to enable interoperability, but in doing so, it also discards information that could have been useful.

The comparison to IPv4 becomes more than an analogy here. IPv4 succeeded precisely because it was simple and widely adopted, but that same success made it extremely difficult to evolve. Even clearly superior alternatives like IPv6 have taken decades to roll out, and adoption remains incomplete.

The OpenAI API is beginning to show similar characteristics. Attempts to introduce richer semantics—such as caching hints or request-level metadata—struggle not because they lack value, but because they break compatibility with the established standard.

This is how narrow waists become entrenched: not by design, but by success.

What’s its implication?

The most important optimizations in modern LLM systems depend on information that lives above the API boundary, and that information is currently lost.

An application often knows that large parts of a prompt are reused across requests, which would enable aggressive prefix caching. An agent may construct a multi-step execution graph with dependencies that could inform scheduling decisions. A user interaction may tolerate higher latency in exchange for lower cost—or require the opposite.

None of this survives the API boundary. Everything is flattened into independent token sequences.

These are not edge cases; they are central to efficiency. Yet the interface forces the backend to infer structure that the application already knows.

This mismatch shows up clearly in systems like Parrot. Modern LLM applications are not single requests—they are programs composed of retrieval, multiple model calls, tool execution, and iterative logic. But the API reduces them to isolated calls, losing the structure entirely. Parrot reverses this by representing applications as dataflow graphs, allowing the system to apply familiar techniques like dependency tracking, global scheduling, and reuse of intermediate results.

What is striking is not the novelty of these techniques, but the magnitude of the gains. When structure is visible, performance improves by large factors, not marginal percentages. That strongly suggests the bottleneck is not in kernels or hardware, but in the interface itself.

The same pattern appears elsewhere. KV cache reuse is fundamentally limited because the API provides no way to indicate which parts of a prompt are stable. Scheduling systems operate on queues of requests without knowing which belong to the same workflow or share context. Memory systems cannot anticipate future usage because the API has no notion of it.

The networking analogy becomes concrete here. In IP networks, routers operate on packets without understanding applications. That abstraction enabled scale, but it also constrained what could be optimized inside the network. The OpenAI API is beginning to impose a similar constraint on LLM systems.

This becomes even more pronounced in graph-structured applications like agents and RAG pipelines. These systems contain internal dependencies, branches, and reusable intermediate states. If the serving system could see that structure, it could overlap execution, precompute results, and eliminate redundant work. Today, none of that can be expressed.

The result is a growing gap: applications are becoming programs, but the interface still assumes stateless requests.

Final thoughts

The LLM stack has already evolved into multiple layers with distinct responsibilities. Applications orchestrate complex workflows at the top, while infrastructure layers handle execution, memory, and scheduling at scale below.

The OpenAI API sits in between, but it behaves as if that complexity does not exist.

This is the core misalignment. The interface was designed for a world where inference is a stateless request-response service, while applications have already moved to a world where inference is embedded inside larger programs.

Why hasn’t this changed? For the same reason IPv4 persisted. Once an interface becomes standard, changing it requires coordinated movement across an entire ecosystem. Applications depend on it, providers implement it, and inertia builds quickly.

So the system evolves around the interface instead. Applications grow more complex above it, infrastructure becomes more optimized below it, and the boundary itself remains largely fixed.

That does not mean it will never change. Narrow waists are stable, but not immutable. In networking, pressure eventually forced changes like HTTPS and IPSec—not because they were elegant, but because they became necessary. A similar dynamic could emerge here if the gap between application needs and API expressiveness continues to widen.

If that happens, the evolution will likely be slow and uneven. It may take the form of incremental extensions, partial redesigns, or entirely new abstractions that coexist with the current API before replacing it.

Until then, the trajectory is predictable. Innovation will continue rapidly at the top of the stack and at the bottom, but the interface between them may stagnate. When that happens, entire classes of optimizations—especially those that require coordination across layers—will remain confined to research and internal systems, not because they do not work, but because they do not fit through the interface.

That is exactly what happened in IP networking.

And it may already be happening in LLM systems.

One response to “OpenAI API Is the New IPv4”

  1. Great thoughts, Junchen, thanks a lot for sharing them. The only thing I would add is since OpenAI’s API is owned by OpenAI, they are the only ones that can decide to make it evolve with a chance to be followed. That’s a concern and I wonder how we could make this layer evolve to something better.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from LMCache Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading