The Next 700 Stateless Services

The Next 700 Stateless Services

Why statelessness is no longer a meaningful concept for most services and what tradeoffs the standard design makes

In today's public cloud age, few distributed services use local disks directly. Instead, they delegate their data durability needs to some infrastructure that ultimately does, usually a distributed SQL database or message broker. Practically all cloud services are, as a result, technically stateless (although definitions vary), which is widely considered a best practice. But why? It is so pervasive as to be meaningless, with the result that statelessness has become an architectural participation award.

Most services still need to manipulate state to be available to function. They just don't store it themselves. Nevertheless, so-called stateless services follow the same general structure, which is useful to understand. It is a reasonable and sometimes adequate baseline as a state management design.

But it does make some tradeoffs. Ignore them at your peril.

The anatomy of a stateless service

The standard stateless service - apart from not storing state directly - is purely request-driven with no internal node-to-node communication and, therefore, horizontally scalable. It is divided into three layers:

  1. Frontend. The frontend layer implements the contract and performs authentication/authorization, rate limiting, etc.

  2. Business. The business layer transforms valid requests into high-level database operations per the service's semantics and business logic.

  3. Database. The database layer contains the low-level SQL statements (or otherwise), transaction control, and utilities necessary to execute the high-level operations.

The stateless design comes in two flavors:

  1. Strict. Each request is turned into a single transactional operation by the business layer. A more descriptive term would be a database wrapper service. CRUD services are typically strict.

  2. Loose. Each request involves multiple transactional operations, or the business layer obtains state non-transactionally, such as from an in-memory cache or an external service. These services typically perform some sort of coordination. Always exciting.

We'll focus on the strict flavor here.

Tradeoffs

Any design has its pros and cons. A design that works in one case may not be suitable in another. Statelessness is no different.

The apparent tradeoff of the strict stateless design is to trade latency for simplicity and scalability, i.e., the strict stateless design is straightforward and horizontally scalable but at the cost of relatively slow request processing. Why slow? Any mutable state needed must be read and processed as part of the database transaction to ensure atomicity, which requires a roundtrip between the service and the database between each step. That adds up quickly.

If the resulting latency is acceptable, the gained simplicity and scalability make for a desirable tradeoff. Strict stateless designs are widely used for a reason.

But that analysis is too naive in practice.

The actual tradeoff is usually less clear-cut, partly because the stateless service is not the whole system. As discussed in a previous article, The Overuse Of Microservices, microservices are prone to create complexity elsewhere. Therefore, a meaningful tradeoff analysis must include such consequences:

  1. Scope. The single-transaction limitation inherent in the strict design narrows the scope, and we naturally end up with a microservice, which brings its own problems. For example, it is easy to accidentally introduce atomicity and correctness issues, especially if the service is internal (where we're typically less cautious) or used by multiple other services with different needs.

  2. Scale. The stateless service itself is horizontally scalable, but that is only part of the story. The database must scale, too. Although it can do a lot of the heavy lifting, the service has to play by its rules and limitations. For example, all transactions are fast on a small dataset, where you can get away with numerous queries, writes, and recomputations in every transaction. Not so on large, busy datasets, where table layout, indices, query optimization, etc., become essential concerns. There is a lot of common knowledge available, but with any non-trivial dependency, at scale, you will need to become an expert in that technology.

  3. Agency. The request-driven design is not always adequate. Sometimes actions need to be taken outside the lifetime of a request, such as periodic checks, expiration, and similar background processing. Such actions do not fit nicely into the stateless design. They have to be initiated by an external trigger, such as a global cron job (which is semantically and operationally part of the service) or coordinated internally, which deviates from the purity of the stateless design. Either option adds complexity.

  4. Latency. What if the resulting latency is not acceptable? The window of acceptability can change over time, and it's painfully common to be trapped with design tradeoffs that are no longer desirable. Sometimes the database operations can be optimized, but if not, the service needs to move work out of slow transactions or enter the treacherous waters of caching. Tricky.

State at "The Usual Pizza"

"The Usual Pizza" is an online (fictional) pizzeria that caters to repeat customers, introduced in The Overuse Of Microservices. Their signature feature is the 1-click reorder, where customers can reorder "the usual" in a single request. Here, we'll focus on how the state is managed.

The system is implemented as three stateless microservices:

  1. Request calls Config to update the address and "the usual," if needed, and then places the order with the Order service. It is loose because it makes multiple non-transactional calls to other services for each request.

  2. Config is a stereotypical database wrapper service that stores the address and "the usual" selection. It is strict because it uses a single transaction for each request.

  3. Order retrieves the address and "the usual" selection from the Config service and then records the order in an external database. It is loose because it makes a non-transactional call to Config before its transaction for each request.

The original point of the example was to show that its atomicity problems could be avoided if combined into a single, strict stateless service that does all the work in a single transaction.

But a few more aspects stand out:

  • Request is a pure coordination service that does not directly manipulate state. It is a wrapper service, but with no transactional control and yet stuck with the same pitfalls. The $1M question: is such a service pulling its weight? The same question applies to functional or transformation services that read from one place and write to another.

  • Order is a loose service, but even if it were strict and read the address and "the usual" in its transaction, the overall system would still suffer from potential atomicity problems due to Request's configuration updates. Scope matters.

Config is the more interesting case. It is a by-the-book strict stateless database wrapper service. What is not to like? It is simple and scalable with no obvious shortcomings.

Unless.

Unless one considers the system as a whole, not just each service in isolation. That is the great con of stateless microservices. It is challenging to spot a dubious design when the problems it causes are elsewhere.

Simplicity revisited

Strict stateless services have their place. When they work, they work very well. But the standard stateless design, like microservices, suffers from overuse. Together they fuel a counter-productive design practice: systems are uncritically broken into microservices that each fit the standard stateless design. This is backward. The result is an endless parade of seemingly reasonable database wrappers and coordination microservices that ultimately make up complex and fragile systems.

So much for simplicity.

But there is hope. First, identify and consider the actual tradeoffs and then make a pragmatic decision instead of blindly following the standard design, where the implied simplicity and scalability do not always materialize. Atomicity and correctness problems with state still exist; they are simply handled elsewhere.

And then there is latency. It is the Achilles' heel of the strict stateless design and why the loose flavor, particularly caching, is so seductive. However, there is a fine line between loose and incorrect when non-transactional state or (gasp!) side effects rein. It is easy to end on the wrong side with a too-clever optimization.