There's no such thing as simultaneous

Digital image, streaks of blue light, reminiscent of the view from a Star Trek ship at warp speed.
Photo by Logan Voss / Unsplash

A thought experiment:

Imagine you have two computers sitting next to each other, about one foot apart, each displaying a clock. The clock has nanosecond precision and you, a superhuman watching two supermonitors, are able to see that the nanoseconds are in sync. As you sit there, looking at the screens, you see that the clocks change simultaneously.

Now move one foot to the left.

The speed of light is approximately, by amazing coincidence, about one foot per nanosecond. It takes a significant fraction of a nanosecond longer for the light from the screen on your right to reach you than it takes the light from the screen on your left.

The clocks you had so carefully synced are no longer in sync. The nanosecond ticks are no longer simultaneous. The clocks didn't change, but to you, the observer, the one on the right is now slow.

In our universe, with its finite speed of light, "simultaneous" is relative. Whether two events appear to happen at the "same time" is a matter of perspective.

Why does this matter to programmers?

Modern computers operate at speeds where the speed of light is a factor. Electrical signals propagate at between 50–99% of the speed of light, and even so, the travel times along different circuit paths can result in meaningful clock skew. Across networks, there's no hope.

In modern application development, we almost always have multiple copies of an application running, as well as databases, caches, and other dependencies. No signal will ever affect them all at the same time. New versions of code will start running at different times. Different database connections will get updated snapshots at different times—especially if there is replication involved. Pushing updates to in-memory state has some signal and propagation time.

A lot of the time, we can safely ignore this. If I change a feature flag and within a few milliseconds—a few million nanoseconds—everything reflects the change, that is good enough. There is not a meaningful difference between a trailing handful of requests that don't get the new state because they were already in flight and requests that started between the time the change was made and it propagated to the process that's handling them.

Certain changes, however, can cause problems if they seem to occur in the wrong order.

SQL database migrations are a common example. Changing a database schema during a deploy usually means multiple seconds, at least, between when the new schema takes effect and new code that uses that schema is running.

If you are adding a column, code that reads that column will fail for those few seconds, if it is added first. So will code that writes to the table, if that column does not have a default. On the other hand, if you are dropping a column, code that reads that column will fail for a few seconds if it is not removed first. The order matters, and the correct order depends on what is changing.

Some applications deploy UIs and APIs together. If an API changes before the UI is ready to handle that change—maybe because someone's browser has just loaded the old version—the user can experience errors. On the other hand, if the UI expects a new API endpoint to be present and the request is handled by a server that is still running the old code, the user can also experience errors.

Low traffic applications can get away with ignoring these few seconds. But if you have a consistent and high enough volume of traffic, the deploy will be accompanied by a burst of errors.

That burst of errors is downtime. They negatively impact service-level indicators and the customer experience.

Planning for zero-downtime

Zero-downtime deploys require a change of perspective: rather than assuming changes occur across the system simultaneously, assume that they happen a few minutes—or days—apart. Then plan them in sequence.

(Even if these seem obvious as technical solutions—and they won't to everyone—they require planning a few distinct steps. Not all teams are happy to see something that looked like one task turn into two or more. Whether it's worth breaking these down is a matter of planning and prioritizing. As always: it depends. And getting agreement among people is the job.)

For adding a database column, this might look like:

  1. Add the column that's nullable or has a default.
  2. Add code that uses this column.
  3. If necessary, backfill and remove the default.

For removing a database column, maybe it's:

  1. If necessary, change code to handle null values in the column.
  2. Make the column nullable, or add a default.
  3. Remove code that reads/writes this column.
  4. Remove the column.

For adding a new API, it might be:

  1. Add the new API.
  2. Add the UI that uses it.

Or possibly:

  1. Add UI code that tries the new API and falls back to the old behavior.
  2. Add the API.
  3. Remove the fallback in the UI.

Or removing an API:

  1. Remove client code that depends on the API.
  2. Remove the API.

Some patterns are a little more forgiving. If you have one service that publishes to a message bus and another service that reads from it, you can stop publishing whenever and the reader will continue to handle all of the (i.e. zero) messages it receives. Or remove the reader and the publisher can keep posting into the void.

The challenge is the planning

Changes can always be made zero-downtime, with enough planning and steps. The bigger challenge will always be making the case that taking the time to do a zero-downtime change is worth it. Tools like Service-Level Objectives can help make these decisions. If your team is not accustomed to making these trade-offs, expect to spend some time selling them—and sometimes they won't be worth it.