The Live Migration Problem

January 2023

Upgrading an existing system in-place is much harder than writing a new one.

In my experience, the bulk of the brainpower poured into server-side web development does not go into fixes, features, and optimizations. It goes into a poorly-understood black hole that I call live migration hell: it's not enough to write code that works—you have to write code that is compatible with the previous version of itself. Seemingly trivial changes get amplified into months-long deployment plans to safely upgrade the running production system in-place.

Live Migrations and Their Pitfalls, by Example

Some coworkers of mine recently set out to make an innocuous optimization to their server code:

To save on disk space, empty lists will no longer be included in the serialized JSON data. That is, the data {"name": "Foo", "tags": []} should instead be stored as {"name": "Foo"}.

This optimization required only a two-line change to the server code. That change was made months ago, but it still has not been deployed to the running service. Even though the new code in source control passes all tests and works perfectly, we cannot safely deploy it.

It's easy to see why:

At this point, Node B crashes because it does not know how to read data without the "tags" property.

The new code isn't compatible with the existing data. Both the code and data have to change, making this one of the simplest examples of a live migration. The devs have to migrate the data from one data format to another and the code from a version that understands the old format to a version that understands the new format. The whole process has to be live; the devs can't take the service down to perform the change.

Remarkably little has been written about live migrations. Once I understood the general shape of them, I started seeing them everywhere, and I started wishing for structured solutions.

State-of-the-Art "Solutions" are Awkward, Time-Consuming, and Error-Prone

Today, everyone I know of approaches live migrations using two-phase deployments:

Write and test a "compatibility" version that can read records with or without the "tags" property.
Deploy the compatibility version to all nodes.
Implement the actual change so that newly written records do not include the "tags" property.
Deploy the final version.

This approach works, but it's substantially more work. The two-line change has become a carefully staggered deployment plan requiring careful oversight. The devs have commit the right changes to source control at the right times, and some of their work has to be blocked by deployments.

Worse, the two-phase deployment is not the whole story.

Bake Times

In practice, the developer will probably have to wait days or weeks between steps 2 and 4. Otherwise, rollbacks aren't safe!

At this point, Node A crashes because it has forgotten how to read data without the "tags" property.

To solve the problem, we have to promise not to roll back too many versions. We can only make that promise if we are completely confident that the compatibility version works, leading to a much slower workflow:

Write and test a "compatibility" version that can read records with or without the "tags" property.
Deploy the compatibility version to all nodes.
NEW: Bake for two weeks.
Implement the actual change so that new records do not include the "tags" property.
Deploy the final version.

Introducing a bake time between steps 2 and 4 reduces the chances that we might have to roll back to the original version.

However, bake times are incredibly annoying. They impose a minimum length of time to complete the work; the feature cannot be finished before the bake time is complete. They also introduce an obligatory context switch; the developer cannot do all the work in one sitting and must return to the project days later to finish up.

Data Cleanup

So far, the optimization is mostly useless. All the existing data is still wasting space. Only new data is being written in the optimized format!

Fixing that problem leads to an even longer workflow:

Write and test a "compatibility" version that can read records with or without the "tags" property.
Deploy the compatibility version to all nodes.
Bake for two weeks.
Implement the actual change so that new records do not include the "tags" property.
NEW: Implement a background job that optimizes the existing data by removing unnecessary "tags" fields.
Deploy the optimized version and background job.

Rewriting existing data while handling client requests can be a nightmare. Sensible implementations for the rewriter background job can cause data loss! For example, suppose we implement the rewriter in the obvious way:

for name in db.all_names():
    record = db.get(name)
    if record.tags == []:
        db.put(remove_tags(record))

This can interleave with client requests to unexpectedly overwrite client data. In this scenario, the rewriter deletes the "x" tag right after the client adds it:

Oops! The client was expecting to see {"name": "Foo", "tags": ["x"]}, but got {"name": "Foo"} instead.

This simplistic approach can also miss records. In this scenario, the rewriter fails to remove the "tags" field from a concurrently-created record:

Such bugs can of course be detected and prevented at the cost of a great deal more engineering—but I'm not satisfied with any solution that requires such a simple change to entail such a herculean effort.

Code Cleanup

The usual two-phase deployment story is a lie. In reality, there are almost always extra deployments.

After all our effort, the code now includes lots of unnecessary cruft: it has pathways to read two different kinds of records and it has a background job that will be irrelevant once it finishes. Eventually, the devs will also need to remove the cruft:

Write and test a "compatibility" version that can read records with or without the "tags" property.
Deploy the compatibility version to all nodes.
Bake for two weeks.
Implement the actual change so that new records do not include the "tags" property.
Implement a background job that optimizes the existing data by removing unnecessary "tags" fields.
Deploy the optimized version and background job.
NEW: Wait for the background job to finish.
NEW: Clean up the compatibility code.
NEW: Clean up the background job code.
NEW: Deploy the cleaned-up version.

Thus our two-phase deployment becomes a three-phase deployment. Furthermore, it involves yet another long wait! The developer can't remove all the unnecessary code until the background job has finished.

Feature Flags

Doing a full deployment can be time consuming. Feature flags offer faster rollout and rollback at the cost of extra work:

Write and test a "compatibility" version that can read records with or without the "tags" property.
Implement the actual change so that new records do not include the "tags" property.
NEW: Implement a feature flag to control whether the data is written in the old or optimized format.
Deploy the compatibility version to all nodes.
Bake for two weeks.
NEW: Flip the feature flag.
Implement a background job that optimizes the existing data by removing unnecessary "tags" fields.
Deploy the background job.
Wait for the background job to finish.
NEW: Remove the feature flag.
Clean up the compatibility code.
Clean up the background job code.
Deploy the final version.

Implementing and testing a new feature flag is yet more development work to build and yet more work to remove once it is no longer necessary.

Musings

I don't accept that live migrations are fundamentally hard. With a bit of insight, maybe we can tackle the difficulties (or at least lessen them). At a minimum, I think we would want two things:

All the development work implementing, reviewing, and merging code can be done in one sitting without waiting on long bake times or background jobs.
The migration itself should be testable (or even better: verifiable).

My thoughts on this are rapidly evolving, but here are a few musings.

Code the Plan (The Past Matters, the Future Matters)

The plan for our example live migration is carried out by people. People are responsible for taking the right actions at the right times, and perhaps they shouldn't be. Perhaps the plan should be written in code. Unlike plans carried out by people, code can be tested under thousands of different scenarios per second to reveal problems.

The kind of code that carries out such a plan would be very different from the kind of code that web developers are writing today. After all, our plan includes actions like "implement this change" and "deploy this code".

The fact is that in web development, history matters. Overwriting the code in your SCM's main branch is not the job—the job is to maintain and improve the production system while it serves clients. Therefore, the plan for that maintenance and improvement ought to have just as privileged a position as the production code itself. Some actor executed some sequence of steps to get the production system to where it is today, and that sequence of steps should be spelled out in code—along with the upcoming steps that have not yet been executed.

Insights from Distributed Transactions

Live migrations can be seen as distributed transactions over two databases: the production data and the production code. Your data might be stored in memory or flat files or a true database server, and your code might be stored as executables on disk or docker containers in a container engine, but all of these are just ways to store data.

Two-phase deployments are a little like two-phase commits: the first deployment is a "prepare" step and the second deployment is a "commit" step.

Viewied this way, the migration plan can be seen as a series of transactions over the combined data/code database. Perhaps that insight gives some structure to what a programmed migration plan might look like?

Back home