At this point, Node B crashes because it does not know how to read data without the "tags"
property.
The new code isn't compatible with the existing data. Both the code and data have to change, making this one of the simplest examples of a live migration. The devs have to migrate the data from one data format to another and the code from a version that understands the old format to a version that understands the new format. The whole process has to be live; the devs can't take the service down to perform the change.
Remarkably little has been written about live migrations. Once I understood the general shape of them, I started seeing them everywhere, and I started wishing for structured solutions.
Today, everyone I know of approaches live migrations using two-phase deployments:
"tags"
property."tags"
property.This approach works, but it's substantially more work. The two-line change has become a carefully staggered deployment plan requiring careful oversight. The devs have commit the right changes to source control at the right times, and some of their work has to be blocked by deployments.
Worse, the two-phase deployment is not the whole story.
In practice, the developer will probably have to wait days or weeks between steps 2 and 4. Otherwise, rollbacks aren't safe!
At this point, Node A crashes because it has forgotten how to read data without the "tags"
property.
To solve the problem, we have to promise not to roll back too many versions. We can only make that promise if we are completely confident that the compatibility version works, leading to a much slower workflow:
"tags"
property."tags"
property.Introducing a bake time between steps 2 and 4 reduces the chances that we might have to roll back to the original version.
However, bake times are incredibly annoying. They impose a minimum length of time to complete the work; the feature cannot be finished before the bake time is complete. They also introduce an obligatory context switch; the developer cannot do all the work in one sitting and must return to the project days later to finish up.
So far, the optimization is mostly useless. All the existing data is still wasting space. Only new data is being written in the optimized format!
Fixing that problem leads to an even longer workflow:
"tags"
property."tags"
property."tags"
fields.Rewriting existing data while handling client requests can be a nightmare. Sensible implementations for the rewriter background job can cause data loss! For example, suppose we implement the rewriter in the obvious way:
for name in db.all_names():
record = db.get(name)
if record.tags == []:
db.put(remove_tags(record))
This can interleave with client requests to unexpectedly overwrite client data. In this scenario, the rewriter deletes the "x"
tag right after the client adds it:
Oops! The client was expecting to see {"name": "Foo", "tags": ["x"]}
, but got {"name": "Foo"}
instead.
This simplistic approach can also miss records. In this scenario, the rewriter fails to remove the "tags"
field from a concurrently-created record:
Such bugs can of course be detected and prevented at the cost of a great deal more engineering—but I'm not satisfied with any solution that requires such a simple change to entail such a herculean effort.
The usual two-phase deployment story is a lie. In reality, there are almost always extra deployments.
After all our effort, the code now includes lots of unnecessary cruft: it has pathways to read two different kinds of records and it has a background job that will be irrelevant once it finishes. Eventually, the devs will also need to remove the cruft:
"tags"
property."tags"
property."tags"
fields.Thus our two-phase deployment becomes a three-phase deployment. Furthermore, it involves yet another long wait! The developer can't remove all the unnecessary code until the background job has finished.
Doing a full deployment can be time consuming. Feature flags offer faster rollout and rollback at the cost of extra work:
"tags"
property."tags"
property."tags"
fields.Implementing and testing a new feature flag is yet more development work to build and yet more work to remove once it is no longer necessary.
I don't accept that live migrations are fundamentally hard. With a bit of insight, maybe we can tackle the difficulties (or at least lessen them). At a minimum, I think we would want two things:
My thoughts on this are rapidly evolving, but here are a few musings.
The plan for our example live migration is carried out by people. People are responsible for taking the right actions at the right times, and perhaps they shouldn't be. Perhaps the plan should be written in code. Unlike plans carried out by people, code can be tested under thousands of different scenarios per second to reveal problems.
The kind of code that carries out such a plan would be very different from the kind of code that web developers are writing today. After all, our plan includes actions like "implement this change" and "deploy this code".
The fact is that in web development, history matters. Overwriting the code in your SCM's main
branch is not the job—the job is to maintain and improve the production system while it serves clients. Therefore, the plan for that maintenance and improvement ought to have just as privileged a position as the production code itself. Some actor executed some sequence of steps to get the production system to where it is today, and that sequence of steps should be spelled out in code—along with the upcoming steps that have not yet been executed.
Live migrations can be seen as distributed transactions over two databases: the production data and the production code. Your data might be stored in memory or flat files or a true database server, and your code might be stored as executables on disk or docker containers in a container engine, but all of these are just ways to store data.
Two-phase deployments are a little like two-phase commits: the first deployment is a "prepare" step and the second deployment is a "commit" step.
Viewied this way, the migration plan can be seen as a series of transactions over the combined data/code database. Perhaps that insight gives some structure to what a programmed migration plan might look like?