You Can't Not Crash (or: Segfaults Shouldn't be Scary)

October 2021

Even the power of full formal verification can't prevent your programs from crashing. This is about learning to live with it.

I used to think that if we were just careful enough while programming, we could prevent our programs from ever crashing. Today I have a very different perspective: crashes are normal and unavoidable. Programmers can and should learn to live with them.

If you learn to live with crashes, many scary things become less scary:

Segfaults and other uncaught exceptions are fine (if perhaps not ideal).
Power outages are fine (if perhaps not ideal).
It's always safe to terminate your program, even if it's in the middle of some work.

Some sources of crashes that you will never be able to prevent

Memory overcommit. Modern operating 64-bit operating systems almost never reject allocation requests. In other words, malloc() always returns a non-null pointer, and the malloc'd memory is instead allocated when it is first used. But, if the system doesn't have enough memory available at that point, it is already too late to reject the allocation! Instead, the operating system simply kills the program trying to use its lazily-allocated memory. In fact, the operating system can choose to terminate some other program in order to make memory available. That means your program can crash on almost any read or write to memory, or when the system is low on memory because of someone else's program.

Power outages. Maybe in the distant future we can make operating systems with stronger, better guarantees about resource usage. We could eliminate memory overcommit entirely. Even if we did, programs will still crash when the power goes out.

Human intervention. Even if you make your entire program state durable so that it can survive a power outage, your program will still run in a world with humans. The thing about humans is that their plans can change; a program they started yesterday might not be relevant to them today. Modern systems have very powerful mechanisms like SIGKILL to allow humans to change their plans. No matter how strongly you feel about your program (or your fellow humans), I would argue that this flexibility is good and important.

Living with crashes

Recognize that every instruction in your program may crash, erase your program's memory, and send you back to the program start.

Therefore, all effects of your program---what files it writes, what network activity it generates---are important, and you need to be able to cope with all possible intermediate states.

Think in terms of taking safe actions. Is every write() in your program safe?

For file I/O, I like SQLite and other embedded databases because they let you bundle many writes into an atomic transaction, greatly reducing the number of potentially-unsafe actions in your program.

Back home