How to Durably Write a File on POSIX Systems

July 2021

It is remarkably difficult to create a file and also be certain it will surive a power outage.

Updated 2021/11/10 to include a few more links, clarify some points, and add "Appendix: Other Languages".

I'll start at the end; here's the recipe in C (error handling omitted for clarity):

#include <fcntl.h>
#include <unistd.h>

// Create the file and make sure its directory entry is durable.
int dir_fd = open("dir", O_RDONLY);
int file_fd = openat(dir_fd, "file", O_WRONLY | O_CREAT, 0600);
err = fsync(dir_fd);
err = close(dir_fd);

// Write the file contents.
// ... write(file_fd, ...); ...
err = fsync(file_fd);
err = close(file_fd);

The rest of this article will explore why you can't simplify the algorithm any further.

What do the `fsync` calls do, and why are there two of them?

The POSIX operations are poorly-named. For performance, most of them do not actually interact with the disk. Instead, they make a change in memory immediately and schedule the actual disk operations to happen later:

O_CREAT should have been O_SCHEDULE_CREAT
write should have been schedule_write
close should have been close_without_flushing_writes

Note that scheduled operations can be reordered by your operating system! They do not necessarily happen in the order they appear in your program.

The fsync call synchronizes the kernel's in-memory state with the real state of the disk for a particular file. Synchronization means waiting for previously-scheduled operations to finish. Operations that fail are undone in memory, and if there was a failed operation fsync will report an error.

There are two fsync calls because there are two filesystem entities involved in creating a new file: the file itself, and the directory that contains the file. Omitting fsync(dir_fd) could result in a state where the file does not exist, even though its contents were written somewhere:

1.

   MEMORY:  "dir" empty             "file" empty
   DISK:    "dir" empty             "file" empty

2. open("file", O_WRONLY | O_CREAT)

   MEMORY:  "dir" contains "file"   "file" empty
   DISK:    "dir" empty             "file" empty

3. write(file_fd, ...)

   MEMORY:  "dir" contains "file"   "file" contains contents
   DISK:    "dir" empty             "file" empty

3. fsync(file_fd)

   MEMORY:  "dir" contains "file"   "file" contains contents
   DISK:    "dir" empty             "file" contains contents

*** POWER OUTAGE!
*** MEMORY IS LOST

4.

   MEMORY:  "dir" empty             "file" contains contents
   DISK:    "dir" empty             "file" contains contents

Even though the file contents reached disk, the file itself isn't reachable. To ensure that the file contains the right contents and is reachable, we need two fsync calls.

Gotchas

You must open the file between open-ing and fsync-ing the directory. You cannot deinterleave these operations.

I have seen several libraries try to do this:

// DO NOT USE
int file_fd = open("dir/file", O_WRONLY | O_CREAT);
int dir_fd = open("dir", O_RDONLY);
err = fsync(dir_fd);
err = close(dir_fd);

This can go wrong if:

open creates the file.
The kernel begins flushing the new directory entry to disk.
The flush fails! Having nowhere to put the error, and being forced to make progress, the kernel quietly logs the error somewhere and keeps going.
open(dir_fd)
fsync(dir_fd)/close(dir_fd) reports no errors, since there have been no errors since the directory was opened.

Replacing openat with normal open is only correct under certain assumptions.

Using openat ensures that the file is created in the directory referenced by dir_fd, even if some other process moves that directory while our code is running.

I believe it also improves the odds that disk failures will be caught by fsync(dir_fd). It is a "hint" to the kernel that we are doing a metadata operation affecting dir_fd and errors involving that operation should be reported to our program via dir_fd.

You cannot retry a failed fsync.

This won't get you better durability:

// DO NOT USE
do {
    err = fsync(fd);
} while (err != 0);

This is because fsync only reports whether there has been an error since the last error report. A retried fsync does not retry the lost writes.

Appendix: Atomicity

The top algorithm achieves durability in the sense that if it completes successfully, then the data will surive a power outage. However, the algorithm is not atomic: if there is a power outage while the code is running, there are many possible outcomes other than "the file exists" and "the file does not exist". In the worst case, the file may exist and be completely filled with garbage:

write(file_fd, ...) called
the kernel expands the file to fit the new bytes (the file is presently filled with garbage data)
a power outage happens
upon restart, the file is filled with garbage

By using a temporary file, we can leverage the atomic rename operation to achieve atomicity:

// Create a temporary file and write its contents
int file_fd = open("file.tmp", O_WRONLY | O_CREAT);
// ... write(file_fd, ...); ...
err = fsync(file_fd);
err = close(file_fd);

// Durably move the file into place
int dir_fd = open("dir", O_RDONLY);
err = renameat(dir_fd, "file.tmp", dir_fd, "file");
err = fsync(dir_fd);
err = close(dir_fd);

This final snippet may leave dir/file.tmp in an arbitrary state after a power outage, but it always results in a correct dir/file or no dir/file.

Note that the renameat must happen between open and fsync for the same reason openat does in the original algorithm.

Appendix: Variants of `fsync`

On Linux, you may replace fsync(file_fd) with fdatasync(file_fd) if you do not care about the file's permissions or modification timestamp.

Using fdatasync might be slightly faster since it flushes only the file contents, not the file metadata. Note that you should still fsync the directory since the manual for fdatasync does not specify whether it works for directories. (In fact, it explicitly states that to ensure an entry in a directory has reached disk, "an explicit fsync() on a file descriptor for the directory is also needed".)

Caveat: if you changed the file size, fdatasync will flush the file metadata just like fsync; the file size is part of its metadata but is necessary to reach the data later. Thus, your performance gain might be less than you expect.

On MacOS, you can replace fsync(fd) with fcntl(fd, F_FULLFSYNC) for added durability.

This works for both the file and directory. Note that F_FULLFSYNC is a heavy hammer: it will likely instruct the disk to flush its entire internal buffer, not just for the file in question. That means your application will probably pay for all other writes on the system in addition to its own.

By contrast, fsync only instructs the kernel to flush its in-memory buffer to the disk. After that point, the disk has its own buffer.

To be cross-platform:

#include <fcntl.h>
#include <unistd.h>

#ifdef F_FULLFSYNC
err = fcntl(fd, F_FULLFSYNC);
#else
err = fsync(fd);
#endif

Caveat: some drives protect their internal buffers from power outages using capacitors. With high-quality hardware, the benefit of F_FULLFSYNC is unclear. Even so, I recommend using it unless you are very familiar with the guarantees of the hardware you're using.

Appendix: Other Languages

You might want to use these ideas in a program written in some language other than C. Depending on the language, this can be quite difficult:

High-level languages and libraries hide the details about what system calls are being executed, making it difficult to tell if your program is correctly achieving durability.
Some useful operations like openat may not be available at all. For instance, this is true today in Java (2021/11/10). It is also true today in Rust, although there are third-party libraries that give you access to openat.

Fortunately, most languages have some way to include "native" C code, giving you access to everything described above. Additionally, some high-level languages like Python offer access to low-level APIs if you dig for them (see e.g. the dir_fd parameter in Python's os library).

Other Resources

Back home