How to Durably Write a File on POSIX Systems

July 2021

Back home

It is remarkably difficult to create a file and also be certain it will surive a power outage.

I'll start at the end; here's the recipe in C (error handling omitted for clarity):

#include <fcntl.h>
#include <unistd.h>

// Create the file and make sure its directory entry is durable.
int dir_fd = open("dir", O_RDONLY);
int file_fd = openat(dir_fd, "file", O_WRONLY | O_CREAT, 0600);
err = fsync(dir_fd);
err = close(dir_fd);

// Write the file contents.
// ... write(file_fd, ...); ...
err = fsync(file_fd);
err = close(file_fd);

The rest of this article will explore why you can't simplify the algorithm any further.

What do the fsync calls do, and why are there two of them?

The POSIX operations are poorly-named. For performance, most of them do not actually interact with the disk. Instead, they make a change in memory immediately and schedule the actual disk operations to happen later:

The fsync call synchronizes the kernel's in-memory state with the real state of the disk for a particular file. Synchronization means waiting for previously-scheduled operations to finish. Operations that fail are undone in memory, and if there was a failed operation fsync will report an error.

There are two fsync calls because there are two filesystem entities involved in creating a new file: the file itself, and the directory that contains the file. Omitting fsync(dir_fd) could result in a state where the file does not exist, even though its contents were written somewhere:

1.

   MEMORY:  "dir" empty             "file" empty
   DISK:    "dir" empty             "file" empty

2. open("file", O_WRONLY | O_CREAT)

   MEMORY:  "dir" contains "file"   "file" empty
   DISK:    "dir" empty             "file" empty

3. write(file_fd, ...)

   MEMORY:  "dir" contains "file"   "file" contains contents
   DISK:    "dir" empty             "file" empty

3. fsync(file_fd)

   MEMORY:  "dir" contains "file"   "file" contains contents
   DISK:    "dir" empty             "file" contains contents

*** POWER OUTAGE!
*** MEMORY IS LOST

4.

   MEMORY:  "dir" empty             "file" contains contents
   DISK:    "dir" empty             "file" contains contents

Even though the file contents reached disk, the file itself isn't reachable. To ensure that the file contains the right contents and is reachable, we need two fsync calls.

Gotchas

You must open the file between open-ing and fsync-ing the directory. You cannot deinterleave these operations.

I have seen several libraries try to do this:

// DO NOT USE
int file_fd = open("dir/file", O_WRONLY | O_CREAT);
int dir_fd = open("dir", O_RDONLY);
err = fsync(dir_fd);
err = close(dir_fd);

This can go wrong if:

  1. open creates the file.
  2. The kernel begins flushing the new directory entry to disk.
  3. The flush fails! Having nowhere to put the error, and being forced to make progress, the kernel quietly logs the error somewhere and keeps going.
  4. open(dir_fd)
  5. fsync(dir_fd)/close(dir_fd) reports no errors, since there have been no errors since the directory was opened.

Replacing openat with normal open is only correct under certain assumptions.

Using openat ensures that the file is created in the directory referenced by dir_fd, even if some other process moves that directory while our code is running.

I believe it also improves the odds that disk failures will be caught by fsync(dir_fd). It is a "hint" to the kernel that we are doing a metadata operation affecting dir_fd.

You cannot retry a failed fsync.

This won't get you better durability:

// DO NOT USE
do {
    err = fsync(fd);
} while (err != 0);

This is because fsync only reports whether there has been an error since the last error report. A retried fsync does not retry the lost writes.

Appendix: Atomicity

The top algorithm achieves durability in the sense that if it completes successfully, then the data will surive a power outage. However, the algorithm is not atomic: if there is a power outage while the code is running, there are many possible outcomes other than "the file exists" and "the file does not exist". In the worst case, the file may exist and be completely filled with garbage:

  1. write(file_fd, ...) called
  2. the kernel expands the file to fit the new bytes (the file is presently filled with garbage data)
  3. a power outage happens
  4. upon restart, the file is filled with garbage

By using a temporary file, we can leverage the atomic rename operation to achieve atomicity:

// Create a temporary file and write its contents
int file_fd = open("file.tmp", O_WRONLY | O_CREAT);
// ... write(file_fd, ...); ...
err = fsync(file_fd);
err = close(file_fd);

// Durably move the file into place
int dir_fd = open("dir", O_RDONLY);
err = renameat(dir_fd, "file.tmp", dir_fd, "file");
err = fsync(dir_fd);
err = close(dir_fd);

This final snippet may leave dir/file.tmp in an arbitrary state after a power outage, but it always results in a correct dir/file or no dir/file.

Note that the renameat must happen between open and fsync for the same reason openat does in the original algorithm.

Appendix: Variants of fsync

On Linux, you may replace fsync(file_fd) with fdatasync(file_fd) if you do not care about the file's permissions or modification timestamp.

Using fdatasync might be slightly faster since it flushes only the file contents, not the file metadata. Note that you should still fsync the directory since the manual for fdatasync does not specify whether it works for directories. (In fact, it explicitly states that to ensure an entry in a directory has reached disk, "an explicit fsync() on a file descriptor for the directory is also needed".)

Caveat: if you changed the file size, fdatasync will flush the file metadata just like fsync; the file size is part of its metadata but is necessary to reach the data later. Thus, your performance gain might be less than you expect.

On MacOS, you can replace fsync(fd) with fcntl(fd, F_FULLFSYNC) for added durability.

This works for both the file and directory. Note that F_FULLFSYNC is a heavy hammer: it will likely instruct the disk to flush its entire internal buffer, not just for the file in question. That means your application will probably pay for all other writes on the system in addition to its own.

By contrast, fsync only instructs the kernel to flush its in-memory buffer to the disk. After that point, the disk has its own buffer.

To be cross-platform:

#include <fcntl.h>
#include <unistd.h>

#ifdef F_FULLFSYNC
err = fcntl(fd, F_FULLFSYNC);
#else
err = fsync(fd);
#endif

Caveat: some drives protect their internal buffers from power outages using capacitors. With high-quality hardware, the benefit of F_FULLFSYNC is unclear. Even so, I recommend using it unless you are very familiar with the guarantees of the hardware you're using.

Other Resources

Back home