It is remarkably difficult to create a file and also be certain it will surive a power outage.
Updated 2021/11/10 to include a few more links, clarify some points, and add "Appendix: Other Languages".
I'll start at the end; here's the recipe in C (error handling omitted for clarity):
#include <fcntl.h>
#include <unistd.h>
// Create the file and make sure its directory entry is durable.
int dir_fd = open("dir", O_RDONLY);
int file_fd = openat(dir_fd, "file", O_WRONLY | O_CREAT, 0600);
err = fsync(dir_fd);
err = close(dir_fd);
// Write the file contents.
// ... write(file_fd, ...); ...
err = fsync(file_fd);
err = close(file_fd);
The rest of this article will explore why you can't simplify the algorithm any further.
fsync
calls do, and why are there two of them?The POSIX operations are poorly-named. For performance, most of them do not actually interact with the disk. Instead, they make a change in memory immediately and schedule the actual disk operations to happen later:
O_CREAT
should have been O_SCHEDULE_CREAT
write
should have been schedule_write
close
should have been close_without_flushing_writes
Note that scheduled operations can be reordered by your operating system! They do not necessarily happen in the order they appear in your program.
The fsync
call synchronizes the kernel's in-memory state with the real state of the disk for a particular file. Synchronization means waiting for previously-scheduled operations to finish. Operations that fail are undone in memory, and if there was a failed operation fsync
will report an error.
There are two fsync
calls because there are two filesystem entities involved in creating a new file: the file itself, and the directory that contains the file. Omitting fsync(dir_fd)
could result in a state where the file does not exist, even though its contents were written somewhere:
1.
MEMORY: "dir" empty "file" empty
DISK: "dir" empty "file" empty
2. open("file", O_WRONLY | O_CREAT)
MEMORY: "dir" contains "file" "file" empty
DISK: "dir" empty "file" empty
3. write(file_fd, ...)
MEMORY: "dir" contains "file" "file" contains contents
DISK: "dir" empty "file" empty
3. fsync(file_fd)
MEMORY: "dir" contains "file" "file" contains contents
DISK: "dir" empty "file" contains contents
*** POWER OUTAGE!
*** MEMORY IS LOST
4.
MEMORY: "dir" empty "file" contains contents
DISK: "dir" empty "file" contains contents
Even though the file contents reached disk, the file itself isn't reachable. To ensure that the file contains the right contents and is reachable, we need two fsync
calls.
You must open the file between open-ing and fsync-ing the directory. You cannot deinterleave these operations.
I have seen several libraries try to do this:
// DO NOT USE
int file_fd = open("dir/file", O_WRONLY | O_CREAT);
int dir_fd = open("dir", O_RDONLY);
err = fsync(dir_fd);
err = close(dir_fd);
This can go wrong if:
open
creates the file.open(dir_fd)
fsync(dir_fd)
/close(dir_fd)
reports no errors, since there have been no errors since the directory was opened.Replacing openat
with normal open
is only correct under certain assumptions.
Using openat
ensures that the file is created in the directory referenced by dir_fd
, even if some other process moves that directory while our code is running.
I believe it also improves the odds that disk failures will be caught by fsync(dir_fd)
. It is a "hint" to the kernel that we are doing a metadata operation affecting dir_fd
and errors involving that operation should be reported to our program via dir_fd
.
You cannot retry a failed fsync
.
This won't get you better durability:
// DO NOT USE
do {
err = fsync(fd);
} while (err != 0);
This is because fsync
only reports whether there has been an error since the last error report. A retried fsync
does not retry the lost writes.
The top algorithm achieves durability in the sense that if it completes successfully, then the data will surive a power outage. However, the algorithm is not atomic: if there is a power outage while the code is running, there are many possible outcomes other than "the file exists" and "the file does not exist". In the worst case, the file may exist and be completely filled with garbage:
write(file_fd, ...)
calledBy using a temporary file, we can leverage the atomic rename
operation to achieve atomicity:
// Create a temporary file and write its contents
int file_fd = open("file.tmp", O_WRONLY | O_CREAT);
// ... write(file_fd, ...); ...
err = fsync(file_fd);
err = close(file_fd);
// Durably move the file into place
int dir_fd = open("dir", O_RDONLY);
err = renameat(dir_fd, "file.tmp", dir_fd, "file");
err = fsync(dir_fd);
err = close(dir_fd);
This final snippet may leave dir/file.tmp
in an arbitrary state after a power outage, but it always results in a correct dir/file
or no dir/file
.
Note that the renameat
must happen between open
and fsync
for the same reason openat
does in the original algorithm.
fsync
On Linux, you may replace fsync(file_fd)
with fdatasync(file_fd)
if you do not care about the file's permissions or modification timestamp.
Using fdatasync
might be slightly faster since it flushes only the file contents, not the file metadata. Note that you should still fsync
the directory since the manual for fdatasync
does not specify whether it works for directories. (In fact, it explicitly states that to ensure an entry in a directory has reached disk, "an explicit fsync()
on a file descriptor for the directory is also needed".)
Caveat: if you changed the file size, fdatasync
will flush the file metadata just like fsync
; the file size is part of its metadata but is necessary to reach the data later. Thus, your performance gain might be less than you expect.
On MacOS, you can replace fsync(fd)
with fcntl(fd, F_FULLFSYNC)
for added durability.
This works for both the file and directory. Note that F_FULLFSYNC
is a heavy hammer: it will likely instruct the disk to flush its entire internal buffer, not just for the file in question. That means your application will probably pay for all other writes on the system in addition to its own.
By contrast, fsync
only instructs the kernel to flush its in-memory buffer to the disk. After that point, the disk has its own buffer.
To be cross-platform:
#include <fcntl.h>
#include <unistd.h>
#ifdef F_FULLFSYNC
err = fcntl(fd, F_FULLFSYNC);
#else
err = fsync(fd);
#endif
Caveat: some drives protect their internal buffers from power outages using capacitors. With high-quality hardware, the benefit of F_FULLFSYNC
is unclear. Even so, I recommend using it unless you are very familiar with the guarantees of the hardware you're using.
You might want to use these ideas in a program written in some language other than C. Depending on the language, this can be quite difficult:
openat
may not be available at all. For instance, this is true today in Java (2021/11/10). It is also true today in Rust, although there are third-party libraries that give you access to openat
.Fortunately, most languages have some way to include "native" C code, giving you access to everything described above. Additionally, some high-level languages like Python offer access to low-level APIs if you dig for them (see e.g. the dir_fd
parameter in Python's os
library).