Computer programs are largely deterministic. Programming languages and the machine code which they compile down to both have defined semantics, and when repeatedly run against the same initial state a program will do the exact same thing in lock step. Non-determinism can leak in and cause repeated runs of a program to behave differently in two ways:

Programs can read additional inputs from the environment, for example from a file or network socket. Different runs can read different data, and the program will behave differently.
Multi-threaded programs can be affected by different thread interleavings. If the OS kernel schedules threads differently then those threads can read and write data in different orders, and the program will behave differently.

If we record how this non-determinism affects the program’s behavior, we can replay the program from that recording and ensure it behaves the same way. Replay’s recorder is based on this idea. In this blog post we’ll describe how Replay records and replays these two types of non-determinism.

Environmental Inputs

The environment that a program runs in can be described in terms of its interface with that program. This interface defines the recording boundary: when replaying, everything the program did within the boundary will be re-executed, and the side effects from everything outside the boundary will be reproduced using the recording.

The interface which Replay uses for the recording boundary is the API between an executable and the system libraries it is dynamically linked to. When recording, all data produced by calls into these system libraries is captured and added to the recording. When replaying, instead of calling into these system libraries the data from the recording is replayed instead.

To understand this better, let’s consider an example. The C program below reads the first line from a file “hello.txt” and then prints it. We want to record what this does so that we can replay it later, reading and printing the same data even if the file doesn’t exist anymore.

#include <stdio.h>int main(int argc, char** argv) {
  FILE* fp = fopen(“hello.txt”, “r”);  char buf[256];
  fgets(buf, sizeof(buf), fp);
  puts(buf);  fclose(fp);
  return 0;
}

When this is compiled to an executable file, that executable does not define fopen, fgets, or any of the other functions used to interact with files. Instead, when that executable is loaded the operating system’s dynamic linker loads system libraries which define these functions, and updates references in the loaded executable so that they will call those functions in the system library. Each executable and library has tables describing the functions it exports and imports, which the linker reads to perform these updates.

https://miro.medium.com/v2/resize:fit:700/0*MFXMtPEWrijgcbva

Replay’s recorder acts very much like the operating system’s dynamic linker. After an executable starts, if it wants to record its behavior it loads the recorder, which is a dynamically linked library. The recorder can read the executable’s import table, and after being loaded it changes outgoing references from the executable so that they will call functions defined by the recorder itself. These recorder functions are shims between the executable and the system libraries: when called, they collect the call’s inputs, call the original system library function with those inputs, collect the call’s outputs, and then return those outputs to the calling executable. The collected data is added to the recording, and the executable resumes execution, with no difference in behavior due to the presence of the recording shims.

Replaying works in a similar way. The executable which was used to make the recording is started, but instead of linking it to system libraries its imported references are updated to call functions defined by the replayer. When called, these functions look for a call with the same inputs in the recording, read the associated outputs from the recording, and return those outputs to the executable.

In the example above, when fgets is called while recording the data which the system’s fgets wrote to its buffer argument will be saved in the recording. When fgets is called while replaying then that buffer data will be read from the recording and written to the call’s buffer parameter. The system library’s fgets is outside the recording boundary and is not called when replaying, but because everything it did was saved while recording and restored while replaying, the executable will behave the same after the call and print the same data that it did when recording.

https://miro.medium.com/v2/resize:fit:700/0*P5LG-ABmgl1PqlPW

Recording at the system library call layer provides several benefits that make it the most practical choice for handling an assortment of operating systems and interpreted language runtimes:

Since this interface is used by the system’s dynamic linker, it is easy to operate on (e.g. insert shims) without having to change code in the program being recorded. The executable file formats are different on each major operating system, but they are well documented, and long running projects like Wine and Darling use similar linker based techniques to run software in new environments.
The system library call API is quite stable, as non-additive changes made between OS versions will break old software.
Different programs running on an operating system all use the same system libraries, with the same APIs. While the programs can call different sets of functions within those libraries, there is a lot of overlap which simplifies the process of recording a new runtime.

Operating on library calls is enough to record some 99% or more of a program’s interactions with its environment, but it isn’t enough for everything. Other interactions like reading from memory shared with another process, or getting non-deterministic information directly from the CPU (such as the rdtsc instruction), will not go through a library call.