From 4de4ad6d0e959ef9a53a2e760ecf7b404b62fb0f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Thomas=20K=C3=B6ppe?= Date: Wed, 2 Oct 2024 19:36:43 +0100 Subject: [PATCH] [docs/advanced] A document about deadlock potential with C++ statics --- docs/advanced/deadlock.md | 391 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 391 insertions(+) create mode 100644 docs/advanced/deadlock.md diff --git a/docs/advanced/deadlock.md b/docs/advanced/deadlock.md new file mode 100644 index 000000000..b878ccb85 --- /dev/null +++ b/docs/advanced/deadlock.md @@ -0,0 +1,391 @@ +# Double locking, deadlocking, GIL + +[TOC] + +## Introduction + +### Overview + +In concurrent programming with locks, *deadlocks* can arise when more than one +mutex is locked at the same time, and careful attention has to be paid to lock +ordering to avoid this. Here we will look at a common situation that occurs in +native extensions for CPython written in C++. + +### Deadlocks + +A deadlock can occur when more than one thread attempts to lock more than one +mutex, and two of the threads lock two of the mutexes in different orders. For +example, consider mutexes `mu1` and `mu2`, and threads T1 and T2, executing: + + | T1 | T2 +--- | ------------------- | ------------------- +1 | `mu1.lock()`{.good} | `mu2.lock()`{.good} +2 | `mu2.lock()`{.bad} | `mu1.lock()`{.bad} +3 | `/* work */` | `/* work */` +4 | `mu2.unlock()` | `mu1.unlock()` +5 | `mu1.unlock()` | `mu2.unlock()` + +Now if T1 manages to lock `mu1` and T2 manages to lock `mu2` (as indicated in +green), then both threads will block while trying to lock the respective other +mutex (as indicated in red), but they are also unable to release the mutex that +they have locked (step 5). + +**The problem** is that it is possible for one thread to attempt to lock `mu1` +and then `mu2`, and for another thread to attempt to lock `mu2` and then `mu1`. +Note that it does not matter if either mutex is unlocked at any intermediate +point; what matters is only the order of any attempt to *lock* the mutexes. For +example, the following, more complex series of operations is just as prone to +deadlock: + + | T1 | T2 +--- | ------------------- | ------------------- +1 | `mu1.lock()`{.good} | `mu1.lock()`{.good} +2 | waiting for T2 | `mu2.lock()`{.good} +3 | waiting for T2 | `/* work */` +3 | waiting for T2 | `mu1.unlock()` +3 | `mu2.lock()`{.bad} | `/* work */` +3 | `/* work */` | `mu1.lock()`{.bad} +3 | `/* work */` | `/* work */` +4 | `mu2.unlock()` | `mu1.unlock()` +5 | `mu1.unlock()` | `mu2.unlock()` + +When the mutexes involved in a locking sequence are known at compile-time, then +avoiding deadlocks is “merely” a matter of arranging the lock +operations carefully so as to only occur in one single, fixed order. However, it +is also possible for mutexes to only be determined at runtime. A typical example +of this is a database where each row has its own mutex. An operation that +modifies two rows in a single transaction (e.g. “transferring an amount +from one account to another”) must lock two row mutexes, but the locking +order cannot be established at compile time. In this case, a dynamic +“deadlock avoidance algorithm” is needed. (In C++, `std::lock` +provides such an algorithm. An algorithm might use a non-blocking `try_lock` +operation on a mutex, which can either succeed or fail to lock the mutex, but +returns without blocking.) + +Conceptually, one could also consider it a deadlock if _the same_ thread +attempts to lock a mutex that it has already locked (e.g. when some locked +operation accidentally recurses into itself): `mu.lock();`{.good} +`mu.lock();`{.bad} However, this is a slightly separate issue: Typical mutexes +are either of _recursive_ or _non-recursive_ kind. A recursive mutex allows +repeated locking and requires balanced unlocking. A non-recursive mutex can be +implemented more efficiently, and/but for efficiency reasons does not actually +guarantee a deadlock on second lock. Instead, the API simply forbids such use, +making it a precondition that the thread not already hold the mutex, with +undefined behaviour on violation. + +### “Once” initialization + +A common programming problem is to have an operation happen precisely once, even +if requested concurrently. While it is clear that we need to track in some +shared state somewhere whether the operation has already happened, it is worth +noting that this state only ever transitions, once, from `false` to `true`. This +is considerably simpler than a general shared state that can change values +arbitrarily. Next, we also need a mechanism for all but one thread to block +until the initialization has completed, which we can provide with a mutex. The +simplest solution just always locks the mutex: + +```c++ +// The "once" mechanism: +constinit absl::Mutex mu(absl::kConstInit); +constinit bool init_done = false; + +// The operation of interest: +void f(); + +void InitOnceNaive() { + absl::MutexLock lock(&mu); + if (!init_done) { + f(); + init_done = true; + } +} +``` + +This works, but the efficiency-minded reader will observe that once the +operation has completed, all future lock contention on the mutex is +unnecessary. This leads to the (in)famous “double-locking” +algorithm, which was historically hard to write correctly. The idea is to check +the boolean *before* locking the mutex, and avoid locking if the operation has +already completed. However, accessing shared state concurrently when at least +one access is a write is prone to causing a data race and needs to be done +according to an appropriate concurrent programming model. In C++ we use atomic +variables: + +```c++ +// The "once" mechanism: +constinit absl::Mutex mu(absl::kConstInit); +constinit std::atomic init_done = false; + +// The operation of interest: +void f(); + +void InitOnceWithFastPath() { + if (!init_done.load(std::memory_order_acquire)) { + absl::MutexLock lock(&mu); + if (!init_done.load(std::memory_order_relaxed)) { + f(); + init_done.store(true, std::memory_order_release); + } + } +} +``` + +Checking the flag now happens without holding the mutex lock, and if the +operation has already completed, we return immediately. After locking the mutex, +we need to check the flag again, since multiple threads can reach this point. + +*Atomic details.* Since the atomic flag variable is accessed concurrently, we +have to think about the memory order of the accesses. There are two separate +cases: The first, outer check outside the mutex lock, and the second, inner +check under the lock. The outer check and the flag update form an +acquire/release pair: *if* the load sees the value `true` (which must have been +written by the store operation), then it also sees everything that happened +before the store, namely the operation `f()`. By contrast, the inner check can +use relaxed memory ordering, since in that case the mutex operations provide the +necessary ordering: if the inner load sees the value `true`, it happened after +the `lock()`, which happened after the `unlock()`, which happened after the +store. + +The C++ standard library, and Abseil, provide a ready-made solution of this +algorithm called `std::call_once`/`absl::call_once`. (The interface is the same, +but the Abseil implementation is possibly better.) + +```c++ +// The "once" mechanism: +constinit absl::once_flag init_flag; + +// The operation of interest: +void f(); + +void InitOnceWithCallOnce() { + absl::call_once(once_flag, f); +} +``` + +Even though conceptually this is performing the same algorithm, this +implementation has some considerable advantages: The `once_flag` type is a small +and trivial, integer-like type and is trivially destructible. Not only does it +take up less space than a mutex, it also generates less code since it does not +have to run a destructor, which would need to be added to the program's global +destructor list. + +The final clou comes with the C++ semantics of a `static` variable declared at +block scope: According to [[stmt.dcl]](https://eel.is/c++draft/stmt.dcl#3): + +> Dynamic initialization of a block variable with static storage duration or +> thread storage duration is performed the first time control passes through its +> declaration; such a variable is considered initialized upon the completion of +> its initialization. [...] If control enters the declaration concurrently while +> the variable is being initialized, the concurrent execution shall wait for +> completion of the initialization. + +This is saying that the initialization of a local, `static` variable precisely +has the “once” semantics that we have been discussing. We can +therefore write the above example as follows: + +```c++ +// The operation of interest: +void f(); + +void InitOnceWithStatic() { + static int unused = (f(), 0); +} +``` + +This approach is by far the simplest and easiest, but the big difference is that +the mutex (or mutex-like object) in this implementation is no longer visible or +in the user’s control. This is perfectly fine if the initializer is +simple, but if the initializer itself attempts to lock any other mutex +(including by initializing another static variable!), then we have no control +over the lock ordering! + +Finally, you may have noticed the `constinit`s around the earlier code. Both +`constinit` and `constexpr` specifiers on a declaration mean that the variable +is *constant-initialized*, which means that no initialization is performed at +runtime (the initial value is already known at compile time). This in turn means +that a static variable guard mutex may not be needed, and static initialization +never blocks. The difference between the two is that a `constexpr`-specified +variable is also `const`, and a variable cannot be `constexpr` if it has a +non-trivial destructor. Such a destructor also means that the guard mutex is +needed after all, since the destructor must be registered to run at exit, +conditionally on initialization having happened. + +## Python, CPython, GIL + +With CPython, a Python program can call into native code. To this end, the +native code registers callback functions with the Python runtime via the CPython +API. In order to ensure that the internal state of the Python runtime remains +consistent, there is a single, shared mutex called the “global interpreter +lock”, or GIL for short. Upon entry of one of the user-provided callback +functions, the GIL is locked (or “held”), so that no other mutations +of the Python runtime state can occur until the native callback returns. + +Many native extensions do not interact with the Python runtime for at least some +part of them, and so it is common for native extensions to _release_ the GIL, do +some work, and then reacquire the GIL before returning. Similarly, when code is +generally not holding the GIL but needs to interact with the runtime briefly, it +will first reacquire the GIL. The GIL is reentrant, and constructions to acquire +and subsequently release the GIL are common, and often don't worry about whether +the GIL is already held. + +If the native code is written in C++ and contains local, `static` variables, +then we are now dealing with at least _two_ mutexes: the static variable guard +mutex, and the GIL from CPython. + +A common problem in such code is an operation with “only once” +semantics that also ends up requiring the GIL to be held at some point. As per +the above description of “once”-style techniques, one might find a +static variable: + +```c++ +// CPython callback, assumes that the GIL is held on entry. +PyObject* InvokeWidget(PyObject* self) { + static PyObject* impl = CreateWidget(); + return PyObject_CallOneArg(impl, self); +} +``` + +This seems reasonable, but bear in mind that there are two mutexes (the "guard +mutex" and "the GIL"), and we must think about the lock order. Otherwise, if the +callback is called from multiple threads, a deadlock may ensue. + +Let us consider what we can see here: On entry, the GIL is already locked, and +we are locking the guard mutex. This is one lock order. Inside the initializer +`CreateWidget`, with both mutexes already locked, the function can freely access +the Python runtime. + +However, it is entirely possible that `CreateWidget` will want to release the +GIL at one point and reacquire it later: + +```c++ +// Assumes that the GIL is held on entry. +// Ensures that the GIL is held on exit. +PyObject* CreateWidget() { + // ... + Py_BEGIN_ALLOW_THREADS // releases GIL + // expensive work, not accessing the Python runtime + Py_END_ALLOW_THREADS // acquires GIL, #! + // ... + return result; +} +``` + +Now we have a second lock order: the guard mutex is locked, and then the GIL is +locked (at `#!`). To see how this deadlocks, consider threads T1 and T2 both +having the runtime attempt to call `InvokeWidget`. T1 locks the GIL and +proceeds, locking the guard mutex and calling `CreateWidget`; T2 is blocked +waiting for the GIL. Then T1 releases the GIL to do “expensive +work”, and T2 awakes and locks the GIL. Now T2 is blocked trying to +acquire the guard mutex, but T1 is blocked reacquiring the GIL (at `#!`). + +In other words: if we want to support “once-called” functions that +can arbitrarily release and reacquire the GIL, as is very common, then the only +lock order that we can ensure is: guard mutex first, GIL second. + +To implement this, we must rewrite our code. Naively, we could always release +the GIL before a `static` variable with blocking initializer: + +```c++ +// CPython callback, assumes that the GIL is held on entry. +PyObject* InvokeWidget(PyObject* self) { + Py_BEGIN_ALLOW_THREADS // releases GIL + static PyObject* impl = CreateWidget(); + Py_END_ALLOW_THREADS // acquires GIL + + return PyObject_CallOneArg(impl, self); +} +``` + +But similar to the `InitOnceNaive` example above, this code cycles the GIL +(possibly descheduling the thread) even when the static variable has already +been initialized. If we want to avoid this, we need to abandon the use of a +static variable, since we do not control the guard mutex well enough. Instead, +we use an operation whose mutex locking is under our control, such as +`call_once`. For example: + +```c++ +// CPython callback, assumes that the GIL is held on entry. +PyObject* InvokeWidget(PyObject* self) { + static constinit PyObject* impl = nullptr; + static constinit std::atomic init_done = false; + static constinit absl::once_flag init_flag; + + if (!init_done.load(std::memory_order_acquire)) { + Py_BEGIN_ALLOW_THREADS // releases GIL + absl::call_once(init_flag, [&]() { + PyGILState_STATE s = PyGILState_Ensure(); // acquires GIL + impl = CreateWidget(); + PyGILState_Release(s); // releases GIL + init_done.store(true, std::memory_order_release); + }); + Py_END_ALLOW_THREADS // acquires GIL + } + + return PyObject_CallOneArg(impl, self); +} +``` + +The lock order is now always guard mutex first, GIL second. Unfortunately we +have to duplicate the “double-checked done flag”, effectively +leading to triple checking, because the flag state inside the `absl::once_flag` +is not accessible to the user. In other words, we cannot ask `init_flag` whether +it has been used yet. + +However, we can perform one last, minor optimisation: since we assume that the +GIL is held on entry, and again when the initializing operation returns, the GIL +actually serializes access to our done flag variable, which therefore does not +need to be atomic. (The difference to the previous, atomic code may be small, +depending on the architecture. For example, on x86-64, acquire/release on a bool +is nearly free ([demo](https://godbolt.org/z/P9vYWf4fE)).) + +```c++ +// CPython callback, assumes that the GIL is held on entry, and indeed anywhere +// directly in this function (i.e. the GIL can be released inside CreateWidget, +// but must be reaqcuired when that call returns). +PyObject* InvokeWidget(PyObject* self) { + static constinit PyObject* impl = nullptr; + static constinit bool init_done = false; // guarded by GIL + static constinit absl::once_flag init_flag; + + if (!init_done) { + Py_BEGIN_ALLOW_THREADS // releases GIL + // (multiple threads may enter here) + absl::call_once(init_flag, [&]() { + // (only one thread enters here) + PyGILState_STATE s = PyGILState_Ensure(); // acquires GIL + impl = CreateWidget(); + init_done = true; // (GIL is held) + PyGILState_Release(s); // releases GIL + }); + + Py_END_ALLOW_THREADS // acquires GIL + } + + return PyObject_CallOneArg(impl, self); +} +``` + +## Debugging tips + +* Build with symbols. +* Ctrl-C sends `SIGINT`, Ctrl-\\ + sends `SIGQUIT`. Both have their uses. +* Useful `gdb` commands: + * `py-bt` prints a Python backtrace if you are in a Python frame. + * `thread apply all bt 10` prints the top-10 frames for each thread. A + full backtrace can be prohibitively expensive, and the top few frames + are often good enough. + * `p PyGILState_Check()` shows whether a thread is holding the GIL. For + all threads, run `thread apply all p PyGILState_Check()` to find out + which thread is holding the GIL. + * The `static` variable guard mutex is accessed with functions like + `cxa_guard_acquire` (though this depends on ABI details and can vary). + The guard mutex itself contains information about which thread is + currently holding it. + +## Links + +* Article on + [double-checked locking](https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/) +* [The Deadlock Empire](https://deadlockempire.github.io/), hands-on exercises + to construct deadlocks