mirror of
https://github.com/pybind/pybind11.git
synced 2024-11-21 12:45:11 +00:00
docs/advanced A document about deadlock potential with C++ statics (#5394)
* [docs/advanced] A document about deadlock potential with C++ statics
* [docs/advanced] Refer to deadlock.md from misc.rst
* [docs/advanced] Fix tables in deadlock.md
* Use :ref:`deadlock-reference-label`
* Revert "Use :ref:`deadlock-reference-label`"
This reverts commit e5734d275f
.
* Add simple references to docs/advanced/deadlock.md filename. (Maybe someone can work on clickable links later.)
---------
Co-authored-by: Ralf W. Grosse-Kunstleve <rgrossekunst@nvidia.com>
This commit is contained in:
parent
56e69a20a5
commit
af67e87393
391
docs/advanced/deadlock.md
Normal file
391
docs/advanced/deadlock.md
Normal file
@ -0,0 +1,391 @@
|
||||
# Double locking, deadlocking, GIL
|
||||
|
||||
[TOC]
|
||||
|
||||
## Introduction
|
||||
|
||||
### Overview
|
||||
|
||||
In concurrent programming with locks, *deadlocks* can arise when more than one
|
||||
mutex is locked at the same time, and careful attention has to be paid to lock
|
||||
ordering to avoid this. Here we will look at a common situation that occurs in
|
||||
native extensions for CPython written in C++.
|
||||
|
||||
### Deadlocks
|
||||
|
||||
A deadlock can occur when more than one thread attempts to lock more than one
|
||||
mutex, and two of the threads lock two of the mutexes in different orders. For
|
||||
example, consider mutexes `mu1` and `mu2`, and threads T1 and T2, executing:
|
||||
|
||||
| | T1 | T2 |
|
||||
|--- | ------------------- | -------------------|
|
||||
|1 | `mu1.lock()`{.good} | `mu2.lock()`{.good}|
|
||||
|2 | `mu2.lock()`{.bad} | `mu1.lock()`{.bad} |
|
||||
|3 | `/* work */` | `/* work */` |
|
||||
|4 | `mu2.unlock()` | `mu1.unlock()` |
|
||||
|5 | `mu1.unlock()` | `mu2.unlock()` |
|
||||
|
||||
Now if T1 manages to lock `mu1` and T2 manages to lock `mu2` (as indicated in
|
||||
green), then both threads will block while trying to lock the respective other
|
||||
mutex (as indicated in red), but they are also unable to release the mutex that
|
||||
they have locked (step 5).
|
||||
|
||||
**The problem** is that it is possible for one thread to attempt to lock `mu1`
|
||||
and then `mu2`, and for another thread to attempt to lock `mu2` and then `mu1`.
|
||||
Note that it does not matter if either mutex is unlocked at any intermediate
|
||||
point; what matters is only the order of any attempt to *lock* the mutexes. For
|
||||
example, the following, more complex series of operations is just as prone to
|
||||
deadlock:
|
||||
|
||||
| | T1 | T2 |
|
||||
|--- | ------------------- | -------------------|
|
||||
|1 | `mu1.lock()`{.good} | `mu1.lock()`{.good}|
|
||||
|2 | waiting for T2 | `mu2.lock()`{.good}|
|
||||
|3 | waiting for T2 | `/* work */` |
|
||||
|3 | waiting for T2 | `mu1.unlock()` |
|
||||
|3 | `mu2.lock()`{.bad} | `/* work */` |
|
||||
|3 | `/* work */` | `mu1.lock()`{.bad} |
|
||||
|3 | `/* work */` | `/* work */` |
|
||||
|4 | `mu2.unlock()` | `mu1.unlock()` |
|
||||
|5 | `mu1.unlock()` | `mu2.unlock()` |
|
||||
|
||||
When the mutexes involved in a locking sequence are known at compile-time, then
|
||||
avoiding deadlocks is “merely” a matter of arranging the lock
|
||||
operations carefully so as to only occur in one single, fixed order. However, it
|
||||
is also possible for mutexes to only be determined at runtime. A typical example
|
||||
of this is a database where each row has its own mutex. An operation that
|
||||
modifies two rows in a single transaction (e.g. “transferring an amount
|
||||
from one account to another”) must lock two row mutexes, but the locking
|
||||
order cannot be established at compile time. In this case, a dynamic
|
||||
“deadlock avoidance algorithm” is needed. (In C++, `std::lock`
|
||||
provides such an algorithm. An algorithm might use a non-blocking `try_lock`
|
||||
operation on a mutex, which can either succeed or fail to lock the mutex, but
|
||||
returns without blocking.)
|
||||
|
||||
Conceptually, one could also consider it a deadlock if _the same_ thread
|
||||
attempts to lock a mutex that it has already locked (e.g. when some locked
|
||||
operation accidentally recurses into itself): `mu.lock();`{.good}
|
||||
`mu.lock();`{.bad} However, this is a slightly separate issue: Typical mutexes
|
||||
are either of _recursive_ or _non-recursive_ kind. A recursive mutex allows
|
||||
repeated locking and requires balanced unlocking. A non-recursive mutex can be
|
||||
implemented more efficiently, and/but for efficiency reasons does not actually
|
||||
guarantee a deadlock on second lock. Instead, the API simply forbids such use,
|
||||
making it a precondition that the thread not already hold the mutex, with
|
||||
undefined behaviour on violation.
|
||||
|
||||
### “Once” initialization
|
||||
|
||||
A common programming problem is to have an operation happen precisely once, even
|
||||
if requested concurrently. While it is clear that we need to track in some
|
||||
shared state somewhere whether the operation has already happened, it is worth
|
||||
noting that this state only ever transitions, once, from `false` to `true`. This
|
||||
is considerably simpler than a general shared state that can change values
|
||||
arbitrarily. Next, we also need a mechanism for all but one thread to block
|
||||
until the initialization has completed, which we can provide with a mutex. The
|
||||
simplest solution just always locks the mutex:
|
||||
|
||||
```c++
|
||||
// The "once" mechanism:
|
||||
constinit absl::Mutex mu(absl::kConstInit);
|
||||
constinit bool init_done = false;
|
||||
|
||||
// The operation of interest:
|
||||
void f();
|
||||
|
||||
void InitOnceNaive() {
|
||||
absl::MutexLock lock(&mu);
|
||||
if (!init_done) {
|
||||
f();
|
||||
init_done = true;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This works, but the efficiency-minded reader will observe that once the
|
||||
operation has completed, all future lock contention on the mutex is
|
||||
unnecessary. This leads to the (in)famous “double-locking”
|
||||
algorithm, which was historically hard to write correctly. The idea is to check
|
||||
the boolean *before* locking the mutex, and avoid locking if the operation has
|
||||
already completed. However, accessing shared state concurrently when at least
|
||||
one access is a write is prone to causing a data race and needs to be done
|
||||
according to an appropriate concurrent programming model. In C++ we use atomic
|
||||
variables:
|
||||
|
||||
```c++
|
||||
// The "once" mechanism:
|
||||
constinit absl::Mutex mu(absl::kConstInit);
|
||||
constinit std::atomic<bool> init_done = false;
|
||||
|
||||
// The operation of interest:
|
||||
void f();
|
||||
|
||||
void InitOnceWithFastPath() {
|
||||
if (!init_done.load(std::memory_order_acquire)) {
|
||||
absl::MutexLock lock(&mu);
|
||||
if (!init_done.load(std::memory_order_relaxed)) {
|
||||
f();
|
||||
init_done.store(true, std::memory_order_release);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Checking the flag now happens without holding the mutex lock, and if the
|
||||
operation has already completed, we return immediately. After locking the mutex,
|
||||
we need to check the flag again, since multiple threads can reach this point.
|
||||
|
||||
*Atomic details.* Since the atomic flag variable is accessed concurrently, we
|
||||
have to think about the memory order of the accesses. There are two separate
|
||||
cases: The first, outer check outside the mutex lock, and the second, inner
|
||||
check under the lock. The outer check and the flag update form an
|
||||
acquire/release pair: *if* the load sees the value `true` (which must have been
|
||||
written by the store operation), then it also sees everything that happened
|
||||
before the store, namely the operation `f()`. By contrast, the inner check can
|
||||
use relaxed memory ordering, since in that case the mutex operations provide the
|
||||
necessary ordering: if the inner load sees the value `true`, it happened after
|
||||
the `lock()`, which happened after the `unlock()`, which happened after the
|
||||
store.
|
||||
|
||||
The C++ standard library, and Abseil, provide a ready-made solution of this
|
||||
algorithm called `std::call_once`/`absl::call_once`. (The interface is the same,
|
||||
but the Abseil implementation is possibly better.)
|
||||
|
||||
```c++
|
||||
// The "once" mechanism:
|
||||
constinit absl::once_flag init_flag;
|
||||
|
||||
// The operation of interest:
|
||||
void f();
|
||||
|
||||
void InitOnceWithCallOnce() {
|
||||
absl::call_once(once_flag, f);
|
||||
}
|
||||
```
|
||||
|
||||
Even though conceptually this is performing the same algorithm, this
|
||||
implementation has some considerable advantages: The `once_flag` type is a small
|
||||
and trivial, integer-like type and is trivially destructible. Not only does it
|
||||
take up less space than a mutex, it also generates less code since it does not
|
||||
have to run a destructor, which would need to be added to the program's global
|
||||
destructor list.
|
||||
|
||||
The final clou comes with the C++ semantics of a `static` variable declared at
|
||||
block scope: According to [[stmt.dcl]](https://eel.is/c++draft/stmt.dcl#3):
|
||||
|
||||
> Dynamic initialization of a block variable with static storage duration or
|
||||
> thread storage duration is performed the first time control passes through its
|
||||
> declaration; such a variable is considered initialized upon the completion of
|
||||
> its initialization. [...] If control enters the declaration concurrently while
|
||||
> the variable is being initialized, the concurrent execution shall wait for
|
||||
> completion of the initialization.
|
||||
|
||||
This is saying that the initialization of a local, `static` variable precisely
|
||||
has the “once” semantics that we have been discussing. We can
|
||||
therefore write the above example as follows:
|
||||
|
||||
```c++
|
||||
// The operation of interest:
|
||||
void f();
|
||||
|
||||
void InitOnceWithStatic() {
|
||||
static int unused = (f(), 0);
|
||||
}
|
||||
```
|
||||
|
||||
This approach is by far the simplest and easiest, but the big difference is that
|
||||
the mutex (or mutex-like object) in this implementation is no longer visible or
|
||||
in the user’s control. This is perfectly fine if the initializer is
|
||||
simple, but if the initializer itself attempts to lock any other mutex
|
||||
(including by initializing another static variable!), then we have no control
|
||||
over the lock ordering!
|
||||
|
||||
Finally, you may have noticed the `constinit`s around the earlier code. Both
|
||||
`constinit` and `constexpr` specifiers on a declaration mean that the variable
|
||||
is *constant-initialized*, which means that no initialization is performed at
|
||||
runtime (the initial value is already known at compile time). This in turn means
|
||||
that a static variable guard mutex may not be needed, and static initialization
|
||||
never blocks. The difference between the two is that a `constexpr`-specified
|
||||
variable is also `const`, and a variable cannot be `constexpr` if it has a
|
||||
non-trivial destructor. Such a destructor also means that the guard mutex is
|
||||
needed after all, since the destructor must be registered to run at exit,
|
||||
conditionally on initialization having happened.
|
||||
|
||||
## Python, CPython, GIL
|
||||
|
||||
With CPython, a Python program can call into native code. To this end, the
|
||||
native code registers callback functions with the Python runtime via the CPython
|
||||
API. In order to ensure that the internal state of the Python runtime remains
|
||||
consistent, there is a single, shared mutex called the “global interpreter
|
||||
lock”, or GIL for short. Upon entry of one of the user-provided callback
|
||||
functions, the GIL is locked (or “held”), so that no other mutations
|
||||
of the Python runtime state can occur until the native callback returns.
|
||||
|
||||
Many native extensions do not interact with the Python runtime for at least some
|
||||
part of them, and so it is common for native extensions to _release_ the GIL, do
|
||||
some work, and then reacquire the GIL before returning. Similarly, when code is
|
||||
generally not holding the GIL but needs to interact with the runtime briefly, it
|
||||
will first reacquire the GIL. The GIL is reentrant, and constructions to acquire
|
||||
and subsequently release the GIL are common, and often don't worry about whether
|
||||
the GIL is already held.
|
||||
|
||||
If the native code is written in C++ and contains local, `static` variables,
|
||||
then we are now dealing with at least _two_ mutexes: the static variable guard
|
||||
mutex, and the GIL from CPython.
|
||||
|
||||
A common problem in such code is an operation with “only once”
|
||||
semantics that also ends up requiring the GIL to be held at some point. As per
|
||||
the above description of “once”-style techniques, one might find a
|
||||
static variable:
|
||||
|
||||
```c++
|
||||
// CPython callback, assumes that the GIL is held on entry.
|
||||
PyObject* InvokeWidget(PyObject* self) {
|
||||
static PyObject* impl = CreateWidget();
|
||||
return PyObject_CallOneArg(impl, self);
|
||||
}
|
||||
```
|
||||
|
||||
This seems reasonable, but bear in mind that there are two mutexes (the "guard
|
||||
mutex" and "the GIL"), and we must think about the lock order. Otherwise, if the
|
||||
callback is called from multiple threads, a deadlock may ensue.
|
||||
|
||||
Let us consider what we can see here: On entry, the GIL is already locked, and
|
||||
we are locking the guard mutex. This is one lock order. Inside the initializer
|
||||
`CreateWidget`, with both mutexes already locked, the function can freely access
|
||||
the Python runtime.
|
||||
|
||||
However, it is entirely possible that `CreateWidget` will want to release the
|
||||
GIL at one point and reacquire it later:
|
||||
|
||||
```c++
|
||||
// Assumes that the GIL is held on entry.
|
||||
// Ensures that the GIL is held on exit.
|
||||
PyObject* CreateWidget() {
|
||||
// ...
|
||||
Py_BEGIN_ALLOW_THREADS // releases GIL
|
||||
// expensive work, not accessing the Python runtime
|
||||
Py_END_ALLOW_THREADS // acquires GIL, #!
|
||||
// ...
|
||||
return result;
|
||||
}
|
||||
```
|
||||
|
||||
Now we have a second lock order: the guard mutex is locked, and then the GIL is
|
||||
locked (at `#!`). To see how this deadlocks, consider threads T1 and T2 both
|
||||
having the runtime attempt to call `InvokeWidget`. T1 locks the GIL and
|
||||
proceeds, locking the guard mutex and calling `CreateWidget`; T2 is blocked
|
||||
waiting for the GIL. Then T1 releases the GIL to do “expensive
|
||||
work”, and T2 awakes and locks the GIL. Now T2 is blocked trying to
|
||||
acquire the guard mutex, but T1 is blocked reacquiring the GIL (at `#!`).
|
||||
|
||||
In other words: if we want to support “once-called” functions that
|
||||
can arbitrarily release and reacquire the GIL, as is very common, then the only
|
||||
lock order that we can ensure is: guard mutex first, GIL second.
|
||||
|
||||
To implement this, we must rewrite our code. Naively, we could always release
|
||||
the GIL before a `static` variable with blocking initializer:
|
||||
|
||||
```c++
|
||||
// CPython callback, assumes that the GIL is held on entry.
|
||||
PyObject* InvokeWidget(PyObject* self) {
|
||||
Py_BEGIN_ALLOW_THREADS // releases GIL
|
||||
static PyObject* impl = CreateWidget();
|
||||
Py_END_ALLOW_THREADS // acquires GIL
|
||||
|
||||
return PyObject_CallOneArg(impl, self);
|
||||
}
|
||||
```
|
||||
|
||||
But similar to the `InitOnceNaive` example above, this code cycles the GIL
|
||||
(possibly descheduling the thread) even when the static variable has already
|
||||
been initialized. If we want to avoid this, we need to abandon the use of a
|
||||
static variable, since we do not control the guard mutex well enough. Instead,
|
||||
we use an operation whose mutex locking is under our control, such as
|
||||
`call_once`. For example:
|
||||
|
||||
```c++
|
||||
// CPython callback, assumes that the GIL is held on entry.
|
||||
PyObject* InvokeWidget(PyObject* self) {
|
||||
static constinit PyObject* impl = nullptr;
|
||||
static constinit std::atomic<bool> init_done = false;
|
||||
static constinit absl::once_flag init_flag;
|
||||
|
||||
if (!init_done.load(std::memory_order_acquire)) {
|
||||
Py_BEGIN_ALLOW_THREADS // releases GIL
|
||||
absl::call_once(init_flag, [&]() {
|
||||
PyGILState_STATE s = PyGILState_Ensure(); // acquires GIL
|
||||
impl = CreateWidget();
|
||||
PyGILState_Release(s); // releases GIL
|
||||
init_done.store(true, std::memory_order_release);
|
||||
});
|
||||
Py_END_ALLOW_THREADS // acquires GIL
|
||||
}
|
||||
|
||||
return PyObject_CallOneArg(impl, self);
|
||||
}
|
||||
```
|
||||
|
||||
The lock order is now always guard mutex first, GIL second. Unfortunately we
|
||||
have to duplicate the “double-checked done flag”, effectively
|
||||
leading to triple checking, because the flag state inside the `absl::once_flag`
|
||||
is not accessible to the user. In other words, we cannot ask `init_flag` whether
|
||||
it has been used yet.
|
||||
|
||||
However, we can perform one last, minor optimisation: since we assume that the
|
||||
GIL is held on entry, and again when the initializing operation returns, the GIL
|
||||
actually serializes access to our done flag variable, which therefore does not
|
||||
need to be atomic. (The difference to the previous, atomic code may be small,
|
||||
depending on the architecture. For example, on x86-64, acquire/release on a bool
|
||||
is nearly free ([demo](https://godbolt.org/z/P9vYWf4fE)).)
|
||||
|
||||
```c++
|
||||
// CPython callback, assumes that the GIL is held on entry, and indeed anywhere
|
||||
// directly in this function (i.e. the GIL can be released inside CreateWidget,
|
||||
// but must be reaqcuired when that call returns).
|
||||
PyObject* InvokeWidget(PyObject* self) {
|
||||
static constinit PyObject* impl = nullptr;
|
||||
static constinit bool init_done = false; // guarded by GIL
|
||||
static constinit absl::once_flag init_flag;
|
||||
|
||||
if (!init_done) {
|
||||
Py_BEGIN_ALLOW_THREADS // releases GIL
|
||||
// (multiple threads may enter here)
|
||||
absl::call_once(init_flag, [&]() {
|
||||
// (only one thread enters here)
|
||||
PyGILState_STATE s = PyGILState_Ensure(); // acquires GIL
|
||||
impl = CreateWidget();
|
||||
init_done = true; // (GIL is held)
|
||||
PyGILState_Release(s); // releases GIL
|
||||
});
|
||||
|
||||
Py_END_ALLOW_THREADS // acquires GIL
|
||||
}
|
||||
|
||||
return PyObject_CallOneArg(impl, self);
|
||||
}
|
||||
```
|
||||
|
||||
## Debugging tips
|
||||
|
||||
* Build with symbols.
|
||||
* <kbd>Ctrl</kbd>-<kbd>C</kbd> sends `SIGINT`, <kbd>Ctrl</kbd>-<kbd>\\</kbd>
|
||||
sends `SIGQUIT`. Both have their uses.
|
||||
* Useful `gdb` commands:
|
||||
* `py-bt` prints a Python backtrace if you are in a Python frame.
|
||||
* `thread apply all bt 10` prints the top-10 frames for each thread. A
|
||||
full backtrace can be prohibitively expensive, and the top few frames
|
||||
are often good enough.
|
||||
* `p PyGILState_Check()` shows whether a thread is holding the GIL. For
|
||||
all threads, run `thread apply all p PyGILState_Check()` to find out
|
||||
which thread is holding the GIL.
|
||||
* The `static` variable guard mutex is accessed with functions like
|
||||
`cxa_guard_acquire` (though this depends on ABI details and can vary).
|
||||
The guard mutex itself contains information about which thread is
|
||||
currently holding it.
|
||||
|
||||
## Links
|
||||
|
||||
* Article on
|
||||
[double-checked locking](https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/)
|
||||
* [The Deadlock Empire](https://deadlockempire.github.io/), hands-on exercises
|
||||
to construct deadlocks
|
@ -62,7 +62,11 @@ will acquire the GIL before calling the Python callback. Similarly, the
|
||||
back into Python.
|
||||
|
||||
When writing C++ code that is called from other C++ code, if that code accesses
|
||||
Python state, it must explicitly acquire and release the GIL.
|
||||
Python state, it must explicitly acquire and release the GIL. A separate
|
||||
document on deadlocks [#f8]_ elaborates on a particularly subtle interaction
|
||||
with C++'s block-scope static variable initializer guard mutexes.
|
||||
|
||||
.. [#f8] See docs/advanced/deadlock.md
|
||||
|
||||
The classes :class:`gil_scoped_release` and :class:`gil_scoped_acquire` can be
|
||||
used to acquire and release the global interpreter lock in the body of a C++
|
||||
@ -142,6 +146,9 @@ following checklist.
|
||||
destructors can sometimes get invoked in weird and unexpected circumstances as a result
|
||||
of exceptions.
|
||||
|
||||
- C++ static block-scope variable initialization that calls back into Python can
|
||||
cause deadlocks; see [#f8]_ for a detailed discussion.
|
||||
|
||||
- You should try running your code in a debug build. That will enable additional assertions
|
||||
within pybind11 that will throw exceptions on certain GIL handling errors
|
||||
(reference counting operations).
|
||||
|
@ -46,6 +46,8 @@ PYBIND11_NAMESPACE_BEGIN(PYBIND11_NAMESPACE)
|
||||
// get processed only when it is the main thread's turn again and it is running
|
||||
// normal Python code. However, this will be unnoticeable for quick call-once
|
||||
// functions, which is usually the case.
|
||||
//
|
||||
// For in-depth background, see docs/advanced/deadlock.md
|
||||
template <typename T>
|
||||
class gil_safe_call_once_and_store {
|
||||
public:
|
||||
|
Loading…
Reference in New Issue
Block a user