-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
PoC: detach execution context scheduler from running thread during blocking syscall #15871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
ysbaddaden
wants to merge
10
commits into
crystal-lang:master
Choose a base branch
from
ysbaddaden:poc/execution-context-detach-thread-during-syscall
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
PoC: detach execution context scheduler from running thread during blocking syscall #15871
ysbaddaden
wants to merge
10
commits into
crystal-lang:master
from
ysbaddaden:poc/execution-context-detach-thread-during-syscall
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4b1309b
to
9e54c28
Compare
9e54c28
to
2ade29a
Compare
ysbaddaden
commented
Jun 3, 2025
ysbaddaden
commented
Jun 3, 2025
A global pool of thread to start new threads from, and return threads to, so we don't start and stop threads all the time, and can wake an existing thread instead of creating a new one from scratch. The thread pool still eventually shuts down a thread after a configurable keepalive is reached, but takes extra measures to never shutdown the main thread, which would invalide the program's main fiber stack (segfaults).
Marks the scheduler has running a blocking syscall. The monitor thread now ticks every 10ms to check if any scheduler in any ST or MT context is blocked on a syscall, and if so tries to detach the scheduler from the thread. On success the scheduler is moved to another thread, taken from the thread pool. The fiber doing a blocking syscall will still be blocked, but other fibers may be resumed by the scheduler. When the blocking syscall returns, the thread will try to unmark the scheduler as running a blocking syscall. Upon success it returns. Upon failure, it enqueues the current fiber back into its execution context, and checks itself back into the thread pool.
We can't join on a thread because it may not terminate anymore. We must call the `#wait` method of the isolated context to know when the isolated fiber has terminated. The method is conveniently aliased as `#join` so we don't need to handle the type in most cases.
fe6a9e6
to
603c2ff
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some syscalls can block the current thread in certain circumstances, for example:
open(2)
when opening a fifo, pipe or characterdevice until another end is connected (from another thread or process);
getaddrinfo(3)
until a DNS response (or error, or timeout) is received.Move scheduler blocked on syscall
This proof of concept introduces a mechanism to declare the scheduler as "doing a syscall" which the monitor thread (SYSMON) can detect on its next iteration and will try to move the scheduler to another thread, so that only the fiber doing the syscall will be blocked, and the other fibers can be resumed.
Usually, the syscall should terminate before the monitor thread notices (for example opening a regular file), so the impact on performance is an atomic STORE + atomic CAS per syscalls. At worst, a thread will be blocked for 10ms (sysmon frequency). For example the updated opening fifo file spec takes ~11ms to complete.
It works for the MT execution contexts and the ST context. It doesn't invalidate the ST guarantee that fibers in the context will never run in parallel: the blocked fiber is blocked on a syscall and will be re-enqueued immediately after the syscall has completed; also the syscalls don't invoke callbacks that would execute crystal code, so AFAICT fibers still won't run in parallel (please correct me if I'm wrong).
Thread Pool
This PoC also introduces a pool of threads. It changes the behavior of threads: we don't start a thread to run a specific scheduler run loop, but each thread now has its own inner loop that basically switches to a scheduler loop then switches back to its inner loop to sleep.The benefit of the global thread loop is that threads are kept around instead of being created and thrown away. If you regularly spawn an isolated fiber, it will likely keep reusing the same thread(s). Threads still eventually shutdown after some inactive time (configurable) except for the main thread (we need to keep the main fiber alive).A potential evolution will park MT threads into the thread pool, instead of keeping them tied to the MT context, so they can be reused by any context that needs parallelism, or to boot a new isolated fiber or ST context.Extracted to #15885.
NOTES
The isolated context expects to block, so the
#syscall(&)
method is a NOOP there.There are probably other blocking syscalls that we might want to consider. For example reading from STDIN on Windows could be greatly simplified.
Another example is
flock
that is currently retried every 100ms when it doesn't block the current thread. We might want to be able to actively detach a scheduler when calling#syscall(&)
, so we could try once (non blocking) then on failure detach the scheduler and try again (blocking) without waiting for SYSMON to notice 🤔The PR contains multiple commits that may be extracted into individual commit (at least the first one). Each commit is focused on one task and it might be easier to read each of them.
POTENTIAL ISSUE
I got one segfault in a gc call nested a libxml2 callback in one early run of the std specs (with-Dpreview_mt -Dexecution_context
) but I couldn't reproduce it after fixing different issues in the PR.Maybe it was a fluke (because of the bugs), or maybe it was just a regular MT issue with libxml2, or maybe sysmon moved the scheduler from the main thread to another thread then resumed a fiber doing something in libxml2, and theglobalthread local state couldn't be found?This is the already known MT issue we have with libxml2. What's new is that the segfault might start happening in a ST environment 😢
Closes #15768.