Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix synchronization issues on the GC scheduler. #53355

Merged
merged 3 commits into from
Feb 22, 2024
Merged

Conversation

gbaraldi
Copy link
Member

Me and @vtjnash stared at this code for a while and it seems that making n_threads_marking the sole synchronization is the way forward, and to remove the fast path because being protected by the lock is quite important

@gbaraldi gbaraldi requested review from d-netto, vtjnash and vchuravy and removed request for vtjnash February 15, 2024 21:32
Copy link
Member

@d-netto d-netto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this would be quite correct.

I'm afraid this might also be vulnerable to the issue described in #53350.

In particular, note that's possible for a thread to be context switched out in this line, and whenever it continues execution, a large enough number of cycles may have been passed and the master thread may have already set gc_all_tls_states = NULL here, so that when the straggler tries to access gc_all_tls_states in gc_count_work_in_queue it will just read NULL.

@vchuravy vchuravy removed their request for review February 15, 2024 22:00
@gbaraldi
Copy link
Member Author

If the master thread has exited, we can never reach that part of the code, the value has to be 0 in n_marking for it to exit.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. Note for the commit message we think we fixed an extra bug in sweep also

src/gc.c Outdated Show resolved Hide resolved
@d-netto
Copy link
Member

d-netto commented Feb 16, 2024

Segfault here seems related to the PR.

@KristofferC KristofferC added the backport 1.11 Change should be backported to release-1.11 label Feb 16, 2024
@gbaraldi
Copy link
Member Author

gbaraldi commented Feb 16, 2024

The segfault in question happened because
gc_setmark_pool_ (o=0x7f4e30af8008, page=0x0, mark_mode=<optimized out>, ptls=0x7f4e34977710) got a null page, now that probably is due to memory corruption I imagine? Which begs the question, why can page_metadata return null but we never check it. We should probably at least add an assertion there.
Or is returning null valid?

@d-netto
Copy link
Member

d-netto commented Feb 16, 2024

We maintain a side byte-map that holds information about the state of the page (GC_PAGE_ALLOCATED , GC_PAGE_LAZILY_FREED, etc).

If page_metadata is returning NULL, it likely means that the byte in the byte-map for that given page is set to something other than GC_PAGE_ALLOCATED.

I.e. we may never have allocated this page or may have freed it prematurely.

src/gc.c Outdated Show resolved Hide resolved
src/julia_threads.h Outdated Show resolved Hide resolved
src/julia_threads.h Outdated Show resolved Hide resolved
@d-netto
Copy link
Member

d-netto commented Feb 17, 2024

The part which removes the fast path in which we check if (n_threads_marking == 0) outside of the lock is relevant to 1.10.

We should backport it.

@@ -1676,7 +1677,7 @@ void gc_sweep_wake_all(jl_ptls_t ptls, jl_gc_padded_page_stack_t *new_gc_allocd_
void gc_sweep_wait_for_all(void)
{
jl_atomic_store(&gc_allocd_scratch, NULL);
while (jl_atomic_load_relaxed(&gc_n_threads_sweeping) != 0) {
while (jl_atomic_load_acquire(&gc_n_threads_sweeping) != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-netto We also changed this just because of reading the code but not from an observed failure. This acquires the writes on the mutator threads, so that the next stage of using gc_allocd_scratch here will acquire all of previous writes from the threads that did a write-release of gc_n_threads_sweeping. I don't think this would likely effect TSO systems, but we theorized that on a weaker ordering system, those last remaining writes might not have become visible yet on this thread, leading to a slightly stale free-page list pointer, resulting in some dropped pages from future sweeping.

@d-netto
Copy link
Member

d-netto commented Feb 22, 2024

Merge?

@KristofferC
Copy link
Member

The part which removes the fast path in which we check if (n_threads_marking == 0) outside of the lock is relevant to 1.10.

Is this, or can it be made into a separate commit?

@gbaraldi
Copy link
Member Author

Shouldn't all changes to gc_should_mark and related ones be backported?

@vtjnash vtjnash merged commit a96726b into master Feb 22, 2024
7 checks passed
@vtjnash vtjnash deleted the gb/gc-scheduler branch February 22, 2024 17:24
KristofferC pushed a commit that referenced this pull request Feb 26, 2024
This aims to slightly simplify the synchronization by making
`n_threads_marking` the sole memory location of relevance for it, it
also removes the fast path, because being protected by the lock is
quite important so that the observed gc state arrays are valid.

Fixes: #53350
Fixes: #52757
Maybe fixes: #53026
Co-authored-by: Jameson Nash <vtjnash@gmail.com>

(cherry picked from commit a96726b)
@KristofferC KristofferC mentioned this pull request Feb 26, 2024
28 tasks
@sgaure sgaure mentioned this pull request Feb 27, 2024
KristofferC added a commit that referenced this pull request Mar 1, 2024
Backported PRs:
- [x] #53361 <!-- 🤖 [master] Bump the SparseArrays stdlib from c9f7293
to cb602d7 -->
- [x] #53300 <!-- allow external absint to hold custom data in
`codeinst.inferred` -->
- [x] #53342 <!-- Add `Base.wrap` to docs -->
- [x] #53372 <!-- Silence warnings in `test/file.jl` -->
- [x] #53357 <!-- 🤖 [master] Bump the Pkg stdlib from 6dd0e7c9e to
76070d295 -->
- [x] #53373 <!-- fix sysimage-native-code=no option with pkgimages -->
- [x] #53333 <!-- More consistent return value for annotations, take 2
-->
- [x] #53354 <!-- fix code coverage bug in tail position and `else` -->
- [x] #53407 <!-- fix sysimage-native-code=yes option -->
- [x] #53388 <!-- Fix documentation: thread pool of main thread -->
- [x] #53355 <!-- Fix synchronization issues on the GC scheduler. -->
- [x] #53429 <!-- Subtype: skip slow-path in `local_∀_∃_subtype` if
inputs contain no ∃ typevar. -->
- [x] #53437 <!-- Add debug variant of loader_trampolines.o -->
- [x] #53284 <!-- Add annotate! method for AnnotatedIOBuffer -->
- [x] #53466 <!-- [MozillaCACerts_jll] Update to v2023-12-12 -->
- [x] #53467 <!-- [LibGit2_jll] Update to v1.7.2 -->
- [x] #53326 <!-- RFC: when loading code for internal purposes, load
stdlib files directly, bypassing DEPOT_PATH, LOAD_PATH, and stale checks
-->
- [x] #53332
- [x] #53320 <!-- Add `Sys.isreadable, Sys.iswriteable`, update `ispath`
-->
- [x] #53476

Contains multiple commits, manual intervention needed:
- [ ] #53285 <!-- Add update mechanism for Terminfo, and common
user-alias data -->

Non-merged PRs with backport label:
- [ ] #53424 <!-- yet more atomics & cache-line fixes on work-stealing
queue -->
- [ ] #53408 <!-- task splitting: change additive accumulation to
multiplicative -->
- [ ] #53403 <!-- Move parallel precompilation to Base -->
- [ ] #53402 <!-- Add `jl_getaffinity` and `jl_setaffinity` -->
- [ ] #53391 <!-- Default to the medium code model in x86 linux -->
- [ ] #53125 <!-- coverage: count coverage where explicitly requested by
inference only -->
- [ ] #52694 <!-- Reinstate similar for AbstractQ for backward
compatibility -->
@KristofferC KristofferC removed the backport 1.11 Change should be backported to release-1.11 label Mar 1, 2024
tecosaur pushed a commit to tecosaur/julia that referenced this pull request Mar 4, 2024
This aims to slightly simplify the synchronization by making
`n_threads_marking` the sole memory location of relevance for it, it
also removes the fast path, because being protected by the lock is
quite important so that the observed gc state arrays are valid.

Fixes: JuliaLang#53350
Fixes: JuliaLang#52757
Maybe fixes: JuliaLang#53026
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
mkitti pushed a commit to mkitti/julia that referenced this pull request Mar 7, 2024
This aims to slightly simplify the synchronization by making
`n_threads_marking` the sole memory location of relevance for it, it
also removes the fast path, because being protected by the lock is
quite important so that the observed gc state arrays are valid.

Fixes: JuliaLang#53350
Fixes: JuliaLang#52757
Maybe fixes: JuliaLang#53026
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
KristofferC pushed a commit that referenced this pull request Mar 13, 2024
Cherry-pick the parts of #53355
which are relevant to the 1.10 GC.
d-netto added a commit to RelationalAI/julia that referenced this pull request Mar 14, 2024
Cherry-pick the parts of JuliaLang#53355
which are relevant to the 1.10 GC.
Drvi pushed a commit to RelationalAI/julia that referenced this pull request Mar 15, 2024
Cherry-pick the parts of JuliaLang#53355
which are relevant to the 1.10 GC.
d-netto added a commit to RelationalAI/julia that referenced this pull request Mar 21, 2024
Cherry-pick the parts of JuliaLang#53355
which are relevant to the 1.10 GC.
d-netto added a commit to RelationalAI/julia that referenced this pull request Mar 21, 2024
Cherry-pick the parts of JuliaLang#53355
which are relevant to the 1.10 GC.
d-netto added a commit that referenced this pull request Mar 30, 2024
Drvi pushed a commit to RelationalAI/julia that referenced this pull request Jun 7, 2024
Cherry-pick the parts of JuliaLang#53355
which are relevant to the 1.10 GC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants