Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move mutable globals into a central structure #7722

Closed
wants to merge 11 commits into from
Closed

Conversation

steven-johnson
Copy link
Contributor

This is a stab at an alternative to #7717, per discussion there. It doesn't necessarily have all the mutable globals moved here, but at least the ones we discussed. Better? Worse? What do you think?

Makefile Outdated
@@ -676,7 +677,7 @@ HEADER_FILES = \
CodeGen_PyTorch.h \
CodeGen_Targets.h \
CodeGen_WebGPU_Dev.h \
CompilerLogger.h \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintended change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops


Globals the_real_globals;

size_t compute_stack_size() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the compiler stack size doesn't belong in this struct. It's not mutable state - it's configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be changed by set_stack_size() at runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not mutated as part of a compilation of a Halide pipeline. It's libHalide configuration state rather than Generator state. Someone might feasibly want to set it with code outside of a loop that calls compile_multitarget a bunch of times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Will change.

std::atomic<int> unique_name_counters[num_unique_name_counters];

// The stack for tic/toc utilities
std::vector<TickStackEntry> tick_stack;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no point putting this in the struct either. It's not mutable compiler state that it makes sense to reset. In fact resetting it could result in toc() trying to pop an empty vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more on this -- you're not wrong, but if a multithreaded caller used this facility, we could end up with similarly mismatched vectors. Since this is mostly for transitory debugging usage, maybe it's no big deal, but it is a minor footgun waiting for code that wants to drive Halide from multiple threads.

This is a stab at an alternative to #7717, per discussion there. It doesn't necessarily have all the mutable globals moved here, but at least the ones we discussed. Better? Worse? What do you think?
@steven-johnson
Copy link
Contributor Author

PTAL -- sorry for the force-push :-/

@steven-johnson
Copy link
Contributor Author

Oooh, this is interesting, looks like a real injection, presumably caused by interaction between this change and the recently-landed #7715, investigating:

[50/1082] Generating camera_pipe.h, camera_pipe.o
FAILED: camera_pipe/camera_pipe.h camera_pipe/camera_pipe.o /home/halidenightly/build_bot/worker/halide-testbranch-main-llvm16-x86-64-linux-cmake/halide-build/apps/camera_pipe/camera_pipe.h /home/halidenightly/build_bot/worker/halide-testbranch-main-llvm16-x86-64-linux-cmake/halide-build/apps/camera_pipe/camera_pipe.o 
cd /home/halidenightly/build_bot/worker/halide-testbranch-main-llvm16-x86-64-linux-cmake/halide-build/apps/camera_pipe && /home/halidenightly/build_bot/worker/halide-testbranch-main-llvm16-x86-64-linux-cmake/halide-build/apps/camera_pipe/camera_pipe.generator -n camera_pipe -d 0 -g camera_pipe -f camera_pipe -e c_header,object -o . target=host-no_runtime
Unhandled exception: Error: In schedule for f7, can't split v0 into v0 and v0 because the new Vars have the same name.
Vars: v0 v1 __outermost

@steven-johnson
Copy link
Contributor Author

OK, there's a subtle bug here that I'm not sure how to resolve, the issue is this:

  • Declare a Var at global scope, without an explicit name (e.g. Var x;), so that it gets a unique name (e.g. "v0");
  • When begin codegen, we call Globals::reset() to put all the unique-name counters back into ground state.
  • Now the Generator does somefunc.vectorize(x, ...);; internally this creates Var tmp; and uses that to create a split.
  • Unfortunately, since we just reset all the unique-name counters in the globals, Var tmp; can re-use the name "v0", and hilarity ensues.

Thinking out loud, possible ways to fix this might be:

  • Capture the state of the Globals after all static initializers are run, then use that as the future 'base state' for reset() calls. But...
  • In C++ we can force static initializer A to run before static initializer B (via lazy-initialization of A as a static with local scope), but we can't force it to run after everything else
  • In the case of Generators, we are usually in control of the main() entry point (via generator_main), so we could record the state of the Globals at that time and use that for future reset calls, but...
  • It's only safe for command-line usage, and theoretically, someone can call (eg) compile_multitarget() directly from their code, so this would be a subtle lurking bug. So...
  • What if, instead, the first time we call Globals::reset(), we note whether we have done so before, and if not, record the state there and use that for future reset calls. This would likely work, but...
  • It could capture additional non-static state as well (e.g. if someone isn't using the commandline directly and e.g. creates Var instances on their stackframe). But I think this might be OK? It isn't the 100% pristine case that would be ideal, but it really just means that we'd have a few more unique names than strictly necessary in all cases. And, more importantly, it would still guarantee that the initial state is identical across all subtargets, which is the goal here.

It's not pretty, but I think it would work... am I missing something?

@abadams
Copy link
Member

abadams commented Aug 1, 2023

Perhaps the interface should be the ability to push/pop state, instead of the ability to reset state. popping would roll it back to just before the matching push.

@steven-johnson
Copy link
Contributor Author

Perhaps the interface should be the ability to push/pop state, instead of the ability to reset state. popping would roll it back to just before the matching push.

...let's discuss this design offline before I do any more work on it and come up with a concrete goal.

@steven-johnson
Copy link
Contributor Author

Despite my comment above, I couldn't resist, so I restructured to make it an entirely opaque API in order to allow save-restore as you requested.

One possible issue here re: thread safety: All the atomic<int> stuff should be fine as-is, but... adding save/reset/restore implies that these should also be threadsafe (somehow), but that would mean mutex-protecting all the counters to ensure they aren't modified when a save/reset/restore is happening. We could do that, but, (1) yuck, and (2) probably a nontrivial performance hit at compile time vs what we do now. I suspect the right answer is to dodge it entirely by declaring that the reset() and restore() functions are NOT thread safe...

@steven-johnson
Copy link
Contributor Author

Failures unrelated, but changes since you approved are pretty major, PTAL before I consider landing

@@ -640,6 +641,9 @@ HTML_TEMPLATE_FILES = \
# Don't include anything here that includes llvm headers.
# Also *don't* include anything that's only used internally (eg SpirvIR.h).
# Keep this list sorted in alphabetical order.
#
# Note that CompilerGlobals.h is deliberately omitted, as we don't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems redundant with the one above. There are lots of .h files not in this list.

@steven-johnson
Copy link
Contributor Author

One possible issue here re: thread safety

Thinking more on this... if we want this to be properly thread-safe, I think what we need to do is:

  • Instead of using std::atomic for protection, use a single mutex for the entire Globals instance. (For the usual single-threaded case, I suspect this wouldn't be a meaningful performance hit, since an uncontested mutex should be pretty efficient, but we'd need to do profiling to see.)
  • Any piece of code that does a save/reset/restore of the Globals needs to hold this mutex during that entire cycle, for obvious reasons. This sounds draconian, but in practice, the only place that we obviously need to do this is inside the body of compile_to_multitarget, which would be a nontrivial bottleneck anyway.

@steven-johnson
Copy link
Contributor Author

Monday morning review ping

@abadams
Copy link
Member

abadams commented Aug 16, 2023

IIRC at the dev meeting we discussed an API where instead of reset() and restore() we just have get and set, and then the scoped widget does a get in its constructor and a set in the destructor, so that the state is restored to whatever it was before the scoped code ran. I believe the difference between that and this is that get() is non-mutating, whereas reset() is.

The advantage of get()/set() over reset()/restore() is that the semantics are simpler, and they can be nested without accidentally reusing names, though I'm not sure what the point of nesting it would be.

None of this addresses multi-threading though. I'm not sure what to do about that. If a set()/restore() happens on one thread while another thread is mid-compilation, we're going to have a bad time...

That seems like a show-stopper, because I already know of one case in the wild where different generators are being run different threads, and this change would break it. Adding a mutex would just serialize the whole thing. I can't immediately think of a solution.

@steven-johnson
Copy link
Contributor Author

I already know of one case in the wild where different generators are being run different threads

Are we sure that that situation is actually robust and not getting lucky? (Have we tried running it with TSAN?)

and this change would break it

It wouldn't break it if we added a mutex to serialize the whole thing... but it would clearly impact performance to some extent. How much, I have no idea -- might be interesting to make a fully-mutexed version out of this and actually try it in that scenario to see how much effect we're talking about. (As the saying goes, speed doesn't matter unless the answer is correct...)

@abadams
Copy link
Member

abadams commented Aug 16, 2023

There's no cross-talk between the threads involved except for accessing the same set of atomic counters, so I'm pretty sure TSAN would be clean. Serializing the whole thing would remove the point of threading it in the first place.

@steven-johnson
Copy link
Contributor Author

Serializing the whole thing would remove the point of threading it in the first place.

Well, no; seems to me that the mutex contention would only occur when we request unique names (or random-var seeds, but that's so infrequently used that it's unlikely to matter). And yes, our code requests unique names a lot, but it's also the case that a lot of code happens in between those instances. It's not so obvious to me that the contention would destroy the point of threading, but testing would give us data pro or con.

@abadams
Copy link
Member

abadams commented Aug 16, 2023

You'd have to hold the mutex for the entirety of lowering, not just when asking for unique names, because you need to prevent another thread from calling restore() while you're in the process of lowering, or you'll get non-unique temporary names.

@steven-johnson
Copy link
Contributor Author

You'd have to hold the mutex for the entirety of lowering

hmm, not necessarily -- you could get away with just holding the mutex while there is a save active (and release it upon restore) -- but I'm not sure that would help much in practice, because the main place this would apply would be in compile_too_multitarget, which does a lot of lowering internally.

OK, so, separate thought: what about using thread-local storage?

@abadams
Copy link
Member

abadams commented Aug 16, 2023

That would add the constraint that we can't use multi-threading internally in lowering (which we don't currently do), and you can't construct Halide IR for a single pipeline using multiple threads (which I doubt anyone does).

Not sure if repeatable naming is worth those constraints.

Another option might be to add the functionality but not turn any of it on by default, and if people want to save/restore state in their own generator drivers where they're familiar with what they do and don't do, they can.

In the case you had where you wanted to reset the random number counters, that could just call reset() at the top of generate() for the affected pipeline.

@steven-johnson
Copy link
Contributor Author

That would add the constraint that we can't use multi-threading internally in lowering (which we don't currently do), and you can't construct Halide IR for a single pipeline using multiple threads (which I doubt anyone does).
Not sure if repeatable naming is worth those constraints.

Sigh... maybe not, but it would sure make life easier when debugging.

Another option might be to add the functionality but not turn any of it on by default, and if people want to save/restore state in their own generator drivers where they're familiar with what they do and don't do, they can.

...maybe? Not sure how to refactor this so that making it conditional wouldn't be ugly, though.

In the case you had where you wanted to reset the random number counters, that could just call reset() at the top of generate() for the affected pipeline.

Fair... I guess I'll go resurrect my original PR that basically just did that as a stopgap.

I already know of one case in the wild where different generators are being run different threads

Can you expand a little bit on the use case here and what specifically is happening (and why), assuming it's not proprietary information? (It might help me wrap my head around the problem space better.)

@steven-johnson
Copy link
Contributor Author

This whole thing makes me a little sad... we both know that it's really hard to get things right the first time, but in hindsight, it seems like having Halide objects belong to a Context of some sort would probably have been a win, despite the possible extra overhead involved. Ah, well, hindsight.

@steven-johnson steven-johnson marked this pull request as draft August 24, 2023 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants