Fiber support #106

jkriegshauser · 2020-09-03T19:10:16Z

Our task system uses Fibers. Since Tracy appears to get its own thread information this makes profiling with fibers problematic. Ideally we like to show a fiber as a thread. In order for this to work, we would need a way to override the current Thread ID that Tracy uses.

A few ideas:

Additional TracyCZone functions that take a "context identifier" that could either be a thread ID or a fiber ID
A TracyCSetThreadIDOverride function that overrides the current thread's ID with a given fiber ID until called again with 0.

The text was updated successfully, but these errors were encountered:

wolfpld · 2020-09-03T19:24:28Z

I don't use fibers. You will need to explain to me how fibers operate and what is their relation with threads. I'd like to know exactly what we're dealing with here, before deciding how to process further.

jkriegshauser · 2020-09-03T21:22:13Z

Thanks for the reply. Fibers are like threads in that they have their own stack, registers, etc. But different in that the OS does not schedule them. Instead, they're more like co-routines (lua, python, etc.) where a thread must explicitly switch to them, and explicitly switch away from them.

When a task is queued to our task system, a Fiber is allocated for it. Once the task is finished, the Fiber is then returned to a pool waiting for a new task.

The task threads basically do the following:

Wait for a Fiber to be available
Switch context to the Fiber. The Fiber runs the task until either of the following happens:
a. The Fiber needs to sleep or wait on a resource. A function is called so that the Thread switches off of the Fiber and puts it into a waiting queue.
b. Or the task is finished. A function is called which makes the Thread switch off of a Fiber and puts it in the free pool.

Currently with Tracy, if a Thread needs to do 2.a. we end all of the current zones but remember them for later. When a Thread resumes a waiting Fiber, we begin again all of the zones that we remembered. This makes it look like all of the functions finished, and were called again later, and all of that time spent waiting isn't represented visually (which isn't ideal).

We'd like to be able to treat Fibers like Threads so that if a Thread isn't running a Fiber we can keep all of the zones pending and show that the Fiber is waiting.

wolfpld · 2020-09-03T21:43:54Z

Thanks for the explanation, now I get the gist of it. I don't think fibers can be aliased to threads, due to the following reasons:

Possible conflicts between identifiers (a minor issue).
Call stack samples are mapped to threads, so you'd get no information about performance of code running in fibers.
You lose information about CPU core usage (threads with fiber payload can migrate or be suspended by the scheduler), which can be valuable information, e.g. if you have a hyperthreaded CPU.

What would be needed instead is:

Some way to manually track creation and destruction of fibers (and represent them as thread-like tracks in the profiler).
Ability to mark when a fiber starts and stops executing on a thread.

This should be enough for proper support.

Can you provide an example application, which would replicate your task system, with some mock jobs that represent your usage patterns?

jkriegshauser · 2020-09-03T22:00:29Z

I agree with what you state is needed. Since fibers can be created or destroyed at any point, and are basically just memory until switched to, point no. 1 isn't necessarily needed. Our task system already has a event notification for starting/stopping a fiber on a thread, so your point no. 2 would be ideal.

On Windows, Fibers are supported as part of the OS (i.e. CreateFiber, etc.), but on Linux we use boost fibers.

I'll see if i can produce a simple sample app at some point.

wolfpld · 2020-10-18T20:58:16Z

There has been some progress on this. The interface and needed changes will be minimal, e.g.:

void SwitchToFiber(Fiber *const fiber) {
	TracyFiberStart(fiber->m_name);
	boost_context::jump_fcontext(&m_context, fiber->m_context, fiber->m_arg);
	TracyFiberEnd;
}

With the fiber->m_name being unique string, as described in the manual.

However, there were also unforseen consequences for these changes. For fiber tracking to work, zone collection within fibers will have to be serialized. I have to think how to make it work efficiently.

jkriegshauser · 2020-10-19T16:04:41Z

Awesome, thanks for the update. Let me know when you have a release that I can test with. I still haven't had any time to make a sample application.

Minor point: On Windows we're using SwitchToFiber instead of boost, but it should work the same way.

expipiplus1 · 2020-12-02T03:30:40Z

This would also be very useful for supporting Haskell threads! I think that the proposed API of

Fiber created/Fiber destroyed
Fiber start on thread/Fiber end on thread

would work very well.

simonvanbernem · 2021-07-12T17:28:46Z

Just wanted to ask what the current state of this issue is, since I accidentally opened a duplicate before finding this one.

The only thing I'd add is that "execution context" or something like that is probably a better terminology for the API functions than "fibers", because this can be used for not just fibers, but also coroutines, job systems and schedulers for example.

wolfpld · 2021-07-12T17:49:47Z

This is pretty much blocked by the lack of a reliable job-scheduler-type-of-thing. The examples I was provided were of some help, but ultimately I got too tired having to deal with CMake bullshittery, or having to figure out the hackeries involved in how the production libraries do work.

So, I need something simple that I can reason about. I need to be able to know when the fiber (execution context) execution is started and when it is stopped (by the fiber controller library).

RichieSams/FiberTaskingLib#126 seems to provide some kind of a support for what I need, but again, half of that PR is some unrelated variable type changes, which makes me not want to take a look at what this does.

simonvanbernem · 2021-07-12T19:09:45Z

Wait so we are waiting on an example applicationt? Just some application that uses fibers/coroutines and creates a bit of load, that you can instrument with Tracy, so that you can test the feature and iron out the bugs?

If that's the problem, I should be able to just throw together some dummy application in a cpp file over the weekend.

Or in the next couple hours tbh.

wolfpld · 2021-07-12T19:18:11Z

Basically, yes. I would prefer something that doesn't necessarily use fibers, but rather simulates their usage. The simpler the better.

(Previously I have encountered races, which were hard to trigger and debug. At the same time I had some synchro issues with Vulkan to figure out at work. Things added up.)

simonvanbernem · 2021-07-12T20:43:33Z

@wolfpld here you go. I used real fibers though, since I don't really know how one would simulate their behavior without the real thing. They swap the registers and stack and so on, so there is no way to do this in code trivially.

This went surprisingly well, I have never used fibers in windows before, but the api is actually pretty nice (good job windows!) :D.

The application has some workers that pretend to do networking. Let me know if you need anything.

wolfpld · 2021-07-12T20:55:26Z

Thanks, it seems to be simple enough. I'll see what I can do with this.

In the meantime, can you prepare a multithreaded version, with concurrent execution of tasks?

simonvanbernem · 2021-07-12T21:38:55Z

I commited the multi-threaded version where each thread tries to take fibers from a global pool to execute.

I actually messed up the terminology in the first version: Fibers are now jobs, and threads are workers.

simonvanbernem · 2021-07-13T05:58:56Z

Btw, I'd also be willing to test the feature as soon as you have a working version up and running. I have an application at hand that I want to profile that uses co-routines.

nagisa · 2021-08-10T11:32:34Z

I've been thinking about this for a while recently, as there are other general programming patterns that tracy don't support well, and fibers just happen to be one of them. Pipelined processing of data is perhaps one that interest me the most – here zones can span multiple threads as well, though for somewhat more straightforward reasons. I think it may be worth to think about decoupling zone data from the thread as an execution context. Perhaps just giving the user ability to specify to tracy what the user thinks the "thread" or in this case a "task" is for each of the zone / message / etc could be a viable solution that does not require much effort to implement? As far as Fibers are concerned in particular each fiber could become a "thread" in current tracy's visualization (and the user could store their identifier that they share between calls to tracy as a fiber-local variable or something).

Sampling profiler would still have to work on a per-thread basis, however, but I don't think that's avoidable in the general case.

Here's an example visualization that I made which demonstrates what the visualization could look like. I used colour coding for the threads, but I don't think its strictly necessary:

wolfpld · 2021-08-10T11:56:14Z

I think it may be worth to think about decoupling zone data from the thread as an execution context. Perhaps just giving the user ability to specify to tracy what the user thinks the "thread" or in this case a "task" is for each of the zone / message / etc could be a viable solution that does not require much effort to implement? As far as Fibers are concerned in particular each fiber could become a "thread" in current tracy's visualization (and the user could store their identifier that they share between calls to tracy as a fiber-local variable or something).

This is exactly how fiber (task/job/parallel whatever) support will be implemented. And it requires some effort :)

Here's an example visualization that I made which demonstrates what the visualization could look like. I used colour coding for the threads, but I don't think its strictly necessary:

Here's what I have in mind:

simonvanbernem · 2021-08-10T18:26:47Z

That is so awesome, I'm glad we got things going again on this!

simonvanbernem · 2021-08-10T18:34:29Z

I was also thinking:

Maybe there should be a job-context-category? If my application uses e.g. a job system with fibers, but I also want to track some different kind of pipelined data processing like @nagisa suggested, then I'd use the same API setting the job-context for the zones. But I have essentially two sets of zones with job-contexts. It would be nice to be able to name one of the groups "fibers" and the other "pipelined data" or something like that, and mark the zones as belonging to one of those groups. And then having the ability to toggle their visibility like threads.

That is a theoretical thing ATM, I don't have a specific use case in mind, just throwing it out there. Not sure if the need for this would actually come up in practice very often.

roeyb1 · 2021-08-26T19:24:13Z

This looks awesome, can't wait to get to try it out!

slembcke · 2021-10-07T20:21:36Z

Oh, this would be lovely. I was just looking into using Tracy with fiber/jobs today.

wolfpld · 2021-10-10T00:37:31Z

Traditionally, each thread in Tracy writes its events to a separate queue that doesn't need to be synchronized or locked in the process. The per-thread async queues are then sent to the server in a random order. This works great as long as events in question are relevant only to a single thread.

Things become problematic when there are interactions between threads. Sometimes the solution is simple, for example when multiple threads produce values on the same plot. All that's needed here to have a coherent view is to sort the plot values by their timestamp.

Old versions of Tracy did the same with lock events. There was much work put into reconstructing the lock timelime, when eventually some past lock events did arrive from a forgotten thread. While it seemed to be mostly working, it never really could. With lock events you need to know the exact ordering, and any two (or more) events can have the same timestamp, which makes it impossible to know which one truly happened first. The advent of multicore not only makes this more apparent due to a larger number of threads running at the same time, but also by making the timestamp readings more granular, due to difficulties at the hardware level. Providing a consistent clock across the system is not an easy task when you have many cores, many on-chip dies, or even multiple CPU sockets. The software solution here is to serialize all lock events, which is not ideal, as now you have a lock, and you have contention, and things are not running as smoothly as before. But you can't do this in any other way.

The same is true for fibers, coroutines or any other such technique. Zone events, previously isolated to a single thread, can now hop from one to another and you need to know the exact order across all threads. Hence the need for serialization of even more events. In practice you won't be able to say "this function is only used by fibers, so serialize this, and not the other parts of the code". Your assumption would break sooner or later and you would suddenly be in a very sad place. So, all zones have to be serialized, even the ones that are isolated within a single thread.

This is why fiber support will need to be explicitly enabled by adding a define.

It may be interesting to know how much impact this serialization may have on execution times. Well, it of course depends on many factors, which basically boil down to how much queue contention is there at any given time. The raytracer example is an extremely pessimistic case, because you would never be measuring 30 threads generating 150 million zones in total, in a time span of a one second. But that's the application I have data for.

With the async per-thread queues the application needs 1.7 second to execute (the one second figure above is true, as it excludes the initialization and shutdown routines) and transfers 731 MB of data to the server. Below you can find a histogram of a short-lived function which is executed 50 million times.

When zones are stored in a synchronized queue, the application finishes in one minute and 41 seconds, and needs to send a bit over 2 gigabytes of data. This increase in data size is due to much more frequent thread context switching (caused by interleaving of events), which requires sending context switch notification, and which also invalidates the thread time delta, forcing transfer of a full timestamp, instead of the nicely compressible mostly-zeros time difference from the previous event. The histogram for the same function as above looks dramatically different.

All the extra time is of course spent waiting for the lock to become available, as you would expect in case of high contention. I will repeat that this is not your typical use case.

wolfpld · 2021-10-10T15:11:43Z

You can now test the serialization of events by checking out the current master branch and adding the TRACY_FIBERS define to your build settings. Note that this does not enable fiber support yet. It would be a good time now to check if there are any problems with this new functionality, and what performance impact does it have, across a wide variety of code bases.

The affected areas are:

Zone begin and end functionality,
Setting zone text, name, color, value with separate macros and not through the source location data,
Sending text messages,
Capturing zone call stacks,
Receiving crash reports.

Each of the available APIs (C++, C, Lua) should be supported.

jkriegshauser · 2021-10-14T22:42:36Z

Awesome, thanks @wolfpld. Is there a new API to call to notify Tracy that a thread is switching contexts?

wolfpld · 2021-10-14T22:44:00Z

Note that this does not enable fiber support yet.

jkriegshauser · 2021-10-14T23:01:28Z

Awesome, thanks @wolfpld. Is there a new API to call to notify Tracy that a thread is switching contexts?

Note that this does not enable fiber support yet.

The reason why I asked is because such a function could be used to switch which ExplicitProducer is being used to one specific to the Fiber; this ensures that all of the fiber events stay strongly ordered despite which thread is running the fiber and maintains the blisteringly fast speed of ConcurrentQueue. I recently did something similar when switching our internal chrome-tracing profiler from a global ringbuffer with horrible contention to moodycamel::ConcurrentQueue similar to what Tracy is using. Essentially we have two events: notifyFiberStart(uint64_t fiberId) and notifyFiberStart(uint64_t fiberId). Right before we call jump_fcontext() or SwitchToFiber() we call notifyFiberStop() for the old fiber ID (or we skip calling it if not running a fiber), followed by notifyFiberStart() with the new fiber ID (or we skip calling it if switching back to a thread context without running another fiber).

wolfpld · 2021-10-15T09:51:37Z

I guess I have not considered such approach, because it would require some hackery on the concurrentqueue side. It certainly makes sense to do things in such a way in the end, but to minimize the amount of moving parts which can break the serialization approach will be used for the time being.

Right now the path is: serialize zones, and then implement fiber-to-thread mapping. These are two separate tasks which you can reason about without needing to think about the impact of the other one. With the concurrentqueue approach you describe, it would only make sense to implement everything in one go.

wolfpld · 2021-11-03T18:12:24Z

There is now a minimal implementation of fiber data collection on master branch. To enable, define TRACY_FIBERS and add markup only to the job execution dispatch, e.g.:

void schedule_job(Job_Data* job_data) {
    TracyFiberEnter( job_data->name );
    SwitchToFiber( job_data->fiber );
    TracyFiberLeave;
}

job_data->name should be a const char* string with an address-unique fiber name. Do not add markup in the yield function, which returns control back to dispatcher:

void job_yield(Job_Data* job_data) {
    SwitchToFiber(worker_data->base_fiber);
}

Make sure that zones are able to complete, e.g. by adding a separate scope:

void job_main(Job_Data* job_data) {
    {
        ZoneScoped;
        job_data->has_job_started = true;
        // ...
        job_data->is_job_done = true;
    }
    job_yield(job_data);
}

Should there be only the function scope, the zone destructor would never be called, as control would never return to job_main() after the last call to job_yield().

wolfpld · 2021-11-04T16:53:02Z

The requirement to go job -> scheduler -> job has been relaxed in 4c77413.

UplinkCoder · 2021-11-04T20:44:15Z

This is huge! I am very happy with how well it works.

UplinkCoder · 2021-11-04T20:51:11Z

Being able to filter the message stream on fibers would be nice, it seems the UI is there, but it doesn't work properly.

wolfpld · 2021-11-06T20:42:00Z

A different approach for internal processing was applied, which should fix issues with messages, or crashes as reported on Discord by @Xenonic. Fibers should now be considered ready to use.

Fiber activity regions are now displayed using context switch data, as was previously described at #106 (comment). This activity data is not integrated with the running thread context switch data. Such functionality may be added later, but it is unlikely to work during a live capture, and will require a save-load of the trace.

Worker threads won't be automatically indicating when they are running in the fiber context. This can be easily added on the client, just as another zone.

wolfpld · 2022-07-03T10:34:51Z

Closing, as this is now implemented. Performance improvements will arrive at a later time and won't be tracked with this ticket.

expipiplus1 mentioned this issue Dec 1, 2020

Alternate thread presentation mode for tracy export ethercrow/opentelemetry-haskell#40

Closed

avoroshilov mentioned this issue Dec 22, 2020

Flow events support #149

Open

AWoloszyn mentioned this issue Dec 31, 2020

RFC: how to integrate Tracy (and other custom tracing) for tracing marl google/marl#127

Open

wolfpld added the enhancement New feature or request label Jan 15, 2021

adepke mentioned this issue Jan 18, 2021

Cross-thread fibers cause profiler halt adepke/Jobs#10

Open

simonvanbernem mentioned this issue Jul 12, 2021

Add an "asynchronous zones" feature #241

Closed

wolfpld mentioned this issue Aug 26, 2021

Switch tracy contexts between threads #254

Closed

wolfpld mentioned this issue Nov 21, 2021

Implement a bare-bones C API for graphics profiling #283

Merged

wolfpld closed this as completed Jul 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fiber support #106

Fiber support #106

jkriegshauser commented Sep 3, 2020

wolfpld commented Sep 3, 2020

jkriegshauser commented Sep 3, 2020

wolfpld commented Sep 3, 2020

jkriegshauser commented Sep 3, 2020 •

edited

Loading

wolfpld commented Oct 18, 2020

jkriegshauser commented Oct 19, 2020

expipiplus1 commented Dec 2, 2020

simonvanbernem commented Jul 12, 2021 •

edited

Loading

wolfpld commented Jul 12, 2021

simonvanbernem commented Jul 12, 2021 •

edited

Loading

wolfpld commented Jul 12, 2021

simonvanbernem commented Jul 12, 2021 •

edited

Loading

wolfpld commented Jul 12, 2021

simonvanbernem commented Jul 12, 2021 •

edited

Loading

simonvanbernem commented Jul 13, 2021

nagisa commented Aug 10, 2021

wolfpld commented Aug 10, 2021

simonvanbernem commented Aug 10, 2021

simonvanbernem commented Aug 10, 2021

roeyb1 commented Aug 26, 2021

slembcke commented Oct 7, 2021

wolfpld commented Oct 10, 2021

wolfpld commented Oct 10, 2021

jkriegshauser commented Oct 14, 2021

wolfpld commented Oct 14, 2021

jkriegshauser commented Oct 14, 2021

wolfpld commented Oct 15, 2021

wolfpld commented Nov 3, 2021

wolfpld commented Nov 4, 2021

UplinkCoder commented Nov 4, 2021

UplinkCoder commented Nov 4, 2021

wolfpld commented Nov 6, 2021

wolfpld commented Jul 3, 2022

Fiber support #106

Fiber support #106

Comments

jkriegshauser commented Sep 3, 2020

wolfpld commented Sep 3, 2020

jkriegshauser commented Sep 3, 2020

wolfpld commented Sep 3, 2020

jkriegshauser commented Sep 3, 2020 • edited Loading

wolfpld commented Oct 18, 2020

jkriegshauser commented Oct 19, 2020

expipiplus1 commented Dec 2, 2020

simonvanbernem commented Jul 12, 2021 • edited Loading

wolfpld commented Jul 12, 2021

simonvanbernem commented Jul 12, 2021 • edited Loading

wolfpld commented Jul 12, 2021

simonvanbernem commented Jul 12, 2021 • edited Loading

wolfpld commented Jul 12, 2021

simonvanbernem commented Jul 12, 2021 • edited Loading

simonvanbernem commented Jul 13, 2021

nagisa commented Aug 10, 2021

wolfpld commented Aug 10, 2021

simonvanbernem commented Aug 10, 2021

simonvanbernem commented Aug 10, 2021

roeyb1 commented Aug 26, 2021

slembcke commented Oct 7, 2021

wolfpld commented Oct 10, 2021

wolfpld commented Oct 10, 2021

jkriegshauser commented Oct 14, 2021

wolfpld commented Oct 14, 2021

jkriegshauser commented Oct 14, 2021

wolfpld commented Oct 15, 2021

wolfpld commented Nov 3, 2021

wolfpld commented Nov 4, 2021

UplinkCoder commented Nov 4, 2021

UplinkCoder commented Nov 4, 2021

wolfpld commented Nov 6, 2021

wolfpld commented Jul 3, 2022

jkriegshauser commented Sep 3, 2020 •

edited

Loading

simonvanbernem commented Jul 12, 2021 •

edited

Loading

simonvanbernem commented Jul 12, 2021 •

edited

Loading

simonvanbernem commented Jul 12, 2021 •

edited

Loading

simonvanbernem commented Jul 12, 2021 •

edited

Loading