Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add budgets to Actor queue flushing #9022

Closed
wants to merge 9 commits into from
Closed

Add budgets to Actor queue flushing #9022

wants to merge 9 commits into from

Conversation

arindam1993
Copy link
Contributor

@arindam1993 arindam1993 commented Nov 25, 2019

This PR tries to achieve some middleground between the issues reported in #8998 and #8913.
It tries to do this with 2 main changes:

  • The Actor on processes messages in batches, based on time or size budgets, whichever is reached first.
  • On the main thread the ThrottledInvoker uses requestAnimationFrame instead of MessageChannel in order to reduce latency for animations.

I've tried to adjust the batch sizes on the main-thread and worker side so we can have a balance between being able to pre-empt and cancel tasks, while maintaining enough throughput for smooth animation.

Testing with IE11, which needs pre-empting the most, the number of requests sent out for the tile-cancellation debug page went down to ~60 from the ~120 on 1.5.1-beta. But its still higher than the ~25 on 1.5.0.
Animation performance on Chrome is roughly the same for me.

@msbarry , would really appreciate if you could take a look at this as well!

Launch Checklist

  • briefly describe the changes in this PR

@arindam1993 arindam1993 self-assigned this Nov 25, 2019
Copy link
Contributor

@msbarry msbarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like a good compromise! I'm seeing 15-20 updates per second with chrome on osx for animate.html. If I bump the main thread time budget to 2 and task budget to 20 as well, then it's back up to 30-40 updates per second which is what I was getting with my original fix.

this._processQueueTop();
taskCtr++;
}
// We've reached our budget for this frame, defer processingo the rest for the netx frame, this lets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some typos here

Copy link
Member

@mourner mourner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement, but still a bummer to have a 2.5x regression for tiles stuck in a queue compared to the previous version.

One random idea that you already considered perhaps, but still thought I'd mention — can we differentiate the way we process the queue not based on arbitrary frame budgets, but based on a heuristic that somehow categorizes sources as either heavy / rarely updated (e.g. basemap vector tiles) or lightweight / frequently updated (e.g. animating GeoJSON)? That way we could have optimal performance in both cases if we can reliably categorize between the two. Perhaps by tracking how often the source is updated and switching dynamically?

@msbarry
Copy link
Contributor

msbarry commented Nov 26, 2019

Also, I wonder if it's necessary to have the task budget at all? If we just have a time budget, then a lot of fast tasks can happen during animation but it will throttle heavy operations. If I change task budget to unlimited and both time budgets to 2, then I'm back up to ~30 updates per second.

Copy link
Member

@kkaefer kkaefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you determine that both a time and a task budget achieves the best compromise?

The original issue fixed in #8633 was that tile requests sent to a worker couldn't be cancelled due to tasks being invoked in serial order. This is primarily a concern when loading vector tiles. The animation performance regression reported in #8701 doesn't use vector tile loading; instead it uses GeoJSON sources with repeated setData. The GeoJSON source has its own mechanism for deduplicating updates (self-sent coalesce messages), similar to how native does it. Over in the original PR, I wrote

I opted for this model instead of a more generic priority queue because we don't have any use cases for a priority queue other than canceling tasks.

However, given that we actually do seem to have differing requirements for worker processing, a priority queue may actually make more sense. Alternatively, we could mark some tasks with an immediate flag which means they get executed immediately instead of put into the queue. This means that tasks may be executed out of order, but if we always set the immediate flag for all requests associated with a particular source (e.g. GeoJSON sources), this may be fine.

import {serialize, deserialize} from './web_worker_transfer';
import ThrottledInvoker from './throttled_invoker';

import type {Transferable} from '../types/transferable';
import type {Cancelable} from '../types/cancelable';

// Upper limit on time in ms, the actor allocates towards flushing the task queue per frame
const MAIN_THREAD_TIME_BUDGET = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're measuring time in integer milliseconds, so a limit of 1ms very coarse.

@mourner
Copy link
Member

mourner commented Nov 26, 2019

if we always set the immediate flag for all requests associated with a particular source (e.g. GeoJSON sources)

I thought about type-based categorization too, but GeoJSON sources aren't always used with light-data-fast-updates use case — e.g. the GeoJSON might be pretty heavy while never updating, making it similar in behavior to vector tiles.

@arindam1993
Copy link
Contributor Author

Great improvement, but still a bummer to have a 2.5x regression for tiles stuck in a queue compared to the previous version.

One random idea that you already considered perhaps, but still thought I'd mention — can we differentiate the way we process the queue not based on arbitrary frame budgets, but based on a heuristic that somehow categorizes sources as either heavy / rarely updated (e.g. basemap vector tiles) or lightweight / frequently updated (e.g. animating GeoJSON)? That way we could have optimal performance in both cases if we can reliably categorize between the two. Perhaps by tracking how often the source is updated and switching dynamically?

Honestly, animation performance in IE11 is terrible by default anyways, so the best heuristic might be detecting the underlying platform, all modern browsers can cancel fetch in the worker anyways.
But dynamically turning up the task-budget when the map is animating i,e when a source is being updated frequently is also viable, idk if thats too much complexity vs the browser detection.

Also, I wonder if it's necessary to have the task budget at all? If we just have a time budget, then a lot of fast tasks can happen during animation but it will throttle heavy operations. If I change task budget to unlimited and both time budgets to 2, then I'm back up to ~30 updates per second.

The task budget is for IE11 cancellation, in my testing no modern browser (Chrome/ Firefox) ever hit that limit, since its measuring time for a synchronous loop which is incredibly fast anyways. By removing the task budget modern browsers always end up processing everything in the queue every frame, which results in great animation performance, but since there is no deferral anymore the ability to pre-empt tasks is lost in IE11.

How did you determine that both a time and a task budget achieves the best compromise?

I started out with a time budget only, but I found out modern browsers never hit it ,but IE11 did hit it rather inconsistently but it wasn't leaving enough tasks in the queue to allow for any decent amount of request cancellation. I then added the task budget and tuned it until I was able to get to a compromise between request cancellation and animation updates. The tiome budget mostly prevents some main thread hitching on IE11 afaict.

The original issue fixed in #8633 was that tile requests sent to a worker couldn't be cancelled due to tasks being invoked in serial order. This is primarily a concern when loading vector tiles. The animation performance regression reported in #8701 doesn't use vector tile loading; instead it uses GeoJSON sources with repeated setData. The GeoJSON source has its own mechanism for deduplicating updates (self-sent coalesce messages), similar to how native does it. Over in the original PR, I wrote

I opted for this model instead of a more generic priority queue because we don't have any use cases for a priority queue other than canceling tasks.

However, given that we actually do seem to have differing requirements for worker processing, a priority queue may actually make more sense. Alternatively, we could mark some tasks with an immediate flag which means they get executed immediately instead of put into the queue. This means that tasks may be executed out of order, but if we always set the immediate flag for all requests associated with a particular source (e.g. GeoJSON sources), this may be fine.

@ansis
Copy link
Contributor

ansis commented Nov 27, 2019

I think there is a lot we could do to improve queuing, cancellation, prioritization but I think we should focus more narrowly on fixing the regression (if it exists) so that we can get the beta out.

From the numbers here it's not clear that something regressed in IE.

Testing with IE11, which needs pre-empting the most, the number of requests sent out for the tile-cancellation debug page went down to ~60 from the ~120 on 1.5.1-beta. But its still higher than the ~25 on 1.5.0.

The only way you cancel a request before it starts is by delaying it. What these numbers suggest to me is that we are delaying requests less. This is a good thing?

#8633 tried to fix the problem that requests that were already cancelled were initiated anyways because the message that they weren't cancelled hadn't been opened yet. That was happening because workers do large tasks that delay for long chunks of time. My guess is that that is not what is happening now. My guess is that the delay in 1.5.0 was much longer than the time it took to open all the waiting messages.

Zooming at the wrong speed in any browser will trigger a lot of requests and cancellations.

If IE was striking a better balance in 1.5.0 it seems it was semi-random rather than something intentionally tuned to delay requests during rapid zooming.

If we need to fix the rapid zooming / cancellation problem now I think we should look into:

@ansis ansis mentioned this pull request Nov 27, 2019
3 tasks
@arindam1993
Copy link
Contributor Author

Closing since #9031 fixes this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants