Add budgets to Actor queue flushing #9022

arindam1993 · 2019-11-25T21:55:46Z

This PR tries to achieve some middleground between the issues reported in #8998 and #8913.
It tries to do this with 2 main changes:

The Actor on processes messages in batches, based on time or size budgets, whichever is reached first.
On the main thread the ThrottledInvoker uses requestAnimationFrame instead of MessageChannel in order to reduce latency for animations.

I've tried to adjust the batch sizes on the main-thread and worker side so we can have a balance between being able to pre-empt and cancel tasks, while maintaining enough throughput for smooth animation.

Testing with IE11, which needs pre-empting the most, the number of requests sent out for the tile-cancellation debug page went down to ~60 from the ~120 on 1.5.1-beta. But its still higher than the ~25 on 1.5.0.
Animation performance on Chrome is roughly the same for me.

@msbarry , would really appreciate if you could take a look at this as well!

Launch Checklist

briefly describe the changes in this PR

msbarry

Thanks, this looks like a good compromise! I'm seeing 15-20 updates per second with chrome on osx for animate.html. If I bump the main thread time budget to 2 and task budget to 20 as well, then it's back up to 30-40 updates per second which is what I was getting with my original fix.

msbarry · 2019-11-26T00:55:31Z

src/util/actor.js

+            this._processQueueTop();
+            taskCtr++;
+        }
+        // We've reached our budget for this frame, defer processingo the rest for the netx frame, this lets


some typos here

mourner

Great improvement, but still a bummer to have a 2.5x regression for tiles stuck in a queue compared to the previous version.

One random idea that you already considered perhaps, but still thought I'd mention — can we differentiate the way we process the queue not based on arbitrary frame budgets, but based on a heuristic that somehow categorizes sources as either heavy / rarely updated (e.g. basemap vector tiles) or lightweight / frequently updated (e.g. animating GeoJSON)? That way we could have optimal performance in both cases if we can reliably categorize between the two. Perhaps by tracking how often the source is updated and switching dynamically?

msbarry · 2019-11-26T13:42:11Z

Also, I wonder if it's necessary to have the task budget at all? If we just have a time budget, then a lot of fast tasks can happen during animation but it will throttle heavy operations. If I change task budget to unlimited and both time budgets to 2, then I'm back up to ~30 updates per second.

kkaefer

How did you determine that both a time and a task budget achieves the best compromise?

The original issue fixed in #8633 was that tile requests sent to a worker couldn't be cancelled due to tasks being invoked in serial order. This is primarily a concern when loading vector tiles. The animation performance regression reported in #8701 doesn't use vector tile loading; instead it uses GeoJSON sources with repeated setData. The GeoJSON source has its own mechanism for deduplicating updates (self-sent coalesce messages), similar to how native does it. Over in the original PR, I wrote

I opted for this model instead of a more generic priority queue because we don't have any use cases for a priority queue other than canceling tasks.

However, given that we actually do seem to have differing requirements for worker processing, a priority queue may actually make more sense. Alternatively, we could mark some tasks with an immediate flag which means they get executed immediately instead of put into the queue. This means that tasks may be executed out of order, but if we always set the immediate flag for all requests associated with a particular source (e.g. GeoJSON sources), this may be fine.

kkaefer · 2019-11-26T13:39:14Z

src/util/actor.js

 import {serialize, deserialize} from './web_worker_transfer';
 import ThrottledInvoker from './throttled_invoker';

 import type {Transferable} from '../types/transferable';
 import type {Cancelable} from '../types/cancelable';

+// Upper limit on time in ms, the actor allocates towards flushing the task queue per frame
+const MAIN_THREAD_TIME_BUDGET = 1;


We're measuring time in integer milliseconds, so a limit of 1ms very coarse.

mourner · 2019-11-26T14:48:45Z

if we always set the immediate flag for all requests associated with a particular source (e.g. GeoJSON sources)

I thought about type-based categorization too, but GeoJSON sources aren't always used with light-data-fast-updates use case — e.g. the GeoJSON might be pretty heavy while never updating, making it similar in behavior to vector tiles.

arindam1993 · 2019-11-26T19:53:48Z

Great improvement, but still a bummer to have a 2.5x regression for tiles stuck in a queue compared to the previous version.

One random idea that you already considered perhaps, but still thought I'd mention — can we differentiate the way we process the queue not based on arbitrary frame budgets, but based on a heuristic that somehow categorizes sources as either heavy / rarely updated (e.g. basemap vector tiles) or lightweight / frequently updated (e.g. animating GeoJSON)? That way we could have optimal performance in both cases if we can reliably categorize between the two. Perhaps by tracking how often the source is updated and switching dynamically?

Honestly, animation performance in IE11 is terrible by default anyways, so the best heuristic might be detecting the underlying platform, all modern browsers can cancel fetch in the worker anyways.
But dynamically turning up the task-budget when the map is animating i,e when a source is being updated frequently is also viable, idk if thats too much complexity vs the browser detection.

Also, I wonder if it's necessary to have the task budget at all? If we just have a time budget, then a lot of fast tasks can happen during animation but it will throttle heavy operations. If I change task budget to unlimited and both time budgets to 2, then I'm back up to ~30 updates per second.

The task budget is for IE11 cancellation, in my testing no modern browser (Chrome/ Firefox) ever hit that limit, since its measuring time for a synchronous loop which is incredibly fast anyways. By removing the task budget modern browsers always end up processing everything in the queue every frame, which results in great animation performance, but since there is no deferral anymore the ability to pre-empt tasks is lost in IE11.

How did you determine that both a time and a task budget achieves the best compromise?

I started out with a time budget only, but I found out modern browsers never hit it ,but IE11 did hit it rather inconsistently but it wasn't leaving enough tasks in the queue to allow for any decent amount of request cancellation. I then added the task budget and tuned it until I was able to get to a compromise between request cancellation and animation updates. The tiome budget mostly prevents some main thread hitching on IE11 afaict.

The original issue fixed in #8633 was that tile requests sent to a worker couldn't be cancelled due to tasks being invoked in serial order. This is primarily a concern when loading vector tiles. The animation performance regression reported in #8701 doesn't use vector tile loading; instead it uses GeoJSON sources with repeated setData. The GeoJSON source has its own mechanism for deduplicating updates (self-sent coalesce messages), similar to how native does it. Over in the original PR, I wrote

I opted for this model instead of a more generic priority queue because we don't have any use cases for a priority queue other than canceling tasks.

However, given that we actually do seem to have differing requirements for worker processing, a priority queue may actually make more sense. Alternatively, we could mark some tasks with an immediate flag which means they get executed immediately instead of put into the queue. This means that tasks may be executed out of order, but if we always set the immediate flag for all requests associated with a particular source (e.g. GeoJSON sources), this may be fine.

ansis · 2019-11-27T05:20:42Z

I think there is a lot we could do to improve queuing, cancellation, prioritization but I think we should focus more narrowly on fixing the regression (if it exists) so that we can get the beta out.

From the numbers here it's not clear that something regressed in IE.

Testing with IE11, which needs pre-empting the most, the number of requests sent out for the tile-cancellation debug page went down to ~60 from the ~120 on 1.5.1-beta. But its still higher than the ~25 on 1.5.0.

The only way you cancel a request before it starts is by delaying it. What these numbers suggest to me is that we are delaying requests less. This is a good thing?

#8633 tried to fix the problem that requests that were already cancelled were initiated anyways because the message that they weren't cancelled hadn't been opened yet. That was happening because workers do large tasks that delay for long chunks of time. My guess is that that is not what is happening now. My guess is that the delay in 1.5.0 was much longer than the time it took to open all the waiting messages.

Zooming at the wrong speed in any browser will trigger a lot of requests and cancellations.

If IE was striking a better balance in 1.5.0 it seems it was semi-random rather than something intentionally tuned to delay requests during rapid zooming.

If we need to fix the rapid zooming / cancellation problem now I think we should look into:

some direct increasing delay for tile requests during rapid zooming (eg a setTimeout before the tile request gets sent to the worker)
or limiting the number of concurrent requests: experimental branch here https://github.com/mapbox/mapbox-gl-js/tree/ie-alt-fix

arindam1993 · 2019-11-27T19:47:35Z

Closing since #9031 fixes this!

Arindam Bose added 2 commits November 22, 2019 21:21

Using rAF based throttled queue on main thread

2f4d84f

Cleanup logic and add bugets to process()

d2405ce

arindam1993 requested a review from kkaefer November 25, 2019 21:55

arindam1993 self-assigned this Nov 25, 2019

Arindam Bose added 5 commits November 25, 2019 13:57

Fix lint errors

04314f9

Fix lint error in Actor

c377bb3

Fix globalScope check

ede3ca1

Add per tick budget to worker side as well

fa6a0ec

Fix lint errors

7959ae0

msbarry reviewed Nov 26, 2019

View reviewed changes

Arindam Bose added 2 commits November 25, 2019 17:33

Fix some comment typos

ff8f2a9

Cleanup comment

eb2a581

mourner reviewed Nov 26, 2019

View reviewed changes

kkaefer reviewed Nov 26, 2019

View reviewed changes

ansis mentioned this pull request Nov 27, 2019

re-add queuing for getResource calls #9031

Merged

3 tasks

arindam1993 closed this Nov 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add budgets to Actor queue flushing #9022

Add budgets to Actor queue flushing #9022

arindam1993 commented Nov 25, 2019 •

edited

Loading

msbarry left a comment

msbarry Nov 26, 2019

mourner left a comment

msbarry commented Nov 26, 2019 •

edited

Loading

kkaefer left a comment

kkaefer Nov 26, 2019

mourner commented Nov 26, 2019

arindam1993 commented Nov 26, 2019

ansis commented Nov 27, 2019

arindam1993 commented Nov 27, 2019

Add budgets to Actor queue flushing #9022

Add budgets to Actor queue flushing #9022

Conversation

arindam1993 commented Nov 25, 2019 • edited Loading

Launch Checklist

msbarry left a comment

Choose a reason for hiding this comment

msbarry Nov 26, 2019

Choose a reason for hiding this comment

mourner left a comment

Choose a reason for hiding this comment

msbarry commented Nov 26, 2019 • edited Loading

kkaefer left a comment

Choose a reason for hiding this comment

kkaefer Nov 26, 2019

Choose a reason for hiding this comment

mourner commented Nov 26, 2019

arindam1993 commented Nov 26, 2019

ansis commented Nov 27, 2019

arindam1993 commented Nov 27, 2019

arindam1993 commented Nov 25, 2019 •

edited

Loading

msbarry commented Nov 26, 2019 •

edited

Loading