Distinct operator performance issues. #2009

bafolts · 2016-10-06T16:14:28Z

RxJS version:
5.0.0-rc1

Code to reproduce:
Create a distinct observable with 100k unique items, it will progressively get slower as internally a serial search for the item is used.

var rx = require("rxjs/Rx");

var s = new rx.Subject();

s.distinct()
    .count()
    .subscribe((n) => {
        console.log(n);
    });

for (var i = 0; i < 100000; i++) {
    s.next(i);
}

s.complete();

Expected behavior:
This code should finish under 1 second.

Actual behavior:
This code takes minutes to run.

Additional information:
The internals of distinct needs to utilize a set to determine if the item has already been observed. The serial search will not scale. I stumbled upon this issue by not fully understanding how distinct worked.

The text was updated successfully, but these errors were encountered:

jadbox · 2016-10-14T15:47:14Z

"RxJS version: most recent" Please be specific: Is it RC1?

bafolts · 2016-10-17T18:24:23Z

https://github.com/ReactiveX/rxjs/blob/master/src/operator/distinct.ts#L68-L73

Whatever version this would be is using the for loop which will lead to the poor performance.

benlesh · 2016-10-17T20:56:45Z

@bafolts Yeah, that is a problem. I'm surprised it made it this long without being pointed out. The distinct operator was added during a phase where we were trying to get functional parity with v3. However, that operator even changed since then for v4. I think we can add the keySelector and probably entirely drop the comparer. I'm not sure who wants to check for distinct in that way anyhow.

I should have a PR for this later today.

We'll probably:

Add keySelector
Drop distinctKey operator entirely.

benlesh · 2016-10-17T20:57:14Z

cc/ @mattpodwysocki @staltz @trxcllnt @jayphelps

jayphelps · 2016-10-18T04:23:12Z

My completely unscientific testing suggests to me that we should indeed utilize a set of some kind, ideally in modern browsers use an ES6 Set and but fall back to an array with an indexOf check for older platforms like IE9-10.

Again, not scientific but I feel pretty safe in my belief that Sets are super fast compared to any approach that uses Arrays--at least in Chrome. And it makes sense too cause the underlying runtime can implement a true HashSet instead of needing to loop through all the items.

next_Set 25.91 ms
next_indexOf 2775.84 ms
next_loop 4547.62 ms

The downside to using this would be that we could no longer accept the same compare function we do today, because that relies on arbitrary comparison of each individual item aka a loop. compare = (prev, next) => prev !== next.

In some distant future when we only support IE11+ we could accept a compare function that is provided the Set and then expected to do something with it to decide if the value is distinct, like the default (set, value) => set.has(value) which would allow you to iterate over each item in the Set if you really wanted to.

Would love to hear others thoughts on this and if they have alternative, IE9+ supported solutions. Without proof, I feel that 99% of people using distinct do not need a compare callback, and if true I'd rather the perf of the default implementation be literally 100x better and drop that feature.

bafolts · 2016-10-18T14:33:14Z

A set is definitely the way to go. Instead of a compare method, we could use a method that generated the hash key for the item. If the hash key had to be a string then IE9 could use generated Object properties as a set. I bet people may use the compare callback to select a property to compare on an object. They could use pluck but then they may lose their original object. With a compare callback they can get distinct objects without losing their original object. If users could provide a function for how to hash their Object I think they could convert their current compare methods fine. If a string return type was enforced we should also be fine for all browsers. IE9 users would have to provide the callback function.

Observable.fromEvent(document, "mousemove")
     .distinct((event: MouseMoveEvent) => {
          return `${event.clientX}:${event.clientY}`;
     });

Internally IE9 can now use the poor man's hash from days of yore {} and Object.prototype.hasOwnProperty.

lock · 2018-06-06T22:41:02Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

benlesh self-assigned this Oct 17, 2016

jayphelps closed this as completed in 89612b2 Oct 26, 2016

lock bot locked as resolved and limited conversation to collaborators Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinct operator performance issues. #2009

Distinct operator performance issues. #2009

bafolts commented Oct 6, 2016 •

edited by jayphelps

Loading

jadbox commented Oct 14, 2016 •

edited

Loading

bafolts commented Oct 17, 2016

benlesh commented Oct 17, 2016

benlesh commented Oct 17, 2016

jayphelps commented Oct 18, 2016 •

edited

Loading

bafolts commented Oct 18, 2016

lock bot commented Jun 6, 2018

Distinct operator performance issues. #2009

Distinct operator performance issues. #2009

Comments

bafolts commented Oct 6, 2016 • edited by jayphelps Loading

jadbox commented Oct 14, 2016 • edited Loading

bafolts commented Oct 17, 2016

benlesh commented Oct 17, 2016

benlesh commented Oct 17, 2016

jayphelps commented Oct 18, 2016 • edited Loading

bafolts commented Oct 18, 2016

lock bot commented Jun 6, 2018

bafolts commented Oct 6, 2016 •

edited by jayphelps

Loading

jadbox commented Oct 14, 2016 •

edited

Loading

jayphelps commented Oct 18, 2016 •

edited

Loading