-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Math.random to prevent duplicate event IDs #499
Comments
Are you running the event fingerprint enrichment? Could you reshare the above table with the event fingerprint output as well? Trying to understand what type of duplicates these are. |
For this dataset, which is a bit older, we did not have that enrichment enabled. (If it even was available back then.) I'll check for the current SP stack, if we have it running. But hashing the whole event does not make a difference. If time, IP address and user agent are different. It will most definitively result in a different hash! |
Thanks @christoph-buente ! |
Looking at the user agent i came to the suspicion that the actual implementation of the uuid algorithm falls back to Math.random() in certain browsers. Can this be confirmed? That would be a bug in the underlying uuid npm module, right? |
Hi @christoph-buente. This is an issue we have written about before: http://discourse.snowplowanalytics.com/t/de-deduplicating-events-in-hadoop-and-redshift-tutorial/248 (in case the solutions outlined there are useful). |
@bogaert thx Chris, i read that article. But i hoped there is a chance to fix that issue instead of working around it later down the data pipeline. |
@christoph-buente That makes sense. We'll look into it (cc @jbeemster). |
Bump 😄 |
Hi @christoph-buente. It was good meeting in Berlin! This is something we'll keep investigating, but in the meantime we're also working on an improved deduplication step in the pipeline: https://github.com/snowplow/snowplow/issues?q=is%3Aopen+is%3Aissue+milestone%3A%22R8x+%5BHAD%5D+Synthetic+dedupe%22 |
From @falschparker82: Hi Snowplowers, we've got heavy problems here due to some bots (especially Googlebot, googleweblight and others) apparently using a broken version of Math.random(), leading to duplicated event_ids. This gets especially annoying with the resulting huge cartesian joins resulting from significant context use. I could track down the problem down into the uuid library of the implementation that the Javascript collector uses: https://github.com/kelektiv/node-uuid/blob/master/lib/rng-browser.js#L17 - which happens to fall eventually back down to Unfortunately, the implementation of http://stackoverflow.com/a/24224089/1281376 I'd like to propose the following solution for this problem:
1a) Make it possible to seed the RNG by invoking 1b) Alternatively - or as a fallback to 1a) - generate the seed vector from the following entropy sources:
Thoughts? Right now I've got a little too much on my plate, so feel free to grab it - but if we happen to do ourselves eventually it we'd love to upstream in case you're interested. |
Scheduling into 2.8.0 as this is important |
re-hi! (was a little confused at first since the issue link was cross-project, so #499 ended up linking to the wrong issue on snowplow/snowplow). As said, so much for the thoughts, can't do it myself right now due to lack of time unfortunately. Just a quick clarification regarding:
I was actually thinking about templating the tag itself when rendering a web page, not the Javascript tracker. So more something along the lines of (thinking in PHP here for simplicity / demonstration purposes):
|
Addendum: I'd be happy to review though - a few implementation hints:
|
Any update on this one? Can I help? |
Hey @falschparker82 - we pushed it back because we are assigning a new (to Snowplow trackers) engineer to the 2.8.0 release and this is a complex ticket. To bring it back into 2.8.0 again, I think we would need a PR from you... |
Unfortunately, we're super low on Frontend power right now and JavaScript isn't exactly my hometurf. I'll see if we can make a slot for that somewhere in the next weeks but no high hopes right now. Will touch base in a month at the latest. |
@alexanderdean do you have capacity now, to tackle that issue? It did not dissapear, unfortunately. 😭 |
Potentially @christoph-buente - we should know within the next month how much bandwidth we have for JS Tracker work... |
Much needed, any news on this issue? |
For the record I've put a comment on the |
Thanks @anagytherealone ! |
Is there any more news on this issue? What's the plan? How are things progressing? We've noticed the same thing- clashes in snowplow-generated ids. We've also seen clashes across ids: e.g. There seems to be a general feeling here that with a good enough random number generator you'll get an essentially unique UUID. While this is true when using a single stream of random numbers it is not true when you have several streams of random numbers not sharing state. Unfortunately, since there are many, many trackers running in parallel, any guarantees on the uniqueness of the ids falls over and is at the whim of how the Bloating the codebase with a more sophisticated random number generator is overkill. Random number generators like the Mersenne Prime Twister are designed to be robust to a whole suite of statistical tests; the goal being that the generated random numbers will be used in simulations. For UUID, all that's needed is a long periodicity; even a basic random number generator would suffice. In my opinion, the correct approach is to simply hash a whole bunch of context relating to the |
Thanks for your comments @smalory |
Great; thanks for the info Paul! |
I am investigating a similar issue. Has anyone tested if crypto.getRandomValues() works better for googleweblight or the googlebot? We also see an issue when the user agent is simply 'Mozilla 5.0' or contains adsbot |
Time for an update on this one given v3 is now available. Unfortunately little has changed at the moment, the issue arises due to falling back to Math.random() when crypto.getRandomValues() isn't available. As we continued to support older browsers with v3, this fallback has stayed in place. Until we drop support for IE9 and IE10 then we can't do too much to fix this (unless anyone has any ideas that don't add too much bloat?). As for dropping IE9 and IE10, they have a depressingly high level of events (particularly IE9) in the data we assessed. I'd hope we can drop them when v3 is mature and move to v4 - those that want IE9 and IE10 support can then stay on the mature v3. For the time being, deduplicating the events downstream is the best solution for this (as described earlier in this issue). |
Hi @paulboocock, In light of your last comment from almost 1.5 years ago, is there nowadays still a need/requirement to support IE9 and IE10 (and even IE11)? Are there already plans for the mentioned v4? Alternatively, would it be an idea to offer the "older browsers" support as a separate NPM package that can be installed like the plugins when people have a need to support those older browsers? |
Hi @schmkr, thanks for raising this question.
Yes. We have plans for the v4. Indicatively we plan to release the v4 before the end of the year. |
Hi @AlexBenny, since Version 3.9 was just released: are you still sticking to the plan that the Math.random issue will be addressed in Version 4.0 ? |
According to the definition of a V4 UUID, event_id is supposed to be unique for every event. Unfortunately we see the same event_id more than once in our collected data.
Our example data set has 13,406,731 events from which 13,235,475 have unique event_ids. This means that there are about 1.3% events that have duplicated event_id. These events may come in completely different times and from different ip's. As an example, there are 15 different events with event_id = "f2d4f27d-077a-4086-8cce-5789b10fbfda". See the graphics below.
We even have a larger data set with more recent JS tracker versions (2.5.x and 2.6.x), and the duplicate ratio is even worse ~= 1.8%
The text was updated successfully, but these errors were encountered: