Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform distribution in helpers.arrayElements results #1765

Closed
mjomble opened this issue Jan 22, 2023 · 7 comments · Fixed by #1770
Closed

Uniform distribution in helpers.arrayElements results #1765

mjomble opened this issue Jan 22, 2023 · 7 comments · Fixed by #1770
Assignees
Labels
c: bug Something isn't working m: helpers Something is referring to the helpers module p: 1-normal Nothing urgent s: accepted Accepted feature / Confirmed bug
Milestone

Comments

@mjomble
Copy link

mjomble commented Jan 22, 2023

Clear and concise description of the problem

I want faker.helpers.arrayElements to always return each element with the same probability.
However, in certain situations, some indexes are picked far more often than others.

For example, when sampling 10 elements from an array with 1000 elements, indexes ending in 9 are are picked ~20 times more often than indexes ending in 1. And when sampling 9 or fewer elements, indexes ending in 1 seem to be never picked at all.

Suggested solution

The root cause of the problem is that arrayElements picks array indices using faker.number.float() here with the default precision of 0.01
This works fine with an array of 100 elements, but as the length grows, anomalies begin to appear.

The simplest solution would be to replace this.faker.number.float({ max: 0.99 }) with Math.random(). This would, however, break some deterministic test cases.

Alternative

We could also use something like this.faker.number.float({ max: 0.999999999, precision: 0.000000001 }) but I'm not sure what the best number of digits is.
The precision could also be conceivably derived from the length of the given array.

If I get some suggestions from maintainers, I may be able to submit a PR.

Additional context

Code to reproduce issue:

const { faker } = require('@faker-js/faker/locale/en')

const arrayLength = 1000
const ids = Array(arrayLength)

for (let i = 0; i < arrayLength; i++) {
    ids[i] = i
}

const countByMod = new Map()

for (let i = 0; i < 1000; i++) {
    for (const id of faker.helpers.arrayElements(ids, 10)) {
        const mod = id % 10
        const count = countByMod.get(mod) ?? 0
        countByMod.set(mod, count + 1)
    }
}

console.log('nines:', countByMod.get(9), 'ones:', countByMod.get(1))

Some outputs:

nines: 2793 ones: 116
nines: 2760 ones: 126
nines: 2755 ones: 107

Both should be much closer to 1000, which is the case when the same code is run on a fixed version of arrayElements:

nines: 990 ones: 1098
nines: 1008 ones: 1013
nines: 1023 ones: 943
@mjomble mjomble added the s: pending triage Pending Triage label Jan 22, 2023
@Shinigami92
Copy link
Member

The simplest solution would be to replace faker.number.float({ max: 0.99 }) with Math.random(). This would, however, break some deterministic test cases.

As you found already out, this is a no-go 🙅


Is this also affecting other parts of faker and we need to change something underlying? Or is this only the precision passed inside the arrayElement function?

If possible we should define a test to cover the probability with a max delta-error 🤔

@mjomble
Copy link
Author

mjomble commented Jan 22, 2023

Or is this only the precision passed inside the arrayElement function?

The problem can be fixed by passing in a different precision from arrayElements, doesn't seem to affect anything else.

@ST-DDT
Copy link
Member

ST-DDT commented Jan 22, 2023

The float imprecision should already be covered by #1675

And the uneven distribution should already be fixed in v8.0.0-alpha.0 because helpers.arrayElement -> number.int -> mersenne without any precision limitations.
Or am I missing something?

@ST-DDT
Copy link
Member

ST-DDT commented Jan 22, 2023

We should probably rewrite this line to make it more readable.

index = Math.floor((i + 1) * this.faker.number.float({ max: 0.99 }));

@matthewmayer
Copy link
Contributor

Note i deliberately kept the precision of 0.01 to avoid too much churn on the snapshot tests when i implemented #1675, but yes after that is merged, that line should be rewritten to take advantage of the higher precision floats now available.

@ST-DDT ST-DDT added c: bug Something isn't working p: 1-normal Nothing urgent s: accepted Accepted feature / Confirmed bug m: helpers Something is referring to the helpers module and removed s: pending triage Pending Triage labels Jan 23, 2023
@ST-DDT ST-DDT moved this to Todo in Faker Roadmap Jan 23, 2023
@ST-DDT
Copy link
Member

ST-DDT commented Jan 23, 2023

@mjomble I originally wanted wanted to let you write a fix for this, but then spend a lot of time trying to understand the issue/code, that I ended up writing the PR myself. Sorry about that.

Will be fixed in #1770.


While working on this issue I also found #1771

If you would like to contribute that, then please a comment there.

@mjomble
Copy link
Author

mjomble commented Jan 23, 2023

@mjomble I originally wanted wanted to let you write a fix for this, but then spend a lot of time trying to understand the issue/code, that I ended up writing the PR myself. Sorry about that.

Totally fine by me 😄 Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: bug Something isn't working m: helpers Something is referring to the helpers module p: 1-normal Nothing urgent s: accepted Accepted feature / Confirmed bug
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants