-
-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faker.datatype.number often gives duplicates #2355
Comments
I used the following code: import { faker } from "@faker-js/faker";
const set = new Set();
const limit = 10_000_000;
for (let i = 0; i < limit; i++) {
set.add(faker.number.int());
//if (i % 100_000 === 0) {
// console.log(i);
//}
}
console.log("=====================================");
console.log("Runs: ", limit);
console.log("Values:", set.size);
console.log("Ratio: ", set.size / limit); Which returned the following values:
Which is 12 duplicates per 10k generated values. I used your code above to count the number of errors: console.warn = () => {};
import { faker } from "@faker-js/faker";
const limit = 5_000_000;
let counter = 0;
for (let i = 0; i < limit; i++) {
try {
faker.helpers.unique(
() => faker.number.int({ max: Number.MAX_SAFE_INTEGER }),
undefined,
{ maxTime: 100 }
);
} catch (e) {
counter++;
}
//if (i % 100_000 === 0) {
// console.log(i);
//}
}
console.log("=====================================");
console.log("Runs: ", limit);
console.log("Errors:", counter);
console.log("Ratio: ", counter / limit); And I didn't get any in 5M executions, I didn't run it on a slim CI machine though. So I assume it is caused by a slow CI pipeline host/GC pause.
This is the only workaround that I can currently think of except from implementing unique yourself. Have you tried using Faker v8?
We aren't happy with the current implementation either.
Faker uses mersenne to generate reproducible values, |
Please also read #1785 for more details and potentially you can already switch now to one of the new unique implementations out there, or even just copy-paste my suggestion and alter it to your own needs. |
It does seem like faker.number.int returns more duplicates than expected for large max. If every int was equally likely to be selected, you would expect fewer duplicates as the max increased. However in fact for max above 10^10 there doesn't seem to be an increase in the runs needed before the first repeated value, it occurs after approx 80,000 values on average. Note than Number.MAX_SAFE_INTEGER is of the magnitude 10^15. const {
faker
} = require("@faker-js/faker");
const attempts = 100
for (let pow = 2; pow < 15; pow++) {
const max = 10 ** pow;
let total = 0
for (let attempt = 0; attempt < attempts; attempt++) {
total += timeToFirstDuplicate(max)
}
let average = Math.round(total / attempts)
console.log("First duplicate for faker.number.int({max:10^" + pow + "}) occurs after avg " + average)
}
function timeToFirstDuplicate(max) {
const set = new Set();
let duplicates = false
let count = 0
do {
const j = faker.number.int({
max
});
if (set.has(j)) {
duplicates = true
} else {
set.add(j)
count++
}
} while (!duplicates)
return count
}
Perhaps the way we have mersenne set up means the number of possible outputs is less than expected? |
This makes sense, as mersenne twister implementation returns around 2^32 seperate values which is approx 10^10, several orders of magnitude less than Number.MAX_SAFE_INTEGER |
If you uniformly draw a random number uniformly from [1…N] with replacement the expected number of draws before you get a repeat is approx Apologies for the mathematical diversion :D |
Send from my phone: what about multiplying faker.number.int() with faker.number.int(min: 1, max: 25000) 👀
We could move this multiplication also into our source to adjust it internally 🤔 But first let's hear the ideas from the others |
A little loss in precision is to be expected in algorithmic randomness. Also we could check if we could use a different access method from our twister. AFAICT there is one with higher precision or at least is says so. |
I tried switching from And the number of duplicates went from ~ Test-Codeimport MersenneTwister19937 from "./twister";
const twister = new MersenneTwister19937();
twister.initGenrand(Math.random() * Number.MAX_SAFE_INTEGER);
const limit = 8_300_000; // JS object key limit!?
const data: Record<number, number> = {};
for (let i = 0; i < limit; i++) {
const random = twister.genrandReal2();
// const random = twister.genrandRes53();
data[random] = (data[random] || 0) + 1;
if (i % 100_000 === 0) {
console.log("Runs: ", i);
}
}
console.log("Runs: ", limit);
console.log("=====================================");
console.log("Calculating...");
const duplicateCount = Object.values(data).reduce((a, b) => a + b - 1, 0);
const duplicateMax = Object.values(data).reduce((a, b) => Math.max(a, b), 0) - 1;
console.log("=====================================");
console.log("Runs: ", limit);
console.log("Duplicates:", duplicateCount);
console.log("Ratio: ", duplicateCount / limit);
console.log("Max: ", duplicateMax);
console.log("====================================="); |
I'll have a look at the bit magic in the implementation, when i have more time. |
I ran 500kkk invocations for faker.number.float() and it returns [0,1) wheres all the other methods return [0,max]. It looks like #1675 might have contributed to this. |
Is [0,1) or [0,1] supposed to be the correct behavior for number.float()? IMO [0,1] is the correct version. |
faker.int.number() only returns even values when using very large max values. |
I think [0, 1) is the correct one. |
I created #2357 to start working on a fix. |
All our other methods return |
With the new PR we could introduce a new option |
I don't like this. Just define what the function does and be done. |
If someone needs that they should open a new feature request then. |
Team Decision We will change mersenne to 53 bit precision to reduce the duplicates and create a separate PR for the float issues. |
Pre-Checks
Describe the bug
fakeId
is called 50--100 times in total on the offending test runs.Analysis
unique
is also sus on CI environments. Namely, this block:https://github.com/faker-js/faker/blob/next/src/modules/helpers/unique.ts#L114-L122
now - startTime
will be greater thanmaxTime
, resulting in the thrown errorSuggestion:
maxTime
.unique
should rely onmaxRetries
onlyMinimal reproduction code
Additional Context
No response
Environment Info
Which module system do you use?
Used Package Manager
npm
The text was updated successfully, but these errors were encountered: