How much is too big? #1396

EVAUTOAI · 2024-12-16T04:46:38Z

According to https://docs.evidentlyai.com/support/f.a.q. its recommended to use sampling for large datasets.
Can you please help me understand what is a "large" dataset that would require sampling?

Like how many (rows * columns) would cause issues?

elenasamuylova · 2024-12-16T16:24:52Z

Hi @EVAUTOAI, there is no fixed answer here since

Evidently can evaluate hundreds of different metrics where each has its computational footprint (e.g., there are metrics like "text content drift" that train a whole machine learning model on your data vs. more straightforward metrics that compute the mean value in the column). You can also combine multiple metrics in the same report.
The computation happens in memory, so the limitation will depend on your infrastructure.

So the simple answer is: if your computation takes longer than you want or fails to compute otherwise, you may consider sampling. Also, sampling often makes sense for metrics like data distribution drift.

EVAUTOAI · 2024-12-19T04:12:12Z

Thank you @elenasamuylova !

Are there any recommendations? or examples or mappings regarding memory size with Dataset size?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How much is too big? #1396

How much is too big? #1396

EVAUTOAI commented Dec 16, 2024

elenasamuylova commented Dec 16, 2024

EVAUTOAI commented Dec 19, 2024

How much is too big? #1396

How much is too big? #1396

Comments

EVAUTOAI commented Dec 16, 2024

elenasamuylova commented Dec 16, 2024

EVAUTOAI commented Dec 19, 2024