Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about KDF performance for reading from CSV file #970

Open
hasanelfalakiy opened this issue Nov 26, 2024 · 5 comments
Open

Question about KDF performance for reading from CSV file #970

hasanelfalakiy opened this issue Nov 26, 2024 · 5 comments
Labels
question Further information is requested

Comments

@hasanelfalakiy
Copy link

before i use dataframe i would like to know if anyone has tested how fast dataframe is when reading and managing data from csv files?
please comment

@zaleslaw zaleslaw added the question Further information is requested label Nov 26, 2024
@zaleslaw zaleslaw changed the title ask Question about KDF performance for reading from CSV file Nov 26, 2024
@zaleslaw
Copy link
Collaborator

@Jolanrensen I remember you tested in one of your PRs the new CSV reader implementation, could you please share some details?

@Jolanrensen
Copy link
Collaborator

Indeed! While the focus of DataFrame is on type-safety and ease of use, we do consider performance. DF 0.15 will have a new experimental module "dataframe-csv" which is built on the much faster Deephaven-csv library. The old implementation uses Apache Commons CSV (which had a habit of running out of memory for large CSV files).

I made a little benchmark to test the difference between the two implementations. They consist of a small (384 B), medium (19,8 MB), and large (784,5 MB) csv file: #903 (comment).

Note that DataFrame reads all data into the JVM heap memory. This is to achieve the aforementioned ease-of-use and type safety, so if your CSV is too large, you can still run into limits, but increasing your memory size could help here.

Hope that answers your question :)

@hasanelfalakiy
Copy link
Author

Indeed! While the focus of DataFrame is on type-safety and ease of use, we do consider performance. DF 0.15 will have a new experimental module "dataframe-csv" which is built on the much faster Deephaven-csv library. The old implementation uses Apache Commons CSV (which had a habit of running out of memory for large CSV files).

I made a little benchmark to test the difference between the two implementations. They consist of a small (384 B), medium (19,8 MB), and large (784,5 MB) csv file: #903 (comment).

Note that DataFrame reads all data into the JVM heap memory. This is to achieve the aforementioned ease-of-use and type safety, so if your CSV is too large, you can still run into limits, but increasing your memory size could help here.

Hope that answers your question :)

interesting, I have also seen on the medium site that someone tested the speed of reading csv files with deephaven and indeed deephaven excels because of its speed, when will DF 0.15 be released?because i will use it to read tabular data correction term file in my project like vsop2000 coefficient correction term which for the sun and moon data alone there are a total of 86 thousand rows of correction terms

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Nov 26, 2024

interesting, I have also seen on the medium site that someone tested the speed of reading csv files with deephaven and indeed deephaven excels because of its speed, when will DF 0.15 be released?because i will use it to read tabular data correction term file in my project like vsop2000 coefficient correction term which for the sun and moon data alone there are a total of 86 thousand rows of correction terms

We won't achieve the full speed of Deephaven (as written on medium), since we work with boxed lists in memory, but at least it will be a lot faster than before :).

We plan to have a release candidate for 0.15 out this week. If nothing comes up, the full release will be soon thereafter.

If you cannot wait and want to try it already, we always publish dev versions from our master branch: https://central.sonatype.com/artifact/org.jetbrains.kotlinx/dataframe/0.15.0-dev-5148. Make sure to also add the new experimental dataframe-csv module and use DataFrame.readCsv() instead of the old DataFrame.readCSV() suite of functions: https://central.sonatype.com/artifact/org.jetbrains.kotlinx/dataframe-csv/0.15.0-dev-5148. Feedback is always welcome :)

@hasanelfalakiy
Copy link
Author

interesting, I have also seen on the medium site that someone tested the speed of reading csv files with deephaven and indeed deephaven excels because of its speed, when will DF 0.15 be released?because i will use it to read tabular data correction term file in my project like vsop2000 coefficient correction term which for the sun and moon data alone there are a total of 86 thousand rows of correction terms

We won't achieve the full speed of Deephaven (as written on medium), since we work with boxed lists in memory, but at least it will be a lot faster than before :).

We plan to have a release candidate for 0.15 out this week. If nothing comes up, the full release will be soon thereafter.

If you cannot wait and want to try it already, we always publish dev versions from our master branch: https://central.sonatype.com/artifact/org.jetbrains.kotlinx/dataframe/0.15.0-dev-5148. Make sure to also add the new experimental dataframe-csv module and use DataFrame.readCsv() instead of the old DataFrame.readCSV() suite of functions: https://central.sonatype.com/artifact/org.jetbrains.kotlinx/dataframe-csv/0.15.0-dev-5148. Feedback is always welcome :)

okay thank you, I'll try it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants