-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace stdlib CSV reader with simpler detector #553
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thank you. I had this on todo for so long but was reluctant because CSV reading is really hairy issue.
I have 2 comments about it:
- please move the code in new package called
csv
in this directory - new detector should behave like the stdlib detector. I changed the tests and also added a fuzzing test against stdlib reader in 8d272ec. Please merge this fuzzing into the PR.
Edit: v1.4.4 and next release include performance improvements, mostly related to memory allocs. It might help but yeah... my profiling as well shows that CSV and NDJSON allocate a lot.
Edit: a benchmark between sv
and svStdlib
to show the improvement would be great.
I'll push the refactors shortly Benchmark test codefunc BenchmarkDetectVsSv(b *testing.B) {
fh, err := os.Open("random_data.csv")
if err != nil {
b.Fatalf("failed to open file: %+v", err)
}
contents, err := io.ReadAll(fh)
if err != nil {
b.Fatalf("failed to read file: %+v", err)
}
for _, limit := range []uint32{0, 100, 1000} {
b.Run(fmt.Sprintf("sv(limit=%d)", limit), func(b *testing.B) {
sv(contents, ',', limit)
})
b.Run(fmt.Sprintf("Detect(limit=%d)", limit), func(b *testing.B) {
Detect(contents, ',', limit)
})
}
}
So The file under test was:
Edit: please note that LazyQuotes is not implemented in this PR, thus, my first attempt to incorporate that feature lead to lost of pain and performance loss. |
After running through the fuzz test, I've found a lot of edge cases. It essentially comes down to the |
@gabriel-vasile I've done some initial work to get the prototype working with fuzz tests, but honestly it's not worth the gains (less memory performance improvement, I might shelve this for a while, but will noodle on it in the background. One thought I had: what if using the "simple" (more performant detector) that did not support LazyQuotes as a configuration option? I also haven't characterized just switching this flag off when creating the stdlib reader. |
@wagoodman I took inspiration from json package and I added a pool of The code is here: 5f825db The benchmark shows -77% allocated bytes:
It's not ideal, there are still allocations done, but it cuts the bulk of allocated bytes. Let me know if you see improvements in stereoscope, but this commit is probably going to master. A CSV detector that avoids allocs completely would be more than welcome, but that's a lot of work. |
There is evidence that using the stdlib csv reader can be resource intensive from a memory perspective:
We're seeing evidence of this in stereoscope:
Since we are not in need of the full CSV reader functionality, this PR drops usage of the CSV reader and adds a CSV detector in its place. This yields a drastic performance improvement memory-wise (not inuse memory, total allocated memory):