Skip to content
This repository has been archived by the owner on Oct 23, 2020. It is now read-only.

CSV Streaming consumes CPU resources (with Feedback from Experienced Clojure Dev) #222

Open
jonathanwcrane opened this issue Sep 5, 2014 · 5 comments

Comments

@jonathanwcrane
Copy link

So I was hanging out on the geohashing IRC channel and got to chatting with a guy who, it turns out is a Clojure developer. I discussed some of the issues we were having, and he took a look at the code and then actually cloned the repo, installed it on a Vagrant box, ran a profiler, and gave me some feedback. This issue is an attempt to catalog his input.

@jonathanwcrane
Copy link
Author

One thing to consider is whether the rows passed to stream-slice-query-csv are a seq that is only lazily generated. That can create the illusion that stream-slice-query-csv is doing a lot of work.

Calling doall on rows (line 426) might make performance worse overall (since it would prevent streaming) but it could allow you to determine the true run time of stream-slice-query-csv.

(Could the rows themselves be lazy seqs hiding delayed computation?)

Careful about giving the JVM too much heap space unless you're using a good concurrent GC; bigger heaps mean longer GC pauses...

@jonathanwcrane
Copy link
Author

Hooeee, vagrant takes a long time to provision!
Ugh, *something* in your dependency set uses version ranges.
It just downloaded every known version of Clojure.
Leiningen says "Consider using [codox-md "0.2.0" :exclusions [org.clojure/clojure]]"

@jonathanwcrane
Copy link
Author

And in an email:

I did find out some more, by the way. In stream-slice-query-csv I
suspected that the "data" value contained pending computation. I found
that that fn took about 450 ms to run in total with the following query:

http://localhost:3000/data/census/slice/population_raw.csv?%24select=region%2C+division%2C+state%2C+sex%2C+origin%2C+race%2C+age%2C+population_2010%2C+population_estimate_2010%2C+population_estimate_2011&%24where=&%24group=&%24orderBy=&%24limit=4000&%24offset=0

I experimented with adding (time (doall data)) as the first line of that
function and then a second call to time around the actual streaming.
The doall took about 350 ms and the streaming took about 100 ms.

90 ms still seems a little high, but the takeaway here is that the CSV
generation itself is not the big timesuck. I didn't dig in to find out
what computation is wrapped up in that lazy seq, but that seems to be
what is taking the lion's share of the execution time.

@timmc
Copy link

timmc commented Sep 5, 2014

Guy here. :-) The vagrant provisioning time was unrelated to the dependency range; the latter only contributed the tiniest amount of time compared to the full time to provision. My concern there is about stability, since it means you might one day deploy and suddenly you have a different version of Clojure. This is a general misfeature of Maven. Here's some more information about version ranges:

https://github.com/technomancy/leiningen/wiki/Repeatability#version-ranges

@m3brown
Copy link
Contributor

m3brown commented Sep 8, 2014

@timmc, thanks for the feedback. Regarding (time (doall data)), I did exactly that a month or two ago, and had a similar conclusion. I don't have my notes handy, but I believe I found that for larger queries (~20s total time), the time ratio was more like 95% doall and 5% streaming. Additionally, while in the doall state, I recall the CPU spending a more time in blocking than processing.

I was not able to dig further into why data takes so long. My assumption (based on the observations above) was that it was simply the time required for the monger library to transfer the data. In hindsight, it is certainly worth looking into as it as there could be another bottleneck.

@jonathanwcrane jonathanwcrane changed the title Feedback from Experienced Clojure Dev CSV Streaming consumes CPU resources (with Feedback from Experienced Clojure Dev) Sep 11, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants