Skip to content

Optimization and Performance Tuning

Jeff Friesen edited this page Nov 24, 2013 · 2 revisions

taken from this google groups thread written by @jrideout:

First, DC's niche is with "small" data sets, that can fit on a browser client, that have multiple dimensions you want to interactively compare. There are many uses cases that do not fall into this category. Any work with larger data sets, can still make use of DC, but will require some customization of the environment if not also DC itself.

Because seems like I can't go more than 16,105 rows...

Probably you can handle more than that with some optimization. I certainly would remove unused columns and optimize the formats of others. DC has two areas of resource constraint:

  1. client memory. You can have a data set as large as will fit in client memory. On my laptop, I've had 500Mb data sets work fine. But I wouldn't want to distribute something like that widely.

  2. The complexity of the DOM. The number of nodes created by the visualization will probably be a more visible bottleneck than the data set size. I've had complex visualizations of 20MB data sets perform worse than 100MB.

On #1 above the limiting factor is the javascript memory footprint - not the size of the csv. If you are using dc.csv to load your data, remember it will convert

h1, h2
r1c1, r1c2
r2c1, r2c2

into

[{h1: "r1c1", h2: r1c2}, {h1: "r2c1", h2:"r2c2"}]

Notice the redundant h1 and h2's. This makes it easier to reference columns in your code, but id does bloat the memory footprint. As an alternative you can use a matrix style array of arrays, but that is much more prone to developer error. Such as [["r1c1","r1c2"],["r2c1","r2c2"] and specifying dimensions like d[0] rather than d.h1.

Another thing to consider are partial aggregation of the raw data. Let say you have something like this:

timestamp, account, amount
1385133212, 1, 10
1385133222, 2, 14
1385133232, 1, 12

you could aggregate against smallest timeframe that you will visualize, say by day, and still keep the transaction count:

day, account, amount, count 
1122, 1, 22, 2
1122, 2, 14, 1

Notice, I also dropped the year, if that isn't needed.

Other optimizations include rounding numbers: 0.1232423425, could become 0.12. Or doing in-memory joins and read time for shared data common to multiple records.

Depending on your data these type of things can make a big impact on memory footprint. Also useful are limits on number of records in the data set. You can have a non-crossfilter, filter to dynamically load or remove data on some dimension (such as time). This adds more complexity again to the UI, and is very hard to manage if you also want to visualize aggregates of the data that hasn't been loaded, but can be necessary in some cases. For instance, you can have a range-chart that load a by day count summary for a long timeframe (say a year) but restricts zooming in for timeframes smaller than 30 days, dynamically loading the dataset by day for no more than 30 at a time.

As far as DOM complexity - favor multiple visualizations that look at small numbers of aggregates rather than fewer charts with larger aggregate ranges. Square solved this problem in cubism http://square.github.io/cubism/ by

A) aggressivly limiting the data on screen by having a fixed display size and removing old data and

B) using canvas rather than svg to limit the dom node interactions.

You could use technique A with DC and B is interesting to explore as a future enhancement.

And last, you also need to consider the load time of the data set. Here you are limited by

  1. the network latency for the duration of file transfer and

  2. the parse time it take to transform the data, and

  3. the time it take to perform the initial aggregations

#3 is a somewhat fixed cost. Optimizations to limit the number of rows and avoid unneeded calculations can help. Such as, for averages, only calculating the total and count in a reduce function, but then doing the division in an chart accessor rather than redundantly in the reduce.

#1 and #2 have tradeoffs, but generally limiting #1 is going to be the focus, as even on the client CPU+RAM-access is generally much faster than network transfer. That is one reason to prefer to TSV or CSV over json datasets. Often the CSV can be more concise, but it does then have a parsing cost. You can limit this cost by parsing as the file is transferred using something like oboe.js or the approach I've taken here: https://gist.github.com/jrideout/7111894/raw/ed0eeb28c87b572e2b441dfc371036f36c0a3745/index.html

I should also note, it would be wise to profile things first.