Data queues, prefetching and multi-source #1773

cypof · 2015-01-21T23:31:06Z

I split the work on data_layer from #1148. It was written initially to get enough bandwidth to feed multiple GPUs and fix performance issues with the thread creation/destruction on each batch. Over time a few other things got in. In particular we are experimenting at Flickr with different ratios of classes by reading from multiple sources. E.g. each dataset can be setup to contain one class, and the probability of each source defines the class ratios at runtime. Features:

Reading from multiple sources, in case one network location or disk cannot feed the solvers. Each source can hold only a shard, in which case they need probabilities balanced by their size. Or a copy of the same dataset with a random offset, which might also change SGD behavior a bit, as some examples might be seen multiple times before the second epoch, but over time coverage should be the same.
Probabilities on sources, e.g. to change the ratio of positive/negative when doing binary classification.
One loading thread per database, even if multiple solvers are running. For single threaded DBs like LevelDB, and to ensure sequential access, which is usually faster. In almost all cases one thread is enough for loading speed as it doesn't do anything else. There is still a transform thread for each solver like today.
No thread creation/deletion per batch. It's inefficient and it causes problems with components that rely on thread-local caching. We also had problem with memory pinning and virtual memory. C.f. @thatguymike
Prefetch asynchronously to each GPU on a separate CUDA stream, so that the batch is already on the GPU when the solver needs it.
Prefetch a configurable number of batches in host memory to erase bandwidth glitches, in particular if data is loaded from a network it might make sense to configure a large prefetch queue.

shelhamer · 2015-01-22T00:12:17Z

@cypof thanks for all the data pipeline improvements. Just a heads-up: this'll likely need a rebase after #1748.

cypof · 2015-01-22T02:57:21Z

Deleted my branch by mistake, copied the PR to #1775

cypof mentioned this pull request Jan 21, 2015

Datum db #1568

Closed

cypof force-pushed the data_queues branch from 7b7e9f5 to 077a865 Compare January 22, 2015 00:26

Data queues / prefetching

4028986

cypof force-pushed the data_queues branch from 077a865 to 4028986 Compare January 22, 2015 00:58

cypof closed this Jan 22, 2015

cypof deleted the data_queues branch January 22, 2015 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data queues, prefetching and multi-source #1773

Data queues, prefetching and multi-source #1773

cypof commented Jan 21, 2015

shelhamer commented Jan 22, 2015

cypof commented Jan 22, 2015

Data queues, prefetching and multi-source #1773

Data queues, prefetching and multi-source #1773

Conversation

cypof commented Jan 21, 2015

shelhamer commented Jan 22, 2015

cypof commented Jan 22, 2015