-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware Limitations and Scaling #147
Changes from 4 commits
dc386d4
a52403d
df714ce
2711701
8e8fcb4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,44 @@ | ||
tag citation | ||
Zhou2015_deep_sea doi:10.1038/nmeth.3547 | ||
Bengio2015_prec arXiv:1412.7024 | ||
Bergstra2011_hyper url:https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf | ||
Bergstra2012_random url:http://dl.acm.org/citation.cfm?id=2188395 | ||
Caruana2014_need arXiv:1312.6184 | ||
Chen2015_hashing arXiv:1504.04788 | ||
Chen2016_gene_expr doi:10.1093/bioinformatics/btw074 | ||
Coates2013_cots_hpc url:http://www.jmlr.org/proceedings/papers/v28/coates13.html | ||
CudNN arXiv:1410.0759 | ||
Dean2012_nips_downpour url:http://research.google.com/archive/large_deep_networks_nips2012.html | ||
Dogwild url:https://papers.nips.cc/paper/5717-taming-the-wild-a-unified-analysis-of-hogwild-style-algorithms.pdf | ||
Edwards2015_growing_pains doi:10.1145/2771283 | ||
Elephas url:https://github.com/maxpumperla/elephas | ||
Gerstein2016_scaling doi:10.1186/s13059-016-0917-0 | ||
Gomezb2016_automatic arXiv:1610.02415 | ||
Graphlab doi:10.14778/2212351.2212354 | ||
Gupta2015_prec arXiv:1502.02551 | ||
Hadjas2015_cct arXiv:1504.04343 | ||
Hinton2015_dark_knowledge arXiv:1503.02531 | ||
Hinton2015_dk arXiv:1503.02531v1 | ||
Hubara2016_qnn arXiv:1609.07061 | ||
Krizhevsky2013_nips_cnn url:https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf | ||
Krizhevsky2014_weird_trick arXiv:1404.5997 | ||
Lacey2016_dl_fpga arXiv:1602.04283 | ||
Li2014_minibatch doi:10.1145/2623330.2623612 | ||
Mapreduce doi:10.1145/1327452.1327492 | ||
Meng2016_mllib arXiv:1505.06807 | ||
Moritz2015_sparknet doi:1511.06051 | ||
NIH2016_genome_cost url:https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/ | ||
RAD2010_view_cc doi:10.1145/1721654.1721672 | ||
Raina2009_gpu doi:10.1145/1553374.1553486 | ||
Sa2015_buckwild arXiv:1506.06438 | ||
Schatz2010_dna_cloud doi:10.1038/nbt0710-691 | ||
Schmidhuber2014_dnn_overview doi:10.1016/j.neunet.2014.09.003 | ||
Seide2014_parallel doi:10.1109/ICASSP.2014.6853593 | ||
Spark doi:10.1145/2934664 | ||
Stein2010_cloud doi:10.1186/gb-2010-11-5-207 | ||
Su2015_gpu arXiv:1507.01239 | ||
Sun2016_ensemble arXiv:1606.00575 | ||
TensorFlow url:http://download.tensorflow.org/paper/whitepaper2015.pdf | ||
Vanhoucke2011_nips_cpu url:https://research.google.com/pubs/pub37631.html | ||
Wang2016_protein_contact doi:10.1101/073239 | ||
Yasushi2016_cgbvs_dnn doi:10.1002/minf.201600045 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,6 +49,84 @@ with only a couple GPUs.* | |
*Some of this is also outlined in the Categorize section. We can decide where | ||
it best fits.* | ||
|
||
Efficiently scaling deep learning is challenging, and there is a high | ||
computational cost (e.g., time, memory, energy) associated with training neural | ||
networks and using them for classification. As such, neural networks | ||
have only recently found widespread use [@tag:Schmidhuber2014_dnn_overview]. | ||
|
||
Many have sought to curb the costs of deep learning, with methods ranging from | ||
the very applied (e.g., reduced numerical precision [@tag:Gupta2015_prec | ||
@tag:Bengio015_prec @tag:Sa2015_buckwild @tag:Hubara2016_qnn]) to the exotic | ||
and theoretic (e.g., training small networks to mimic large networks and | ||
ensembles [@tag:Caruana2014_need @tag:Hinton2015_dark_knowledge]). The largest | ||
gains in efficiency have come from computation with graphics processing units | ||
(GPUs) [@tag:Raina2009_gpu @tag:Vanhoucke2011_cpu @tag:Seide2014_parallel | ||
@tag:Hadjas2015_cc @tag:Edwards2015_growing_pains | ||
@tag:Schmidhuber2014_dnn_overview], which excel at the matrix and vector | ||
operations so central to deep learning. The massively parallel nature of GPUs | ||
allows additional optimizations, such as accelerated mini-batch gradient | ||
descent [@tag:Vanhoucke2011_cpu @tag:Seide2014_parallel @tag:Su2015_gpu | ||
@tag:Li2014_minibatch]. However, GPUs also have a limited quantity of memory, | ||
making it difficult to implement networks of significant size and | ||
complexity on a single GPU or machine [@tag:Raina2009_gpu | ||
@tag:Krizhevsky2013_nips_cnn]. This restriction has sometimes forced | ||
computational biologists to use workarounds or limit the size of an analysis. | ||
For example, Chen et al. [@tag:Chen2016_gene_expr] aimed to infer the | ||
expression level of all genes with a single neural network, but due to | ||
memory restrictions they randomly partitioned genes into two halves and | ||
analyzed each separately. In other cases, researchers limited the size | ||
of their neural network [@tag:Wang2016_protein_contact | ||
@tag:Gomezb2016_automatic]. Some have also chosen to use slower | ||
CPU implementations rather than sacrifice network size or performance | ||
[@tag:Yasushi2016_cgbvs_dnn]. | ||
|
||
Steady improvements in GPU hardware may alleviate this issue somewhat, but it | ||
is not clear whether they can occur quickly enough to keep up with the growing | ||
amount of available biological data or increasing network sizes. Much has | ||
been done to minimize the memory | ||
requirements of neural networks [@tag:CudNN @tag:Caruana2014_need | ||
@tag:Gupta2015_prec @tag:Bengio015_prec @tag:Sa2015_buckwild | ||
@tag:Chen2015_hashing @tag:Hubara2016_qnn], but there is also growing | ||
interest in specialized hardware, such as field-programmable gate arrays | ||
(FPGAs) [@tag:Edwards2015_growing_pains @tag:Lacey2016_dl_fpga] and | ||
application-specific integrated circuits (ASICs). Specialized hardware promises | ||
improvements in deep learning at reduced time, energy, and memory | ||
[@tag:Edwards2015_growing_pains]. Logically, there is less software for highly | ||
specialized hardware [@tag:Lacey2016_dl_fpga], and it could be a difficult | ||
investment for those not solely interested in deep learning. However, it is | ||
likely that such options will find increased support as they become a more | ||
popular platform for deep learning and general computation. | ||
|
||
Distributed computing is a general solution to intense computational | ||
requirements, and has enabled many large-scale deep learning efforts. Early | ||
approaches to distributed computation [@tag:Mapreduce @tag:Graphlab] were | ||
not suitable for deep learning [@tag:Dean2012_nips_downpour], | ||
but significant progress has been made. There | ||
now exist a number of algorithms [@tag:Dean2012_nips_downpour @tag:Dogwild | ||
@Sa2015_buckwild], tools [@tag:Moritz2015_sparknet @tag:Meng2016_mllib | ||
@tag:Tensorflow], and high-level libraries [@tag:Keras, @tag:Elephas] for deep | ||
learning in a distributed environment, and it is possible to train very complex | ||
networks with limited infrastructure [@tag:Coates2013_cots_hpc]. Besides | ||
handling very large networks, distributed or parallelized approaches offer | ||
other advantages, such as improved ensembling [@tag:Sun2016_ensemble] or | ||
accelerated hyperparameter optimization [@tag:Bergstra2011_hyper | ||
@tag:Bergstra2012_random]. | ||
|
||
Cloud computing, which has already seen adoption in genomics | ||
[@tag:Schatz2010_dna_cloud], could facilitate easier sharing of the large | ||
datasets common to biology [@tag:Gerstein2016_scaling @tag:Stein2010_cloud], | ||
and may be key to scaling deep learning. Cloud computing affords researchers | ||
significant flexibility, and enables the use of specialized hardware (e.g., | ||
FPGAs, ASICs, GPUs) without significant investment. With such flexibility, it | ||
could be easier to address the different challenges associated with the | ||
multitudinous layers and architectures available | ||
[@tag:Krizhevsky2014_weird_trick]. Though many are reluctant to store sensitive | ||
data (e.g., patient electronic health records) in the cloud, | ||
secure/regulation-compliant cloud services do exist [@tag:RAD2010_view_cc]. | ||
|
||
*TODO: Write the transition once more of the Discussion section has been | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The transition could include some commentary on how this relates to our guiding question. Will hardware issues slow deep learning from making progress on the problems we have discussed? Do requirements for specialized hardware (GPUs, FPGAs, etc.) or costs of using cloud resources create a barrier to entry that will slow progress because fewer groups can participate? Conversely, if some of these hardware challenges are resolved do we expect an acceleration of biomedical results? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with these thoughts. Would you like to write them into the text? One more comment is that the NVIDIA seeding program is great for getting a group started, but in our case (and I suspect others) it is addictive. Once you have methods working on one GPU, you quickly want to buy more or move to the cloud. |
||
fleshed out.* | ||
|
||
### Code, data, and model sharing | ||
|
||
*Reproducibiliy is important for science to progress. In the context of deep | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#104 provides a concrete example that could be referenced here:
Their ability to learn a great feature representation was limited by the number of compounds they could train with on a single GPU. Or maybe they did use a few GPUs but need even more to scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hesitant to include this in the paragraph on distribution as it is unclear to me what hardware they are currently using, and they have not yet shown if a distributed/multi-GPU implementation will influence their performance. That being said, I think this would go well with the previous paragraph on GPUs, as the memory limitations seemed to be an issue for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point taken, we don't need speculate on what they're doing regarding distributed GPUs.