Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware Limitations and Scaling #147

Merged
merged 5 commits into from
Dec 21, 2016
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions references/tags.tsv
Original file line number Diff line number Diff line change
@@ -1,2 +1,42 @@
tag citation
Zhou2015_deep_sea doi:10.1038/nmeth.3547
Bengio2015_prec arXiv:1412.7024
Bergstra2011_hyper url:https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
Bergstra2012_random url:http://dl.acm.org/citation.cfm?id=2188395
Caruana2014_need arXiv:1312.6184
Chen2015_hashing arXiv:1504.04788
Chen2016_gene_expr doi:10.1093/bioinformatics/btw074
Coates2013_cots_hpc http://www.jmlr.org/proceedings/papers/v28/coates13.html
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add url:

CudNN arXiv:1410.0759
Dean2012_nips_downpour url:http://research.google.com/archive/large_deep_networks_nips2012.html
Dogwild url:https://papers.nips.cc/paper/5717-taming-the-wild-a-unified-analysis-of-hogwild-style-algorithms.pdf
Edwards2015_growing_pains doi:10.1145/2771283
Elephas url:https://github.com/maxpumperla/elephas
Gerstein2016_scaling doi:10.1186/s13059-016-0917-0
Graphlab doi:10.14778/2212351.2212354
Gupta2015_prec arXiv:1502.02551
Hadjas2015_cct arXiv:1504.04343
Hinton2015_dark_knowledge arXiv:1503.02531
Hinton2015_dk arXiv:1503.02531v1
Hubara2016_qnn arXiv:1609.07061
Krizhevsky2013_nips_cnn https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add url:

Krizhevsky2014_weird_trick arXiv:1404.5997
Lacey2016_dl_fpga arXiv:1602.04283
Li2014_minibatch doi:10.1145/2623330.2623612
Mapreduce doi:10.1145/1327452.1327492
Meng2016_mllib arXiv:1505.06807
Moritz2015_sparknet doi:1511.06051
NIH2016_genome_cost url:https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
RAD2010_view_cc doi:10.1145/1721654.1721672
Raina2009_gpu doi:10.1145/1553374.1553486
Sa2015_buckwild arXiv:1506.06438
Schatz2010_dna_cloud doi:10.1038/nbt0710-691
Schmidhuber2014_dnn_overview doi:10.1016/j.neunet.2014.09.003
Seide2014_parallel doi:10.1109/ICASSP.2014.6853593
Spark doi:10.1145/2934664
Stein2010_cloud doi:10.1186/gb-2010-11-5-207
Su2015_gpu arXiv:1507.01239
Sun2016_ensemble arXiv:1606.00575
TensorFlow url:http://download.tensorflow.org/paper/whitepaper2015.pdf
Vanhoucke2011_nips_cpu url:https://research.google.com/pubs/pub37631.html
Yasushi2016_cgbvs_dnn doi:10.1002/minf.201600045
80 changes: 80 additions & 0 deletions sections/06_discussion.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,86 @@ with only a couple GPUs.*
*Some of this is also outlined in the Categorize section. We can decide where
it best fits.*

Efficiently scaling deep learning is challenging, and there is a high
computational cost (e.g., time, memory, energy) associated with training neural
networks and using them for classification. For these reasons, neural networks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it training them and using them or primarily training that is resource intensive?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Though the accumulated run time cost may eclipse the design/train time (e.g., a popular and widely used tool), the latter is not distributed, and thus poses a more direct challenge for computational biologists.

have only recently found widespread use [@tag:Schmidhuber2014_dnn_overview].
For biologists, such problems are further complicated by the immense size of
most biological datasets.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue many of the datasets used in our biomedical examples are not that large, especially compared to some image or speech datasets. Should we highlight more specific examples of the problems where the size of the data on disk is a limiting factor (perhaps imaging, applications with (epi)genomic input data, etc.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disk space problem is an issue, but not the only one. I agree that its current placement might alienate discussion of other motivating issues (e.g., "wide" datasets). Perhaps I should leave this sentence out entirely?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be dropped; there are many other interesting points being made here already.


*TODO: Perhaps a visual illustrating the changes over time in the cost of
storage
(i.e. hard drive space), genome sequencing, and processing power. Additionally,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU cores and RAM would also be relevant here. I wouldn't want to have to compile that data ourselves, maybe a different deep learning review has recently? Wikipedia has it all but isn't a great source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree, the GPU cores and RAM would definitely be good to have. I will see if I can find a review that includes such information.
  • I have already compiled a list of ImageNET classification task winners with accuracy, along with citations for the models where available. However, I am still unsure about how to quantify the increasing complexity of the winning networks, as there is clearly a distinction to be made between types of layers, and I don't know how useful parameter counts would be.
  • The data in [@tag:Stein2010_cloud] is missing recent years. This is not really a problem, and I can find more data regarding storage costs if that ends up being something we want to have.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree parameter counts don't tell the whole story. Some important advances worked well because they reduce parameter counts. If it is hard to capture the ImageNET history without going into a lot of detail, should we drop it?

plotting the accuracy, depth, and training time of imagenet winners could be
illustrative. See [@tag:Stein2010_cloud] for example of calculating storage
cost,
and [@tag:NIH2016_genome_cost] for data on the cost of sequencing.*

Many have sought to curb the costs of deep learning, with methods ranging from
the very applied (e.g., reduced numerical precision [@tag:Gupta2015_prec
@tag:Bengio015_prec @tag:Sa2015_buckwild @tag:Hubara2016_qnn]) to the exotic
and theoretic (e.g., training small networks to mimic large networks and
ensembles [@tag:Caruana2014_need @tag:Hinton2015_dark_knowledge]). The largest
gains in efficiency have come from computation with graphics processing units
(GPUs) [@tag:Raina2009_gpu @tag:Vanhoucke2011_cpu @tag:Seide2014_parallel
@tag:Hadjas2015_cc @tag:Edwards2015_growing_pains
@tag:Schmidhuber2014_dnn_overview], which excel at the matrix and vector
operations so central to deep learning. The massively parallel nature of GPUs
allows additional optimizations, such as accelerated mini-batch gradient
descent [@tag:Vanhoucke2011_cpu @tag:Seide2014_parallel @tag:Su2015_gpu
@tag:Li2014_minibatch]. However, GPUs also have a limited quantity of memory,
which makes it difficult to train/use networks of significant size and
complexity on a single GPU or machine [@tag:Raina2009_gpu
@tag:Krizhevsky2013_nips_cnn]. This restriction has sometimes stymied the use
of deep learning for computational biology[@tag:Chen2016_gene_expr], though
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps change to "... for computational biology or limited network size" as in #91

some have chosen to use slower CPU implementations rather than sacrifice
network size or performance [@tag:Yasushi2016_cgbvs_dnn].

Steady improvements in GPU hardware may alleviate this issue somewhat, but it
is not clear whether it can occur quickly enough to keep up with the increasing
amount of available biological data. Much has been done to minimize the memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amount of data and network size?

requirements of neural networks [@tag:CudNN @tag:Caruana2014_need
@tag:Gupta2015_prec @tag:Bengio015_prec @tag:Sa2015_buckwild
@tag:Chen2015_hashing @tag:Hubara2016_qnn], but there is also growing
interest in specialized hardware, such as field-programmable gate arrays
(FPGAs) [@tag:Edwards2015_growing_pains Lacey2016_dl_fpga] and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing @tag

application-specific integrated circuits (ASICs). Specialized hardware promises
improvements in deep learning at reduced time, energy, and memory
[@tag:Edwards2015_growing_pains]. Logically, there is less software for highly
specialized hardware [@tag:Lacey2016_dl_fpga], and it could be a difficult
investment for those not solely interested in deep learning. However, it is
likely that such options will find increased support as they become a more
popular platform for deep learning and general computation.

Distributed computing is a general solution to intense computational
requirements, and has enabled many large-scale deep learning efforts. Early
approaches to distributed computation were not suitable for deep learning
[@tag:Dean2012_nips_downpour], but significant progress has been made. There
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read this. Why is it not suitable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was citing the discussion provided in @tag:Dean2012_nips_downpour, rather than Downpour itself. Updated with clarification.

now exist a number of algorithms [@tag:Dean2012_nips_downpour @tag:Dogwild
@Sa2015_buckwild], tools [@tag:Moritz2015_sparknet @tag:Meng2016_mllib
@tag:Tensorflow], and high-level libraries [@tag:Keras, @tag:Elephas] for deep
learning in a distributed environment, and it is possible to train very complex
networks with limited infrastructure [@tag:Coates2013_cots_hpc]. Besides
handling very large networks, distributed or parallelized approaches offer
other advantages, such as improved ensembling [@tag:Sun2016_ensemble] or
accelerated hyperparameter optimization [@tag:Bergstra2011_hyper
@tag:Bergstra2012_random].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#104 provides a concrete example that could be referenced here:

In addition, we also expect to obtain better generalization by training a larger deep autoencoder with more data. The chemical structures of close to one hundred million chemical compounds are known, and could be used to train a
single unified embedding of known chemistry. Software packages that use multiple graphical processing units are being applied to this task

Their ability to learn a great feature representation was limited by the number of compounds they could train with on a single GPU. Or maybe they did use a few GPUs but need even more to scale.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hesitant to include this in the paragraph on distribution as it is unclear to me what hardware they are currently using, and they have not yet shown if a distributed/multi-GPU implementation will influence their performance. That being said, I think this would go well with the previous paragraph on GPUs, as the memory limitations seemed to be an issue for them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken, we don't need speculate on what they're doing regarding distributed GPUs.


Cloud computing, which has already seen adoption in genomics
[@tag:Schatz2010_dna_cloud], could facilitate easier sharing of the large
datasets common to biology [@tag:Gerstein2016_scaling @tag:Stein2010_cloud],
and may be key to scaling deep learning. Cloud computing affords researchers
significant flexibility, and enables the use of specialized hardware (e.g.,
FPGAs, ASICs, GPUs) without significant investment. With such flexibility, it
could be easier to address the different challenges associated with the
multitudinous layers and architectures available
[@tag:Krizhevsky2014_weird_trick]. Though many are reluctant to store sensitive
data (e.g., patient electronic health records) in the cloud,
secure/regulation-compliant cloud services do exist [@tag:RAD2010_view_cc].

*TODO: Write the transition once more of the Discussion section has been
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transition could include some commentary on how this relates to our guiding question. Will hardware issues slow deep learning from making progress on the problems we have discussed? Do requirements for specialized hardware (GPUs, FPGAs, etc.) or costs of using cloud resources create a barrier to entry that will slow progress because fewer groups can participate?

Conversely, if some of these hardware challenges are resolved do we expect an acceleration of biomedical results?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • In regards to cloud computing, it think it enables broader use in the short term. This is largely due to the number of small grants out there for cloud credit (e.g., https://aws.amazon.com/grants/), which are relatively simple to apply for. Additionally, there seems to be significant support for it (https://datascience.nih.gov/commons). Perhaps a bigger issue would be the stability of such a model; it may be unwise to predicate the advancement of deep learning on accessibility to cloud computing.
  • I suspect that problems with interpretation, model sharing, and reproducibility are just as limiting as hardware. Fortunately, these should be things we can fix right now without significant hardware investment. Even without cloud resources, it should be easy for any lab to acquire a GPU (https://developer.nvidia.com/academic_gpu_seeding), and begin working on these issues immediately.
  • It would seem that solving hardware challenges would lead to an acceleration of research, at least based on what we've seen in the reviewed papers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with these thoughts. Would you like to write them into the text? One more comment is that the NVIDIA seeding program is great for getting a group started, but in our case (and I suspect others) it is addictive. Once you have methods working on one GPU, you quickly want to buy more or move to the cloud.

fleshed out.*

### Code, data, and model sharing

*Reproducibiliy is important for science to progress. In the context of deep
Expand Down