Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add use cases #2

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
337 changes: 335 additions & 2 deletions index.bs
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Status: CG-DRAFT
Group: webml
URL: https://webmachinelearning.github.io/webnn/
Editor: Your Name, Your Company http://example.com/your-company, your-email@example.com, http://example.com/your-personal-website
Abstract: A dedicated API for neural network inference hardware acceleration.
Abstract: This document describes a dedicated low-level API for neural network inference hardware acceleration.
Repository: https://github.com/webmachinelearning/webnn
</pre>

Expand All @@ -18,5 +18,338 @@ Introduction here.
Use cases {#usecases}
=====================

Use cases here.
## High-Level Use Cases ## {#usecases-highlevel}

This section illustrates application-level use cases for neural network
inference hardware acceleration. All applications in those use cases can be
built on top of pre-trained deep neural network (DNN) models.

### Person Detection ### {#usecase-person-detection}

A user opens a web-based video conferencing application, but she temporarily
leaves from her room. The application is watching whether she is in front of her
PC by using object detection (for example, using object detection approaches
such as [[SSD]] or [[YOLO]] that use a single DNN) to detect regions in a camera
input frame that include persons.

When she comes back, the application automatically detects her and notifies
other online users that she is active now.

### Semantic Segmentation ### {#usecase-segmentation}

A user joins a teleconference via a web-based video conferencing application at
her desk since no meeting room in her office is available. During the
teleconference, she does not wish that her room and people in the background are
visible. To protect the privacy of the other people and the surroundings, the
application runs a machine learning model such as [[DeepLabv3+]] or
[[MaskR-CNN]] to semantically split an image into segments and replaces
segments that represent other people and background with another picture.

### Skeleton Detection ### {#usecase-skeleton-detection}

A web-based video conferencing application tracks a pose of user's skeleton by
running a machine learning model, which allows for real-time human pose
estimation, such as [[PoseNet]] to recognize her gesture and body language. When
she raises her hand, her microphone is automatically unmuted and she can start
speaking on the teleconference.

### Face Recognition ### {#usecase-face-recognition}

There are multiple people in the conference room and they join an online meeting
using a web-based video conferencing application. The application detects faces
of participants by using object detection (for example, using object detection
approaches such as [[SSD]]) and checks whether each face was present at the
previous meeting or not by running a machine learning model such as [[FaceNet]],
which verifies whether two faces would be identical or not.

### Facial Landmark Detection ### {#usecase-facial-landmarks}

A user wants to find new glasses that beautifully fits her on an online glasses
store. The online store offers web-based try-on simulator that runs a machine
learning model such as Face Alignment Network [[FAN]] to detect facial landmarks
like eyes, nose, mouth, etc. When she chooses a pair of glasses, the simulator
properly render the selected glasses on the detected position of eyes on her
facial image.

### Style Transfer ### {#usecase-style-transfer}

A user is looking for cosmetics on an online store and wondering which color may
fit her face. The online store shows sample facial makeup images of cosmetics,
and offers makeup simulator that runs a machine learning model like
[[ContextualLoss]] or [[PairedCycleGAN]] to transfer the makeup style of the
sample makeup image to her facial image. She can check how the selected makeup
looks like on her face by the simulator.

### Super Resolution ### {#usecase-super-resolution}

A web-based video conferencing is receiving a video stream from its peer, but
the resolution of the video becomes lower due to network congestion. To prevent
degradation of the perceived video quality, the application runs a machine
learning model for super-resolution such as [[SRGAN]] to generate
higher-resolution video frames.

### Image Captioning ### {#usecase-image-captioning}

For better accessibility, a web-based presentation application provides
automatic image captioning by running a machine learning model such as
[[im2txt]] which predicts explanatory words of the presentation slides.

### Machine Translation ### {#usecase-translation}

Multiple people from various countries are talking via a web-based real-time
text chat application. The application translates their conversation by using a
machine learning model such as [[GNMT]] or [[OpenNMT]], which translates every
text into different language.

### Emotion Analysis ### {#usecase-emotion-analysis}

A user is talking to her friend via a web-based real-time text chat application,
and she is wondering how the friend feels because she cannot see the friend's
face. The application analyses the friend's emotion by using a machine learning
model such as [[DeepMoji]], which infers emotion from input texts, and displays
an emoji that represents the estimated emotion.

### Video Summarization ### {#usecase-video-summalization}

A web-based video conferencing application records received video streams, and
it needs to reduce recorded video data to be stored. The application generates
the short version of the recorded video by using a machine learning model for
video summarization such as [[Video-Summarization-with-LSTM]].

## Low-Level Use Cases ## {#usecases-lowlevel}

This section collects API-level use cases for a dedicated low-level API for
neural network inference hardware acceleration. It is expected that Machine
Learning frameworks will be key consumers of the Web Neural Network API (WebNN
API) and the low-level details exposed through the WebNN API are abstracted out
from typical web developers. However, it is also expected that web developers
with specific interest and competence in Machine Learning will want to interface
with the WebNN API directly instead of a higher-level ML framework.

### Custom Layer ### {#usecase-custom-layer}

A web application developer wants to run a DNN model on the WebNN API. However,
she has found that some of activation functions like [[LeakyReLU]], [[ELU]],
etc. are not included in the WebNN API. To address this issue, she constructs
custom layers of the additional activation functions on top of the WebNN API.
Note that the scope of custom layers may include convolution, normalization,
etc. as well as activation.

### Network Concatenation ### {#usecase-network-concat}
Copy link
Contributor

@huningxin huningxin Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the TPAC F2F meeting, this looks like a training use case. As training is out of current charter's scope, would it be better to add this in the future?

Copy link
Contributor Author

@tomoyukilabs tomoyukilabs Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but not limited to training. Possible detailed examples are:

  • The web app downloads convolutional layer weights of MobileNetV1/V2 from CDN and weights of fully-connected layers made by transfer learning from her own web site
  • The web app downloads complete weights of MobileNetV1/V2, and then partially update fully-connected layers later by downloading fine-tuned weights

Anyway, the current description seems to suggest the use case of training, as you pointed out. I'll update those sentences so that they clearly indicate a use case of client-side partial update based on fine tuning or transfer learning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification! It would be great if you can update the description accordingly.


A web application uses a DNN model, and its model data of upper convolutional
layers and lower fully-connected layers are stored in separate files, since
model data of the fully-connected layers are periodically updated due to fine
tuning at the server side.

Therefore, the application downloads both partial model files at first and
concatenates them into a single model. When the model is updated, the
application downloads fine-tuned part of the model and replace only the
fully-connected layers with it.

### Performance Adaptation ### {#usecase-perf-adapt}

A web application developer has a concern about performance of her DNN model on
mobile devices. She has confirmed that it may run too slow on mobile devices
which do not have GPU acceleration. To address this issue, her web application
refers to the WebNN API to confirm whether acceleration is available or not, so
that the application can display the warning for devices without acceleration.

After several weeks, she has developed a tiny DNN model that can even run on
CPU. In order to accommodate CPU execution, she modifies the application
so that the application loads the tiny model in the case of CPU-only devices.

<pre class="biblio">
{
"SSD": {
"href": "https://arxiv.org/abs/1512.02325",
"title": "SSD: Single Shot MultiBox Detector",
"authors": [
"Wei Liu",
"Dragomir Anguelov",
"Dumitru Erhan",
"Christian Szegedy",
"Scott Reed",
"Cheng-Yang Fu",
"Alexander C. Berg"
],
"date": "December 2016"
},
"YOLO": {
"href": "https://arxiv.org/abs/1506.02640",
"title": "You Only Look Once: Unified, Real-Time Object Detection",
"authors": [
"Joseph Redmon",
"Santosh Divvala,",
"Ross Girshick",
"Ali Farhadi"
],
"date": "May 2016"
},
"DeepLabv3+": {
"href": "https://arxiv.org/abs/1802.02611",
"title": "Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation",
"authors": [
"Liang-Chieh Chen",
"Yukun Zhu",
"George Papandreou",
"Florian Schroff",
"Hartwig Adam"
],
"date": "August 2018"
},
"MaskR-CNN": {
"href": "https://arxiv.org/abs/1703.06870",
"title": "Mask R-CNN",
"authors": [
"Kaiming He",
"Georgia Gkioxari",
"Piotr Dollár",
"Ross Girshick"
],
"date": "January 2018"
},
"PoseNet": {
"href": "https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5",
"title": "Real-time Human Pose Estimation in the Browser with TensorFlow.js",
"authors": [
"Dan Oved"
],
"date": "May 2018"
},
"FaceNet": {
"href": "https://arxiv.org/abs/1503.03832",
"title": "FaceNet: A Unified Embedding for Face Recognition and Clustering",
"authors": [
"Florian Schroff",
"Dmitry Kalenichenko",
"James Philbin"
],
"date": "June 2015"
},
"FAN": {
"href": "https://arxiv.org/abs/1703.07332",
"title": "How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks)",
"authors": [
"Adrian Bulat",
"Georgios Tzimiropoulos"
],
"date": "September 2017"
},
"ContextualLoss": {
"href": "https://arxiv.org/abs/1803.02077",
"title": "The Contextual Loss for Image Transformation with Non-Aligned Data",
"authors": [
"Roey Mechrez",
"Itamar Talmi",
"Lihi Zelnik-Manor"
],
"date": "July 2018"
},
"PairedCycleGAN": {
"href": "http://openaccess.thecvf.com/content_cvpr_2018/html/Chang_PairedCycleGAN_Asymmetric_Style_CVPR_2018_paper.html",
"title": "PairedCycleGAN: Asymmetric Style Transfer for Applying and Removing Makeup",
"authors": [
"Huiwen Chang",
"Jingwan Lu",
"Fisher Yu",
"Adam Finkelstein"
],
"date": "June 2018"
},
"SRGAN": {
"href": "https://arxiv.org/abs/1609.04802",
"title": "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network",
"authors": [
"Christian Ledig",
"Lucas Theis",
"Ferenc Huszar",
"Jose Caballero",
"Andrew Cunningham",
"Alejandro Acosta",
"Andrew Aitken",
"Alykhan Tejani",
"Johannes Totz",
"Zehan Wang",
"Wenzhe Shi"
],
"date": "May 2017"
},
"im2txt": {
"href": "https://arxiv.org/abs/1609.06647",
"title": "Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge",
"authors": [
"Oriol Vinyals",
"Alexander Toshev",
"Samy Bengio",
"Dumitru Erhan"
],
"date": "September 2016"
},
"GNMT": {
"href": "https://github.com/tensorflow/nmt",
"title": "Neural Machine Translation (seq2seq) Tutorial",
"authors": [
"Minh-Thang Luong",
"Eugene Brevdo",
"Rui Zhao"
],
"date": "May 2017"
},
"OpenNMT": {
"href": "https://arxiv.org/abs/1701.02810",
"title": "OpenNMT: Open-Source Toolkit for Neural Machine Translation",
"authors": [
"Guillaume Klein",
"Yoon Kim",
"Yuntian Deng",
"Jean Senellart",
"Alexander M. Rush"
],
"date": "March 2017"
},
"DeepMoji": {
"href": "https://arxiv.org/abs/1708.00524",
"title": "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm",
"authors": [
"Bjarke Felbo",
"Alan Mislove",
"Anders Søgaard",
"Iyad Rahwan",
"Sune Lehmann"
],
"date": "October 2017"
},
"Video-Summarization-with-LSTM": {
"href": "http://www-scf.usc.edu/~zhan355/ke_eccv2016.pdf",
"title": "Video summarization with long short-term memory",
"authors": [
"Ke Zhang",
"Wei-Lun Chao",
"Fei Sha",
"Kristen Grauman"
],
"date": "October 2016"
},
"LeakyReLU": {
"href": "https://pdfs.semanticscholar.org/367f/2c63a6f6a10b3b64b8729d601e69337ee3cc.pdf",
"title": "Rectifier Nonlinearities Improve Neural Network Acoustic Models",
"authors": [
"Andrew L. Maas",
"Awni Y. Hannun",
"Andrew Y. Ng"
],
"date": "June 2013"
},
"ELU": {
"href": "https://arxiv.org/abs/1511.07289",
"title": "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)",
"authors": [
"Djork-Arné Clevert",
"Thomas Unterthiner",
"Sepp Hochreiter"
],
"date": "February 2016"
}
}
</pre>
Loading