Skip to content

Commit 15a5cbc

Browse files
sethjuarezsubramen
authored andcommitted
corrected rebase errors
1 parent acf287d commit 15a5cbc

10 files changed

+571
-703
lines changed

beginner_source/quickstart/autograd_quickstart_tutorial.py

Lines changed: 0 additions & 16 deletions
This file was deleted.
Lines changed: 140 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
"""
2-
Automatic Differentiation with AutoGrad
2+
Automatic Differentiation with ``torch.autograd``
33
=======================================
44
55
When training neural networks, the most frequently used algorithm is
66
**back propagation**. In this algorithm, parameters (model weights) are
77
adjusted according to the **gradient** of the loss function with respect
88
to the given parameter.
99
10-
To compute those gradients, PyTorch has a built-in mechanism called
11-
**AutoGrad**. It supports automatic computation of gradient for any
10+
To compute those gradients, PyTorch has a built-in differentiation engine
11+
called ``torch.autograd``. It supports automatic computation of gradient for any
1212
computational graph.
1313
1414
Consider the simplest one-layer neural network, with input ``x``,
@@ -17,12 +17,12 @@
1717
"""
1818

1919
import torch
20-
x = torch.ones(5) # input tensor
21-
y = torch.zeros(3) # expected output
22-
w = torch.randn(5,3,requires_grad=True)
23-
b = torch.randn(3,requires_grad=True)
24-
z = torch.matmul(x,w)+b
25-
loss = torch.nn.functional.binary_cross_entropy_with_logits(z,y)
20+
x = torch.ones(5) # input tensor
21+
y = torch.zeros(3) # expected output
22+
w = torch.randn(5, 3, requires_grad=True)
23+
b = torch.randn(3, requires_grad=True)
24+
z = torch.matmul(x, w)+b
25+
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
2626

2727

2828
######################################################################
@@ -51,22 +51,22 @@
5151
# the backward propagation function is stored in ``grad_fn`` property of a
5252
# tensor. You can find more information of ``Function`` `in
5353
# documentation <https://pytorch.org/docs/stable/autograd.html#function>`__.
54-
#
54+
#
5555

56-
print(z.grad_fn,loss.grad_fn,sep='\n')
56+
print(z.grad_fn, loss.grad_fn, sep='\n')
5757

5858
######################################################################
5959
# Computing Gradients
6060
# -------------------
61-
#
61+
#
6262
# To optimize weights of parameters in the neural network, we need to
6363
# compute the derivatives of our loss function with respect to parameters,
6464
# namely, we need :math:`\frac{\partial loss}{\partial w}` and
6565
# :math:`\frac{\partial loss}{\partial b}` under some fixed values of
6666
# ``x`` and ``y``. To compute those derivatives, we call
6767
# ``loss.backward()``, and then retrieve the values from ``w.grad`` and
6868
# ``b.grad``:
69-
#
69+
#
7070

7171
loss.backward()
7272
print(w.grad)
@@ -83,169 +83,196 @@
8383
# ``backward`` once on a given graph, for performance reasons. If we need
8484
# to do several ``backward`` calls on the same graph, we need to pass
8585
# ``retain_graph=True`` to the ``backward`` call.
86-
#
87-
88-
89-
######################################################################
90-
# Tensor Gradients and Jacobian Products
91-
# --------------------------------------
92-
#
93-
# In many cases, we have a scalar loss function, and we need to compute
94-
# the gradient with respect to some parameters. However, there are cases
95-
# when the output function is an arbitrary tensor. In this case, PyTorch
96-
# allows you to compute so-called **Jacobian product**, and not the actual
97-
# gradient.
98-
#
99-
# For a vector function :math:`\vec{y}=f(\vec{x})`, where
100-
# :math:`\vec{x}=\langle x_1,\dots,x_n\rangle` and
101-
# :math:`\vec{y}=\langle y_1,\dots,y_m\rangle`, a gradient of
102-
# :math:`\vec{y}` with respect to :math:`\vec{x}` is given by **Jacobian
103-
# matrix**:
104-
#
105-
# .. math::
106-
#
107-
#
108-
# \begin{align}J=\left(\begin{array}{ccc}
109-
# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
110-
# \vdots & \ddots & \vdots\\
111-
# \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
112-
# \end{array}\right)\end{align}
113-
#
114-
# Instead of computing the Jacobian matrix itself, PyTorch allows you to
115-
# compute **Jacobian Product** :math:`v^T\cdot J` for a given input vector
116-
# :math:`v=(v_1 \dots v_m)`. This is achieved by calling ``backward`` with
117-
# :math:`v` as an argument. The size of :math:`v` should be the same as
118-
# the size of the original tensor, with respect to which we want to
119-
# compute the product:
120-
#
121-
122-
inp = torch.eye(5,requires_grad=True)
123-
out = (inp+1).pow(2)
124-
out.backward(torch.ones_like(inp),retain_graph=True)
125-
print("First call\n",inp.grad)
126-
out.backward(torch.ones_like(inp),retain_graph=True)
127-
print("\nSecond call\n",inp.grad)
128-
inp.grad.zero_()
129-
out.backward(torch.ones_like(inp),retain_graph=True)
130-
print("\nCall after zeroing gradients\n",inp.grad)
131-
132-
133-
######################################################################
134-
# Notice that when we call ``backward`` for the second time with the same
135-
# argument, the value of the gradient is different. This happens because
136-
# when doing ``backward`` propagation, PyTorch **accumulates the
137-
# gradients**, i.e. the value of computed gradients is added to the
138-
# ``grad`` property of all leaf nodes of computational graph. If you want
139-
# to compute the proper gradients, you need to zero out the ``grad``
140-
# property before. In real-life training an *optimizer* helps us to do
141-
# this.
142-
143-
######################################################################
144-
# .. note:: Previously we were calling ``backward()`` function without
145-
# parameters. This is essentially equivalent to calling
146-
# ``backward(torch.tensor(1.0))``, which is a useful way to compute the
147-
# gradients in case of a scalar-valued function, such as loss during
148-
# neural network training.
149-
#
86+
#
15087

15188

15289
######################################################################
15390
# Disabling Gradient Tracking
15491
# ---------------------------
155-
#
92+
#
15693
# By default, all tensors with ``requires_grad=True`` are tracking their
15794
# computational history and support gradient computation. However, there
15895
# are some cases when we do not need to do that, for example, when we have
15996
# trained the model and just want to apply it to some input data, i.e. we
16097
# only want to do *forward* computations through the network. We can stop
16198
# tracking computations by surrounding our computation code with
16299
# ``with torch.no_grad()`` block:
163-
#
100+
#
164101

165-
z = torch.matmul(x,w)+b
102+
z = torch.matmul(x, w)+b
166103
print(z.requires_grad)
167104

168105
with torch.no_grad():
169-
z = torch.matmul(x,w)+b
106+
z = torch.matmul(x, w)+b
170107
print(z.requires_grad)
171108

172109

173110
######################################################################
174111
# Another way to achieve the same result is to use the ``detach()`` method
175112
# on the tensor:
176-
#
113+
#
177114

178-
z = torch.matmul(x,w)+b
115+
z = torch.matmul(x, w)+b
179116
z_det = z.detach()
180117
print(z_det.requires_grad)
181118

182-
183119
######################################################################
184-
# All forward-pass computations on tensors that do not track gradients
185-
# would be more efficient.
186-
#
120+
# There are several reasons you might want to disable gradient tracking:
121+
# - To mark some parameters in your neural network at **frozen parameters**. This is
122+
# a very common scenario for
123+
# `finetuning a pretrained network <https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html>`__
124+
# - To **speed up computations** when you are only doing forward pass, because computations on tensors that do
125+
# not track gradients would be more efficient.
187126

188127

189128
######################################################################
190129
# Example of Gradient Descent
191130
# ---------------------------
192-
#
131+
#
193132
# Let's use the AutoGrad functionality to minimize a simple function of
194133
# two variables :math:`f(x_1,x_2)=(x_1-3)^2+(x_2+2)^2`. We will use the
195134
# ``x`` tensor to represent the coordinates of a point. To do the gradient
196135
# descent, we start with some initial value :math:`x^{(0)}=(0,0)`, and
197136
# compute each consecutive step using:
198-
#
137+
#
199138
# .. math::
200-
#
201-
#
139+
#
140+
#
202141
# x^{(n+1)} = x^{(n)} - \eta\nabla f
203-
#
142+
#
204143
# Here :math:`\eta` is so-called **learning rate** (we will call it ``lr``
205144
# in our code), and
206145
# :math:`\nabla f = (\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2})`
207146
# is the gradient of :math:`f`.
208-
#
147+
#
209148
# We will start by defining the initial value of ``x`` and the function
210149
# ``f``:
211-
#
150+
#
151+
152+
x = torch.zeros(2, requires_grad=True)
153+
def f(x): return (x-torch.tensor([3, -2])).pow(2).sum()
154+
212155

213-
x = torch.zeros(2,requires_grad=True)
214-
f = lambda x : (x-torch.tensor([3,-2])).pow(2).sum()
215156
lr = 0.1
216157

217158

218159
######################################################################
219160
# For the gradient descent, let's do 15 iterations. On each iteration, we
220161
# will update the coordinate tensor ``x`` and print its coordinates to
221162
# make sure that we are approaching the minimum:
222-
#
163+
#
223164

224165
for i in range(15):
225166
y = f(x)
226167
y.backward()
227168
gr = x.grad
228169
x.data.add_(-lr*gr)
229170
x.grad.zero_()
230-
print("Step {}: x[0]={}, x[1]={}".format(i,x[0],x[1]))
171+
print("Step {}: x[0]={}, x[1]={}".format(i, x[0], x[1]))
231172

232173

233174
######################################################################
234175
# As you can see, we have obtained the values close to the optimal point
235-
# :math:`(3,-2)`. Training a neural network is in fact a very similar
176+
# :math:`(3,-2)`. `Training a neural network <optimization_tutorial.html>`_ is in fact a very similar
236177
# process, we will need to do a number of iterations to minimize the value
237178
# of **loss function**.
238-
#
239-
# Next: Learn more about `how to use AutoGrad to train a neural network model <optimization_tutorial.html>`_.
240-
#
241-
242-
##################################################################
243-
# Pytorch Quickstart Topics
244-
# -----------------
245-
#| `Tensors <tensor_tutorial.html>`_
246-
#| `DataSets and DataLoaders <data_quickstart_tutorial.html>`_
247-
#| `Transforms <transforms_tutorial.html>`_
248-
#| `Build Model <build_model_tutorial.html>`_
249-
#| `Optimization Loop <optimization_tutorial.html>`_
250-
#| `AutoGrad <autograd_tutorial.html>`_
251-
#| `Save, Load and Run Model <save_load_run_tutorial.html>`_
179+
180+
######################################################################
181+
# More on Computational Graphs
182+
# ----------------------------
183+
# Conceptually, autograd keeps a record of data (tensors) & all executed
184+
# operations (along with the resulting new tensors) in a directed acyclic
185+
# graph (DAG) consisting of
186+
# `Function <https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function>`__
187+
# objects. In this DAG, leaves are the input tensors, roots are the output
188+
# tensors. By tracing this graph from roots to leaves, you can
189+
# automatically compute the gradients using the chain rule.
190+
#
191+
# In a forward pass, autograd does two things simultaneously:
192+
#
193+
# - run the requested operation to compute a resulting tensor, and
194+
# - maintain the operation’s *gradient function* in the DAG.
195+
#
196+
# The backward pass kicks off when ``.backward()`` is called on the DAG
197+
# root. ``autograd`` then:
198+
#
199+
# - computes the gradients from each ``.grad_fn``,
200+
# - accumulates them in the respective tensor’s ``.grad`` attribute, and
201+
# - using the chain rule, propagates all the way to the leaf tensors.
202+
#
203+
# .. note::
204+
# **DAGs are dynamic in PyTorch**
205+
# An important thing to note is that the graph is recreated from scratch; after each
206+
# ``.backward()`` call, autograd starts populating a new graph. This is
207+
# exactly what allows you to use control flow statements in your model;
208+
# you can change the shape, size and operations at every iteration if
209+
# needed.
210+
211+
######################################################################
212+
# Optional Reading: Tensor Gradients and Jacobian Products
213+
# --------------------------------------
214+
#
215+
# In many cases, we have a scalar loss function, and we need to compute
216+
# the gradient with respect to some parameters. However, there are cases
217+
# when the output function is an arbitrary tensor. In this case, PyTorch
218+
# allows you to compute so-called **Jacobian product**, and not the actual
219+
# gradient.
220+
#
221+
# For a vector function :math:`\vec{y}=f(\vec{x})`, where
222+
# :math:`\vec{x}=\langle x_1,\dots,x_n\rangle` and
223+
# :math:`\vec{y}=\langle y_1,\dots,y_m\rangle`, a gradient of
224+
# :math:`\vec{y}` with respect to :math:`\vec{x}` is given by **Jacobian
225+
# matrix**:
226+
#
227+
# .. math::
228+
#
229+
#
230+
# \begin{align}J=\left(\begin{array}{ccc}
231+
# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
232+
# \vdots & \ddots & \vdots\\
233+
# \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
234+
# \end{array}\right)\end{align}
235+
#
236+
# Instead of computing the Jacobian matrix itself, PyTorch allows you to
237+
# compute **Jacobian Product** :math:`v^T\cdot J` for a given input vector
238+
# :math:`v=(v_1 \dots v_m)`. This is achieved by calling ``backward`` with
239+
# :math:`v` as an argument. The size of :math:`v` should be the same as
240+
# the size of the original tensor, with respect to which we want to
241+
# compute the product:
242+
#
243+
244+
inp = torch.eye(5, requires_grad=True)
245+
out = (inp+1).pow(2)
246+
out.backward(torch.ones_like(inp), retain_graph=True)
247+
print("First call\n", inp.grad)
248+
out.backward(torch.ones_like(inp), retain_graph=True)
249+
print("\nSecond call\n", inp.grad)
250+
inp.grad.zero_()
251+
out.backward(torch.ones_like(inp), retain_graph=True)
252+
print("\nCall after zeroing gradients\n", inp.grad)
253+
254+
255+
######################################################################
256+
# Notice that when we call ``backward`` for the second time with the same
257+
# argument, the value of the gradient is different. This happens because
258+
# when doing ``backward`` propagation, PyTorch **accumulates the
259+
# gradients**, i.e. the value of computed gradients is added to the
260+
# ``grad`` property of all leaf nodes of computational graph. If you want
261+
# to compute the proper gradients, you need to zero out the ``grad``
262+
# property before. In real-life training an *optimizer* helps us to do
263+
# this.
264+
265+
######################################################################
266+
# .. note:: Previously we were calling ``backward()`` function without
267+
# parameters. This is essentially equivalent to calling
268+
# ``backward(torch.tensor(1.0))``, which is a useful way to compute the
269+
# gradients in case of a scalar-valued function, such as loss during
270+
# neural network training.
271+
#
272+
273+
274+
######################################################################
275+
# Next: Learn more about `how to use automatic differentiation to train a neural network model <optimization_tutorial.html>`_.
276+
#
277+
# .. include:: /beginner_source/quickstart/qs_toc.txt
278+
#

0 commit comments

Comments
 (0)