|
1 | 1 | """ |
2 | | -Automatic Differentiation with AutoGrad |
| 2 | +Automatic Differentiation with ``torch.autograd`` |
3 | 3 | ======================================= |
4 | 4 |
|
5 | 5 | When training neural networks, the most frequently used algorithm is |
6 | 6 | **back propagation**. In this algorithm, parameters (model weights) are |
7 | 7 | adjusted according to the **gradient** of the loss function with respect |
8 | 8 | to the given parameter. |
9 | 9 |
|
10 | | -To compute those gradients, PyTorch has a built-in mechanism called |
11 | | -**AutoGrad**. It supports automatic computation of gradient for any |
| 10 | +To compute those gradients, PyTorch has a built-in differentiation engine |
| 11 | +called ``torch.autograd``. It supports automatic computation of gradient for any |
12 | 12 | computational graph. |
13 | 13 |
|
14 | 14 | Consider the simplest one-layer neural network, with input ``x``, |
|
17 | 17 | """ |
18 | 18 |
|
19 | 19 | import torch |
20 | | -x = torch.ones(5) # input tensor |
21 | | -y = torch.zeros(3) # expected output |
22 | | -w = torch.randn(5,3,requires_grad=True) |
23 | | -b = torch.randn(3,requires_grad=True) |
24 | | -z = torch.matmul(x,w)+b |
25 | | -loss = torch.nn.functional.binary_cross_entropy_with_logits(z,y) |
| 20 | +x = torch.ones(5) # input tensor |
| 21 | +y = torch.zeros(3) # expected output |
| 22 | +w = torch.randn(5, 3, requires_grad=True) |
| 23 | +b = torch.randn(3, requires_grad=True) |
| 24 | +z = torch.matmul(x, w)+b |
| 25 | +loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y) |
26 | 26 |
|
27 | 27 |
|
28 | 28 | ###################################################################### |
|
51 | 51 | # the backward propagation function is stored in ``grad_fn`` property of a |
52 | 52 | # tensor. You can find more information of ``Function`` `in |
53 | 53 | # documentation <https://pytorch.org/docs/stable/autograd.html#function>`__. |
54 | | -# |
| 54 | +# |
55 | 55 |
|
56 | | -print(z.grad_fn,loss.grad_fn,sep='\n') |
| 56 | +print(z.grad_fn, loss.grad_fn, sep='\n') |
57 | 57 |
|
58 | 58 | ###################################################################### |
59 | 59 | # Computing Gradients |
60 | 60 | # ------------------- |
61 | | -# |
| 61 | +# |
62 | 62 | # To optimize weights of parameters in the neural network, we need to |
63 | 63 | # compute the derivatives of our loss function with respect to parameters, |
64 | 64 | # namely, we need :math:`\frac{\partial loss}{\partial w}` and |
65 | 65 | # :math:`\frac{\partial loss}{\partial b}` under some fixed values of |
66 | 66 | # ``x`` and ``y``. To compute those derivatives, we call |
67 | 67 | # ``loss.backward()``, and then retrieve the values from ``w.grad`` and |
68 | 68 | # ``b.grad``: |
69 | | -# |
| 69 | +# |
70 | 70 |
|
71 | 71 | loss.backward() |
72 | 72 | print(w.grad) |
|
83 | 83 | # ``backward`` once on a given graph, for performance reasons. If we need |
84 | 84 | # to do several ``backward`` calls on the same graph, we need to pass |
85 | 85 | # ``retain_graph=True`` to the ``backward`` call. |
86 | | -# |
87 | | - |
88 | | - |
89 | | -###################################################################### |
90 | | -# Tensor Gradients and Jacobian Products |
91 | | -# -------------------------------------- |
92 | | -# |
93 | | -# In many cases, we have a scalar loss function, and we need to compute |
94 | | -# the gradient with respect to some parameters. However, there are cases |
95 | | -# when the output function is an arbitrary tensor. In this case, PyTorch |
96 | | -# allows you to compute so-called **Jacobian product**, and not the actual |
97 | | -# gradient. |
98 | | -# |
99 | | -# For a vector function :math:`\vec{y}=f(\vec{x})`, where |
100 | | -# :math:`\vec{x}=\langle x_1,\dots,x_n\rangle` and |
101 | | -# :math:`\vec{y}=\langle y_1,\dots,y_m\rangle`, a gradient of |
102 | | -# :math:`\vec{y}` with respect to :math:`\vec{x}` is given by **Jacobian |
103 | | -# matrix**: |
104 | | -# |
105 | | -# .. math:: |
106 | | -# |
107 | | -# |
108 | | -# \begin{align}J=\left(\begin{array}{ccc} |
109 | | -# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\ |
110 | | -# \vdots & \ddots & \vdots\\ |
111 | | -# \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} |
112 | | -# \end{array}\right)\end{align} |
113 | | -# |
114 | | -# Instead of computing the Jacobian matrix itself, PyTorch allows you to |
115 | | -# compute **Jacobian Product** :math:`v^T\cdot J` for a given input vector |
116 | | -# :math:`v=(v_1 \dots v_m)`. This is achieved by calling ``backward`` with |
117 | | -# :math:`v` as an argument. The size of :math:`v` should be the same as |
118 | | -# the size of the original tensor, with respect to which we want to |
119 | | -# compute the product: |
120 | | -# |
121 | | - |
122 | | -inp = torch.eye(5,requires_grad=True) |
123 | | -out = (inp+1).pow(2) |
124 | | -out.backward(torch.ones_like(inp),retain_graph=True) |
125 | | -print("First call\n",inp.grad) |
126 | | -out.backward(torch.ones_like(inp),retain_graph=True) |
127 | | -print("\nSecond call\n",inp.grad) |
128 | | -inp.grad.zero_() |
129 | | -out.backward(torch.ones_like(inp),retain_graph=True) |
130 | | -print("\nCall after zeroing gradients\n",inp.grad) |
131 | | - |
132 | | - |
133 | | -###################################################################### |
134 | | -# Notice that when we call ``backward`` for the second time with the same |
135 | | -# argument, the value of the gradient is different. This happens because |
136 | | -# when doing ``backward`` propagation, PyTorch **accumulates the |
137 | | -# gradients**, i.e. the value of computed gradients is added to the |
138 | | -# ``grad`` property of all leaf nodes of computational graph. If you want |
139 | | -# to compute the proper gradients, you need to zero out the ``grad`` |
140 | | -# property before. In real-life training an *optimizer* helps us to do |
141 | | -# this. |
142 | | - |
143 | | -###################################################################### |
144 | | -# .. note:: Previously we were calling ``backward()`` function without |
145 | | -# parameters. This is essentially equivalent to calling |
146 | | -# ``backward(torch.tensor(1.0))``, which is a useful way to compute the |
147 | | -# gradients in case of a scalar-valued function, such as loss during |
148 | | -# neural network training. |
149 | | -# |
| 86 | +# |
150 | 87 |
|
151 | 88 |
|
152 | 89 | ###################################################################### |
153 | 90 | # Disabling Gradient Tracking |
154 | 91 | # --------------------------- |
155 | | -# |
| 92 | +# |
156 | 93 | # By default, all tensors with ``requires_grad=True`` are tracking their |
157 | 94 | # computational history and support gradient computation. However, there |
158 | 95 | # are some cases when we do not need to do that, for example, when we have |
159 | 96 | # trained the model and just want to apply it to some input data, i.e. we |
160 | 97 | # only want to do *forward* computations through the network. We can stop |
161 | 98 | # tracking computations by surrounding our computation code with |
162 | 99 | # ``with torch.no_grad()`` block: |
163 | | -# |
| 100 | +# |
164 | 101 |
|
165 | | -z = torch.matmul(x,w)+b |
| 102 | +z = torch.matmul(x, w)+b |
166 | 103 | print(z.requires_grad) |
167 | 104 |
|
168 | 105 | with torch.no_grad(): |
169 | | - z = torch.matmul(x,w)+b |
| 106 | + z = torch.matmul(x, w)+b |
170 | 107 | print(z.requires_grad) |
171 | 108 |
|
172 | 109 |
|
173 | 110 | ###################################################################### |
174 | 111 | # Another way to achieve the same result is to use the ``detach()`` method |
175 | 112 | # on the tensor: |
176 | | -# |
| 113 | +# |
177 | 114 |
|
178 | | -z = torch.matmul(x,w)+b |
| 115 | +z = torch.matmul(x, w)+b |
179 | 116 | z_det = z.detach() |
180 | 117 | print(z_det.requires_grad) |
181 | 118 |
|
182 | | - |
183 | 119 | ###################################################################### |
184 | | -# All forward-pass computations on tensors that do not track gradients |
185 | | -# would be more efficient. |
186 | | -# |
| 120 | +# There are several reasons you might want to disable gradient tracking: |
| 121 | +# - To mark some parameters in your neural network at **frozen parameters**. This is |
| 122 | +# a very common scenario for |
| 123 | +# `finetuning a pretrained network <https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html>`__ |
| 124 | +# - To **speed up computations** when you are only doing forward pass, because computations on tensors that do |
| 125 | +# not track gradients would be more efficient. |
187 | 126 |
|
188 | 127 |
|
189 | 128 | ###################################################################### |
190 | 129 | # Example of Gradient Descent |
191 | 130 | # --------------------------- |
192 | | -# |
| 131 | +# |
193 | 132 | # Let's use the AutoGrad functionality to minimize a simple function of |
194 | 133 | # two variables :math:`f(x_1,x_2)=(x_1-3)^2+(x_2+2)^2`. We will use the |
195 | 134 | # ``x`` tensor to represent the coordinates of a point. To do the gradient |
196 | 135 | # descent, we start with some initial value :math:`x^{(0)}=(0,0)`, and |
197 | 136 | # compute each consecutive step using: |
198 | | -# |
| 137 | +# |
199 | 138 | # .. math:: |
200 | | -# |
201 | | -# |
| 139 | +# |
| 140 | +# |
202 | 141 | # x^{(n+1)} = x^{(n)} - \eta\nabla f |
203 | | -# |
| 142 | +# |
204 | 143 | # Here :math:`\eta` is so-called **learning rate** (we will call it ``lr`` |
205 | 144 | # in our code), and |
206 | 145 | # :math:`\nabla f = (\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2})` |
207 | 146 | # is the gradient of :math:`f`. |
208 | | -# |
| 147 | +# |
209 | 148 | # We will start by defining the initial value of ``x`` and the function |
210 | 149 | # ``f``: |
211 | | -# |
| 150 | +# |
| 151 | + |
| 152 | +x = torch.zeros(2, requires_grad=True) |
| 153 | +def f(x): return (x-torch.tensor([3, -2])).pow(2).sum() |
| 154 | + |
212 | 155 |
|
213 | | -x = torch.zeros(2,requires_grad=True) |
214 | | -f = lambda x : (x-torch.tensor([3,-2])).pow(2).sum() |
215 | 156 | lr = 0.1 |
216 | 157 |
|
217 | 158 |
|
218 | 159 | ###################################################################### |
219 | 160 | # For the gradient descent, let's do 15 iterations. On each iteration, we |
220 | 161 | # will update the coordinate tensor ``x`` and print its coordinates to |
221 | 162 | # make sure that we are approaching the minimum: |
222 | | -# |
| 163 | +# |
223 | 164 |
|
224 | 165 | for i in range(15): |
225 | 166 | y = f(x) |
226 | 167 | y.backward() |
227 | 168 | gr = x.grad |
228 | 169 | x.data.add_(-lr*gr) |
229 | 170 | x.grad.zero_() |
230 | | - print("Step {}: x[0]={}, x[1]={}".format(i,x[0],x[1])) |
| 171 | + print("Step {}: x[0]={}, x[1]={}".format(i, x[0], x[1])) |
231 | 172 |
|
232 | 173 |
|
233 | 174 | ###################################################################### |
234 | 175 | # As you can see, we have obtained the values close to the optimal point |
235 | | -# :math:`(3,-2)`. Training a neural network is in fact a very similar |
| 176 | +# :math:`(3,-2)`. `Training a neural network <optimization_tutorial.html>`_ is in fact a very similar |
236 | 177 | # process, we will need to do a number of iterations to minimize the value |
237 | 178 | # of **loss function**. |
238 | | -# |
239 | | -# Next: Learn more about `how to use AutoGrad to train a neural network model <optimization_tutorial.html>`_. |
240 | | -# |
241 | | - |
242 | | -################################################################## |
243 | | -# Pytorch Quickstart Topics |
244 | | -# ----------------- |
245 | | -#| `Tensors <tensor_tutorial.html>`_ |
246 | | -#| `DataSets and DataLoaders <data_quickstart_tutorial.html>`_ |
247 | | -#| `Transforms <transforms_tutorial.html>`_ |
248 | | -#| `Build Model <build_model_tutorial.html>`_ |
249 | | -#| `Optimization Loop <optimization_tutorial.html>`_ |
250 | | -#| `AutoGrad <autograd_tutorial.html>`_ |
251 | | -#| `Save, Load and Run Model <save_load_run_tutorial.html>`_ |
| 179 | + |
| 180 | +###################################################################### |
| 181 | +# More on Computational Graphs |
| 182 | +# ---------------------------- |
| 183 | +# Conceptually, autograd keeps a record of data (tensors) & all executed |
| 184 | +# operations (along with the resulting new tensors) in a directed acyclic |
| 185 | +# graph (DAG) consisting of |
| 186 | +# `Function <https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function>`__ |
| 187 | +# objects. In this DAG, leaves are the input tensors, roots are the output |
| 188 | +# tensors. By tracing this graph from roots to leaves, you can |
| 189 | +# automatically compute the gradients using the chain rule. |
| 190 | +# |
| 191 | +# In a forward pass, autograd does two things simultaneously: |
| 192 | +# |
| 193 | +# - run the requested operation to compute a resulting tensor, and |
| 194 | +# - maintain the operation’s *gradient function* in the DAG. |
| 195 | +# |
| 196 | +# The backward pass kicks off when ``.backward()`` is called on the DAG |
| 197 | +# root. ``autograd`` then: |
| 198 | +# |
| 199 | +# - computes the gradients from each ``.grad_fn``, |
| 200 | +# - accumulates them in the respective tensor’s ``.grad`` attribute, and |
| 201 | +# - using the chain rule, propagates all the way to the leaf tensors. |
| 202 | +# |
| 203 | +# .. note:: |
| 204 | +# **DAGs are dynamic in PyTorch** |
| 205 | +# An important thing to note is that the graph is recreated from scratch; after each |
| 206 | +# ``.backward()`` call, autograd starts populating a new graph. This is |
| 207 | +# exactly what allows you to use control flow statements in your model; |
| 208 | +# you can change the shape, size and operations at every iteration if |
| 209 | +# needed. |
| 210 | + |
| 211 | +###################################################################### |
| 212 | +# Optional Reading: Tensor Gradients and Jacobian Products |
| 213 | +# -------------------------------------- |
| 214 | +# |
| 215 | +# In many cases, we have a scalar loss function, and we need to compute |
| 216 | +# the gradient with respect to some parameters. However, there are cases |
| 217 | +# when the output function is an arbitrary tensor. In this case, PyTorch |
| 218 | +# allows you to compute so-called **Jacobian product**, and not the actual |
| 219 | +# gradient. |
| 220 | +# |
| 221 | +# For a vector function :math:`\vec{y}=f(\vec{x})`, where |
| 222 | +# :math:`\vec{x}=\langle x_1,\dots,x_n\rangle` and |
| 223 | +# :math:`\vec{y}=\langle y_1,\dots,y_m\rangle`, a gradient of |
| 224 | +# :math:`\vec{y}` with respect to :math:`\vec{x}` is given by **Jacobian |
| 225 | +# matrix**: |
| 226 | +# |
| 227 | +# .. math:: |
| 228 | +# |
| 229 | +# |
| 230 | +# \begin{align}J=\left(\begin{array}{ccc} |
| 231 | +# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\ |
| 232 | +# \vdots & \ddots & \vdots\\ |
| 233 | +# \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} |
| 234 | +# \end{array}\right)\end{align} |
| 235 | +# |
| 236 | +# Instead of computing the Jacobian matrix itself, PyTorch allows you to |
| 237 | +# compute **Jacobian Product** :math:`v^T\cdot J` for a given input vector |
| 238 | +# :math:`v=(v_1 \dots v_m)`. This is achieved by calling ``backward`` with |
| 239 | +# :math:`v` as an argument. The size of :math:`v` should be the same as |
| 240 | +# the size of the original tensor, with respect to which we want to |
| 241 | +# compute the product: |
| 242 | +# |
| 243 | + |
| 244 | +inp = torch.eye(5, requires_grad=True) |
| 245 | +out = (inp+1).pow(2) |
| 246 | +out.backward(torch.ones_like(inp), retain_graph=True) |
| 247 | +print("First call\n", inp.grad) |
| 248 | +out.backward(torch.ones_like(inp), retain_graph=True) |
| 249 | +print("\nSecond call\n", inp.grad) |
| 250 | +inp.grad.zero_() |
| 251 | +out.backward(torch.ones_like(inp), retain_graph=True) |
| 252 | +print("\nCall after zeroing gradients\n", inp.grad) |
| 253 | + |
| 254 | + |
| 255 | +###################################################################### |
| 256 | +# Notice that when we call ``backward`` for the second time with the same |
| 257 | +# argument, the value of the gradient is different. This happens because |
| 258 | +# when doing ``backward`` propagation, PyTorch **accumulates the |
| 259 | +# gradients**, i.e. the value of computed gradients is added to the |
| 260 | +# ``grad`` property of all leaf nodes of computational graph. If you want |
| 261 | +# to compute the proper gradients, you need to zero out the ``grad`` |
| 262 | +# property before. In real-life training an *optimizer* helps us to do |
| 263 | +# this. |
| 264 | + |
| 265 | +###################################################################### |
| 266 | +# .. note:: Previously we were calling ``backward()`` function without |
| 267 | +# parameters. This is essentially equivalent to calling |
| 268 | +# ``backward(torch.tensor(1.0))``, which is a useful way to compute the |
| 269 | +# gradients in case of a scalar-valued function, such as loss during |
| 270 | +# neural network training. |
| 271 | +# |
| 272 | + |
| 273 | + |
| 274 | +###################################################################### |
| 275 | +# Next: Learn more about `how to use automatic differentiation to train a neural network model <optimization_tutorial.html>`_. |
| 276 | +# |
| 277 | +# .. include:: /beginner_source/quickstart/qs_toc.txt |
| 278 | +# |
0 commit comments