2020年4月4日 下午2:58
解释pytorch三个内置函数的功能:
What does the backward() function do? - autograd - PyTorch Forums
- loss.backward() computes dloss/dx for every parameter x which has requires_grad=True. These are accumulated into x.grad for every parameter x. In pseudo-code:
- x.grad += dloss/dx
- optimizer.step updates the value of x using the gradient x.grad. For example, the SGD optimizer performs:
- x += -lr * x.grad
- optimizer.zero_grad() clears x.grad for every parameter x in the optimizer. It’s important to call this before loss.backward(), otherwise you’ll accumulate the gradients from multiple passes.
- If you have multiple losses (loss1, loss2) you can sum them and then call backwards once:
- loss3 = loss1 + loss2
- loss3.backward()
重要
- 在每轮训练的过程中,其实还是数学结论的代码实现,代码只负责数学的结论,不体现你在稿纸上的推理结论过程
- 在每轮训练的过程中尝试问自己一个问题?
- 为什么 x += -lr * x.grad 就可以让W越来越接近正确的结果?
- 这里的x,其实写成 w += -lr * w.grad,我觉得更加正确,loss函数的自变量应该是w,而不是样本x,样本应该是已知量。
- 首先需要认识到这是一个从数学推导出来的结论,这个推导过程叫做SGD(梯度下降),这个loss就是f(w),在SGD中我们对他进行了泰勒展开,得出了更新w的结论w += -lr * w.grad
- 这个问题必须得问出来,否则犯了一个本末倒置的问题
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43import numpy as np
import torch
# Assuming we know that the desired function is a polynomial of 2nd degree, we
# allocate a vector of size 3 to hold the coefficients and initialize it with
# random noise.
w = torch.tensor(torch.randn([3, 1]), requires_grad=True)
# We use the Adam optimizer with learning rate set to 0.1 to minimize the loss.
opt = torch.optim.Adam([w], 0.1)
def model(x):
# We define yhat to be our estimate of y.
f = torch.stack([x * x, x, torch.ones_like(x)], 1)
yhat = torch.squeeze(f @ w, 1)
return yhat
def compute_loss(y, yhat):
# The loss is defined to be the mean squared error distance between our
# estimate of y and its true value.
loss = torch.nn.functional.mse_loss(yhat, y)
return loss
def generate_data():
# Generate some training data based on the true function
x = torch.rand(100) * 20 - 10
y = 5 * x * x + 3
return x, y
def train_step():
x, y = generate_data()
yhat = model(x)
loss = compute_loss(y, yhat)
opt.zero_grad()
loss.backward()
opt.step()
for _ in range(1000):
train_step()
print(w.detach().numpy())
- 为什么 x += -lr * x.grad 就可以让W越来越接近正确的结果?