解释pytorch中的每轮迭代训练，更新参数W的过程

2020年4月4日下午2:58

解释pytorch三个内置函数的功能：

What does the backward() function do? - autograd - PyTorch Forums

loss.backward() computes dloss/dx for every parameter x which has requires_grad=True. These are accumulated into x.grad for every parameter x. In pseudo-code:
- x.grad += dloss/dx
optimizer.step updates the value of x using the gradient x.grad. For example, the SGD optimizer performs:
- x += -lr * x.grad
optimizer.zero_grad() clears x.grad for every parameter x in the optimizer. It’s important to call this before loss.backward(), otherwise you’ll accumulate the gradients from multiple passes.
If you have multiple losses (loss1, loss2) you can sum them and then call backwards once:
- loss3 = loss1 + loss2
- loss3.backward()

重要

在每轮训练的过程中，其实还是数学结论的代码实现，代码只负责数学的结论，不体现你在稿纸上的推理结论过程

在每轮训练的过程中尝试问自己一个问题？

为什么 x += -lr * x.grad 就可以让W越来越接近正确的结果？
- 这里的x，其实写成 w += -lr * w.grad，我觉得更加正确，loss函数的自变量应该是w，而不是样本x，样本应该是已知量。
- 首先需要认识到这是一个从数学推导出来的结论，这个推导过程叫做SGD(梯度下降)，这个loss就是f(w)，在SGD中我们对他进行了泰勒展开，得出了更新w的结论w += -lr * w.grad

这个问题必须得问出来，否则犯了一个本末倒置的问题

import numpy as np
import torch

# Assuming we know that the desired function is a polynomial of 2nd degree, we
# allocate a vector of size 3 to hold the coefficients and initialize it with
# random noise.
w = torch.tensor(torch.randn([3, 1]), requires_grad=True)

# We use the Adam optimizer with learning rate set to 0.1 to minimize the loss.
opt = torch.optim.Adam([w], 0.1)

def model(x):
    # We define yhat to be our estimate of y.
    f = torch.stack([x * x, x, torch.ones_like(x)], 1)
    yhat = torch.squeeze(f @ w, 1)
    return yhat

def compute_loss(y, yhat):
    # The loss is defined to be the mean squared error distance between our
    # estimate of y and its true value. 
    loss = torch.nn.functional.mse_loss(yhat, y)
    return loss

def generate_data():
    # Generate some training data based on the true function
    x = torch.rand(100) * 20 - 10
    y = 5 * x * x + 3
    return x, y

def train_step():
    x, y = generate_data()

    yhat = model(x)
    loss = compute_loss(y, yhat)

    opt.zero_grad()
    loss.backward()
    opt.step()

for _ in range(1000):
    train_step()

print(w.detach().numpy())