线性回归实验

线性回归

回归问题是非常常见的一类问题，目的是寻找变量之间的关系。比如要从数据中寻找房屋面积与价格的关系，年龄和身高的关系，气体压力和体积的关系等等。而机器学习要做的正是要让机器自己来学习这些关系，并为对未知的情况做出预测。

$h_{\theta}(x)= \theta_{0} + \theta_{1} x$ $\pmb{\theta}$ $\theta_{1}$ $\theta_{0}$ $x$ $h$ 表示算法得到的映射。

代价函数的表示

$J(\pmb{\theta}) = \frac{1}{2n}\sum_{i=1}^{n} \Big( h_{\theta}(x^{(i)}) - y^{(i)} \Big)^2$ $h_{\theta}(x^{(i)})$ $J(\pmb{\theta}) = \frac{1}{2n} \sum_{i=1}^{n} \Big( \theta_0 + \theta_1 x^{(i)} - y^{(i)} \Big)^2$ 误差函数的值越小，则代表算法拟合结果与真实结果越接近。

梯度下降

$\theta$ $\theta_i$ $\theta_i = \theta_i - \alpha\frac{\partial }{\partial \theta_i}J(\pmb{\theta})$

$\alpha$ $\frac{\partial J}{\partial \theta_{0} } = \frac{1}{n} \sum_{i=1}^{n} \Big( h_{\theta}(x^{(i)}) - y^{(i)} \Big) \cdot 1,$ $\frac{\partial J}{\partial \theta_{1} } = \frac{1}{n} \sum_{i=1}^{n} \Big( h_{\theta}(x^{(i)}) - y^{(i)} \Big) \cdot x^{(i)}.$

代码实现

现在让我们开始动手实现，首先让我们回顾一下numpy和matplotlib：


xxxxxxxxxx
import numpy as np
import matplotlib.pyplot as plt


def warm_up_exercise():
    A = None
    A = np.eye(5,5)
    return A


# 当你的实现正确时，下面会输出一个单位矩阵：
print(warm_up_exercise())

运行时长: 403毫秒

结束时间: 2022-09-21 09:37:05


[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]

你需要实现绘制数据集中图像的函数，当你的实现|正确时，你应该会得到如下的图像：


def plot_data(x, y):
    """绘制给定数据x与y的图像"""
    plt.figure()
    # ====================== 你的代码 ==========================
    # 绘制x与y的图像
    # 使用 matplotlib.pyplt 的命令 plot, xlabel, ylabel 等。
    # 提示：可以使用 'rx' 选项使数据点显示为红色的 "x"，
    #       使用 "markersize=8, markeredgewidth=2" 使标记更大

    # 给制数据

    # 设置y轴标题为 'Profit in $10,000s'
    
    # 设置x轴标题为 'Population of City in 10,000s'
    
    # =========================================================
    plt.xlabel("Population of City in 10,000s")
    plt.ylabel("Profit in $10,000s")
    # 数据载入，由","分割，且按列解包
    # numpy.loadtxt参见https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html
    x, y = np.loadtxt("data/data5984/PRML_LR_data.txt", delimiter=',', unpack=True)
    # 横纵坐标标签
    plt.xlabel("Population of City in 10,000s")
    plt.ylabel("Profit in $10,000s")
    # matplotlib.pyplot.plot 参见 https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html
    # 加入 "markersize=8, markeredgewidth=2" 使标记更大
    plt.plot(x, y, "rx", markersize = 8, markeredgewidth = 2)
    
# 让我们测试一下你的实现是否正确
# 从txt中加载数据
print('Plotting Data ...\n')
data = np.loadtxt('./data/data5984/PRML_LR_data.txt', delimiter=',')
x, y = data[:, 0], data[:, 1]

# 绘图
plot_data(x, y)
plt.show()

运行时长: 190毫秒

结束时间: 2022-09-21 09:49:45

现在运用所学的知识，对上述数据利用线性回归进行拟合。首先我们对要学习的参数和数据做一个准备:


# Add a column of ones to x
m = len(y)
X = np.ones((m, 2))
X[:, 1] = data[:, 0]

# initialize fitting parameters
theta = np.zeros((2, 1))

# Some gradient descent settings
iterations = 1500
alpha = 0.01

计算初始误差函数的值，你需要实现误差函数的计算：


def compute_cost(X, y, theta):
    """计算线性回归的代价。"""
    J = 0.0
    # ====================== 你的代码 ==========================
    # 计算给定 theta 参数下线性回归的代价
    # 请将正确的代价赋值给 J
    # predictions = np.dot(X, theta)
    # sqrErrors = np.multiply((predictions - y), (predictions - y)) 
    # J = np.sum(sqrErrors) / (2 * m)
    # J /= m
    J = 1/(2*m) * sum((y - (theta[1] * X[:,1] + theta[0])) **2)
    # =========================================================
    return J

# compute and display initial cost
# Expected value 32.07
J0 = compute_cost(X, y, theta)
print(J0)

现在你验证了代价计算的正确性，接下来就需要实现最核心的部分：梯度下降。在实现这一部分之前，确定你理解了上述各种变量及其表示。你需要完成梯度下降的核心代码部分：


def gradient_descent(X, y, theta, alpha, num_iters):
    """执行梯度下降算法来学习参数 theta。"""
    m = len(y)
    J_history = np.zeros((num_iters,))
    theta_history = np.zeros((2, iterations))

    for iter in range(num_iters):
        # ====================== 你的代码 ==========================
        # 计算给定 theta 参数下线性回归的梯度，实现梯度下降算法
        # =========================================================
        # 将各次迭代后的代价进行记录
        # print(X.shape)
        # print(theta.shape)
        # print(np.dot(X, theta).shape)
        # print(y.shape)
        # print(np.dot((np.dot(X, theta).T - y), X).T.shape)
        theta -= (alpha / m) * np.dot((np.dot(X, theta).T - y), X).T

        J_history[iter] = compute_cost(X, y, theta)
        theta_history[0, iter] = theta[0]
        theta_history[1, iter] = theta[1]

    return theta, J_history, theta_history

# run gradient descent
# Expected value: theta = [-3.630291, 1.166362]
theta, J_history, theta_history = gradient_descent(X, y, theta,
                                    alpha, iterations)
print(theta)
print(J_history.shape)
print(theta_history.shape)
plt.figure()
plt.plot(range(iterations),J_history)
plt.xlabel("iterations")
plt.ylabel(r'$J(\theta)$')
plt.show()

为了验证梯度下降方法实现的正确性，你需要把学习的到的直线绘制出来，确定你的实现是否正确。前面你已经绘制了数据集中的点，现在你需要在点的基础上绘制一条直线，如果你的实现正确，那么得到的图像应该是如下这样：拟合结果现在你已经正确实现了线性回归，你可能会对误差函数的优化过程比较好奇。为了更好地理解这个过程，你可以将损失函数的图像绘制出来。为此你需要将需要优化的参数的各个取值时误差函数的取值在图像上绘制出来，以下代码需要你进行填写。


def plot_visualize_cost(X, y, theta_best):
    """可视化代价函数"""

    # 生成参数网格
    theta0_vals = np.linspace(-10, 10, 101)
    theta1_vals = np.linspace(-1, 4, 101)
    t = np.zeros((2, 1))
    J_vals = np.zeros((101, 101))
    for i in range(101):
        for j in range(101):
            # =============== 你的代码 ===================
            # 加入代码，计算 J_vals 的值
            J_vals[i][j] = compute_cost(X,y,[theta0_vals[i],theta1_vals[j]])
            # ===========================================

    plt.figure()
    plt.contour(theta0_vals, theta1_vals, J_vals,
                levels=np.logspace(-2, 3, 21))
    plt.plot(theta_best[0], theta_best[1], 'rx',
             markersize=8, markeredgewidth=2)
    plt.xlabel(r'$\theta_0$')
    plt.ylabel(r'$\theta_1$')
    plt.title(r'$J(\theta)$')
   

plot_visualize_cost(X, y, theta)
plt.show()

$J(\theta)$ 的图像以及代价的变化过程，可以加深你的理解。在梯度下降的迭代中，我们设置终止条件为完成了固定的迭代次数，但是在迭代次数完成时，由于学习率等参数的设置，可能得到的参数并不是使得代价最低的值。你可以通过观察代价函数的变化过程，想办法调整学习率等参数或者改进程序，使得参数的取值为搜索到的最优结果。

请利用历史信息，改造plot_visual_cost函数，在等高线图上绘制线性回归模型的优化迭代过程。


def plot_visual_history(X, y, theta_histroy):
    """可视化线性回归模型迭代优化过程"""

    plt.figure()
    # =============== 你的代码 ===================
    # 生成参数网格
    theta0_vals = np.linspace(-10, 10, 101)
    theta1_vals = np.linspace(-1, 4, 101)
    t = np.zeros((2, 1))
    J_vals = np.zeros((101, 101))
    for i in range(101):
        for j in range(101):
            J_vals[i][j] = compute_cost(X,y,[theta0_vals[i],theta1_vals[j]])
    plt.figure()
    plt.contour(theta0_vals, theta1_vals, J_vals,
                levels=np.logspace(-2, 3, 21))

    for i in range(0, iterations, 100):
        plt.plot(theta_history[0, i], theta_history[1, i],  'rx',
             markersize=8, markeredgewidth=1)
    # ===========================================
    plt.xlabel(r'$\theta_0$')
    plt.ylabel(r'$\theta_1$')
    plt.title(r'$J(\theta)$')
   

plot_visual_history(X, y, theta_history)
plt.show()

进阶

在实现中，你可能采取了像上面公式中给出的结果一样逐个样本计算代价函数，或者在梯度下降的更新时也采用了逐个样本计算的方式。但事实上，你可以采用numpy的矩阵函数一次性计算所有样本的代价函数。可以采用矩阵乘法(np.matmul())求和等方式（np.sum（））。利用你学到的线性代数知识，将其实现更改一下吧。