一般对于连续的值进行预测,使用回归的方式。

回归分为线性回归和逻辑回归,这里介绍比较简单的线性回归。

线性回归认为有一个函数 h(x) = ti*xi+b (其中xi为一个数据的第i个特征)可以使所有数据均围绕这个函数曲线而达到使用函数预测数据的目的。

梯度下降法是一种学习率(learning_rate)不变的回归方式,他每次迭代根据学习率不同对每个参数进行一定的更改以达到最贴近数据的参数。

我们使用的cost函数是:65243f7b-84cc-4046-8a0b-9dd043160464 即度量预测函数和真实数据之间差异的函数。迭代的目标就是使其最小。

迭代过程直接偷一个图:

6aab35b2-9589-4cfd-a005-ec656a4a4ce1

这里简单实现了整个过程:可以通过最高迭代次数或者cost变化量来决定何时停止迭代:

import numpy as np
import random


class XYLinearRegression:
    def __init__(self, x_arr, y_arr, learning_rate=0.01, min_cost=0, max_iter=100):
        self.x_arr = x_arr
        self.y_arr = y_arr
        self.h_num = len(x_arr[0])  # 估计的参数个数
        self.learning_rate = learning_rate
        self.min_cost = min_cost
        self.max_iter = max_iter
        self.min_max_pair = [(min(i), max(i)) for i in np.array(x_arr).transpose()]
        self.thata = [random.randrange(i[0], i[1]) for i in self.min_max_pair]
        self.thata.append(0)
        self.data_len = len(x_arr)

    def linear_regression(self):
        last_j = self.min_max_pair[0][0] + 1
        iter = 0
        cost = self.calc_cost()
        while iter < self.max_iter or last_j - cost > self.min_cost:
            last_j = cost
            self.thata = self.get_thata()
            cost = self.calc_cost()
            print(self.thata)
            print('Cost:' + str(cost))
            iter += 1

        print('max Iter!')
        print(self.thata)

    def calc_cost(self, no_square=False, i=0):
        m = self.data_len
        result = 0
        for x, y in zip(self.x_arr, self.y_arr):
            sum = 0
            for i in range(self.h_num):
                sum += x[i] * self.thata[i]
            sum += self.thata[self.h_num]
            sum -= y
            if no_square is False:
                sum *= sum
            else:
                if i is not -1:
                    sum *= x[i]
            result += sum

        result /= float(m * 2)
        if no_square is True:
            result *= 2
        return result

    def get_thata(self):
        curr_thata = self.thata
        for i in range(self.h_num):
            t_cost = self.calc_cost(no_square=True, i=i)
            curr_thata[i] -= self.learning_rate * t_cost
        t_cost = self.calc_cost(no_square=True, i=-1)
        curr_thata[self.h_num] -= self.learning_rate * t_cost

        return curr_thata


# x_arr = [[random.randrange(100) for i in range(3)] for j in range(50)]
# y_arr = [random.randrange(50) for i in range(50)]

x_arr = [
    [0], [1], [2], [3], [4], [5], [6],[7],[8],[9]
]
y_arr = [
    0, 1, 2, 3, 4.3, 4.7, 6,7,8,9
]

xy = XYLinearRegression(x_arr, y_arr,max_iter=10000,min_cost=0.01,learning_rate=0.01)
xy.linear_regression()