K Nearest Neighborhood

2016-05-22

机器学习

K Nearest Neighborhood

K Nearest Neighborhood

下面引用下维基百科对 KNN 的描述，具体点击这里

在模式识别领域中，最近邻居法（KNN 算法，又译 K-近邻算法）是一种用于分类和回归的非参数统计方法。在这两种情况下，输入包含特征空间中的 k 个最接近的训练样本。

在 k-NN 分类中，输出是一个分类族群。一个对象的分类是由其邻居的“多数表决”确定的，k 个最近邻居（k 为正整数，通常较小）中最常见的分类决定了赋予该对象的类别。若 k = 1，则该对象的类别直接由最近的一个节点赋予。

在 k-NN 回归中，输出是该对象的属性值。该值是其 k 个最近邻居的值的平均值。

算法

训练样本是多维特征空间向量，其中每个训练样本带有一个类别标签。算法的训练阶段只包含存储的特征向量和训练样本的标签。在分类阶段，k 是一个用户定义的常数。一个没有类别标签的向量（查询或测试点）将被归类为最接近该点的 k 个样本点中最频繁使用的一类。如何定义两个样本点在 KNN 中是最关键的，根据不同的数据分布特征可以选择不同的距离公式，最常用的是欧氏距离、汉明距离、余弦距离，也有使用 Pearson 和 Spearman 相关系数的。如果适当运用算法来计算相似度(距离)的话，k 近邻分类精度可显著提高。无论是分类还是回归，衡量邻居的权重都非常有用，使较近邻居的权重比较远邻居的权重大。例如，一种常见的加权方案是给每个邻居权重赋值为 \(\frac{1}{d}\)，其中\(d\)是到邻居的距离

特点

KNN 使用“多数表决”进行分类，这样会在类别分布偏斜时出现缺陷。也就是说，出现频率较多的样本将会主导测试点的预测结果，因为他们比较大可能出现在测试点的 K 邻域而测试点的属性又是通过 k 邻域内的样本计算出来的。

解决这个缺点的方法之一是在进行分类时将样本到 k 个近邻点的距离考虑进去。k 近邻点中每一个的分类（对于回归问题来说，是数值）都乘以与测试点之间距离的成反比的权重。另一种克服偏斜的方式是通过数据表示形式的抽象

决策过程

具体过程可以参考维基百科这幅图

测试样本（绿色圆形）应归入要么是第一类的蓝色方形或是第二类的红色三角形。如果 k=3（实线圆圈）它被分配给第二类，因为有 2 个三角形和只有 1 个正方形在内侧圆圈之内。如果 k=5（虚线圆圈）它被分配到第一类（3 个正方形与 2 个三角形在外侧圆圈之内）。

python 代码实现

def kNN(test, train, target, k=1, normalize=False):
    """
    This is the k nearest neighbour algorithm
    :param test: N*D numpy array, each row record an example,
    each column represent a feature field
    :param train: N*D numpy array, each row record an example,
    each column represent a feature field
    :param target: 1*D numpy array, each element is the target value
    :param k: int, the best k near neighbour
    :param normalize: bool, determine whether to normalize `test` and `train` or not
    :return: 1*D numpy array, the predict value
    """
    y_hat = np.zeros(test.shape[0])
    if normalize:
        test = test / norm(test, axis=1)[:, np.newaxis]
        train = train / norm(train, axis=1)[:, np.newaxis]
    for i in range(test.shape[0]):
        #tmp = train - test[i]
        dist = distance(train, test[i], method="cosine")
        idx = dist.argsort()[:k]
        # given the weight of each neighbour
        w = np.array([1]) / dist[idx]
        w /= w.sum()
        y_hat[i] = (target[idx] * w).sum()
    return y_hat

def distance(x, y, method="euclidean", axis=1):
    """
    compute the distance between `x` and `y` by `method`
    :param x: 1*D numpy array or array like
    :param y: 1*D numpy array or array like
    :param method: the method using to compute the distance between `x` and `y`
                    the method can only be "euclidean" or "cosine"
    :return: a float number represent the distance between `x` and `y`
    """
    x = np.asarray(x)
    y = np.asarray(y)
    if method == "cosine":
        tmp = (x * y).sum(axis=axis) / (np.linalg.norm(x, axis=axis) * np.linalg.norm(y))
        return np.array([1]) / (tmp + 1e-3)
    elif method == "euclidean":
        return np.linalg.norm(x - y, axis=axis)
    else:
        raise NotImplemented("Not implement method" + method)