- K-means++ under approximation stability

###### Options

# K-means++ under approximation stability

Journal

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ISSN

03029743

Date Issued

2013-01-01

Author(s)

Agarwal, Manu

Jaiswal, Ragesh

Pal, Arindam

AbstractShow more

The Lloyd's algorithm, also known as the k-means algorithm, is one of the most popular algorithms for solving the k-means clustering problem in practice. However, it does not give any performance guarantees. This means that there are datasets on which this algorithm can behave very badly. One reason for poor performance on certain datasets is bad initialization. The following simple sampling based seeding algorithm tends to fix this problem: pick the first center randomly from among the given points and then for i ≥ 2, pick a point to be the ith center with probability proportional to the squared distance of this point from the previously chosen centers. This algorithm is more popularly known as the k-means++ seeding algorithm and is known to exhibit some nice properties. These have been studied in a number of previous works [AV07, AJM09, ADK09, BR11]. The algorithm tends to perform well when the optimal clusters are separated in some sense. This is because the algorithm gives preference to further away points when picking centers. Ostrovsky et al. [ORSS06] discuss one such separation condition on the data. Jaiswal and Garg [JG12] show that if the dataset satisfies the separation condition of [ORSS06], then the sampling algorithm gives a constant approximation with probability Ω(1/k). Another separation condition that is strictly weaker than [ORSS06] is the approximation stability condition discussed by Balcan et al.[BBG09]. In this work, we show that the sampling algorithm gives a constant approximation with probability Ω(1/k) if the dataset satisfies the separation condition of [BBG09] and the optimal clusters are not too small. We give a negative result for datasets that have small optimal clusters. © Springer-Verlag Berlin Heidelberg 2013.

Volume

7876 LNCS

Unpaywall