folkertdev / elm-kmeans / KMeans

K-means is a method for partitioning data points into k clusters.

The standard method only guarantees at most k clusters: sometimes there are fewer. In many cases an exact k-clustering is desired and this usecase is also supported.

Cluster

cluster : Basics.Int -> List ( Basics.Float, Basics.Float ) -> List (List ( Basics.Float, Basics.Float ))

Partition a list of 2d points into at most k clusters.

clusterBy : (a -> List Basics.Float) -> Basics.Int -> List a -> { centroids : List (List Basics.Float), clusters : List (List a) }

Partition a list of points into at most k clusters.

This function offers more flexibility in the type of coordinate you have: Just turn it into a list of float values, e.g.

tuple2d : ( Float, Float ) -> List Float
tuple2d ( x, y ) =
    [ x, y ]

myTuples : List (Float, Float)

clusterBy tuple2d 4 myTuples

Or using ianmackenzie/elm-geometry and ianmackenzie/elm-units

point2d : Point2d Pixels Float -> List Float
point2d point =
    let
        ( a, b ) =
            Point2d.toTuple Pixels.inPixels point
    in
    [ a, b ]

myPoint2ds : List (Point2d Pixels Float)

clusterBy point2d 4 myPoint2ds

This function also works with 1, 3 or n dimensions. Additionally, you get back not only the clustered values, but also the centroid (median of all points in the cluster) of each cluster.

clusterExactlyBy : (a -> List Basics.Float) -> Basics.Int -> List a -> { centroids : List (List Basics.Float), clusters : List (List a) }

Try to find a clustering with exactly k clusters

The K-means algorithm initially groups the data randomly into clusters. In some cases, this can cause a cluster to be "bumped out" and become empty. Therefore, normally you get at most, but not always exactly, k clusters.

This function will retry clustering when fewer than k clusters are found, by moving data points from the front to the back of the input. It will retry at most n times (where n is the number of input points).

For big sorted inputs an initial full random shuffle can be helpful to decrease computation time. See the advice on shuffling below.

Helpers

associate : { centroids : List (List Basics.Float), clusters : List (List a) } -> List { centroid : List Basics.Float, points : List a }

Associate a centroid with its points.

Shuffling

K-means is sensitive to the initial guess of the centroids. If two centroid points are too close, one of them often becomes empty during the clustering process, and we don't have k clusters any more. This is especially likely when the input data is sorted.

The clusterExactlyBy function tries to solve this issue by trying n permutations of the list (it moves items from the front to the back, then tries clustering to see if k clusters emerge). Another method that helps is shuffling the input list. On average, the initial clusters will be distributed more evenly.

In elm shuffling a list is easiest with the elm-community/random-extra package, that exposes a Random.List.shuffle function.

import KMeans
import Random
import Random.List

shuffleAndClusterBy :
    (a -> List Float)
    -> Int
    -> List a
    -> Random.Generator { centroids : List (List Float), clusters : List (List a) }
shuffleAndClusterBy toVector k items =
    Random.List.shuffle items
        |> Random.map (KMeans.clusterBy toVector k)

The guide explains how to work with randomness and Random.Generator.