Phillip Trelford's Array

POKE 36879,255

k-means clustering

The machine learning theme continues to be popular at the F#unctional Londoners meetup group. Last night Matt Moloney gave a great hands on session on k-means clustering. Matt has worked on large machine learning systems at e-Bay. More recently he has been working on the Tsunami IDE, an extensible REPL environment for the desktop and cloud.

Tsunami provides a lightweight environment focused on interactive development, very suited to machine learning. And with F# 3 Type Providers you get typed access to a diverse set of data from CSV files all the way up to Hadoop. Interestingly Tsunami can be embedded in to Excel and used as a replacement for VBA.

Grey Young describes Tsunami as a REPL on steroids.

Machine Learning - Matt Moloney from ptrelford

k-means clustering has a number of interesting application areas, from search to pharmaceuticals. For the session Matt provided an F# script to analyse the canonical iris data set (flowers). The script also produces a variety of charts for visualizing the data including animated gifs showing the centroid positions at each iteration:

results_0_1

The FSharp.Data CSV Type Provider, available on Nuget, gives typed access over CSV files and was used to extract the values from the iris data file:

type Iris = CsvProvider<irisDataFile>
let iris = Iris.Load(irisDataFile)
let irisData = iris.Data |> Seq.toArray

/// classifcations
let y = irisData |> Array.map (fun row -> row.Class)
/// feature vectors
let X = irisData |> Array.map (fun row -> 
  [|row.``Sepal Length`` 
    row.``Sepal Width`` 
    row.``Petal Length`` 
    row.``Petal Width`|])

Computing k-means centroids:

let K = 3 // The Iris dataset is known to only have 3 clusters

let seed = 
  [|X.[0]; X.[1]; X.[2]|]  // pick bad centroids on purpose

let centroidResults = 
  KMeans.computeCentroids seed X |> Seq.take iterationLimit

I was particularly impressed by the conciseness of Matt’s implementation of the algorithm:

(* K-Means Algorithm *)

/// Group all the vectors by the nearest center. 
let classify centroids vectors = 
  vectors |> Array.groupBy (fun v -> centroids |> Array.minBy (distance v))

/// Repeatedly classify the vectors, starting with the seed centroids
let computeCentroids seed vectors = 
  seed |> Seq.iterate (fun centers -> classify centers vectors 
                                      |> Array.map (snd >> average))

Thanks again to Matt for giving a really interesting session.

Learning Machine Learning


If you’re interested in learning more Matt’s also giving an in depth session on machine learning at the Progressive F# Tutorials in London at the end of October:

ProgFsharp London 2013 

And if you’re in New York next week you can catch Rachel Reese give an introduction to data science followed by a machine learning introduction with Mathias Brandewinder and I at the Progressive F# Tutorials NYC.
Comments are closed