The machine learning theme continues to be popular at the F#unctional Londoners meetup group. Last night Matt Moloney gave a great hands on session on k-means clustering. Matt has worked on large machine learning systems at e-Bay. More recently he has been working on the Tsunami IDE, an extensible REPL environment for the desktop and cloud.
Tsunami provides a lightweight environment focused on interactive development, very suited to machine learning. And with F# 3 Type Providers you get typed access to a diverse set of data from CSV files all the way up to Hadoop. Interestingly Tsunami can be embedded in to Excel and used as a replacement for VBA.
Grey Young describes Tsunami as a REPL on steroids.
k-means clustering has a number of interesting application areas, from search to pharmaceuticals. For the session Matt provided an F# script to analyse the canonical iris data set (flowers). The script also produces a variety of charts for visualizing the data including animated gifs showing the centroid positions at each iteration:
The FSharp.Data CSV Type Provider, available on Nuget, gives typed access over CSV files and was used to extract the values from the iris data file:
type Iris = CsvProvider<irisDataFile>
let iris = Iris.Load(irisDataFile)
let irisData = iris.Data |> Seq.toArray
/// classifcations
let y = irisData |> Array.map (fun row -> row.Class)
/// feature vectors
let X = irisData |> Array.map (fun row ->
[|row.``Sepal Length``
row.``Sepal Width``
row.``Petal Length``
row.``Petal Width`|])
Computing k-means centroids:
let K = 3 // The Iris dataset is known to only have 3 clusters
let seed =
[|X.[0]; X.[1]; X.[2]|] // pick bad centroids on purpose
let centroidResults =
KMeans.computeCentroids seed X |> Seq.take iterationLimit
I was particularly impressed by the conciseness of Matt’s implementation of the algorithm:
(* K-Means Algorithm *)
/// Group all the vectors by the nearest center.
let classify centroids vectors =
vectors |> Array.groupBy (fun v -> centroids |> Array.minBy (distance v))
/// Repeatedly classify the vectors, starting with the seed centroids
let computeCentroids seed vectors =
seed |> Seq.iterate (fun centers -> classify centers vectors
|> Array.map (snd >> average))
Thanks again to Matt for giving a really interesting session.
If you’re interested in learning more Matt’s also giving an in depth session on machine learning at the Progressive F# Tutorials in London at the end of October:
And if you’re in New York next week you can catch Rachel Reese give an introduction to data science followed by a machine learning introduction with Mathias Brandewinder and I at the Progressive F# Tutorials NYC.