Description
System Information (please complete the following information):
- OS & Version: Windows 10
- ML.NET Version: ML.NET v1.7.1 (also tested with 2.0.0-preview.22313.1)
- .NET Version: NET 6.0
Describe the bug
When creating a KMeans cluster prediction engine for a training data set that does not change, the predicted cluster ids
are not consistent, even when the seed is specified for the MLContext.
To Reproduce
For this fixed data set:
using Microsoft.ML;
using Microsoft.ML.Data;
public class ModelData
{
public float Value1 { get; set; }
public float Value2 { get; set; }
}
public class ClusterPrediction
{
[ColumnName("PredictedLabel")]
public uint PredictedClusterId;
[ColumnName("Score")]
public float[] Distances = null!;
[ColumnName("Features")]
public float[] Features = null!;
}
var data = Enumerable.Range(0, 60).Select(x => new ModelData { Value1 = Random.Shared.Next(0, 2000), Value2 = Random.Shared.Next(0, 7) }).ToList();
And this function to create a new instance of the prediction engine:
const string FeaturesColumnName = "Features";
const int ClusterCount = 4;
public PredictionEngine<ModelData, ClusterPrediction> CreateModel(IEnumerable<ModelData> data)
{
var mlContext = new MLContext(seed: 0);
var dataView = mlContext.Data.LoadFromEnumerable(data);
IEstimator<ITransformer> pipeline = mlContext.Transforms
.Concatenate(FeaturesColumnName, new[] { nameof(ModelData.Value1), nameof(ModelData.Value2) })
.Append(mlContext.Clustering.Trainers.KMeans(FeaturesColumnName, numberOfClusters: ClusterCount));
var model = pipeline.Fit(dataView);
return mlContext.Model.CreatePredictionEngine<ModelData, ClusterPrediction>(model);
}
We should be able to create the same prediction engine producing the same results many times. The following creates the engine in a loop and calculates the cluster ids for each of the data set's data points, displaying the number of items that end up in each of the clusters:
using System.Linq;
for (var i = 0; i < 10; i++)
{
var engine = CreateModel(data);
var clusterCounts = data.Select(d => engine.Predict(d).PredictedClusterId).ToLookup(x => (int)x);
Console.WriteLine(string.Join(" ", Enumerable.Range(1, ClusterCount).Select(x => $"Cluster {x}: {clusterCounts[x].Count()} items")));
}
This outputs:
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 13 items Cluster 2: 20 items Cluster 3: 12 items Cluster 4: 15 items
Cluster 1: 15 items Cluster 2: 15 items Cluster 3: 17 items Cluster 4: 13 items
Cluster 1: 23 items Cluster 2: 22 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 22 items Cluster 2: 23 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 13 items Cluster 3: 15 items Cluster 4: 12 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Expected behavior
I would expect that each time the cluster is constructed from an MLContext with a fixed seed, the predicted cluster counts would be identical, with the same data points associated to them.
Screenshots, Code, Sample Projects
I've attached a .NET Interactive notebook (zipped) for ease of reproduction.