Skip to content

KMeans cluster analysis is non-deterministic when using KMeansYinyang initialization, even with fixed MLContext seed #6375

Open
@mikegoatly

Description

@mikegoatly

System Information (please complete the following information):

  • OS & Version: Windows 10
  • ML.NET Version: ML.NET v1.7.1 (also tested with 2.0.0-preview.22313.1)
  • .NET Version: NET 6.0

Describe the bug
When creating a KMeans cluster prediction engine for a training data set that does not change, the predicted cluster ids
are not consistent, even when the seed is specified for the MLContext.

To Reproduce
For this fixed data set:

using Microsoft.ML;
using Microsoft.ML.Data;

public class ModelData
{
    public float Value1 { get; set; }
    public float Value2 { get; set; }
}

public class ClusterPrediction
{
    [ColumnName("PredictedLabel")]
    public uint PredictedClusterId;

    [ColumnName("Score")]
    public float[] Distances = null!;

    [ColumnName("Features")]
    public float[] Features = null!;
}

var data = Enumerable.Range(0, 60).Select(x => new ModelData { Value1 = Random.Shared.Next(0, 2000), Value2 = Random.Shared.Next(0, 7) }).ToList();

And this function to create a new instance of the prediction engine:

const string FeaturesColumnName = "Features";
const int ClusterCount = 4;

public PredictionEngine<ModelData, ClusterPrediction> CreateModel(IEnumerable<ModelData> data)
{
    var mlContext = new MLContext(seed: 0);

    var dataView = mlContext.Data.LoadFromEnumerable(data);

    IEstimator<ITransformer> pipeline = mlContext.Transforms
        .Concatenate(FeaturesColumnName, new[] { nameof(ModelData.Value1), nameof(ModelData.Value2) })
        .Append(mlContext.Clustering.Trainers.KMeans(FeaturesColumnName, numberOfClusters: ClusterCount));

    var model = pipeline.Fit(dataView);

    return mlContext.Model.CreatePredictionEngine<ModelData, ClusterPrediction>(model);
}

We should be able to create the same prediction engine producing the same results many times. The following creates the engine in a loop and calculates the cluster ids for each of the data set's data points, displaying the number of items that end up in each of the clusters:

using System.Linq;

for (var i = 0; i < 10; i++)
{
    var engine = CreateModel(data);

    var clusterCounts = data.Select(d => engine.Predict(d).PredictedClusterId).ToLookup(x => (int)x);

    Console.WriteLine(string.Join(" ", Enumerable.Range(1, ClusterCount).Select(x => $"Cluster {x}: {clusterCounts[x].Count()} items")));
}

This outputs:

Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 13 items Cluster 2: 20 items Cluster 3: 12 items Cluster 4: 15 items
Cluster 1: 15 items Cluster 2: 15 items Cluster 3: 17 items Cluster 4: 13 items
Cluster 1: 23 items Cluster 2: 22 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 22 items Cluster 2: 23 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 13 items Cluster 3: 15 items Cluster 4: 12 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items

Expected behavior
I would expect that each time the cluster is constructed from an MLContext with a fixed seed, the predicted cluster counts would be identical, with the same data points associated to them.

Screenshots, Code, Sample Projects
I've attached a .NET Interactive notebook (zipped) for ease of reproduction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions