KMeans cluster analysis is non-deterministic when using KMeansYinyang initialization, even with fixed MLContext seed

**System Information (please complete the following information):**
 - OS & Version: Windows 10 
 - ML.NET Version: ML.NET v1.7.1 (also tested with 2.0.0-preview.22313.1)
 - .NET Version: NET 6.0

**Describe the bug**
When creating a KMeans cluster prediction engine for a training data set that does not change, the predicted cluster ids
are not consistent, even when the seed is specified for the MLContext.

**To Reproduce**
For this fixed data set:

``` csharp
using Microsoft.ML;
using Microsoft.ML.Data;

public class ModelData
{
    public float Value1 { get; set; }
    public float Value2 { get; set; }
}

public class ClusterPrediction
{
    [ColumnName("PredictedLabel")]
    public uint PredictedClusterId;

    [ColumnName("Score")]
    public float[] Distances = null!;

    [ColumnName("Features")]
    public float[] Features = null!;
}

var data = Enumerable.Range(0, 60).Select(x => new ModelData { Value1 = Random.Shared.Next(0, 2000), Value2 = Random.Shared.Next(0, 7) }).ToList();
```

And this function to create a new instance of the prediction engine:

``` csharp
const string FeaturesColumnName = "Features";
const int ClusterCount = 4;

public PredictionEngine<ModelData, ClusterPrediction> CreateModel(IEnumerable<ModelData> data)
{
    var mlContext = new MLContext(seed: 0);

    var dataView = mlContext.Data.LoadFromEnumerable(data);

    IEstimator<ITransformer> pipeline = mlContext.Transforms
        .Concatenate(FeaturesColumnName, new[] { nameof(ModelData.Value1), nameof(ModelData.Value2) })
        .Append(mlContext.Clustering.Trainers.KMeans(FeaturesColumnName, numberOfClusters: ClusterCount));

    var model = pipeline.Fit(dataView);

    return mlContext.Model.CreatePredictionEngine<ModelData, ClusterPrediction>(model);
}
```

We should be able to create the same prediction engine producing the same results many times. The following creates the engine in a loop and calculates the cluster ids for each of the data set's data points, displaying the number of items that end up in each of the clusters:

``` csharp
using System.Linq;

for (var i = 0; i < 10; i++)
{
    var engine = CreateModel(data);

    var clusterCounts = data.Select(d => engine.Predict(d).PredictedClusterId).ToLookup(x => (int)x);

    Console.WriteLine(string.Join(" ", Enumerable.Range(1, ClusterCount).Select(x => $"Cluster {x}: {clusterCounts[x].Count()} items")));
}
```

This outputs:

```
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 13 items Cluster 2: 20 items Cluster 3: 12 items Cluster 4: 15 items
Cluster 1: 15 items Cluster 2: 15 items Cluster 3: 17 items Cluster 4: 13 items
Cluster 1: 23 items Cluster 2: 22 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 22 items Cluster 2: 23 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 13 items Cluster 3: 15 items Cluster 4: 12 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
```

**Expected behavior**
I would expect that each time the cluster is constructed from an MLContext with a fixed seed, the predicted cluster counts would be identical, with the same data points associated to them.

**Screenshots, Code, Sample Projects**
I've attached a [.NET Interactive notebook (zipped)](https://github.com/dotnet/machinelearning/files/9774987/kmeans.cluster.analysis.zip) for ease of reproduction.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KMeans cluster analysis is non-deterministic when using KMeansYinyang initialization, even with fixed MLContext seed #6375

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KMeans cluster analysis is non-deterministic when using KMeansYinyang initialization, even with fixed MLContext seed #6375

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions