Skip to content

[AutoML] Allow to use serialized IDataView as an input #3684

Open
@sergey-tihon

Description

@sergey-tihon

ML.NET support at least two types of IDataView serializations out of the box - text and binary files.

So I can use one of two to prepare my data set for AutoML

using (var stream = File.Create(textFileName))
    mlContext.Data.SaveAsText(data, stream);

using (var stream = File.Create(binFileName))
    mlContext.Data.SaveAsBinary(data, stream);

But when I try to use serialized file as an input for AutoML (both CLI and GUI version) it unable to parse them.

Binary format

Using binary format

mlnet auto-train --task binary-classification --dataset "data-bin.idv" --label-column-name IsCS --cache on --max-exploration-time 60 --verbosity diag

I see following error

Inferring Columns ...
An Error occured during inferring columns
Unable to split the file provided into multiple, consistent columns.
Microsoft.ML.AutoML.InferenceException: Unable to split the file provided into multiple, consistent columns.
   at Microsoft.ML.AutoML.ColumnInferenceApi.InferSplit(MLContext context, TextFileSample sample, Nullable`1 separatorChar, Nullable`1 allowQuotedStrings, Nullable`1 supportSparse)
   at Microsoft.ML.AutoML.ColumnInferenceApi.InferColumns(MLContext context, String path, ColumnInformation columnInfo, Nullable`1 separatorChar, Nullable`1 allowQuotedStrings, Nullable`1 supportSparse, Boolean trimWhitespace, Boolean groupColumns)
   at Microsoft.ML.CLI.CodeGenerator.AutoMLEngine.InferColumns(MLContext context, ColumnInformation columnInformation)
   at Microsoft.ML.CLI.CodeGenerator.CodeGenerationHelper.GenerateCode()
   at Microsoft.ML.CLI.Program.<>c__DisplayClass1_0.<Main>b__0(NewCommandSettings options)
Please see the log file for more info.
Exiting ...

Text format

With --verbosity diag it stuck on the line

Inferring Columns ...
Creating Data loader ...
Loading data ...
Exploring multiple ML algorithms and settings to find you the best model for ML task: binary-classification
For further learning check: https://aka.ms/mlnet-cli
|     Trainer                              Accuracy      AUC    AUPRC  F1-score  Duration #Iteration             |
[Source=AutoML, Kind=Trace] Channel started

with default verbosity

mlnet auto-train --task binary-classification --dataset "data-txt.tsv" --label-column-name IsCS --cache on --max-exploration-time 60

it return an error of type mismatch

xploring multiple ML algorithms and settings to find you the best model for ML task: binary-classification
For further learning check: https://aka.ms/mlnet-cli
──────────────────────────
Waiting for the first iteration to complete ...                                                                                                                                       00:00:00
Exception occured while exploring pipelines:
Provided label column 'IsCS' was of type Single, but only type Boolean is allowed.
Please see the log file for more info.

but data file looks correct (it serialized by ML.NET).
This is the header and first lines of dataset

#@ TextLoader{
#@   header+
#@   sep=tab
#@   col=IsCS:BL:0
#@   col=Features:R4:1-19
#@ }
IsCS	19	0:""
0	2	0.259255171	0	0	0	1.41421354	0	1.41421354	0	1.41421354	0	1.41421354	0	3	6	0	0	1	1192
0	6	0.259255171	0	0	0	1.41421354	0	1.41421354	0	1.41421354	0	1.41421354	0	3	6	0	0	1	1192

Metadata

Metadata

Assignees

No one assigned

    Labels

    AutoML.NETAutomating various steps of the machine learning processP1Priority of the issue for triage purpose: Needs to be fixed soon.bugSomething isn't workingclassificationBugs related classification tasksloadsaveBugs related loading and saving data or models

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions