Skip to content

Interpreting a pipeline's resulting Schema and/or .Preview() #4023

Open
@nganju98

Description

@nganju98

Going over this article to be able to inspect data after the preprocessing pipeline:
https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/inspect-intermediate-data-ml-net

Coming from python where you have a dataframe and you can just do dataframe.show(), this is quite an ordeal. The CreateEnumerable() is pretty impractical because your original POCO won't fit the schema any more if you've done one-hot-encoding etc. The one-hot-encoding creates multiple columns with the same name, how can that ever map back to a POCO?

And there's a mandatory Features column at the end of the row, that repeats all the values in the row. And if you then NormalizeMinMax the Features column, you get ANOTHER Features column with the same name. Then you're trying to look at that in a Preview() and it's very confusing to see what's going on.

Then there's DataViewRowCursor, where you need to specify reflection-style getters for each column. So each time you tweak the pipeline you have to rewrite the code that lets you see the results of your pipeline? It defeats the ability to quickly tweak and look, tweak and look, in the way that python does it so simply with dataframe.show().

So assuming this is how we have to inspect our data, I've got two questions:

  1. When a Transform (like OneHotEncoding) creates multiple columns with the same name, what's going on? Is the training algorithm going to look at all of them? Is the IsHidden how it's deciding what to use? If so, can there be some official documentation on how all of this works?

  2. Why is it our responsibility to create a Features column at all? Can't the algorithms just run on the IDataView we created, like in python? It seems like a complicated and unnecessary step. Also does it seem like good OO design to have a Features field at the end of a row that repeats all the values in said row? If you need to make this Features vector for performance reasons, why not create it once Fit() is called, keep it out of our data table, and hide this implementation detail from the user?

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIIssues pertaining the friendly APIP3Doc bugs, questions, minor issues, etc.documentationRelated to documentation of ML.NETenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions