Description
Going over this article to be able to inspect data after the preprocessing pipeline:
https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/inspect-intermediate-data-ml-net
Coming from python where you have a dataframe and you can just do dataframe.show(), this is quite an ordeal. The CreateEnumerable() is pretty impractical because your original POCO won't fit the schema any more if you've done one-hot-encoding etc. The one-hot-encoding creates multiple columns with the same name, how can that ever map back to a POCO?
And there's a mandatory Features column at the end of the row, that repeats all the values in the row. And if you then NormalizeMinMax the Features column, you get ANOTHER Features column with the same name. Then you're trying to look at that in a Preview() and it's very confusing to see what's going on.
Then there's DataViewRowCursor, where you need to specify reflection-style getters for each column. So each time you tweak the pipeline you have to rewrite the code that lets you see the results of your pipeline? It defeats the ability to quickly tweak and look, tweak and look, in the way that python does it so simply with dataframe.show().
So assuming this is how we have to inspect our data, I've got two questions:
-
When a Transform (like OneHotEncoding) creates multiple columns with the same name, what's going on? Is the training algorithm going to look at all of them? Is the IsHidden how it's deciding what to use? If so, can there be some official documentation on how all of this works?
-
Why is it our responsibility to create a Features column at all? Can't the algorithms just run on the IDataView we created, like in python? It seems like a complicated and unnecessary step. Also does it seem like good OO design to have a Features field at the end of a row that repeats all the values in said row? If you need to make this Features vector for performance reasons, why not create it once Fit() is called, keep it out of our data table, and hide this implementation detail from the user?