Description
When doing object detection (especially on pre-trained models), there are some pre-processing and post-processing steps that are required to get the input in the format expected by the model as well as to make sense of the output. Ultimately, while these steps are required to prepare the data and interpret the outputs, they are not directly related to the training / prediction task. With state-of-the-art models, this process is mainly boiler-plate and writing the code is often left up to the user to implement every time. A good example of this can be seen when making predictions using the Tiny YOLOv2 pre-trained model.
In this tutorial, a significant portion is boilerplate code to create a parser that extracts the values output by the model (1-D Tensor into bounding box dimensions, confidence score and class probabilities). Internalizing some of these steps as part of a high-level API, especially for pre-trained state-of-the-art models where the inputs / outputs are well-defined would make it easier for users to use these types of models.
Problem:
- Pre-Processing/Post-Processing boilerplate code is required when performing object detection to prepare input/output for the model. Although the process is model-specific, the expected inputs and outputs have already been pre-defined for state-of-the-art pre-trained models. Therefore, it doesn't make sense to keep re-writing the same code every time when it is already a solved problem.
- No consistent way to interpret model outputs
Proposal:
- Internalize pre-processing/post-processing boiler plate code as part of high-level API
- Provide option for user to select the model architecture (i.e. SSD, YOLO, Fast R-CNN) they're interested in using to train / score with. Based on the selection, the appropriate pre-processing/post-processing transformations will take place to produce a simple and consistent output for the user.
- Provide a consistent way to interpret outputs
- Currently, models like binary classification have output column names (i.e. PredictedLabel, Probability, Score) that the user can access to get the result of training/scoring. Something similar should exist for object detection models. Typical output features include the dimensions of the bounding boxes detected, the probabilities or labels of objects detected in the bounding boxes and the confidence that there is an object inside the bounding box. These could be made available as part of the output schema for the user to access once the scored output produced by the model is post-processed.
Resources:
TF provides an object detection API which allows the user to extract the class probabilities, bounding box dimensions and objectness scores from the model outputs.
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
https://github.com/tensorflow/models/tree/master/research/object_detection
https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb
Windows ML uses LearningModelEvaluationResult which the user can extract the respective outputs from the model. In the case of ML.NET, this could expose the output schema for an object detection prediction.
https://docs.microsoft.com/en-us/windows/ai/windows-ml/evaluate-model-inputs -
https://docs.microsoft.com/en-us/uwp/api/windows.ai.machinelearning.learningmodelevaluationresult