-
Notifications
You must be signed in to change notification settings - Fork 706
Design: Clustering in SQLflow #737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/cluster_design.md
Outdated
|
||
For analysts and real business people, in the daily analysis work, most of the work is not prediction, but analysis of the patterns in the data. This can help them mine user behavioral characteristics and differences, helping the business discover value and operate. | ||
|
||
This design doc introduces how to support the `Cluster Model` in SQLFlow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a section to introduce the Cluster Model
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As one of the most powerful method in pattern recognition, clustering focus on finding the similarities between items per group and differences between groups. Hence,
Cluster Model
can help analysts to build such a model which can split data samples into different group according to their features automatically.
@Echo9573 If it can help.
doc/cluster_design.md
Outdated
|
||
``` sql | ||
SELECT * FROM train_table | ||
EXTRCT clusterModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you want to expose ClusterModel
to SQLFlow? For the current implement this should be a Tensorflow premade Estimator or Custom Kerse Model, can we implement ClusterModel
as a Custom Kerase Model?
doc/cluster_design.md
Outdated
- | ||
|
||
## Note | ||
- The **EXTRCT SQL** includes two models, the autoencode model and the cluster model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need a more general target for the keyword EXTRACT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This keyword is just for illustration of the design. Next, maybe we can think about a better keywords together.
doc/cluster_design.md
Outdated
- | ||
|
||
## Note | ||
- The **EXTRCT SQL** includes two models, the autoencode model and the cluster model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EXTRCT SQL
=> EXTRACT SQL
doc/cluster_design.md
Outdated
|
||
## User interface | ||
|
||
Users usually use a **TRAIN SQL** to train a model in Supervised learning. But, in this scenario, we focus on the extraction of data patterns in unsupervised learning. Therefore, we use **EXTRCT SQL** for pattern extraction, the simple pipeline like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean EXTRACT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes~
Are we planning to support semi-supervised clustering? If so, we may allow users to optionally pass a subset of labels. |
doc/cluster_design.md
Outdated
|
||
``` sql | ||
SELECT * FROM train_table | ||
EXTRCT clusterModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"EXTRACT" a model does not read well, we can train a model, save a model, use a model to do prediction but not extract a model. Since the model name is "clusterModel", we can distinguish the model type just like we did in xgboost. Maybe we can use TRAIN ClusterModel.myDeepClusteringModel
then the code generator can generate specific code for the training pipeline for clustering models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Echo9573, thanks for this informative design doc. It helped the team to realize a significant portion of an analysts work is unsupervised clustering.
We would not consider semi-supervised learning at present, because it may not be widely used among analysts. |
Is there a reference for that? It is pretty popular in applications where there are not sufficient labels or that the labels are polluted. This will help us decide whether we need to reuse |
From the design doc, maybe the root reason for the
Maybe we can reuse the SELEC * FROM train_table
TRAIN ClusteringModel
INTO
model = my_cluster_model
table = cluster.predict For the KMeans algorithms: SELECT * FROM train_table
TRAIN KMeans
INTO
table = kmeans.predict |
doc/cluster_design.md
Outdated
- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table. | ||
|
||
## clusterModel Details | ||
<img src="figures/cluster_model_train_overview.png"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe paste the link here before this figure, so that readers can understand this figure well.
doc/cluster_design.md
Outdated
```python | ||
class clusterModel(tf.keras.Model): | ||
|
||
def pre_train(dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the blank lines and use 4 spaces as indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK~~thx~
doc/cluster_design.md
Outdated
2. run_pretrain = true & Using model.existed_pretrain_model = existed_pretrain_model: | ||
existed_pretrain_model Pretrain+ Random initialization weights for cluster. (Note that model.encode_units `is not work` at this time.) | ||
3. run_pretrain = false & Using model.existed_pretrain_model = None: | ||
Random initialization weights for cluster. (Note that model.encode_units `is not work` at this time.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean “does not work”? Same in other places. Also did you mean to use double quotes instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The USING
statement have a higher precedence than pre_train=True in WITH statement. Since model.encode_units sets the pre_train part of the autoencoder network, the pre_train statement does not work when the using statement exists, so encode_units does not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that. I was just asking you to revisit this part of the grammar. It should be “does” instead of “is”. And then switch to use double quotes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK~~thx!
doc/cluster_design.md
Outdated
model.n_clusters = 5 | ||
model.run_pretrain = false | ||
COLUMN m1, m2, m3, m4, m5, m6, m7, m8, m9, m10 | ||
USING model.existed_pretrain_model = existed_pretrain_model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to refine the USING syntax? Or just following the existing syntax USING existed_pretran_model
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, it can be divided into such two scenarios:
-
User has no pre-train model (auto-encoder).
In this scenario the user wants the full training process which consists ofauto-encoder
andclustering
. Thus the user should be forbidden to use theUSING
clause, because there is no pre-train model ready for using in training process. User should define relate parameters ofauto-encoder
model inWITH
clause clearly, likemodel.encode_units
. -
User has pre-train model (auto-encoder).
In this case, use has at least one pre-trainauto-encoder
model already and he/she want to use it without training this part again.USING
clause should be used for defining the path/name of the pre-trainauto-encoder
model and inWITH
clause user should guarantee the correct structure of pre-train model to make sure that the model data can be loaded correctly. It requires some additional checks to be performed in the background.
doc/cluster_design.md
Outdated
|
||
- template_tf.go | ||
```python | ||
if 'pre_train' is in classifier: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not Python syntax, maybe using
if hasattr(classifier, 'pre_train'):
classifier.pre_train(...)
if hasattr(classifier, 'cluster_train_loop'):
classifier.cluster_train_loop
doc/cluster_design.md
Outdated
- `my_cluster_model` is the trained cluster model. | ||
- `run_pretrain` is used to determine if autoencoder pretrain needs to be run, default true. | ||
- `model.existed_pretrain_model` is used to specify an existing pretrain_model | ||
- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to specify the result column by PREDICT output_table.group_id
is more accurate.
doc/cluster_design.md
Outdated
- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table. | ||
|
||
## clusterModel Details | ||
<img src="figures/cluster_model_train_overview.png"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving the cluster model introduction section on the top of the document, the structure can be:
- ClusterModel introcution
- User interface in SQLFlow
- How implement ClusterModel it in SQLFlow
doc/cluster_design.md
Outdated
The below figure demonstrates overall workflow for clusterModel train. This figure includes two parts, the pretrian autoencode model and the cluster model are included. | ||
1. First, the former is used to train a pretrain model. The `model.encode_units` describes the layer structure of the encoder of the autoencoder network. We only use the output of the trained encode layer (10000*7) as the input to the clustering model. | ||
2. Then, the clustering model starts training, randomly initializes weights and multiple iterations, generates clustering models. | ||
3. Finally, the overall train process ultimately outputs an unsupervised clustering model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about split Cluster
section into Train
and Predict
so that users can know what does TRAIN SQL
and PREDICT SQL
do.
doc/cluster_design.md
Outdated
- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table. | ||
|
||
## clusterModel Details | ||
<img src="figures/cluster_model_train_overview.png"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the decoder
should be include into the stage of Pre-train
. Because the auto-encoder
is used for building encoder
for next training process. The decoder
will be created at the same time. Even the decoder
will be never be used in the future, it still should be treated as Pre-train
(Just my opinion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Echo9573 @BlackPoint-CX Thanks for submitting this excellent PR. I am approving this PR because the general design looks great to me.
Please also take a look at possible readability improvements mentioned by other reviewers. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the excellent design about unsupervised learning, LGTM and can merge this PR first, and keep improving as implement.
#648