Design: Clustering in SQLflow #737

Echo9573 · 2019-08-29T05:47:58Z

Yancey0623 · 2019-08-29T09:28:25Z

doc/cluster_design.md

+
+For analysts and real business people, in the daily analysis work, most of the work is not prediction, but analysis of the patterns in the data. This can help them mine user behavioral characteristics and differences, helping the business discover value and operate.
+
+This design doc introduces how to support the `Cluster Model` in SQLFlow. 


Add a section to introduce the Cluster Model?

As one of the most powerful method in pattern recognition, clustering focus on finding the similarities between items per group and differences between groups. Hence, Cluster Model can help analysts to build such a model which can split data samples into different group according to their features automatically.

@Echo9573 If it can help.

Yancey0623 · 2019-08-29T09:32:29Z

doc/cluster_design.md

+
+``` sql
+SELECT * FROM train_table
+EXTRCT clusterModel


How do you want to expose ClusterModel to SQLFlow? For the current implement this should be a Tensorflow premade Estimator or Custom Kerse Model, can we implement ClusterModel as a Custom Kerase Model?

Yancey0623 · 2019-08-29T09:34:20Z

doc/cluster_design.md

+- 
+
+## Note
+- The **EXTRCT SQL** includes two models, the autoencode model and the cluster model. 


Maybe we need a more general target for the keyword EXTRACT

This keyword is just for illustration of the design. Next, maybe we can think about a better keywords together.

Yancey0623 · 2019-08-29T09:34:35Z

doc/cluster_design.md

+- 
+
+## Note
+- The **EXTRCT SQL** includes two models, the autoencode model and the cluster model. 


EXTRCT SQL => EXTRACT SQL

terrytangyuan · 2019-08-29T12:26:27Z

doc/cluster_design.md

+
+## User interface
+
+Users usually use a **TRAIN SQL** to train a model in Supervised learning. But, in this scenario, we focus on the extraction of data patterns in unsupervised learning. Therefore, we use **EXTRCT SQL** for pattern extraction, the simple pipeline like:


Do you mean EXTRACT?

terrytangyuan · 2019-08-29T12:29:35Z

Are we planning to support semi-supervised clustering? If so, we may allow users to optionally pass a subset of labels.

typhoonzero · 2019-08-29T12:54:44Z

doc/cluster_design.md

+
+``` sql
+SELECT * FROM train_table
+EXTRCT clusterModel


"EXTRACT" a model does not read well, we can train a model, save a model, use a model to do prediction but not extract a model. Since the model name is "clusterModel", we can distinguish the model type just like we did in xgboost. Maybe we can use TRAIN ClusterModel.myDeepClusteringModel then the code generator can generate specific code for the training pipeline for clustering models.

tonyyang-svail

Hi @Echo9573, thanks for this informative design doc. It helped the team to realize a significant portion of an analysts work is unsupervised clustering.

doc/cluster_design.md

Echo9573 · 2019-08-30T00:03:54Z

Are we planning to support semi-supervised clustering? If so, we may allow users to optionally pass a subset of labels.

We would not consider semi-supervised learning at present, because it may not be widely used among analysts.

terrytangyuan · 2019-08-30T00:11:17Z

Are we planning to support semi-supervised clustering? If so, we may allow users to optionally pass a subset of labels.

We would not consider semi-supervised learning at present, because it may not be widely used among analysts.

Is there a reference for that? It is pretty popular in applications where there are not sufficient labels or that the labels are polluted. This will help us decide whether we need to reuse TRAIN for unsupervised problems and LABEL for semi-supervised problems instead of inventing a new EXTRACT syntax.

Yancey0623 · 2019-08-30T04:06:17Z

Is there a reference for that? It is pretty popular in applications where there are not sufficient labels or that the labels are polluted. This will help us decide whether we need to reuse TRAIN for unsupervised problems and LABEL for semi-supervised problems instead of inventing a new EXTRACT syntax.

@terrytangyuan @Echo9573

From the design doc, maybe the root reason for the EXTRACT keyword is:

The output of ClusteringModel including trained model and the result of clustering which difference from TRAIN SQL.
Other supervised algorthms like KMeans only need to output the result, don't need to train a model.

Maybe we can reuse the TRAIN keyword and extend INTO statement to :

SELEC * FROM train_table
TRAIN ClusteringModel
INTO
    model = my_cluster_model
    table = cluster.predict

For the KMeans algorithms:

SELECT * FROM train_table
TRAIN KMeans
INTO
    table = kmeans.predict

typhoonzero · 2019-08-31T06:22:00Z

doc/cluster_design.md

+- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table.
+
+## clusterModel Details
+<img src="figures/cluster_model_train_overview.png">


Maybe paste the link here before this figure, so that readers can understand this figure well.

typhoonzero · 2019-08-31T06:22:43Z

doc/cluster_design.md

+```python
+class  clusterModel(tf.keras.Model):
+
+	def pre_train(dataset):


please remove the blank lines and use 4 spaces as indentation.

terrytangyuan · 2019-08-31T07:52:03Z

doc/cluster_design.md

+2.  run_pretrain = true & Using model.existed_pretrain_model = existed_pretrain_model：
+existed_pretrain_model Pretrain+ Random initialization weights for cluster. (Note that model.encode_units `is not work` at this time.)
+3.  run_pretrain = false & Using model.existed_pretrain_model = None: 
+Random initialization weights for cluster. (Note that model.encode_units `is not work` at this time.)


You mean “does not work”? Same in other places. Also did you mean to use double quotes instead?

The USING statement have a higher precedence than pre_train=True in WITH statement. Since model.encode_units sets the pre_train part of the autoencoder network, the pre_train statement does not work when the using statement exists, so encode_units does not work.

I understand that. I was just asking you to revisit this part of the grammar. It should be “does” instead of “is”. And then switch to use double quotes.

Yancey0623 · 2019-09-01T02:45:22Z

doc/cluster_design.md

+    model.n_clusters = 5
+    model.run_pretrain = false
+COLUMN m1, m2, m3, m4, m5, m6, m7, m8, m9, m10 
+USING model.existed_pretrain_model =  existed_pretrain_model


Do we need to refine the USING syntax? Or just following the existing syntax USING existed_pretran_model.

In my opinion, it can be divided into such two scenarios:

User has no pre-train model (auto-encoder).
In this scenario the user wants the full training process which consists of auto-encoder and clustering. Thus the user should be forbidden to use the USING clause, because there is no pre-train model ready for using in training process. User should define relate parameters of auto-encoder model in WITH clause clearly, like model.encode_units.

User has pre-train model (auto-encoder).
In this case, use has at least one pre-train auto-encoder model already and he/she want to use it without training this part again. USING clause should be used for defining the path/name of the pre-train auto-encoder model and in WITH clause user should guarantee the correct structure of pre-train model to make sure that the model data can be loaded correctly. It requires some additional checks to be performed in the background.

Yancey0623 · 2019-09-01T02:49:19Z

doc/cluster_design.md

+
+- template_tf.go
+```python
+if 'pre_train' is in classifier:


It's not Python syntax, maybe using

if hasattr(classifier, 'pre_train'): classifier.pre_train(...) if hasattr(classifier, 'cluster_train_loop'): classifier.cluster_train_loop

Yancey0623 · 2019-09-01T02:58:32Z

doc/cluster_design.md

+- `my_cluster_model` is the trained cluster model.
+- `run_pretrain`  is used to determine if autoencoder pretrain needs to be run, default true.
+- `model.existed_pretrain_model` is used to specify an existing pretrain_model
+- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table.


Maybe to specify the result column by PREDICT output_table.group_id is more accurate.

Yancey0623 · 2019-09-01T03:03:11Z

doc/cluster_design.md

+- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table.
+
+## clusterModel Details
+<img src="figures/cluster_model_train_overview.png">


How about moving the cluster model introduction section on the top of the document, the structure can be:

ClusterModel introcution

User interface in SQLFlow

How implement ClusterModel it in SQLFlow

Yancey0623 · 2019-09-01T03:08:39Z

doc/cluster_design.md

+The below figure demonstrates overall workflow for clusterModel train. This figure includes two parts, the pretrian autoencode model and the cluster model are included.
+1. First, the former is used to train a pretrain model. The `model.encode_units` describes the layer structure of the encoder of the autoencoder network. We only use the output of the trained encode layer (10000*7) as the input to the clustering model. 
+2. Then, the clustering model starts training, randomly initializes weights and multiple iterations, generates clustering models.
+3. Finally, the overall train process ultimately outputs an unsupervised clustering model.


How about split Cluster section into Train and Predict so that users can know what does TRAIN SQL and PREDICT SQL do.

BlackPoint-CX · 2019-09-01T09:22:41Z

doc/cluster_design.md

+- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table.
+
+## clusterModel Details
+<img src="figures/cluster_model_train_overview.png">


I think the decoder should be include into the stage of Pre-train. Because the auto-encoder is used for building encoder for next training process. The decoder will be created at the same time. Even the decoder will be never be used in the future, it still should be treated as Pre-train(Just my opinion).

tonyyang-svail

@Echo9573 @BlackPoint-CX Thanks for submitting this excellent PR. I am approving this PR because the general design looks great to me.

Please also take a look at possible readability improvements mentioned by other reviewers. :)

Yancey0623

Thanks for the excellent design about unsupervised learning, LGTM and can merge this PR first, and keep improving as implement.

Echo9573 added 2 commits August 28, 2019 19:54

Fix executor test

6e34279

Design: Clustering in SQLflow

d44903e

Echo9573 requested a review from Yancey0623 August 29, 2019 05:48

Echo9573 changed the title ~~Zwj~~ Design doc: Clustering in SQLflow Aug 29, 2019

Echo9573 changed the title ~~Design doc: Clustering in SQLflow~~ Design: Clustering in SQLflow Aug 29, 2019

fix:Design of Clustering in SQLflow

50a74c8

Yancey0623 reviewed Aug 29, 2019

View reviewed changes

Yancey0623 requested a review from typhoonzero August 29, 2019 09:55

terrytangyuan reviewed Aug 29, 2019

View reviewed changes

typhoonzero reviewed Aug 29, 2019

View reviewed changes

tonyyang-svail reviewed Aug 29, 2019

View reviewed changes

doc/cluster_design.md Show resolved Hide resolved

Echo9573 added 3 commits August 30, 2019 23:40

cluster_model_train_overview.png

e53cbd3

fix 2.0 Design: Clustering in SQLflow

b86bce7

fix2.0 Design: Clustering in SQLflow

25e19a0

Echo9573 requested review from Yancey0623, tonyyang-svail, typhoonzero and terrytangyuan August 31, 2019 03:19

typhoonzero reviewed Aug 31, 2019

View reviewed changes

terrytangyuan reviewed Aug 31, 2019

View reviewed changes

Yancey0623 reviewed Sep 1, 2019

View reviewed changes

BlackPoint-CX reviewed Sep 1, 2019

View reviewed changes

tonyyang-svail previously approved these changes Sep 2, 2019

View reviewed changes

fix3.0 Design: Clustering in SQLflow

0322358

Echo9573 dismissed tonyyang-svail’s stale review via 0322358 September 2, 2019 08:55

modify cluster_model_train_overview.png

0b4302c

Echo9573 requested review from Yancey0623, tonyyang-svail, typhoonzero, BlackPoint-CX and terrytangyuan September 2, 2019 09:01

Yancey0623 approved these changes Sep 2, 2019

View reviewed changes

Echo9573 merged commit b741910 into develop Sep 2, 2019


		For analysts and real business people, in the daily analysis work, most of the work is not prediction, but analysis of the patterns in the data. This can help them mine user behavioral characteristics and differences, helping the business discover value and operate.

		This design doc introduces how to support the `Cluster Model` in SQLFlow.


		## User interface

		Users usually use a TRAIN SQL to train a model in Supervised learning. But, in this scenario, we focus on the extraction of data patterns in unsupervised learning. Therefore, we use EXTRCT SQL for pattern extraction, the simple pipeline like:

Design: Clustering in SQLflow #737

Design: Clustering in SQLflow #737

Uh oh!

Conversation

Echo9573 commented Aug 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terrytangyuan commented Aug 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonyyang-svail left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Echo9573 commented Aug 30, 2019

Uh oh!

terrytangyuan commented Aug 30, 2019

Uh oh!

Yancey0623 commented Aug 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terrytangyuan Aug 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonyyang-svail left a comment

Choose a reason for hiding this comment

Uh oh!

Yancey0623 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

terrytangyuan Aug 31, 2019 •

edited

Loading

Yancey0623 left a comment •

edited

Loading