Skip to content

Design: Clustering in SQLflow #737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 2, 2019
Merged

Design: Clustering in SQLflow #737

merged 8 commits into from
Sep 2, 2019

Conversation

Echo9573
Copy link
Collaborator

@Echo9573 Echo9573 requested a review from Yancey0623 August 29, 2019 05:48
@Echo9573 Echo9573 changed the title Zwj Design doc: Clustering in SQLflow Aug 29, 2019
@Echo9573 Echo9573 changed the title Design doc: Clustering in SQLflow Design: Clustering in SQLflow Aug 29, 2019

For analysts and real business people, in the daily analysis work, most of the work is not prediction, but analysis of the patterns in the data. This can help them mine user behavioral characteristics and differences, helping the business discover value and operate.

This design doc introduces how to support the `Cluster Model` in SQLFlow.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section to introduce the Cluster Model?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As one of the most powerful method in pattern recognition, clustering focus on finding the similarities between items per group and differences between groups. Hence, Cluster Model can help analysts to build such a model which can split data samples into different group according to their features automatically.

@Echo9573 If it can help.


``` sql
SELECT * FROM train_table
EXTRCT clusterModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want to expose ClusterModel to SQLFlow? For the current implement this should be a Tensorflow premade Estimator or Custom Kerse Model, can we implement ClusterModel as a Custom Kerase Model?

-

## Note
- The **EXTRCT SQL** includes two models, the autoencode model and the cluster model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need a more general target for the keyword EXTRACT

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keyword is just for illustration of the design. Next, maybe we can think about a better keywords together.

-

## Note
- The **EXTRCT SQL** includes two models, the autoencode model and the cluster model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXTRCT SQL => EXTRACT SQL

@Yancey0623 Yancey0623 requested a review from typhoonzero August 29, 2019 09:55

## User interface

Users usually use a **TRAIN SQL** to train a model in Supervised learning. But, in this scenario, we focus on the extraction of data patterns in unsupervised learning. Therefore, we use **EXTRCT SQL** for pattern extraction, the simple pipeline like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean EXTRACT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes~

@terrytangyuan
Copy link
Member

Are we planning to support semi-supervised clustering? If so, we may allow users to optionally pass a subset of labels.


``` sql
SELECT * FROM train_table
EXTRCT clusterModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"EXTRACT" a model does not read well, we can train a model, save a model, use a model to do prediction but not extract a model. Since the model name is "clusterModel", we can distinguish the model type just like we did in xgboost. Maybe we can use TRAIN ClusterModel.myDeepClusteringModel then the code generator can generate specific code for the training pipeline for clustering models.

Copy link
Collaborator

@tonyyang-svail tonyyang-svail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Echo9573, thanks for this informative design doc. It helped the team to realize a significant portion of an analysts work is unsupervised clustering.

@Echo9573
Copy link
Collaborator Author

Are we planning to support semi-supervised clustering? If so, we may allow users to optionally pass a subset of labels.

We would not consider semi-supervised learning at present, because it may not be widely used among analysts.

@terrytangyuan
Copy link
Member

Are we planning to support semi-supervised clustering? If so, we may allow users to optionally pass a subset of labels.

We would not consider semi-supervised learning at present, because it may not be widely used among analysts.

Is there a reference for that? It is pretty popular in applications where there are not sufficient labels or that the labels are polluted. This will help us decide whether we need to reuse TRAIN for unsupervised problems and LABEL for semi-supervised problems instead of inventing a new EXTRACT syntax.

@Yancey0623
Copy link
Collaborator

Is there a reference for that? It is pretty popular in applications where there are not sufficient labels or that the labels are polluted. This will help us decide whether we need to reuse TRAIN for unsupervised problems and LABEL for semi-supervised problems instead of inventing a new EXTRACT syntax.

@terrytangyuan @Echo9573

From the design doc, maybe the root reason for the EXTRACT keyword is:

  • The output of ClusteringModel including trained model and the result of clustering which difference from TRAIN SQL.
  • Other supervised algorthms like KMeans only need to output the result, don't need to train a model.

Maybe we can reuse the TRAIN keyword and extend INTO statement to :

SELEC * FROM train_table
TRAIN ClusteringModel
INTO
    model = my_cluster_model
    table = cluster.predict

For the KMeans algorithms:

SELECT * FROM train_table
TRAIN KMeans
INTO
    table = kmeans.predict

- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table.

## clusterModel Details
<img src="figures/cluster_model_train_overview.png">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe paste the link here before this figure, so that readers can understand this figure well.

```python
class clusterModel(tf.keras.Model):

def pre_train(dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the blank lines and use 4 spaces as indentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK~~thx~

2. run_pretrain = true & Using model.existed_pretrain_model = existed_pretrain_model:
existed_pretrain_model Pretrain+ Random initialization weights for cluster. (Note that model.encode_units `is not work` at this time.)
3. run_pretrain = false & Using model.existed_pretrain_model = None:
Random initialization weights for cluster. (Note that model.encode_units `is not work` at this time.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean “does not work”? Same in other places. Also did you mean to use double quotes instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The USING statement have a higher precedence than pre_train=True in WITH statement. Since model.encode_units sets the pre_train part of the autoencoder network, the pre_train statement does not work when the using statement exists, so encode_units does not work.

Copy link
Member

@terrytangyuan terrytangyuan Aug 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that. I was just asking you to revisit this part of the grammar. It should be “does” instead of “is”. And then switch to use double quotes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK~~thx!

model.n_clusters = 5
model.run_pretrain = false
COLUMN m1, m2, m3, m4, m5, m6, m7, m8, m9, m10
USING model.existed_pretrain_model = existed_pretrain_model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to refine the USING syntax? Or just following the existing syntax USING existed_pretran_model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it can be divided into such two scenarios:

  1. User has no pre-train model (auto-encoder).
    In this scenario the user wants the full training process which consists of auto-encoder and clustering. Thus the user should be forbidden to use the USING clause, because there is no pre-train model ready for using in training process. User should define relate parameters of auto-encoder model in WITH clause clearly, like model.encode_units.

  2. User has pre-train model (auto-encoder).
    In this case, use has at least one pre-train auto-encoder model already and he/she want to use it without training this part again. USING clause should be used for defining the path/name of the pre-train auto-encoder model and in WITH clause user should guarantee the correct structure of pre-train model to make sure that the model data can be loaded correctly. It requires some additional checks to be performed in the background.


- template_tf.go
```python
if 'pre_train' is in classifier:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not Python syntax, maybe using

if hasattr(classifier, 'pre_train'):
    classifier.pre_train(...)
if hasattr(classifier, 'cluster_train_loop'):
    classifier.cluster_train_loop

- `my_cluster_model` is the trained cluster model.
- `run_pretrain` is used to determine if autoencoder pretrain needs to be run, default true.
- `model.existed_pretrain_model` is used to specify an existing pretrain_model
- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to specify the result column by PREDICT output_table.group_id is more accurate.

- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table.

## clusterModel Details
<img src="figures/cluster_model_train_overview.png">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the cluster model introduction section on the top of the document, the structure can be:

  1. ClusterModel introcution
  2. User interface in SQLFlow
  3. How implement ClusterModel it in SQLFlow

The below figure demonstrates overall workflow for clusterModel train. This figure includes two parts, the pretrian autoencode model and the cluster model are included.
1. First, the former is used to train a pretrain model. The `model.encode_units` describes the layer structure of the encoder of the autoencoder network. We only use the output of the trained encode layer (10000*7) as the input to the clustering model.
2. Then, the clustering model starts training, randomly initializes weights and multiple iterations, generates clustering models.
3. Finally, the overall train process ultimately outputs an unsupervised clustering model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about split Cluster section into Train and Predict so that users can know what does TRAIN SQL and PREDICT SQL do.

- `output_table` is the cluster result for input_table, which is adding the `group_id` column predicted by the cluster model to the input_table.

## clusterModel Details
<img src="figures/cluster_model_train_overview.png">

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the decoder should be include into the stage of Pre-train. Because the auto-encoder is used for building encoder for next training process. The decoder will be created at the same time. Even the decoder will be never be used in the future, it still should be treated as Pre-train(Just my opinion).

tonyyang-svail
tonyyang-svail previously approved these changes Sep 2, 2019
Copy link
Collaborator

@tonyyang-svail tonyyang-svail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Echo9573 @BlackPoint-CX Thanks for submitting this excellent PR. I am approving this PR because the general design looks great to me.

Please also take a look at possible readability improvements mentioned by other reviewers. :)

Copy link
Collaborator

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the excellent design about unsupervised learning, LGTM and can merge this PR first, and keep improving as implement.

@Echo9573 Echo9573 merged commit b741910 into develop Sep 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants