Skip to content

ElasticDL code overall review discussion

Bright edited this page Sep 25, 2019 · 11 revisions

20190924

Ready for execution

Need further consideration

  1. It's a common scenario that the task list contains some successive training tasks and then some successive evaluation tasks are following them. The dataset is using prefetch function. While the worker is handling the last train task in the sublist, the prefetch action will pull all the successive evaluation task into this worker.
  2. How to do early stop? Early stop need the training and evaluation metrics to make the decision.
  3. Refactor the evaluation process.
  4. Each time executing the ElasticDL command, the client will build a new image. Support reusing the existed image.

Include in next plan

  1. Support fail over of the EmbeddingService Redis cluster. At the present, it's single point.

Performance issue

Clone this wiki locally