[refactor] Store and restore state along with checkpoints #4025

ervteng · 2020-05-26T23:05:09Z

Proposed change(s)

This PR adds a mechanism to store data that is then written out as a JSON file (training_status.json) that can be loaded on resume. This is done through a new class (GlobalTrainingStatus) that keeps both metadata bout the JSON file and the key/values that need to be written, organized by behavior name. Currently, it stores just the lesson number during curriculum. This value is loaded when --resume is specified.

training_status.json is versioned in a very similar way to the timers.json file, and versions are checked on resume. Warnings are thrown if the version doesn't match.

We also no longer need the --lesson CLI option as that was used to reset a lesson on resume.

Types of change(s)

Checklist

Added tests that prove my fix is effective or that my feature works
Updated the changelog (if applicable)
Updated the documentation (if applicable)
Updated the migration guide (if applicable)

Other comments

com.unity.ml-agents/CHANGELOG.md

chriselion · 2020-05-27T01:22:32Z

ml-agents/mlagents/trainers/stats.py

@@ -304,6 +343,39 @@ def __init__(self, category: str):
    def add_writer(writer: StatsWriter) -> None:
        StatsReporter.writers.append(writer)

+    @staticmethod


StatsReporter doesn't seem like it's the right place to be saving and loading global state. Can you make a new class for this? TrainingGlobalState?

The main reason I put it in StatsReporter was that the trainers already have one initialized with the right key (the brain_name). No problem, it can be moved

chriselion · 2020-05-27T01:23:02Z

ml-agents/mlagents/trainers/learn.py

@@ -82,6 +82,9 @@ def run_training(run_seed: int, options: RunOptions) -> None:
        )
        # Make run logs directory
        os.makedirs(run_logs_dir, exist_ok=True)
+        # Load any needed states
+        if checkpoint_settings.resume:
+            StatsReporter.load_state(os.path.join(run_logs_dir, "training_status.json"))


Move "training_status.json" to a constant?

I've moved it to a constant and the method to a helper method similar to the timing tree and configuration file writes.

chriselion · 2020-05-27T01:26:55Z

ml-agents/mlagents/trainers/meta_curriculum.py

+
+        for brain_name, curriculum in self.brains_to_curricula.items():
+            # Create a temporary StatsReporter with the right brain name
+            _statsreporter = StatsReporter(brain_name)


It's unclear what's going on here; it feels really hacky (and probably brittle).

It's less of a concern now since we're no longer using StatsReporters, but we're still using the brain_name to refer to meta curriculums.

chriselion · 2020-05-27T01:28:10Z

ml-agents/mlagents/trainers/stats.py

+            # Update saved state.
+            StatsReporter.saved_state.update(loaded_dict)
+        except FileNotFoundError:
+            pass


Should there be a warning here, or is it expected that this won't be there most of the time?

Added a warning. This should only happen if the user is loading from an older version of ML-Agents that did not save out the training_status.json file, or (as Vince mentioned below) the file was not saved out due to a crash.

chriselion

Please don't overload StatsReporter for this; I don't think it's the right place to stash the data.

vincentpierre · 2020-05-27T16:25:33Z

ml-agents/mlagents/trainers/learn.py

@@ -159,6 +162,7 @@ def run_training(run_seed: int, options: RunOptions) -> None:
        env_manager.close()
        write_run_options(write_path, options)
        write_timing_tree(run_logs_dir)
+        StatsReporter.save_state(os.path.join(run_logs_dir, "training_status.json"))


I don't think this is the best place to store this global state. I would put it within the save model calls.

Hmm, that makes sense since we'd want to resume on the event of a crash as well.

However the save model calls will be moved to within the trainers, since it seems logical that each trainer drives its own checkpointing (at possibly different frequencies). Wouldn't want the global save to be there as well. We could have each trainer manage its own save state, or have every trainer just trigger the save_state function.

An alternative would be to just write to the JSON every time a new state is written. For stuff like the lesson number (which is very infrequently written) this is OK.

Co-authored-by: Chris Elion <[email protected]>

…ml-agents into develop-lessonresume

Add warning if file not found

chriselion · 2020-05-28T20:20:02Z

ml-agents/mlagents/trainers/training_status.py

+        type of status needed to be saved (e.g. Lesson Number). Finally the Value is the float value
+        attached to this stat.
+        """
+        self.category: str = category


This still feels awkward having a mix of instance and static data. Why not drop self.category and add an extra argument to restore_parameter_state and store_parameter_state?

I went this way to keep it inline with the StatsReporter, and so that if a trainer uses it multiple times it doesn't need to continue to pass in the category.

Don't have a strong preference since it's used much less frequently than StatsReporter; I'm OK with changing it to a parameter to the class methods. We could also then make the class methods static and not need an instance of this at all.

Changed it to two args for get_parameter_state and set_parameter_state.

Can you remove self.category (and the initializer) now?

Yes - oversight on my part. Removed both.

chriselion · 2020-05-28T20:25:40Z

ml-agents/mlagents/trainers/training_status.py

+        with open(path, "w") as f:
+            json.dump(GlobalTrainingStatus.saved_state, f, indent=4)
+
+    def store_parameter_state(self, key: StatusType, value: Any) -> None:


nit: set_parameter_state() and get_paramater_state()? I think restore_parameter_state is a bad name since "restore" makes it sound like it's doing loading.

chriselion · 2020-05-28T20:26:02Z

ml-agents/mlagents/trainers/training_status.py

+
+    def restore_parameter_state(self, key: StatusType) -> Any:
+        """
+        Stores an arbitrary-named parameter in training_status.json.


copy-pasted docstring.

chriselion · 2020-05-28T20:28:23Z

ml-agents/mlagents/trainers/tests/test_training_status.py

+
+    statsreporter_new = GlobalTrainingStatus("Category1")
+    GlobalTrainingStatus.load_state(path_dir)
+    restored_val = statsreporter_new.restore_parameter_state(StatusType.LESSON_NUM)


Also test that restore_parameter_state() on an unknown category or StatusType returns None instead of raising an exception.

Added test for these.

chriselion

One question, looks good though

Ervin Teng added 30 commits April 29, 2020 21:12

Use attrs for RunOptions and CLI

f2ce4c7

Add example of strict type conversion

124d777

Recursively apply cattr with being strict

7b39baa

PPO trains

b5121af

Use new settings for BC module

3dfe312

Use correct enum typing

cd23b0a

SAC now works

0ba816d

Better SAC defaults

ad33ab1

Reward Signals and GhostTrainer to new settings

a8406d9

Conversion script and fix mypy

a826bb4

Update curriculum to new settings

65a0e13

Fix issue with mypy fix

cf7990d

Enable running without config file

5060638

Fix issue with upgrade script

69ebbfb

Fix some tests

9e7c32c

Fix most of simple_rl tests

d29b4b7

Fix remaining simple_rl tests

c9c6613

Remove unneeded methods

32b934d

Fix some more tests

d0c3bd3

Fix meta curriculum test

a2bb9a0

Fix remaining tests

8885cb0

Merge branch 'master' into develop-attrs

a40fa55

Fix update config script

f5a97c8

Revert 3DBall.yaml

85827ce

Convert PPO configs to new format

b3bb269

Update SAC configs

41b11f1

Remove nulls from configs, update imitation

02b54fc

Fix setup.py

bb88ff2

Clean up typing, variable names

df3ed19

Remove unneeded cast

b887170

Update Changelog and Migrating

63876bc

ervteng requested review from chriselion and vincentpierre May 26, 2020 23:18

ervteng marked this pull request as ready for review May 26, 2020 23:18

chriselion reviewed May 27, 2020

View reviewed changes

com.unity.ml-agents/CHANGELOG.md Outdated Show resolved Hide resolved

chriselion reviewed May 27, 2020

View reviewed changes

chriselion suggested changes May 27, 2020

View reviewed changes

vincentpierre reviewed May 27, 2020

View reviewed changes

Ervin Teng and others added 5 commits May 27, 2020 11:53

Move training status out of stats

7fd9f34

Update com.unity.ml-agents/CHANGELOG.md with missing PR number

1d4a493

Co-authored-by: Chris Elion <[email protected]>

Fix tests

951601a

Merge branch 'develop-lessonresume' of github.com:Unity-Technologies/…

55b7ec1

…ml-agents into develop-lessonresume

Clean up write code in learn

89e6b22

Add warning if file not found

ervteng requested a review from chriselion May 28, 2020 18:26

Merge branch 'master' into develop-lessonresume

e3ee063

chriselion reviewed May 28, 2020

View reviewed changes

Ervin Teng added 2 commits May 29, 2020 15:24

Make GlobalTrainingStatus completely static

c4e6437

Improve test

7e1017e

chriselion approved these changes Jun 2, 2020

View reviewed changes

Removed initializer for GlobalTrainingStatus

c68ec2a

ervteng merged commit 5d02292 into master Jun 3, 2020

delete-merged-branch bot deleted the develop-lessonresume branch June 3, 2020 01:11

chriselion mentioned this pull request Jul 27, 2020

Training uses the wrong ciriculum level for the first ephisode when using --load #2993

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jun 3, 2021

[refactor] Store and restore state along with checkpoints #4025

[refactor] Store and restore state along with checkpoints #4025

Uh oh!

Conversation

ervteng commented May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed change(s)

Types of change(s)

Checklist

Other comments

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chriselion left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ervteng May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chriselion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ervteng commented May 26, 2020 •

edited

Loading

ervteng May 28, 2020 •

edited

Loading