migrate runtime to modern ET libraries (#2994)

zonglinpeng · facebook-github-bot · commit c992983f55e5 · 2024-04-29T14:22:06.000-07:00
Summary: Pull Request resolved: #2994 ## Overview Migrated methods from ET libraries to replace our home-brew logics. - Model and input flat buffer is migrated to bundled program flat buffer (.bpte) - Jarvis memory allocation in runtime is migrated to executorch memory manager defined by executorch Span - Input memory allocation is migrated to method-based data pointer assignment. - Output and debug buffer is **partially** migrated to ETDump. - Model output validation is **partially** migrated to method-based verification in bundled program. ## Input flow: - Takes the edge program manager - Build testsuites from methods. Only FOWARD method is applied and hardcoded. - Build bundled program - Serialize and store the bundled program in the flat buffer ## Output flow: - A bundled program is loaded from the serialized flat buffer - The program is executed on a selected backend. - The output is generated. - Validation: compare the expected with actual output by 1. the original Jarvis compare method (ENABLED), and 2. method-based VerifyResultWithBundledExpectedOutput (DISABLED) - **Note**: the sink flow was reverted backed to a series of .npy output files and unflatten by `torch.utils._pytree.tree_unflatten` to re-enable legacy tests. ET/Bolt adopted a new flow that save outputs as `.bin` and load by `np.fromfile`. ETDump gets output from debug buffer. **These will be investigate in stage2** TODO: T185104750 T185106115 ## Memory Allocation Re-abled Jarvis custom memory planning and supported to run on different backends (e.g. HIFI4). - Enabled alloc_graph_input and output. - Defined memory in torch::Span. - **Note**: alloc_graph_output is using deprecated ET APIs: set_data(), mutable_date_ptr(). It has memory misalignment issue when migrating to the new flow. **These will be investigate in stage2** TODO: T185104439 ## Output Validation Verify output by `torch::executor::bundled_program::VerifyResultWithBundledExpectedOutput`. This is currently a dummy validation for quantized tests which have high rtol. So their error threshold is set to a random large value i.e. 1e5 1e7. **These will be investigate in stage2** TODO:T180249993 T185104615 T185104862 # Design Major design decisions (ADR). ## Method 1 [ADOPTED] Modify executor.cpp to consume a bundled_program flatbuffer and execute on a different BUCK host. | - Pros: max reuse of existing configuration for custom Jarvis ops. | - Cons: impact to runtime performance due to starting a new host. ## Method 2 [ABANDONED] Use ET pybinding APIs to consume bundled program as a input and execute in runtime. | - Pros: all ET APIs are encapsulated in Pythons that gears well with existing infrastructure | - Cons: bad extensibility as backend is static (CPU) on start up and cannot be switched on the fly. | - Cons: missing custom ops in runtime on the same BUCK host. Have to duplicate and hardcode dependencies. # Progress Program Injestion (input) - [x] POC run of aten_relu_out and quantized_linear_out - [x] Obtain Javis custom ops in runtime Program Sink (Output) - [x] Get etdump as etdp - [x] Get Inspector object from etdump - [x] Get program output from method - [x] Re-enable scuba profile - [x] Get debug buffer binary - [x] enable dump output from etdump - [x] get output from etdump - [ ] migrate sink flow to etdump - [ ] adjust memory config for dump Verification - [x] verify_result_with_bundled_expected_output with rtol and atol. Will set a very large rtol and atol to pass the validation for quantize. - [x] Compare output with expected_output by original Jarvis compare (RMS) Memory Planning - [x] define memory planning input: MemoryConfig - [x] understand what ET MemoryManager actually takes - [x] migrate to ET MemoryManager with three new arguments - [x] Re-enable alloc_graph_input - [x] Re-enable alloc_graph_output - [x] update legacy of HierarchicalAllocator - [x] Verify if the size of planned buffer are correct Misc. - [ ] verify if input has been memcpy to a custom input buffer in bundled program when input mem is not allocated. Use set_input - [ ] investigate if testsuites run in serial or like buck in parallel - [ ] investigate output.bin workflow. Bolt as reference. - [ ] Refactor to reuse module.h, module.cpp, data_module.cpp - [ ] refactor based on TODO - [x] clean legacy code Reviewed By: tarun292, skrtskrtfb, mcremon-meta Differential Revision: D53870154 fbshipit-source-id: 05efdd48da040f089c0cc65ee7ad5f2cb14be5bd
diff --git a/profiler/parse_profiler_results.py b/profiler/parse_profiler_results.py
@@ -434,19 +434,19 @@ def profile_framework_tax_table(
 
 def deserialize_profile_results_files(
     profile_results_path: str,
-    model_ff_path: str,
+    bundled_program_ff_path: str,
     time_scale: TimeScale = TimeScale.TIME_IN_NS,
 ):
     with open(profile_results_path, "rb") as prof_res_file, open(
-        model_ff_path, "rb"
+        bundled_program_ff_path, "rb"
     ) as model_ff_file:
         prof_res_buf = prof_res_file.read()
-        model_ff_buf = model_ff_file.read()
+        bundled_program_ff_buf = model_ff_file.read()
 
     prof_data, mem_allocations = deserialize_profile_results(prof_res_buf, time_scale)
     framework_tax_data = profile_aggregate_framework_tax(prof_data)
 
-    prof_tables = profile_table(prof_data, model_ff_buf)
+    prof_tables = profile_table(prof_data, bundled_program_ff_buf)
     for table in prof_tables:
         print(table)