Skip to content

WIP: (//core): Added device meta data serialization/deserialization implic… #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

andi4191
Copy link
Contributor

@andi4191 andi4191 commented Aug 24, 2020

…itly

Signed-off-by: Anurag Dixit [email protected]

Description

This change enables the end-user to bypass setting CUDA device ID at runtime. It automatically sets the CUDA device based on serialization.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

  • New feature (non-breaking change which adds functionality)
    Added abstraction of Device Info configuration at runtime

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes

@andi4191 andi4191 requested a review from narendasan August 24, 2020 22:45
@andi4191 andi4191 self-assigned this Aug 24, 2020
@andi4191 andi4191 added component: core Issues re: The core compiler component: execution Issues re: Execution of engines labels Aug 24, 2020
@narendasan
Copy link
Collaborator

@andi4191 Any updates on the comments. Also there was some changes in the execution phase to support the non deterministic binding order. You will probably need to rebase on master

@andi4191 andi4191 force-pushed the anuragd/device_info_serialize branch from 819f6d6 to 9c70a39 Compare September 2, 2020 06:06
@narendasan
Copy link
Collaborator

All existing tests pass on x86 and Jetson for me

@andi4191 andi4191 force-pushed the anuragd/device_info_serialize branch from afb0de7 to 1d3529a Compare September 17, 2020 00:43
@andi4191 andi4191 requested a review from narendasan September 18, 2020 19:10
@narendasan
Copy link
Collaborator

Are the Python API changes for device supposed to be in this PR?

@github-actions
Copy link

This PR has not seen activity for 30 days, Remove stale label or comment or this will be closed in 5 days

@andi4191 andi4191 force-pushed the anuragd/device_info_serialize branch 2 times, most recently from 1ea0b45 to 3df301e Compare November 20, 2020 19:32
@andi4191 andi4191 added this to the v0.2.0 milestone Nov 20, 2020
{
cuda_device = deserialize_device(serialized_device_info);
// Set CUDA device as configured in serialized meta data
set_cuda_device(cuda_device);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we store the deserialized device somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't save the deserialized device information anywhere. Do you want to perform file write in runtime?

@andi4191 andi4191 force-pushed the anuragd/device_info_serialize branch from 755a55e to 5069368 Compare December 9, 2020 23:22
@@ -174,7 +183,7 @@ torch::jit::script::Module CompileGraph(const torch::jit::script::Module& mod, C
}

void set_device(const int gpu_id) {
TRTORCH_ASSERT(cudaSetDevice(gpu_id) == cudaSuccess, "Unable to set CUDA device: " << gpu_id);
TRTORCH_CHECK((cudaSetDevice(gpu_id) == cudaSuccess), "Unable to set CUDA device: " << gpu_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call runtime::set_cuda_device here just to centralize responsibility for device management in the runtime section?

CudaDevice cuda_device;
// Deserialize device meta data if device_info is non-empty
if (!serialized_device_info.empty()) {
cuda_device = deserialize_device(serialized_device_info);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the device should be maintained as a field in TRTEngine instead of querying set device before serialization

@@ -19,14 +40,16 @@ struct TRTEngine : torch::CustomClassHolder {
std::pair<uint64_t, uint64_t> num_io;
EngineID id;
std::string name;
CudaDevice device_info;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we use this but we should

: logger(
std::string("[") + mod_name + std::string("_engine] - "),
util::logging::get_logger().get_reportable_severity(),
util::logging::get_logger().get_is_colored_output_on()) {
CudaDevice cuda_device;
// Deserialize device meta data if device_info is non-empty
if (!serialized_device_info.empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems potentially unsafe. What happens if we provide a DLA engine but no serialized_device_info?

TRTEngine::TRTEngine(
std::string mod_name,
std::string serialized_engine,
std::string serialized_device_info = std::string())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take a CudaDevice struct as the arg here instead?

[](const c10::intrusive_ptr<TRTEngine>& self) -> std::string {
auto serialized_engine = self->cuda_engine->serialize();
return std::string((const char*)serialized_engine->data(), serialized_engine->size());
[](const c10::intrusive_ptr<TRTEngine>& self) -> std::vector<std::string> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard would it be to use a map instead of a vector? I think a map will be a more scalable solution long term for metadata but if its too hard its not worth

util::logging::get_logger().get_reportable_severity(),
util::logging::get_logger().get_is_colored_output_on()) {
std::string _name = "deserialized_trt";
std::string device_info = serialized_info[0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need size or key checking here.

@narendasan
Copy link
Collaborator

narendasan commented Dec 10, 2020

Do you think we need to add a section to the main execution function that sets device on the fly? or does it need to be done at de-serialization time only? What happens if there is one engine on GPU 1 and one engine on GPU 2?

@narendasan
Copy link
Collaborator

Do we need to do something in the runtime that moves input tensors around to the right GPUs for execution?

@github-actions
Copy link

This PR has not seen activity for 30 days, Remove stale label or comment or this will be closed in 5 days

@narendasan narendasan modified the milestones: v0.2.0, v0.3.0 Jan 11, 2021
@andi4191 andi4191 force-pushed the anuragd/device_info_serialize branch from 5069368 to fa942e7 Compare March 3, 2021 18:06
@andi4191 andi4191 changed the title (//core): Added device meta data serialization/deserialization implic… WIP: (//core): Added device meta data serialization/deserialization implic… Mar 3, 2021
@andi4191 andi4191 force-pushed the anuragd/device_info_serialize branch 2 times, most recently from 1c2cf90 to 00c1dd3 Compare March 5, 2021 22:20
@andi4191 andi4191 force-pushed the anuragd/device_info_serialize branch from 00c1dd3 to 574d77e Compare March 5, 2021 22:23
@andi4191 andi4191 force-pushed the anuragd/device_info_serialize branch from 574d77e to 968cb82 Compare May 20, 2021 21:53
@andi4191
Copy link
Contributor Author

After design discussion #311, the PR is being tracked here: #484

Closing this PR.

@andi4191 andi4191 closed this May 25, 2021
@andi4191 andi4191 deleted the anuragd/device_info_serialize branch June 1, 2021 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: core Issues re: The core compiler component: execution Issues re: Execution of engines component: runtime component: tests Issues re: Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants