Replies: 6 comments
-
Thank you for using Amazon SageMaker. Would you be able to provide the following to help us in troubleshooting your issue?
Look forward to hearing back from you. |
Beta Was this translation helpful? Give feedback.
-
Thank you. Here are answers to your questions: SageMaker Python SDK version: 2.13.0 This appears to be a problem with the TensorFlow Estimator. GPU utilization goes to zero. I am supplying you a notebook that demonstrates this by training my model with the estimator and without it. The below link is my codebase that recreates this issue. Please open the notebook: DeepTradingAI.ipynb using an AWS instance of ml.p3.2xlarge This notebook has two parts:
Note: You do not need to worry about the data as they are fetched from a database connection. Here is the download link: Thanks, Nektarios |
Beta Was this translation helpful? Give feedback.
-
Even if I use my own custom Docker container, I get:
|
Beta Was this translation helpful? Give feedback.
-
I managed to get the above loaded, but still get: 2020-10-07 16:17:42.690258: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. This is my Dockerfile:
|
Beta Was this translation helpful? Give feedback.
-
Do the boxes have CUDA installed properly?? This used to all work. All of a sudden things broke |
Beta Was this translation helpful? Give feedback.
-
Did you solve this issue? I am facing similar problems. Running the same script locally takes about 10s/epoch while on sagemaker instance it takes about 2mins/epoch. I don’t understand what’s the problem. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When I used to train my Deep Learning model on a p3.2xlarge instance it used to take 2 seconds per epoch. Now, all of a sudden, it takes around 39 seconds per epoch! Before total training time was 15 minutes and now it can go up to 2 hours!
Please advise why this is happening.
See image attachment.
Thank you!
Nektarios
Describe the bug
Training very slow on p3.2xlarge. Maybe GPU not being used.
To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots or logs

If applicable, add screenshots or logs to help explain your problem.
System information
A description of your system. Please provide:
Additional context
Add any other context about the problem here.
Beta Was this translation helpful? Give feedback.
All reactions