Description
Describe the bug
The AWS Transcribe Streaming SDK C++ implementation is consuming excessive CPU resources when processing audio streams. Each individual stream consumes approximately 100% CPU usage, scaling linearly with multiple streams (e.g., 3 streams = 300% CPU usage). This appears inefficient for an operation that should primarily be handling audio data transmission to AWS Transcribe service.
I have tested using the CRT-HTTP version also and I get similar results. Will follow up with CRT-HTTP Docker version if requested.
It will slightly fluctuate on CPU usage but will mostly stick around 100%. I have tested on Macbook M1 running docker and then multiple Linux EC2 instance types and had the same results.
Is this performance intended/expected?
Regression Issue
- Select this option if this issue appears to be a regression.
Expected Behavior
- Minimal CPU usage for streaming audio to AWS Transcribe service
- Efficient handling of multiple concurrent streams without linear CPU scaling
- CPU usage should primarily be focused on audio data transmission rather than processing
Current Behavior
- Each individual stream consumes 100% CPU
- Multiple streams scale linearly (e.g., 3 streams = 300% CPU)
- CPU usage monitored through top command shows excessive utilization
- The high CPU usage persists throughout the entire streaming session
- Behavior is consistent across multiple test runs
Reproduction Steps
Here is the minimal reproduction steps in a single Dockerfile using the sample code.
Dockerfile
FROM public.ecr.aws/lts/ubuntu:22.04_stable
RUN apt-get update && \
apt-get install build-essential cmake git libcurl4-openssl-dev zlib1g-dev libssl-dev curl ffmpeg -y
#Build sdk from source
RUN git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp && \
cd aws-sdk-cpp && \
mkdir build && \
cd build && \
cmake .. -G "Unix Makefiles" -DBUILD_ONLY="transcribestreaming;transcribe" && \
make install
#Build transcribe samples
RUN git clone https://github.com/awsdocs/aws-doc-sdk-examples.git && \
cd aws-doc-sdk-examples/cpp/example_code/transcribe-streaming && \
mkdir build && \
cd build && \
cmake .. -G "Unix Makefiles" && \
make
# Download and convert the test file
RUN cd /aws-doc-sdk-examples/cpp/example_code/transcribe-streaming/.media && \
rm -f transcribe-test-file.wav && \
curl -L "https://ia800202.us.archive.org/26/items/desophisticiselenchis/desophisticiselenchis_01_aristotle_pdf557.wav" -o original.wav && \
ffmpeg -i original.wav -ar 8000 transcribe-test-file.wav && \
rm original.wav
Please note:
- Test file: Using a longer audio file from archive.org (converted to match original specs)
Steps:
- Build the Docker container using provided Dockerfile:
docker build -t transcribe-cpu-test-example .
- Run the container with AWS credentials:
docker run -d \
-e AWS_ACCESS_KEY_ID=<key> \
-e AWS_SECRET_ACCESS_KEY=<secret> \
-e AWS_SESSION_TOKEN=<token> \
--name transcribe-container \
transcribe-cpu-test-example \
tail -f /dev/null
- In first terminal, run:
docker exec -it transcribe-container bash
top # Keep this running to monitor CPU
- In second terminal, execute:
docker exec -it transcribe-container bash
/aws-doc-sdk-examples/cpp/example_code/transcribe-streaming/build/get_transcript
Repeat step 4 in additional terminals to observe CPU scaling with multiple streams
You will notice high cpu usage.
Possible Solution
Potential memory leaks or inefficient resource handling in the streaming implementation.
Additional Information/Context
- This is just a single example I have seen it in my own implementation with different file types also
- Issue affects scalability of applications requiring multiple concurrent streams
AWS CPP SDK version used
Latest
Compiler and Version used
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Operating System and version
Ubuntu 22.04 LTS (running in Docker container)