Skip to content

High CPU Usage (100% per stream) in AWS Transcribe Streaming #3257

Open
@blundercode

Description

@blundercode

Describe the bug

The AWS Transcribe Streaming SDK C++ implementation is consuming excessive CPU resources when processing audio streams. Each individual stream consumes approximately 100% CPU usage, scaling linearly with multiple streams (e.g., 3 streams = 300% CPU usage). This appears inefficient for an operation that should primarily be handling audio data transmission to AWS Transcribe service.

I have tested using the CRT-HTTP version also and I get similar results. Will follow up with CRT-HTTP Docker version if requested.

It will slightly fluctuate on CPU usage but will mostly stick around 100%. I have tested on Macbook M1 running docker and then multiple Linux EC2 instance types and had the same results.

Is this performance intended/expected?

Image

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

  1. Minimal CPU usage for streaming audio to AWS Transcribe service
  2. Efficient handling of multiple concurrent streams without linear CPU scaling
  3. CPU usage should primarily be focused on audio data transmission rather than processing

Current Behavior

  1. Each individual stream consumes 100% CPU
  2. Multiple streams scale linearly (e.g., 3 streams = 300% CPU)
  3. CPU usage monitored through top command shows excessive utilization
  4. The high CPU usage persists throughout the entire streaming session
  5. Behavior is consistent across multiple test runs

Reproduction Steps

Here is the minimal reproduction steps in a single Dockerfile using the sample code.

Dockerfile

FROM public.ecr.aws/lts/ubuntu:22.04_stable

RUN apt-get update && \
  apt-get install build-essential cmake git libcurl4-openssl-dev zlib1g-dev libssl-dev curl ffmpeg -y

#Build sdk from source
RUN git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp && \
    cd aws-sdk-cpp && \
    mkdir build && \
    cd build && \
    cmake .. -G "Unix Makefiles" -DBUILD_ONLY="transcribestreaming;transcribe" && \
    make install

#Build transcribe samples
RUN git clone https://github.com/awsdocs/aws-doc-sdk-examples.git && \
    cd aws-doc-sdk-examples/cpp/example_code/transcribe-streaming && \
    mkdir build && \
    cd build && \
    cmake .. -G "Unix Makefiles" && \
    make

# Download and convert the test file
RUN cd /aws-doc-sdk-examples/cpp/example_code/transcribe-streaming/.media && \
    rm -f transcribe-test-file.wav && \
    curl -L "https://ia800202.us.archive.org/26/items/desophisticiselenchis/desophisticiselenchis_01_aristotle_pdf557.wav" -o original.wav && \
    ffmpeg -i original.wav -ar 8000 transcribe-test-file.wav && \
    rm original.wav

Please note:

  • Test file: Using a longer audio file from archive.org (converted to match original specs)

Steps:

  1. Build the Docker container using provided Dockerfile:
docker build -t transcribe-cpu-test-example .
  1. Run the container with AWS credentials:
docker run -d \
-e AWS_ACCESS_KEY_ID=<key> \
-e AWS_SECRET_ACCESS_KEY=<secret> \
-e AWS_SESSION_TOKEN=<token> \
--name transcribe-container \
transcribe-cpu-test-example \
tail -f /dev/null
  1. In first terminal, run:
docker exec -it transcribe-container bash
top  # Keep this running to monitor CPU
  1. In second terminal, execute:
docker exec -it transcribe-container bash
/aws-doc-sdk-examples/cpp/example_code/transcribe-streaming/build/get_transcript

Repeat step 4 in additional terminals to observe CPU scaling with multiple streams

You will notice high cpu usage.

Possible Solution

Potential memory leaks or inefficient resource handling in the streaming implementation.

Additional Information/Context

  • This is just a single example I have seen it in my own implementation with different file types also
  • Issue affects scalability of applications requiring multiple concurrent streams

AWS CPP SDK version used

Latest

Compiler and Version used

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Operating System and version

Ubuntu 22.04 LTS (running in Docker container)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue is a bug.guidanceQuestion that needs advice or information.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions