Skip to content

[analyzer][docs] Document how to use perf and uftrace to debug performance issues #126520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions clang/docs/analyzer/developer-docs/PerformanceInvestigation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ Performance Investigation
Multiple factors contribute to the time it takes to analyze a file with Clang Static Analyzer.
A translation unit contains multiple entry points, each of which take multiple steps to analyze.

Performance analysis using ``-ftime-trace``
===========================================

You can add the ``-ftime-trace=file.json`` option to break down the analysis time into individual entry points and steps within each entry point.
You can explore the generated JSON file in a Chromium browser using the ``chrome://tracing`` URL,
or using `speedscope <https://speedscope.app>`_.
Expand Down Expand Up @@ -45,3 +48,91 @@ Note: Both Chrome-tracing and speedscope tools might struggle with time traces a
Luckily, in most cases the default max-steps boundary of 225 000 produces the traces of approximately that size
for a single entry point.
You can use ``-analyze-function=get_global_options`` together with ``-ftime-trace`` to narrow down analysis to a specific entry point.


Performance analysis using ``perf``
===================================

`Perf <https://perfwiki.github.io/main/>`_ is an excellent tool for sampling-based profiling of an application.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that perf is excellent, I wonder if we in general want to stay away from value judgements in documentation.

Copy link
Contributor Author

@steakhal steakhal Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased into Perf is a tool for conducting sampling-based profiling.
Fixed in steakhal@aa5a285.

It's easy to start profiling, you only have 2 prerequisites.
Build with ``-fno-omit-frame-pointer`` and debug info (``-g``).
You can use release builds, but probably the easiest is to set the ``CMAKE_BUILD_TYPE=RelWithDebInfo``
along with ``CMAKE_CXX_FLAGS="-fno-omit-frame-pointer"`` when configuring ``llvm``.
Here is how to `get started <https://llvm.org/docs/CMake.html#quick-start>`_ if you are in trouble.

.. code-block:: bash
:caption: Running the Clang Static Analyzer through ``perf`` to gather samples of the execution.

# -F: Sampling frequency, use `-F max` for maximal frequency
# -g: Enable call-graph recording for both kernel and user space
perf record -F 99 -g -- clang -cc1 -nostdsysteminc -analyze -analyzer-constraints=range \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a simpler/smaller CSA invocation would suffice here for demonstration purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to sick with a similar invocation as was present in the beginning of this file.
If we were t simplify this, we should harmonize them.

Copy link
Contributor Author

@steakhal steakhal Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in steakhal@169664c.

-setup-static-analyzer -analyzer-checker=core,unix,alpha.unix.cstring,debug.ExprInspection \
-verify ./clang/test/Analysis/string.c

Once you have the profile data, you can use it to produce a Flame graph.
A Flame graph is a visual representation of the stack frames of the samples.
Common stack frame prefixes are squashed together, making up a wider bar.
The wider the bar, the more time was spent under that particular stack frame,
giving a sense of how the overall execution time was spent.

Clone the `FlameGraph <https://github.com/brendangregg/FlameGraph>`_ git repository,
as we will use some scripts from there to convert the ``perf`` samples into a Flame graph.
It's also useful to check out Brendan Gregg's (the author of FlameGraph)
`homepage <https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html>`_.


.. code-block:: bash
:caption: Converting the ``perf`` profile into a Flamegraph, then opening it in Firefox.

perf script | /path/to/FlameGraph/stackcollapse-perf.pl > perf.folded
/path/to/FlameGraph/flamegraph.pl perf.folded > perf.svg
firefox perf.svg

.. image:: ../images/flamegraph.svg


Performance analysis using ``uftrace``
======================================

`uftrace <https://github.com/namhyung/uftrace/wiki/Tutorial#getting-started>`_ is a great tool to generate rich profile data
that you could use to focus and drill down into the timeline of your application.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/could/can/

Also do that substitution on other places.

Copy link
Contributor Author

@steakhal steakhal Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced both could with can in steakhal@1b105e0.

We will use it to generate Chromium trace JSON.
In contrast to ``perf``, this approach statically instruments every function, so it should be more precise and through than the sampling-based approaches like ``perf``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In contrast to ``perf``, this approach statically instruments every function, so it should be more precise and through than the sampling-based approaches like ``perf``.
In contrast to ``perf``, this approach statically instruments every function, so it should be more precise and thorough than the sampling-based approaches like ``perf``.

Copy link
Contributor Author

@steakhal steakhal Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed in steakhal@3b2d323.

In contrast to using `-ftime-trace`, functions don't need to opt-in to be profiled using ``llvm::TimeTraceScope``.
All functions are profiled due to static instrumentation.

There is only one prerequisite to use this tool.
You need to build the binary you are about to instrument using ``-pg`` or ``-finstrument-functions``.
This will make it run substantially slower but allows rich instrumentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interesting to include the typical slowdown factor.
Also, I think, it is important to note the substantial disk space requirement

Copy link
Contributor Author

@steakhal steakhal Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't measure. The best I could find on the internet was that it's not as slow as cachegrind. I'd avoid mentioning this though.
I added a remark about high storage requirement in steakhal@7a76bd5.


.. code-block:: bash
:caption: Recording with ``uftrace``, then dumping the result as a Chrome trace JSON.

uftrace record clang -cc1 -nostdsysteminc -analyze -analyzer-constraints=range \
-setup-static-analyzer -analyzer-checker=core,unix,alpha.unix.cstring,debug.ExprInspection \
-verify ./clang/test/Analysis/string.c
uftrace dump --filter=".*::AnalysisConsumer::HandleTranslationUnit" --time-filter=300 --chrome > trace.json

.. image:: ../images/uftrace_detailed.png

In this picture, you can see the functions below the Static Analyzer's entry point, which takes at least 300 nanoseconds to run, visualized by Chrome's ``about:tracing`` page
You can also see how deep function calls we may have due to AST visitors.

Using different filters can reduce the number of functions to record.
For the `common options <https://github.com/namhyung/uftrace/blob/master/doc/uftrace-record.md#common-options>`_, refer to the ``uftrace`` documentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it makes more sense to put the link on the documentation rather than "generic" "common options" noun

Suggested change
For the `common options <https://github.com/namhyung/uftrace/blob/master/doc/uftrace-record.md#common-options>`_, refer to the ``uftrace`` documentation.
For the common options, refer to `the ``uftrace`` documentation <https://github.com/namhyung/uftrace/blob/master/doc/uftrace-record.md#common-options>`_.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful -- the RST format has a dumb limitation that inline formatting cannot be nested, so I'd guess that the suggested "monospace text within link text" nesting wouldn't work (but I'm not 100% sure).

Copy link
Contributor Author

@steakhal steakhal Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the link to the word documentation in steakhal@4aa1f34.


Similar filters could be applied for dumping too. That way you can reuse the same (detailed)
recording to selectively focus on some special part using a refinement of the filter flags.
Remember, the trace JSON needs to fit into Chrome's ``about:tracing`` or `speedscope <https://speedscope.app>`_,
thus it needs to be of a limited size.
In that case though, every dump operation would need to sieve through the whole recording if called repeatedly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks out of place. I guess it goes with the second sentence of this paragraph, but not with the third (which now immediately precedes it).
I think you can make it more clear if you avoid "that":

Suggested change
In that case though, every dump operation would need to sieve through the whole recording if called repeatedly.
If you do not apply filters on recording, you will collect a large trace and every dump operation would need to sieve through the much larger recording which may be annoying if done repeatedly.

Copy link
Contributor Author

@steakhal steakhal Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted as-is in steakhal@196dd50.


If the trace JSON is still too large to load, have a look at the dump and look for frequent entries that refer to non-interesting parts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the trace JSON is still too large to load, have a look at the dump and look for frequent entries that refer to non-interesting parts.
If the trace JSON is still too large to load, have a look at the dump as plain text and look for frequent entries that refer to non-interesting parts.

Copy link
Contributor Author

@steakhal steakhal Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted as-is in steakhal@004b8a6.

Once you have some of those, add them as ``--hide`` flags to the ``uftrace dump`` call.
To see what functions appear frequently in the trace, use this command:

.. code-block:: bash

cat trace.json | grep -Po '"name":"(.+)"' | sort | uniq -c | sort -nr | head -n 50

``uftrace`` can also dump the report as a Flame graph using ``uftrace dump --framegraph``.
Loading