Skip to content

LTO build profile extension documentation #1219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 8, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions docs/api/memory/link_time_optimization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Link Time Optimization

Link Time Optimization (LTO) is a program memory usage optimization mechanism that the compiler performs at link time. At compile time, the compiler creates a special intermediate representation of all translation units. It then optimizes them as a single unit at link time, which uses less memory than non-LTO builds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is link time? What are translation units?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

translation unit it's usually single C/C++ source file .c/.cpp which is then compiled (compilation time) to single object file .o/.obj and then linker links (link time) this object files in to single executable file


## Using LTO in Mbed OS

The Mbed OS build system implements LTO as an optional profile extension in `tools\profiles\extensions\lto.json`. Enabling LTO amends the build profile with LTO flags.

<span class="notes">**Note:** LTO performs heavy memory optimizations that break debugging, so we recommend using it only with the release profile, not debugging and develop.</span>

To enable LTO, add the `--profile` option with the LTO file path `tools\profiles\extensions\lto.json` to the build command.

<span class="notes">**Note**: For profile extensions you have to put the full path relative to the project's root folder.</span>

To enable LTO with the `release` profile:

```
mbed compile -t TOOLCHAIN -m TARGET --profile release --profile mbed-os/tools/profiles/extensions/lto.json
```

Example LTO profile memory savings for [mbed-os-example-blinky](https://github.com/ARMmbed/mbed-os-example-blinky):

|Build type|Total static RAM memory (data + BSS)|Total flash memory (text + data)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's BSS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All data BSS and text are a segments of memory to store different type of variables/data

| --- | --- | --- |
| GCC_ARM - release - no LTO | 12,096B | 44,628B |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect these numbers to change? If so, we should remove the table, so we don't have to keep updating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly only amount of saved memory is important. But true, it will change slightly depending on Mbed OS version used.

Maybe it would be better to round the numbers to kB like below example?
What do you think @jamesbeyond ?

Build type Total Static RAM memory (data + bss) Total Flash memory (text + data)
GCC_ARM - release - no LTO 12.1 kB 44.6 kB
GCC_ARM - release - LTO 11.8 kB 41.1 kB
saved memory 0.3 kB 3.5 kB
ARM - release - no LTO 10.36 kB 35.5 kB
ARM - release - LTO 10.15 kB 31.5 kB
saved memory 0.21 kB ‭ 4 kB

| GCC_ARM - release - LTO | 11,800B | 41,088B |
|***saved memory***|296B|3,540B|
| ARM - release - no LTO | 10,365B | 35,496B |
| ARM - release - LTO | 10,153B | 31,514B |
|***saved memory***|212B|‭3,982‬B|

LTO profile build results for [mbed-cloud-client-example](https://github.com/ARMmbed/mbed-cloud-client-example):

|Build type|Total static RAM memory (data + BSS)|Total flash memory (text + data)|
| --- | --- | --- |
| GCC_ARM - release - no LTO | 59,760B | 389,637B |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect these numbers to change? If so, we should remove the table, so we don't have to keep updating it.

| GCC_ARM - release - LTO | 59,432B | 354,167B |
|***saved memory***| 328B | ‭35,470‬B|
| ARM - release - no LTO | 58,099B | 353,849B |
| ARM - release - LTO | 57,150B | 322,500B |
|***saved memory***| 949B | ‭31,349‬B|

<span class="notes">**Note**: In LTO builds, the compiler produces bytecode/bitcode instead of regular object code. It's hard to analyze this output with object code analysis tools.</span><--Can we remove this note as it's repeated in the "Limitations" section below?-->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say something about the good side of the LTO as well? e.g. how much ROM/RAM can be saved in terms of blinky example and PDMC example? doesn't need to be with too many details, a table about roughly how many bytes can be saved would be good enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

## Limitations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I suggest in the Limitation section to mention the impact to the debuggability also the impact to the symbol size, which disruption to memory map generation. even we briefly mentioned above. here is an article that could be useful https://interrupt.memfault.com/blog/best-and-worst-gcc-clang-compiler-flags#-flto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move most or all of Limitations to our release notes? That may be a better place for this content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- LTO slows down the build process.
- It’s very hard to control memory placement when using LTO.
- LTO performs heavy memory optimizations that break debugging.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've said this already in the main text

- In LTO builds, the compiler produces bytecode/bitcode instead of regular object code. It's hard to analyze this output with object code analysis tools.
- LTO could cause increases to the stack space needed due to cross-object inlining.

### Arm Compiler 6

- No bitcode libraries: `armlink` only supports bitcode objects on the command line. It does not support bitcode objects coming from libraries. `armlink` gives an error message if it encounters a file containing bitcode while loading from a library.
- Partial Linking is not supported with LTO as it only works with ELF objects not bitcode files.
- Arm recommends that link time optimization is only performed on code and data that does not require precise placement in the scatter file, with general input section selectors such as `*(+RO)` and `.ANY(+RO)` used to select sections generated by link time optimization. It is not possible to match bitcode in `.llvmbc` sections by name in a scatter file.
- Bitcode objects are not guaranteed to be compatible across compiler versions. This means that you should ensure all your bitcode files are built using the same version of the compiler when linking with LTO.


### GCC_ARM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any know limitation need to mention for ARMC6 LTO? I can see a lot ot place we need to add MBED_USED macro.


- The minimal required version of `GCC_ARM` is the GNU Arm Embedded Toolchain Version 9-2019-q4-major. Earlier `GCC_ARM` versions can cause various issues when the `-flto` flag is used, for example a platform-specific error during the final link stage on Windows hosts with GCC8.
- You must use the `noinline` attribute for every function that must be placed into a specific section (specified with a `section(".section_name")` attribute). In general, when a function is considered for inlining, the `section` attribute is always ignored. However, with the link time optimizer enabled, the chances for inlining are much higher because the inliner works across multiple translation units. As a result, the output sections' sizes change compared to a non-LTO build. This may lead to `section ".section_name" will not fit in region "region_name"` type errors.
- In all GCC versions, LTO removes C functions declared as weak in assembler (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83967 https://bugs.launchpad.net/gcc-arm-embedded/+bug/1747966). For Mbed OS, the problem emerges when exporting Mbed OS projects to other build systems. You can fix this by changing the order of object files in the linker command: Objects providing the weak symbols and compiled from assembly must be listed before the objects providing the strong symbols.