Skip to content

[Support] Vendor rpmalloc in-tree and use it for the Windows 64-bit release #91862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion llvm/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -733,7 +733,35 @@ if( WIN32 AND NOT CYGWIN )
endif()
set(LLVM_NATIVE_TOOL_DIR "" CACHE PATH "Path to a directory containing prebuilt matching native tools (such as llvm-tblgen)")

set(LLVM_INTEGRATED_CRT_ALLOC "" CACHE PATH "Replace the Windows CRT allocator with any of {rpmalloc|mimalloc|snmalloc}. Only works with CMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded.")
set(LLVM_ENABLE_RPMALLOC "" CACHE BOOL "Replace the CRT allocator with rpmalloc.")
if(LLVM_ENABLE_RPMALLOC)
if(NOT (CMAKE_SYSTEM_NAME MATCHES "Windows|Linux"))
message(FATAL_ERROR "LLVM_ENABLE_RPMALLOC is only supported on Windows and Linux.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only supported on Windows, surely? Since it relies on LLVM_INTEGRATED_CRT_ALLOC which is only supported on Windows (see line 767 below).

endif()
if(LLVM_USE_SANITIZER)
message(FATAL_ERROR "LLVM_ENABLE_RPMALLOC cannot be used along with LLVM_USE_SANITIZER!")
endif()
if(WIN32)
if(CMAKE_CONFIGURATION_TYPES)
foreach(BUILD_MODE ${CMAKE_CONFIGURATION_TYPES})
string(TOUPPER "${BUILD_MODE}" uppercase_BUILD_MODE)
if(uppercase_BUILD_MODE STREQUAL "DEBUG")
message(WARNING "The Debug target isn't supported along with LLVM_ENABLE_RPMALLOC!")
endif()
endforeach()
else()
if(CMAKE_BUILD_TYPE AND uppercase_CMAKE_BUILD_TYPE STREQUAL "DEBUG")
message(FATAL_ERROR "The Debug target isn't supported along with LLVM_ENABLE_RPMALLOC!")
endif()
endif()
endif()

# Override the C runtime allocator with the in-tree rpmalloc
set(LLVM_INTEGRATED_CRT_ALLOC "${CMAKE_CURRENT_SOURCE_DIR}/lib/Support")
set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded")
endif()

set(LLVM_INTEGRATED_CRT_ALLOC "${LLVM_INTEGRATED_CRT_ALLOC}" CACHE PATH "Replace the Windows CRT allocator with any of {rpmalloc|mimalloc|snmalloc}. Only works with CMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded.")
if(LLVM_INTEGRATED_CRT_ALLOC)
if(NOT WIN32)
message(FATAL_ERROR "LLVM_INTEGRATED_CRT_ALLOC is only supported on Windows.")
Expand Down
9 changes: 8 additions & 1 deletion llvm/docs/CMake.rst
Original file line number Diff line number Diff line change
Expand Up @@ -710,8 +710,15 @@ enabled sub-projects. Nearly all of these variable names begin with
$ D:\git> git clone https://github.com/mjansson/rpmalloc
$ D:\llvm-project> cmake ... -DLLVM_INTEGRATED_CRT_ALLOC=D:\git\rpmalloc
This flag needs to be used along with the static CRT, ie. if building the
This option needs to be used along with the static CRT, ie. if building the
Release target, add -DCMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded.
Note that rpmalloc is also supported natively in-tree, see option below.

**LLVM_ENABLE_RPMALLOC**:BOOL
Similar to LLVM_INTEGRATED_CRT_ALLOC, embeds the in-tree rpmalloc into the
host toolchain as a C runtime allocator. The version currently used is
rpmalloc 1.4.5. This option also implies linking with the static CRT, there's
no need to provide CMAKE_MSVC_RUNTIME_LIBRARY.

**LLVM_LINK_LLVM_DYLIB**:BOOL
If enabled, tools will be linked with the libLLVM shared library. Defaults
Expand Down
11 changes: 10 additions & 1 deletion llvm/docs/ReleaseNotes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,16 @@ Changes to LLVM infrastructure
Changes to building LLVM
------------------------

- The ``LLVM_ENABLE_TERMINFO`` flag has been removed. LLVM no longer depends on
* LLVM now has rpmalloc version 1.4.5 in-tree, as a replacement C allocator for
hosted toolchains. This supports several host platforms such as Mac or Unix,
however currently only the Windows 64-bit LLVM release uses it.
This has a great benefit in terms of build times on Windows when using ThinLTO
linking, especially on machines with lots of cores, to an order of magnitude
or more. Clang compilation is also improved. Please see some build timings in
(`#91862 <https://github.com/llvm/llvm-project/pull/91862#issue-2291033962>`_)
For more information, refer to the **LLVM_ENABLE_RPMALLOC** option in `CMake variables <https://llvm.org/docs/CMake.html#llvm-related-variables>`_.

* The ``LLVM_ENABLE_TERMINFO`` flag has been removed. LLVM no longer depends on
terminfo and now always uses the ``TERM`` environment variable for color
support autodetection.

Expand Down
3 changes: 2 additions & 1 deletion llvm/lib/Support/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,10 @@ if(LLVM_INTEGRATED_CRT_ALLOC)
message(FATAL_ERROR "Cannot find the path to `git clone` for the CRT allocator! (${LLVM_INTEGRATED_CRT_ALLOC}). Currently, rpmalloc, snmalloc and mimalloc are supported.")
endif()

if(LLVM_INTEGRATED_CRT_ALLOC MATCHES "rpmalloc$")
if((LLVM_INTEGRATED_CRT_ALLOC MATCHES "rpmalloc$") OR LLVM_ENABLE_RPMALLOC)
add_compile_definitions(ENABLE_OVERRIDE ENABLE_PRELOAD)
set(ALLOCATOR_FILES "${LLVM_INTEGRATED_CRT_ALLOC}/rpmalloc/rpmalloc.c")
set(delayload_flags "${delayload_flags} -INCLUDE:malloc")
elseif(LLVM_INTEGRATED_CRT_ALLOC MATCHES "snmalloc$")
set(ALLOCATOR_FILES "${LLVM_INTEGRATED_CRT_ALLOC}/src/snmalloc/override/new.cc")
set(system_libs ${system_libs} "mincore.lib" "-INCLUDE:malloc")
Expand Down
19 changes: 19 additions & 0 deletions llvm/lib/Support/rpmalloc/CACHE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Thread caches
rpmalloc has a thread cache of free memory blocks which can be used in allocations without interfering with other threads or going to system to map more memory, as well as a global cache shared by all threads to let spans of memory pages flow between threads. Configuring the size of these caches can be crucial to obtaining good performance while minimizing memory overhead blowup. Below is a simple case study using the benchmark tool to compare different thread cache configurations for rpmalloc.

The rpmalloc thread cache is configured to be unlimited, performance oriented as meaning default values, size oriented where both thread cache and global cache is reduced significantly, or disabled where both thread and global caches are disabled and completely free pages are directly unmapped.

The benchmark is configured to run threads allocating 150000 blocks distributed in the `[16, 16000]` bytes range with a linear falloff probability. It runs 1000 loops, and every iteration 75000 blocks (50%) are freed and allocated in a scattered pattern. There are no cross thread allocations/deallocations. Parameters: `benchmark n 0 0 0 1000 150000 75000 16 16000`. The benchmarks are run on an Ubuntu 16.10 machine with 8 cores (4 physical, HT) and 12GiB RAM.

The benchmark also includes results for the standard library malloc implementation as a reference for comparison with the nocache setting.

![Ubuntu 16.10 random [16, 16000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=387883204&format=image)
![Ubuntu 16.10 random [16, 16000] bytes, 8 cores](https://docs.google.com/spreadsheets/d/1NWNuar1z0uPCB5iVS_Cs6hSo2xPkTmZf0KsgWS_Fb_4/pubchart?oid=1644710241&format=image)

For single threaded case the unlimited cache and performance oriented cache settings have identical performance and memory overhead, indicating that the memory pages fit in the combined thread and global cache. As number of threads increase to 2-4 threads, the performance settings have slightly higher performance which can seem odd at first, but can be explained by low contention on the global cache where some memory pages can flow between threads without stalling, reducing the overall number of calls to map new memory pages (also indicated by the slightly lower memory overhead).

As threads increase even more to 5-10 threads, the increased contention and eventual limit of global cache cause the unlimited setting to gain a slight advantage in performance. As expected the memory overhead remains constant for unlimited caches, while going down for performance setting when number of threads increases.

The size oriented setting maintain good performance compared to the standard library while reducing the memory overhead compared to the performance setting with a decent amount.

The nocache setting still outperforms the reference standard library allocator for workloads up to 6 threads while maintaining a near zero memory overhead, which is even slightly lower than the standard library. For use case scenarios where number of allocation of each size class is lower the overhead in rpmalloc from the 64KiB span size will of course increase.
Loading
Loading