fix: use `vm_allocate` to allocate CPU backend buffer on macOS #9875

giladgd · 2024-10-14T00:42:26Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Creating a context in an Electron app using node-llama-cpp on macOS without Metal crashes the process with some models (issue), so I've investigated what's happening and found that allocating a large memory block using malloc is the culprit.

For some reason, it happens only on Electron and not on Nodejs, but I couldn't figure out why.

To fix this issue, I switched to use GGML_ALIGNED_MALLOC instead of malloc in ggml_backend_cpu_buffer_type_alloc_buffer, and added support for vm_allocate in GGML_ALIGNED_MALLOC on macOS.

slaren · 2024-10-14T00:57:19Z

ggml/src/ggml-backend.cpp

@@ -770,12 +772,21 @@ static const char * ggml_backend_cpu_buffer_type_get_name(ggml_backend_buffer_ty
 }

 static ggml_backend_buffer_t ggml_backend_cpu_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
+#if defined(GGML_USE_METAL) || defined(TARGET_OS_OSX)


We cannot rely on backend macros like GGML_USE_METAL, and I don't see the point of enabling this specifically with Metal anyway, when the crash is happening in the CPU backend. Instead I would suggest using GGML_ALIGNED_MALLOC everywhere like the TODO in the comment suggests.

The main focus here is on TARGET_OS_OSX, but I figured that including GGML_USE_METAL covers more cases where posix_memalign is available and can be used instead of malloc.

I'll move GGML_ALIGNED_MALLOC to ggml-backend-impl.h and use it instead then.

It still needs some changes:

It should be in ggml-impl.h, not ggml-backend-impl.h

The implementation details need to be hidden in the source file, the header should only have the ggml_aligned_malloc and ggml_aligned_free functions, and the macros should be removed

I realize this is a lot more work than just fixing the bug, but we cannot add technical debt every time we fix a bug.

…t for `vm_allocate` on macOS

…ctions and add them to `ggml-impl.h`

slaren · 2024-10-14T23:40:21Z

ggml/src/ggml.c

 #if defined(_MSC_VER) || defined(__MINGW32__)
-#define GGML_ALIGNED_MALLOC(size) _aligned_malloc(size, GGML_MEM_ALIGN)
-#define GGML_ALIGNED_FREE(ptr)    _aligned_free(ptr)
+    return _aligned_malloc(size, GGML_MEM_ALIGN);


GGML_MEM_ALIGN is 16, but the CPU backend expects an alignment of TENSOR_ALIGNMENT = 32. I suspect that's why it is crashing sometimes. It would probably be ok to increase the alignment of this function to the page size in every case.

max-krasnyansky · 2024-10-15T04:37:40Z

Do we really need vm_allocate() here?
posix_memalign() works well.
I've not seen any issues with CPU backend with models up to 70B (they are dog slow but run on devices with enough RAM).
If you suspect large malloc is an issue (not quite sure why would it be) perhaps we can just add get_max_size and limit the allocs to say 1-2GB.

      static struct ggml_backend_buffer_type ggml_backend_cpu_buffer_type = {
        /* .iface   = */ {
            /* .get_name         = */ ggml_backend_cpu_buffer_type_get_name,
            /* .alloc_buffer     = */ ggml_backend_cpu_buffer_type_alloc_buffer,
            /* .get_alignment    = */ ggml_backend_cpu_buffer_type_get_alignment,
            /* .get_max_size     = */ NULL, // defaults to SIZE_MAX  <<<<<<< here

giladgd · 2024-10-15T18:22:05Z

When loading llama.cpp from inside an Electron app using node-llama-cpp on macOS, using malloc or posix_memalign to allocate large buffers crashes the process.
This happens only on Electron on macOS, and doesn't happen on nodejs and other platforms.

I used vm_allocate since this is what the Apple documentation recommends and it works well regardless of using a GPU.

We can limit the allocs to a smaller size, but I think that using vm_allocate to workaround this issue may be a better solution.

slaren · 2024-10-15T22:31:37Z

The problem with reducing the value of get_max_size is that it limits the maximum size of an individual tensor. The output layer and token embeddings of current models are often in the several gigabytes due to the very large vocabularies. I think that going directly to the OS with vm_allocate/VirtualAlloc/mmap may be the more sensible way to allocate very large buffers. Do you expect any issues with using these instead of malloc?

ggml/src/ggml.c

max-krasnyansky · 2024-10-16T02:14:58Z

The problem with reducing the value of get_max_size is that it limits the maximum size of an individual tensor. The output layer and token embeddings of current models are often in the several gigabytes due to the very large vocabularies. I think that going directly to the OS with vm_allocate/VirtualAlloc/mmap may be the more sensible way to allocate very large buffers. Do you expect any issues with using these instead of malloc?

Just seems like unnecessary complexity. ie if posix_memalign works on most platforms why introduce yet another allocation wrapper type thing.
No objections / expected issues otherwise, though it would be good to figure out what exactly fails under Electron with the posix_memalign() I doubt that it's an issue with the underlying allocator itself. The only thing I can think of is if the malloc heap gets corrupted (double-free, overruns, etc) and that causes the crashes. vm_allocate() probably calls mmap() or some other low-level VM API directly and is unaffected by the malloc stack corruption.

@giladgd It might worth trying to run your test with malloc debugging enabled
https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MallocDebug.html
I'm not familiar with Electron so not sure how easy that would be.

giladgd · 2024-10-16T22:24:35Z

@max-krasnyansky Thanks for the guidance for malloc debugging.
I've tried it but couldn't find any memory corruption.
My guess is that it has something to do with Electron's V8 memory cage, although it shouldn't be an issue since there's no direct access to the allocated external memory from V8. And it only happens on macOS so I'm not sure about that direction.

To investigate it further, I've run some tests where I tried calling posix_memalign with different values in a native module loaded from Electron, and here are the results:

I've run all tests on an M1 Max MacBook Pro

Test 1

const long alloc_size = 587218944;
fprintf(stderr, "alloc_size: %ld\n", alloc_size);
fflush(stderr);

void * aligned_memory = NULL;
void * aligned_memory2 = NULL;
int result = posix_memalign(&aligned_memory, sysconf(_SC_PAGESIZE), alloc_size);
int result2 = posix_memalign(&aligned_memory2, sysconf(_SC_PAGESIZE), alloc_size);
if (result != 0 || result2 != 0) {
    fputs("failed to allocate memory\n", stderr);
    fflush(stderr);
} else {
    fputs("done\n", stderr);
    fflush(stderr);
}

Result:

alloc_size: 587218944
done

Test 2

const long alloc_size = 1073741824;
fprintf(stderr, "alloc_size: %ld\n", alloc_size);
fflush(stderr);

void * aligned_memory = NULL;
int result = posix_memalign(&aligned_memory, sysconf(_SC_PAGESIZE), alloc_size);
if (result != 0) {
    fputs("failed to allocate memory\n", stderr);
    fflush(stderr);
} else {
    fputs("done\n", stderr);
    fflush(stderr);
}

Result:

alloc_size: 1073741824
/Users/user/Documents/node-llama-cpp/templates/electron-typescript-react/node_modules/electron/dist/Electron.app/Contents/MacOS/Electron exited with signal SIGTRAP

(crashed)

Test 3

const long alloc_size = 1073741824;
fprintf(stderr, "alloc_size: %ld\n", alloc_size);
fflush(stderr);

void * aligned_memory = NULL;
kern_return_t alloc_status = vm_allocate((vm_map_t) mach_task_self(), (vm_address_t *) &aligned_memory, alloc_size, VM_FLAGS_ANYWHERE);
if (alloc_status != KERN_SUCCESS) {
    fputs("failed to allocate memory\n", stderr);
    fflush(stderr);
} else {
    fputs("done\n", stderr);
    fflush(stderr);
}

Result:

alloc_size: 1073741824
done

Notice that the Test 1 allocated a buffer with size 587218944 twice and it worked fine, but allocating one single 1073741824 buffer in Test 2 failed (587218944 * 2 > 1073741824), so there seems to be an issue with posix_memalign specifically rather than some sort of memory corruption.

I've shared my findings on the Electron repo in the past for a similar issue, but I'll open another issue for this so they can investigate it further.
Until it's resolved on the Electron side, I think using vm_allocate is a good solution that works in all cases.

Update: here's the issue on the Electron repo: link

max-krasnyansky · 2024-10-16T22:44:11Z

@giladgd
Sounds good. Thanks for the additional testing.

…org#9875) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment

giladgd added 2 commits October 14, 2024 02:30

fix: use vm_allocate to allocate CPU backend buffer on macOS

06b119a

fix: switch to posix_memalign to keep existing free() usages work

cec2d4e

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 14, 2024

slaren reviewed Oct 14, 2024

View reviewed changes

giladgd added 7 commits October 15, 2024 01:08

feat: move GGML_ALIGNED_MALLOC to ggml-backend-impl.h, add suppor…

51ccdeb

…t for `vm_allocate` on macOS

style: formatting

c2259e3

fix: move const outside of #ifndef

55298d2

style: formatting

fa79e0d

fix: unused var

19a820f

fix: transform GGML_ALIGNED_MALLOC and GGML_ALIGNED_FREE into fun…

8352354

…ctions and add them to `ggml-impl.h`

fix: unused var

df67ae3

giladgd changed the title ~~fix: use posix_memalign to allocate CPU backend buffer on macOS~~ fix: use vm_allocate to allocate CPU backend buffer on macOS Oct 14, 2024

slaren reviewed Oct 14, 2024

View reviewed changes

giladgd added 5 commits October 15, 2024 02:51

fix: page align to GGUF_DEFAULT_ALIGNMENT

33b4308

fix: page align to TENSOR_ALIGNMENT

cb22464

fix: convert TENSOR_ALIGNMENT to a macro

1516f7b

fix: increase page size to 32 on iOS

4455c7f

fix: iOS page size

de7836a

slaren reviewed Oct 15, 2024

View reviewed changes

ggml/src/ggml.c Outdated Show resolved Hide resolved

fix: hbw_posix_memalign alignment

ecddb24

slaren approved these changes Oct 16, 2024

View reviewed changes

slaren merged commit 73afe68 into ggml-org:master Oct 16, 2024
50 of 53 checks passed

giladgd mentioned this pull request Oct 16, 2024

Support for allocating memory larger than 2GiB via partition allocator electron/electron#44291

Open

3 tasks

giladgd deleted the macosMemalignBackendCpuBuffer branch October 16, 2024 23:02

giladgd mentioned this pull request Oct 16, 2024

fix: allocating CPU buffer with size 0 #9917

Merged

4 tasks

giladgd mentioned this pull request Oct 31, 2024

feat: chat session response prefix withcatai/node-llama-cpp#375

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use `vm_allocate` to allocate CPU backend buffer on macOS #9875

fix: use `vm_allocate` to allocate CPU backend buffer on macOS #9875

Uh oh!

giladgd commented Oct 14, 2024 •

edited

Loading

Uh oh!

slaren Oct 14, 2024

Uh oh!

giladgd Oct 14, 2024

Uh oh!

slaren Oct 14, 2024

Uh oh!

slaren Oct 14, 2024

Uh oh!

max-krasnyansky commented Oct 15, 2024

Uh oh!

giladgd commented Oct 15, 2024

Uh oh!

slaren commented Oct 15, 2024

Uh oh!

Uh oh!

max-krasnyansky commented Oct 16, 2024

Uh oh!

giladgd commented Oct 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

max-krasnyansky commented Oct 16, 2024

Uh oh!

Uh oh!

fix: use vm_allocate to allocate CPU backend buffer on macOS #9875

fix: use vm_allocate to allocate CPU backend buffer on macOS #9875

Uh oh!

Conversation

giladgd commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

giladgd Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

slaren Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

slaren Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Oct 15, 2024

Uh oh!

giladgd commented Oct 15, 2024

Uh oh!

slaren commented Oct 15, 2024

Uh oh!

Uh oh!

max-krasnyansky commented Oct 16, 2024

Uh oh!

giladgd commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test 1

Test 2

Test 3

Uh oh!

Uh oh!

max-krasnyansky commented Oct 16, 2024

Uh oh!

Uh oh!

fix: use `vm_allocate` to allocate CPU backend buffer on macOS #9875

fix: use `vm_allocate` to allocate CPU backend buffer on macOS #9875

giladgd commented Oct 14, 2024 •

edited

Loading

giladgd commented Oct 16, 2024 •

edited

Loading