Skip to content

fix: use vm_allocate to allocate CPU backend buffer on macOS #9875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Oct 16, 2024

Conversation

giladgd
Copy link
Contributor

@giladgd giladgd commented Oct 14, 2024

Creating a context in an Electron app using node-llama-cpp on macOS without Metal crashes the process with some models (issue), so I've investigated what's happening and found that allocating a large memory block using malloc is the culprit.

For some reason, it happens only on Electron and not on Nodejs, but I couldn't figure out why.

To fix this issue, I switched to use GGML_ALIGNED_MALLOC instead of malloc in ggml_backend_cpu_buffer_type_alloc_buffer, and added support for vm_allocate in GGML_ALIGNED_MALLOC on macOS.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 14, 2024
@@ -770,12 +772,21 @@ static const char * ggml_backend_cpu_buffer_type_get_name(ggml_backend_buffer_ty
}

static ggml_backend_buffer_t ggml_backend_cpu_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
#if defined(GGML_USE_METAL) || defined(TARGET_OS_OSX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot rely on backend macros like GGML_USE_METAL, and I don't see the point of enabling this specifically with Metal anyway, when the crash is happening in the CPU backend. Instead I would suggest using GGML_ALIGNED_MALLOC everywhere like the TODO in the comment suggests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main focus here is on TARGET_OS_OSX, but I figured that including GGML_USE_METAL covers more cases where posix_memalign is available and can be used instead of malloc.

I'll move GGML_ALIGNED_MALLOC to ggml-backend-impl.h and use it instead then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still needs some changes:

  • It should be in ggml-impl.h, not ggml-backend-impl.h
  • The implementation details need to be hidden in the source file, the header should only have the ggml_aligned_malloc and ggml_aligned_free functions, and the macros should be removed

I realize this is a lot more work than just fixing the bug, but we cannot add technical debt every time we fix a bug.

@giladgd giladgd changed the title fix: use posix_memalign to allocate CPU backend buffer on macOS fix: use vm_allocate to allocate CPU backend buffer on macOS Oct 14, 2024
ggml/src/ggml.c Outdated
#if defined(_MSC_VER) || defined(__MINGW32__)
#define GGML_ALIGNED_MALLOC(size) _aligned_malloc(size, GGML_MEM_ALIGN)
#define GGML_ALIGNED_FREE(ptr) _aligned_free(ptr)
return _aligned_malloc(size, GGML_MEM_ALIGN);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGML_MEM_ALIGN is 16, but the CPU backend expects an alignment of TENSOR_ALIGNMENT = 32. I suspect that's why it is crashing sometimes. It would probably be ok to increase the alignment of this function to the page size in every case.

@max-krasnyansky
Copy link
Collaborator

Do we really need vm_allocate() here?
posix_memalign() works well.
I've not seen any issues with CPU backend with models up to 70B (they are dog slow but run on devices with enough RAM).
If you suspect large malloc is an issue (not quite sure why would it be) perhaps we can just add get_max_size and limit the allocs to say 1-2GB.

      static struct ggml_backend_buffer_type ggml_backend_cpu_buffer_type = {
        /* .iface   = */ {
            /* .get_name         = */ ggml_backend_cpu_buffer_type_get_name,
            /* .alloc_buffer     = */ ggml_backend_cpu_buffer_type_alloc_buffer,
            /* .get_alignment    = */ ggml_backend_cpu_buffer_type_get_alignment,
            /* .get_max_size     = */ NULL, // defaults to SIZE_MAX  <<<<<<< here

@giladgd
Copy link
Contributor Author

giladgd commented Oct 15, 2024

When loading llama.cpp from inside an Electron app using node-llama-cpp on macOS, using malloc or posix_memalign to allocate large buffers crashes the process.
This happens only on Electron on macOS, and doesn't happen on nodejs and other platforms.

I used vm_allocate since this is what the Apple documentation recommends and it works well regardless of using a GPU.

We can limit the allocs to a smaller size, but I think that using vm_allocate to workaround this issue may be a better solution.

@slaren
Copy link
Member

slaren commented Oct 15, 2024

The problem with reducing the value of get_max_size is that it limits the maximum size of an individual tensor. The output layer and token embeddings of current models are often in the several gigabytes due to the very large vocabularies. I think that going directly to the OS with vm_allocate/VirtualAlloc/mmap may be the more sensible way to allocate very large buffers. Do you expect any issues with using these instead of malloc?

@max-krasnyansky
Copy link
Collaborator

The problem with reducing the value of get_max_size is that it limits the maximum size of an individual tensor. The output layer and token embeddings of current models are often in the several gigabytes due to the very large vocabularies. I think that going directly to the OS with vm_allocate/VirtualAlloc/mmap may be the more sensible way to allocate very large buffers. Do you expect any issues with using these instead of malloc?

Just seems like unnecessary complexity. ie if posix_memalign works on most platforms why introduce yet another allocation wrapper type thing.
No objections / expected issues otherwise, though it would be good to figure out what exactly fails under Electron with the posix_memalign() I doubt that it's an issue with the underlying allocator itself. The only thing I can think of is if the malloc heap gets corrupted (double-free, overruns, etc) and that causes the crashes. vm_allocate() probably calls mmap() or some other low-level VM API directly and is unaffected by the malloc stack corruption.

@giladgd It might worth trying to run your test with malloc debugging enabled
https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MallocDebug.html
I'm not familiar with Electron so not sure how easy that would be.

@giladgd
Copy link
Contributor Author

giladgd commented Oct 16, 2024

@max-krasnyansky Thanks for the guidance for malloc debugging.
I've tried it but couldn't find any memory corruption.
My guess is that it has something to do with Electron's V8 memory cage, although it shouldn't be an issue since there's no direct access to the allocated external memory from V8. And it only happens on macOS so I'm not sure about that direction.

To investigate it further, I've run some tests where I tried calling posix_memalign with different values in a native module loaded from Electron, and here are the results:

I've run all tests on an M1 Max MacBook Pro

Test 1

const long alloc_size = 587218944;
fprintf(stderr, "alloc_size: %ld\n", alloc_size);
fflush(stderr);

void * aligned_memory = NULL;
void * aligned_memory2 = NULL;
int result = posix_memalign(&aligned_memory, sysconf(_SC_PAGESIZE), alloc_size);
int result2 = posix_memalign(&aligned_memory2, sysconf(_SC_PAGESIZE), alloc_size);
if (result != 0 || result2 != 0) {
    fputs("failed to allocate memory\n", stderr);
    fflush(stderr);
} else {
    fputs("done\n", stderr);
    fflush(stderr);
}

Result:

alloc_size: 587218944
done

Test 2

const long alloc_size = 1073741824;
fprintf(stderr, "alloc_size: %ld\n", alloc_size);
fflush(stderr);

void * aligned_memory = NULL;
int result = posix_memalign(&aligned_memory, sysconf(_SC_PAGESIZE), alloc_size);
if (result != 0) {
    fputs("failed to allocate memory\n", stderr);
    fflush(stderr);
} else {
    fputs("done\n", stderr);
    fflush(stderr);
}

Result:

alloc_size: 1073741824
/Users/user/Documents/node-llama-cpp/templates/electron-typescript-react/node_modules/electron/dist/Electron.app/Contents/MacOS/Electron exited with signal SIGTRAP

(crashed)

Test 3

const long alloc_size = 1073741824;
fprintf(stderr, "alloc_size: %ld\n", alloc_size);
fflush(stderr);

void * aligned_memory = NULL;
kern_return_t alloc_status = vm_allocate((vm_map_t) mach_task_self(), (vm_address_t *) &aligned_memory, alloc_size, VM_FLAGS_ANYWHERE);
if (alloc_status != KERN_SUCCESS) {
    fputs("failed to allocate memory\n", stderr);
    fflush(stderr);
} else {
    fputs("done\n", stderr);
    fflush(stderr);
}

Result:

alloc_size: 1073741824
done

Notice that the Test 1 allocated a buffer with size 587218944 twice and it worked fine, but allocating one single 1073741824 buffer in Test 2 failed (587218944 * 2 > 1073741824), so there seems to be an issue with posix_memalign specifically rather than some sort of memory corruption.

I've shared my findings on the Electron repo in the past for a similar issue, but I'll open another issue for this so they can investigate it further.
Until it's resolved on the Electron side, I think using vm_allocate is a good solution that works in all cases.

Update: here's the issue on the Electron repo: link

@slaren slaren merged commit 73afe68 into ggml-org:master Oct 16, 2024
50 of 53 checks passed
@max-krasnyansky
Copy link
Collaborator

@giladgd
Sounds good. Thanks for the additional testing.

drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024
…org#9875)

* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
…org#9875)

* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
…org#9875)

* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
…org#9875)

* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants