Skip to content

enable CPU HBM #2603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 8, 2023
Merged

enable CPU HBM #2603

merged 7 commits into from
Sep 8, 2023

Conversation

jikunshang
Copy link
Contributor

this pr try to enable CPU HBM with memkind library to allocate hbm memory.

see #2602

@netrunnereve
Copy link
Collaborator

Are you seeing any performance improvements?

@jikunshang
Copy link
Contributor Author

Are you seeing any performance improvements?

sorry for late reply, I just got test machine recently.
On Xeon CPU Max 9462, model use llama2-7b:
baseline inference performance is 124 ms/token,
with this enhancement and HBM enabled, inference performance is 87 ms/token, which can get about 40% perf gain.

@jikunshang
Copy link
Contributor Author

hi @ggerganov, can you take a review at your convenience?

@hydroo
Copy link
Contributor

hydroo commented Sep 2, 2023

Sorry for my ignorance.
Why doesn't the system allocate on HBM without this change?
Is this a system with DDR and HBM, and some BIOS setting (I vaguely remember caching vs other modes (For Optane and Knights* chips)) makes it such that the system either prioritizes DDR or even never touches HBM?
I'm asking, because it's unintuitive to me that a system wouldn't perhaps prioritize using HBM over DDR.

@jikunshang
Copy link
Contributor Author

Sorry for my ignorance. Why doesn't the system allocate on HBM without this change? Is this a system with DDR and HBM, and some BIOS setting (I vaguely remember caching vs other modes (For Optane and Knights* chips)) makes it such that the system either prioritizes DDR or even never touches HBM? I'm asking, because it's unintuitive to me that a system wouldn't perhaps prioritize using HBM over DDR.

Yes, you are right.
Actually, there are 3 kinds memory mode for Xeon Max Cpu serious: HBM only memory mode, Flat memory mode(1LM), Cache memory mode(2LM). For cache memory mode, it will work as you describe, but it will lock of fine grained memory management since all HBM memory are transparent, like L4 cache.
This code change is target for Flat memory mode, HBM and DDR are exposed to software as separate address space. we can use HBM on demand.
More details about HBM configuration can be found here

@ggerganov
Copy link
Member

Merge if CI passes

@jikunshang
Copy link
Contributor Author

Hi @ggerganov can you approve again for workflows? thanks!

@slaren slaren merged commit 7f412da into ggml-org:master Sep 8, 2023
@kunger97
Copy link

Does HBM compile the default in LLAMA.CPP or do you need to specify during compilation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants