-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Fix embedding when embedding layer on GPU #1873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
OpenCL is not needed for now but it will be useful eventually. |
When is the embedding layer ever on the GPU? |
when we creates KV in GPU. |
I don't see how that is going to happen. Please provide a concrete set of parameters with which the embedding layer ends up on the GPU. |
I agree that the embedding layer probably isn't being offloaded to the GPU right now. But having the option to move memory back to the host could be a handy feature, especially if we're thinking about bringing these changes back into the GGML repo. This would definitely give us more freedom in terms of offloading and acceleration. |
To keep the code simpler, I think this should only be merged if there is an actual use case. Currently you can set the backend of a tensor to |
Oh i didnt know that, if i can simply set the backend to CPU that should be good enought for my use case. If that's the case we dont need an additional function to get the data back to the host. |
The logic is currently like this: if any out of |
Thank you, this information is exactly what I needed. However, I have a question concerning the behavior when both I'm asking because I'm currently integrating your CUDA acceleration into I'm probably not the first and won't be the last person which is a bit confused on how the offloading works via the backends, what do you think should we create a short readme that describes the process? Or am I just stupid and the documentation already exists? |
If any out of
Here's what should work: do not apply
Documentation is always useful but it's a question of opportunity cost. Right now the CUDA code is changing relatively quickly so I don't want to spend time writing documentation that may become outdated after I notice that one of my earlier design decisions was bad and will need to be changed. I'm happy to answer specific questions though; I have a Mumble server that could be used for talking about it if desired. |
Yup, that was the exeption i got and i already suspected the aliby function, i'll give it another try tomorrow. Could you link me to your Mumble server? If i have additional questions i'll ask them there. |
Either write an email to the address on my Github page or add your email address to your page and I'll send you the address and password for the mumble server. |
here is a repo:
|
Okay, it seems I misunderstood the point of this PR. I can confirm that the |
I made another PR #1891 that should also work as a fix but with an extremely simple change. |
sure. your fix is much simpler. I am thinking how we can make the samplers running inside CUDA and keep everything stay in GPU. |
Are you sure that would actually be faster? Copying small amounts of data between CPU and GPU takes a few nanoseconds. Doing that once or twice per token is not going to make a meaningful difference. So the question would then be whether sampling is suitable for GPU acceleration. If it isn't then I suspect a GPU implementation will be slower than what is currently on master. |
In any case, I think the proper way to do what you want to do would be to add a backend like |
This needs more tests. My suspicion is that the sync between GPU and CPU in order to do the sampling is slow. Of course, I don't have data to prove. will do some tests and get back. |
Not needed anymore |
Support copying data from GPU back to HOST