Skip to content

[SYCL] Add a leaf limit to the execution graph #1070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 5, 2020

Conversation

sergey-semenov
Copy link
Contributor

This patch adds a leaf limit (per memory object) for the command
execution graph in order to avoid graph bloat in applications that have
an overwhelming number of command groups that can be executed in
parallel.

Whenever the limit is exceeded, one of the old leaves is added as a
dependency of the new leaf instead.

Signed-off-by: Sergey Semenov [email protected]

@keryell
Copy link
Contributor

keryell commented Feb 1, 2020

It is not clear to me what is the problem you are trying to solve. I admit I am not familiar with your graph scheduler...
I do not understand how your circular buffer works. Is there any modulo operation lacking somewhere?

@sergey-semenov
Copy link
Contributor Author

It is not clear to me what is the problem you are trying to solve. I admit I am not familiar with your graph scheduler...

Admittedly, this patch is mainly solving a problem that has not been merged to the code base yet. It has to do with the graph cleanup (#1066) which regularly deletes finished non-alloca non-leaf command nodes from the graph to avoid large memory usage. A large enough number of leaves in the graph dramatically slows down this process and that's where this change comes in.

Thanks for pointing this out, I'll amend the commit message to make the intent clearer.

I do not understand how your circular buffer works. Is there any modulo operation lacking somewhere?

Where do you expect it to be?

@AlexeySachkov
Copy link
Contributor

@sergey-semenov,

Where do you expect it to be?

I guess that usually circular buffer is implemented on top of regular vector:

void push_back(T elem) {
  auto Index = ++LastIndex % StorageSize;
  Storage[Index] = elem;
}

This patch adds a leaf limit (per memory object) for the command
execution graph in order to avoid leaf bloat in applications that have
an overwhelming number of command groups that can be executed in
parallel. Limiting the number of leaves is necessary for reducing
performance overhead of regular cleanup of finished command nodes.

Whenever the limit is exceeded, the oldest leaf is added as a
dependency of the new one instead.

Signed-off-by: Sergey Semenov <[email protected]>
Signed-off-by: Sergey Semenov <[email protected]>
@keryell
Copy link
Contributor

keryell commented Feb 3, 2020

@sergey-semenov,

Where do you expect it to be?

I guess that usually circular buffer is implemented on top of regular vector:

void push_back(T elem) {
  auto Index = ++LastIndex % StorageSize;
  Storage[Index] = elem;
}

Yes usually this is what a circular buffer is. Here it is more like a FIFO...
So it is a std::queue with a bounded capacity.
I have the feeling it is more efficient to use a vector + % or & (if power of 2, the best) or ?: to handle the end instead of paying the price of a deque, that will allocate and deallocate memory on a regular basis.

@keryell
Copy link
Contributor

keryell commented Feb 3, 2020

Thanks for pointing this out, I'll amend the commit message to make the intent clearer.

Thanks for the explanation, even if I think I do not have the back-ground to understand.
Otherwise I am always confused with the name alloca since it is neither here the UNIX alloca nor the LLVM alloca...

Signed-off-by: Sergey Semenov <[email protected]>
Signed-off-by: Sergey Semenov <[email protected]>
@sergey-semenov
Copy link
Contributor Author

I have the feeling it is more efficient to use a vector + % or & (if power of 2, the best) or ?: to handle the end instead of paying the price of a deque, that will allocate and deallocate memory on a regular basis.

That's definitely a valid concern, deque was chosen primarily for the ease of implementation (double ended push/pop + out-of-box iterators), but making that switch might prove to be a worthwhile optimization eventually.

@romanovvlad romanovvlad merged commit 7c293e2 into intel:sycl Feb 5, 2020
@sergey-semenov sergey-semenov deleted the leaflimit branch February 5, 2020 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants