-
Notifications
You must be signed in to change notification settings - Fork 902
osc/pt2pt hang in master #1299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Not quite right but I see what is going on. Working on a patch now. |
hjelmn
added a commit
to hjelmn/ompi
that referenced
this issue
Feb 2, 2016
This commit fixes several bugs identified by a new multi-threaded RMA benchmarking suite. The following bugs have been identified and fixed: - The code that signaled the actual start of an access epoch changed the eager_send_active flag on a synchronization object without holding the object's lock. This could cause another thread waiting on eager sends to block indefinitely because the entirety of ompi_osc_pt2pt_sync_expected could exectute between the check of eager_send_active and the conditon wait of ompi_osc_pt2pt_sync_wait. - The bookkeeping of fragments could get screwed up when performing long put/accumulate operations from different threads. This was caused by the fragment flush code at the end of both put and accumulate. This code was put in place to avoid sending a large number of unexpected messages to a peer. To fix the bookkeeping issue we now 1) wait for eager sends to be active before stating any large isend's, and 2) keep track of the number of large isends associated with a fragment. If the number of large isends reaches 32 the active fragment is flushed. - Use atomics to update the large receive/send tag counters. This prevents duplicate tags from being used. The tag space has also been updated to use the entire 16-bits of the tag space. These changes should also fix open-mpi#1299. Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn
added a commit
to hjelmn/ompi
that referenced
this issue
Feb 2, 2016
This commit fixes open-mpi#1299. Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn
added a commit
to hjelmn/ompi-release
that referenced
this issue
Feb 3, 2016
This commit fixes several bugs identified by a new multi-threaded RMA benchmarking suite. The following bugs have been identified and fixed: - The code that signaled the actual start of an access epoch changed the eager_send_active flag on a synchronization object without holding the object's lock. This could cause another thread waiting on eager sends to block indefinitely because the entirety of ompi_osc_pt2pt_sync_expected could exectute between the check of eager_send_active and the conditon wait of ompi_osc_pt2pt_sync_wait. - The bookkeeping of fragments could get screwed up when performing long put/accumulate operations from different threads. This was caused by the fragment flush code at the end of both put and accumulate. This code was put in place to avoid sending a large number of unexpected messages to a peer. To fix the bookkeeping issue we now 1) wait for eager sends to be active before stating any large isend's, and 2) keep track of the number of large isends associated with a fragment. If the number of large isends reaches 32 the active fragment is flushed. - Use atomics to update the large receive/send tag counters. This prevents duplicate tags from being used. The tag space has also been updated to use the entire 16-bits of the tag space. These changes should also fix open-mpi/ompi#1299. Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from open-mpi/ompi@d7264aa) Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn
added a commit
to hjelmn/ompi-release
that referenced
this issue
Feb 3, 2016
This commit fixes open-mpi/ompi#1299. Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from open-mpi/ompi@519fffb) Signed-off-by: Nathan Hjelm <[email protected]>
bosilca
pushed a commit
to bosilca/ompi
that referenced
this issue
Oct 3, 2016
This commit fixes several bugs identified by a new multi-threaded RMA benchmarking suite. The following bugs have been identified and fixed: - The code that signaled the actual start of an access epoch changed the eager_send_active flag on a synchronization object without holding the object's lock. This could cause another thread waiting on eager sends to block indefinitely because the entirety of ompi_osc_pt2pt_sync_expected could exectute between the check of eager_send_active and the conditon wait of ompi_osc_pt2pt_sync_wait. - The bookkeeping of fragments could get screwed up when performing long put/accumulate operations from different threads. This was caused by the fragment flush code at the end of both put and accumulate. This code was put in place to avoid sending a large number of unexpected messages to a peer. To fix the bookkeeping issue we now 1) wait for eager sends to be active before stating any large isend's, and 2) keep track of the number of large isends associated with a fragment. If the number of large isends reaches 32 the active fragment is flushed. - Use atomics to update the large receive/send tag counters. This prevents duplicate tags from being used. The tag space has also been updated to use the entire 16-bits of the tag space. These changes should also fix open-mpi#1299. Signed-off-by: Nathan Hjelm <[email protected]>
bosilca
pushed a commit
to bosilca/ompi
that referenced
this issue
Oct 3, 2016
This commit fixes open-mpi#1299. Signed-off-by: Nathan Hjelm <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@hjelmn can you please have a look at this ?
here is a reproducer
this can be ran with only one MPI task.
it works fine with
--mca osc sm
on bothv1.10
andmaster
but with
--mca osc pt2pt
, it works fine onv1.10
but it hangs onmaster
i ran this under the debugger, and ended up writing this patch so
master
mimicv1.10
.that being said, i have no idea whether this is correct or not ...
The text was updated successfully, but these errors were encountered: