Skip to content

Defect: event post hangs using 2 images per node. #411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Ambra91 opened this issue Jul 12, 2017 · 9 comments
Closed

Defect: event post hangs using 2 images per node. #411

Ambra91 opened this issue Jul 12, 2017 · 9 comments
Assignees

Comments

@Ambra91
Copy link

Ambra91 commented Jul 12, 2017

Defect/Bug Report

Hi. When performing the following toy example for some configuration settings, the event post statement hangs.
For example, this happens with 4 and 8 images, only when the images are distributed in the following way: 2 images for each node.

program prova_post
    use iso_fortran_env
    implicit none
    integer :: np, me
    type(event_type), allocatable :: snd_copied(:)[:]
 
    me=this_image()
    np = num_images()
    if (allocated(snd_copied)) deallocate(snd_copied)
    allocate(snd_copied(np)[*])
    if (me == 2)  print*,'I am  image 2, I am posting to 4'
    if (me == 2) event post(snd_copied(2)[4])
    if (me == 2) print*,' I am image 2, I have posted to 4'
    if (me == 4) event wait(snd_copied(2))
    if (allocated(snd_copied)) deallocate(snd_copied)
end program prova_post
  • OpenCoarrays Version: 1.9.0
  • Fortran Compiler: gfortran 7.1
  • C compiler used for building lib: gcc 7.1
  • Installation method: cmake
  • Output of uname -a: Linux yoda 2.6.32-642.11.1.el6.x86_64 tests dis_transpose: test passed  #1 SMP Fri Nov 18 19:25:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • MPI library being used: MPICH 3.2
  • Version of CMake: CMAKE-3.8.2

Observed Behavior

I am  image 2, I am posting to 4

(program hangs)

Expected Behavior

I am  image 2, I am posting to 4
 I am image 2, I have posted to 4

Steps to Reproduce

I am using a PBS script to run the program.
I have compiled the program using:

caf -o prova_post.x -g prova_post.f90

And I run it using the following script:

#!/bin/bash
#PBS -l nodes=2:ppn=2,walltime=04:00:00
#PBS -N caf_prova_post
#PBS -e caf_prova_post.err
#PBS -o caf_prova_post.out
#PBS  -W x="NACCESSPOLICY:SINGLEJOB"  
cd $PBS_O_WORKDIR

cafrun -np 4./prova_post.x>prova_post.out
@afanfa afanfa self-assigned this Jul 12, 2017
@afanfa
Copy link
Contributor

afanfa commented Jul 12, 2017

Hi, this sounds like a problem related with the MPI progress and the fact that events are currently based on MPI atomic operations. What happens if you run the same code on a single node with 4 processes?

@Ambra91
Copy link
Author

Ambra91 commented Jul 12, 2017

In that case I got the expected output. I obtain the expected output if I run the same code on 4 nodes too. The problem arises only when I distribute images on 2 nodes (2 per node).

@afanfa
Copy link
Contributor

afanfa commented Jul 12, 2017

Ok, I know what is going on, I'll produce a patch in the next hour. Because of our policy, the patch may take up to 24 hours before hitting the trunk. If you need to fix this issue quickly I recommend to apply the patch yourself to your OpenCoarrays. The patch will be just few lines long.

@afanfa
Copy link
Contributor

afanfa commented Jul 12, 2017

I'm just curious: have you tried to run your code with MVAPICH?

@zbeekman
Copy link
Collaborator

@afanfa Alessandro, very responsive fix! I'm happy to merge this soon, but I saw over on the PR you wanted to discuss this further. I'm happy to look at it with you, or we can also try to get Damian online too.

@Ambra91 Thanks for the detailed bug report, and for using the template! It helps A LOT when people go the extra mile to help make our lives easier! Should be getting a fix out pretty soon.

@Ambra91
Copy link
Author

Ambra91 commented Jul 14, 2017

Thank you for the quick response and for addressing the issue.
@afanfa I am experiencing the same problem with MVAPICH too.

@zbeekman
Copy link
Collaborator

@Ambra91 thanks again for such a wonderful and detailed report! 💯 Please try again using the master branch and confirm that this issue is fixed. If, not, then we'll reopen it.

Thanks again for your contribution!

@zbeekman
Copy link
Collaborator

zbeekman commented Aug 8, 2017

@Ambra91 Can we use your code for a regression/unit test? FYI we use the Linux Foundation CLA: https://gist.github.com/zbeekman/0a5d60a1cbd1f6a8cfa5

@Ambra91
Copy link
Author

Ambra91 commented Aug 8, 2017

@zbeekman Yes, of course.

zbeekman added a commit that referenced this issue Aug 9, 2017
Really fix #411 on all mpi-installations reliably.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants