Skip to content

Add installer detection of gfortran/mpich mismatch #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rouson opened this issue Nov 8, 2016 · 15 comments
Closed

Add installer detection of gfortran/mpich mismatch #246

rouson opened this issue Nov 8, 2016 · 15 comments
Assignees

Comments

@rouson
Copy link
Member

rouson commented Nov 8, 2016

A common installation failure mode arises when OpenCoarrays is built with an MPICH installation that itself was built by different gfortran version than the gfortran version currently invoked via mpif90. Detecting this failure mode in the OpenCoarrays installer has long been on my To Do list. This issue is a reminder to work on it. Detecting and preventing such subtle and hard-to-diagnose problems is one of the main aims of the OpenCoarrays installer.

@zbeekman
Copy link
Collaborator

zbeekman commented Nov 8, 2016

There may be some cases when this isn't a problem, at least once we've decoupled the need to explicitly pass in FC=mpif90 to CMake and rely more heavily on find_package().

One big issue stems from the .mod files and their formats, which is non-standardized and radically different between compiler vendors, and sometimes changes from minor version to minor version within a compiler. (Gfortran has a reputation for this...) However, just because the .mod file format changed doesn't mean that the version of MPI previously built is incompatible with the current OpenCoarrays and Gfortran toolchain: the ABI of the object files may still be the same.

Furthermore, the only hard dependency of the actual library is on the C MPI interface: if my understanding is correct, we only depend on the Fortran MPI interface for the OpenCoarrays wrapper module and perhaps some tests or system introspection.

I think the best option for detecting this would be through system introspection at configure/cmake time. We can try to build and run an MPI hello world Fortran program. If we're using a recent GCC then we could even consider skipping the OpenCoarrays wrapper module and any tests in which use mpi makes an appearance. Furthermore, since it's possible that the ABI for object files etc. is consistent, even if the ABI for .mod files isn't, we could also try using include mpi.h instead of use mpi

@rouson
Copy link
Member Author

rouson commented Nov 8, 2016

@zbeekman Thanks for writing this thorough and thoughtful response. It helped me remember the details of the failure mode I've seen. On several occasions both in my own work and in supporting collaborators, I've seen the OpenCoarrays installation progress without error or warning and the resulting installation produces executable programs without warning, but the simplest of executable programs fails to launch MPI correctly. Your suggestion to build and run a simple MPI hello world program is spot on. In such situations, we could either terminate the build with an error message or attempt to recover by finding a working MPI installation and offering the user the option to build with it.

This also reminds me, however, that the failure mode is sometimes even more subtle. I think it might even be the cause of behavior that @vehre just encountered. He downloaded the virtual machine and built OpenCoarrays with the system-installed MPI. They system MPI was built with an older version of gfortran, but in his the-set up, mpif90 invokes a recent gfortran 7.0.0 trunk build. Apparently, the system-installed MPI was sufficiently recent that the only sign of a problem was one test failure (#8), which I suspect will go away after he builds a new MPICH using the GCC trunk and then builds OpenCoarrays with that new GCC.

@vehre please let us know whether your new build confirms my suspicion.

@zbeekman
Copy link
Collaborator

zbeekman commented Nov 8, 2016

Apparently, the system-installed MPI was sufficiently recent that the only sign of a problem was one test failure (#8), which I suspect will go away after he builds a new MPICH using the GCC trunk and then builds OpenCoarrays with that new GCC.

If this is true, that is a very insidious error. I wonder if there is a way to determine which GCC built MPICH. If not, then there's no real guarantee that you can catch and prevent this issue. Continuing to use the mpi.mod will be the best/closest thing to directly detecting this (over using include mpi.h.

@vehre
Copy link
Collaborator

vehre commented Nov 9, 2016

Hi all,

my previous post did not get inserted here. I therefore inline it now:
Hi all,

I did the following test cycle:

  1. a) Remove /opt/mpich completely in the VM.
    b) Reinstall system-supplied mpich version 3.2
    c) Delete build-opencoarrays and rebuild using:
    cmake -D CMAKE_BUILD_TYPE=Debug -D CMAKE_C_COMPILER='gcc' -D
    CMAKE_CXX_COMPILER='g++' -D CMAKE_EXE_LINKER_FLAGS=' -Wl,-rpath
    -Wl,/usr/lib/mpich/lib
    -Wl,--enable-new-dtags /usr/lib/mpich/lib/libmpich.so /usr/lib/mpich/lib/
    libmpichfort.so' -D
    CMAKE_Fortran_FLAGS="-I /usr/include/mpich" ../opencoarrays && make
    d) ctest . => Test Adding provisional capability for co_broadcast wiht character scalar … #8 still failing.
  2. a) Remove system supplied mpich.
    b) Install libopenmpi-dev and openmpi-bin version 1.10.2.
    c) like above, but with:
    cmake -D CMAKE_BUILD_TYPE=Debug -D CMAKE_C_COMPILER='gcc' -D
    CMAKE_CXX_COMPILER='g++' -D CMAKE_EXE_LINKER_FLAGS=' -Wl,-rpath
    -Wl,/usr/lib/openmpi/lib/
    -Wl,--enable-new-dtags /usr/lib/openmpi/lib/libmpi_usempif08.so /usr/lib/openmpi/
    lib/libmpi_usempi_ignore_tkr.so /usr/lib/openmpi/lib/libmpi_mpifh.so /usr/lib/
    openmpi/lib/libmpi.so' -D
    CMAKE_Fortran_FLAGS="-I /usr/lib/openmpi/lib" ../opencoarrays && make
    d) ctest . => Test Adding provisional capability for co_broadcast wiht character scalar … #8 failing.
  3. a) Remove system supplied openmpi-common.
    b) wget http://www.mpich.org/static/downloads/3.2/mpich-3.2.tar.gz
    c) ../mpich-3.2/configure --prefix=/opt/mpich/3.2 && make -j4 && sudo make
    install
    d) Add modulefile to /usr/share/modules/modulefiles/mpich/3.2:
##
## modules modulefile
##
# for Tcl script use only
set     version         3.2
set     modname         "mpich/3.2"
set     moddesc         "MPICH  environment"
set     root_path       "/opt/mpich/3.2/"

#setenv  MPIEXEC_COMM    p4

# include code common to all mpi modules
source /usr/share/modules/modulefiles/mpi_common

prepend-path    LD_LIBRARY_PATH $root_path/lib
prepend-path    PATH        $root_path/bin
prepend-path    MANPATH     $root_path/share/man

e) module load mpich/3.2
f) cmake -D CMAKE_BUILD_TYPE=Debug -D CMAKE_C_COMPILER='gcc' -D
CMAKE_CXX_COMPILER='g++' -D CMAKE_EXE_LINKER_FLAGS=' -Wl,-rpath
-Wl,/usr/lib/mpich/lib
-Wl,--enable-new-dtags /opt/mpich/3.2/lib/libmpich.so /opt/mpich/3.2/lib/libmpifort.so'
-D CMAKE_Fortran_FLAGS="-I /opt/mpich/3.2/include" ../opencoarrays && make
g) ctest . => Test #8 failing.

  1. a) Unload mpich/3.2
    b) wget http://www.mpich.org/static/downloads/3.3a1/mpich-3.3a1.tar.gz
    tar -xzf mpich-3.3a1.tar.gz && mkdir build-mpich33 && cd build-mpich33
    c) ../mpich-3.3a1/configure --prefix=/opt/mpich/3.3a1 && make -j4
    d) sudo make install and adapt module-file.
    e) module load mpich/3.3a1
    f) cmake -D CMAKE_BUILD_TYPE=Debug -D CMAKE_C_COMPILER='gcc' -D
    CMAKE_CXX_COMPILER='g++' -D CMAKE_EXE_LINKER_FLAGS=' -Wl,-rpath
    -Wl,/opt/mpich/3.3a1/lib
    -Wl,--enable-new-dtags /opt/mpich/3.3a1/lib/libmpich.so /opt/mpich/3.3a1/lib/libmpifort.so'
    -D CMAKE_Fortran_FLAGS="-I /opt/mpich/3.3a1/include" ../opencoarrays && make
    g) ctest . => Adding provisional capability for co_broadcast wiht character scalar … #8 failing

So nothing helped. Seems to be some other issue not depending on the
mpi-library. I will take a deep look into the code generated by gfortran now:

The arrays in the testcase are all plain C-style arrays, i.e., without an array
descriptor. Unfortunately do the caf-communication routines require the array
bounds to be set correctly when temporary array descriptors are generated. This
is not the case in the current pseudo-code and therefore the testcase is
failing. I am wondering were that got lost, but it plainly is a compiler
problem. I will see to it.

Regards,
Andre

@vehre
Copy link
Collaborator

vehre commented Nov 9, 2016

Well, my last comment in the above is incorrect. I was looking at a scalar that was transfered which obviously has no array bounds attached it.
I investigated further and figured that there have been some commits between my copy of master and the current master. After updating to current master, I get the same error on bare metal, i.e., something between commits: dfe2ec0 and HEAD has caused the regression.
I hope this helps.

  • Andre

@zbeekman
Copy link
Collaborator

zbeekman commented Nov 9, 2016

@vehre So are you saying that the error goes away when you checkout dfe2ec0?

@vehre
Copy link
Collaborator

vehre commented Nov 9, 2016

Correct with dfe2ec0 all tests pass on bare metal and on the virtual machine. You still need to bisect which commit is the troublemaker. This commit is only the one that I know is working.

@zbeekman
Copy link
Collaborator

zbeekman commented Nov 9, 2016

yes I'm about to start running git bisect. Which gcc trunk are you currently on? For me, I always get a test failure, but the odd part is that when I run from the latest master with GCC 7 I get test # 8 get_self failing and when I run from dfe2ec0 test # 9 get_with_offset_1d fails. Any chance you have time for a short Skype call?

@vehre
Copy link
Collaborator

vehre commented Nov 9, 2016

On bare metal I am on trunk from noon today. On the virtual machine trunk is from yesterday noon. So both are quite recent. I don't see test # 9 failing on the VM the dfe2ec0 commit. The VM is using mpich3.3a1. Sure give me a call.

@zbeekman
Copy link
Collaborator

@vehre have we determined the cause of the test 8 and/or test 9 failures, or are your comments still pertinent?

@vehre
Copy link
Collaborator

vehre commented Nov 24, 2016

Well, I am convinced to know the causes of the failures:

sameloc.f90: Needs an improved gfortran-compiler as available in vehre/coarray on github.gcc.
get_with_offset_1D.f90: Either needs the improved gfortran-compiler or the strided sendget patch.

@zbeekman
Copy link
Collaborator

@vehre great thanks so much for clarifying this for me.

@zbeekman
Copy link
Collaborator

zbeekman commented Nov 29, 2016

So, @rouson, I've done some more research, and AFAICT there is no good way to determine which compiler built the MPI library. Perhaps on some linux systems you can use readelf and perhaps on some systems that don't use strip to shrink the binaries you can use strings -a libmpi.a | grep -i gcc or similar, but in general there is no good way to do this.

I think, the closest we can come to detecting these sorts of issues is to do some system introspection via testing a few MPI hello world type of examples: This will at least tell us if there is an incompatible or missing mpi.mod which is a necessary but not sufficient condition to determine which if a different Fortran compiler was used when building MPI, as discussed over email with @jerryd

@zbeekman zbeekman self-assigned this Nov 30, 2016
zbeekman added a commit that referenced this issue Nov 30, 2016
 Fall back to using `#include 'mpif.h'`
 Create interfaces when needed

# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# On branch #246-gfortran-mpich-mismatch-detection
# Changes to be committed:
#	modified:   ../CMakeLists.txt
#	modified:   ../src/extensions/opencoarrays.F90
#	modified:   ../src/mpi/CMakeLists.txt
#	modified:   ../src/tests/integration/dist_transpose/coarray_distributed_transpose.F90
#	modified:   ../src/tests/integration/pde_solvers/navier-stokes/coarray-shear_coll.F90
#	modified:   ../src/tests/performance/BurgersMPI/shared.F90
#	modified:   ../src/tests/performance/mpi_dist_transpose/mpi_distributed_transpose.F90
#	modified:   ../src/tests/unit/init_register/CMakeLists.txt
#	renamed:    ../src/tests/unit/init_register/initialize_mpi.f90 -> ../src/tests/unit/init_register/initialize_mpi.F90
#
# Untracked files:
#	../.travis-scripts/
#	./
#
zbeekman added a commit that referenced this issue Nov 30, 2016
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# On branch #246-gfortran-mpich-mismatch-detection
# Changes to be committed:
#	modified:   src/tests/integration/dist_transpose/coarray_distributed_transpose.F90
#
# Untracked files:
#	.travis-scripts/
#	build/
#
zbeekman added a commit that referenced this issue Dec 1, 2016
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# On branch #246-gfortran-mpich-mismatch-detection
# Changes to be committed:
#	modified:   src/tests/integration/dist_transpose/coarray_distributed_transpose.F90
#	modified:   src/tests/integration/pde_solvers/navier-stokes/coarray-shear_coll.F90
#	modified:   src/tests/performance/mpi_dist_transpose/mpi_distributed_transpose.F90
#
# Untracked files:
#	.travis-scripts/
#	build/
#
@zbeekman
Copy link
Collaborator

zbeekman commented Dec 1, 2016

I'm going to close this when we merge #258 unless anyone objects; I don't really see a way to do anything beyond this.

zbeekman added a commit that referenced this issue Dec 1, 2016
…smatch-detection

Handle MPI and Fortran mod file compatibility more robustly

 - Fixes #246
@ghost ghost removed the in-progress label Dec 1, 2016
@jerryd
Copy link

jerryd commented Dec 1, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants