Skip to content

Build failure when compiling w/ patched build system in parallel #366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amckinstry opened this issue Apr 25, 2017 · 20 comments
Closed

Build failure when compiling w/ patched build system in parallel #366

amckinstry opened this issue Apr 25, 2017 · 20 comments

Comments

@amckinstry
Copy link

This is seen in Debian (stretch, unstable):

make[3]: Entering directory '/«PKGBUILDDIR»/obj-arm-linux-gnueabi'
[  4%] Building Fortran object src/mpi/CMakeFiles/caf_mpi.dir/__/extensions/opencoarrays.F90.o
[  4%] Building Fortran object src/mpi/CMakeFiles/caf_mpi_static.dir/__/extensions/opencoarrays.F90.o
cd /«PKGBUILDDIR»/obj-arm-linux-gnueabi/src/mpi && /usr/bin/mpifort  -DMPI_WORKING_MODULE -DPREFIX_NAME=_gfortran_caf_ -I/usr/lib/arm-linux-gnueabi/openmpi/lib -I/usr/lib/arm-linux-gnueabi/openmpi/include -I/usr/lib/arm-linux-gnueabi/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent/include -I/usr/lib/arm-linux-gnueabi/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent -I/usr/lib/arm-linux-gnueabi/openmpi/include/openmpi -I/«PKGBUILDDIR»/src -I/«PKGBUILDDIR»/obj-arm-linux-gnueabi/mod  -g -O2 -fdebug-prefix-map=/«PKGBUILDDIR»=. -fstack-protector-strong -J../../mod   -c /«PKGBUILDDIR»/src/extensions/opencoarrays.F90 -o CMakeFiles/caf_mpi_static.dir/__/extensions/opencoarrays.F90.o
cd /«PKGBUILDDIR»/obj-arm-linux-gnueabi/src/mpi && /usr/bin/mpifort  -DMPI_WORKING_MODULE -DPREFIX_NAME=_gfortran_caf_ -Dcaf_mpi_EXPORTS -I/usr/lib/arm-linux-gnueabi/openmpi/lib -I/usr/lib/arm-linux-gnueabi/openmpi/include -I/usr/lib/arm-linux-gnueabi/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent/include -I/usr/lib/arm-linux-gnueabi/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent -I/usr/lib/arm-linux-gnueabi/openmpi/include/openmpi -I/«PKGBUILDDIR»/src -I/«PKGBUILDDIR»/obj-arm-linux-gnueabi/mod  -g -O2 -fdebug-prefix-map=/«PKGBUILDDIR»=. -fstack-protector-strong -J../../mod -fPIC   -c /«PKGBUILDDIR»/src/extensions/opencoarrays.F90 -o CMakeFiles/caf_mpi.dir/__/extensions/opencoarrays.F90.o
f951: Fatal Error: Can't rename module file '../../mod/opencoarrays.mod0' to '../../mod/opencoarrays.mod': No such file or directory
compilation terminated.
src/mpi/CMakeFiles/caf_mpi.dir/build.make:113: recipe for target 'src/mpi/CMakeFiles/caf_mpi.dir/__/extensions/opencoarrays.F90.o' failed
make[3]: *** [src/mpi/CMakeFiles/caf_mpi.dir/__/extensions/opencoarrays.F90.o] Error 1
make[3]: Leaving directory '/«PKGBUILDDIR»/obj-arm-linux-gnueabi'
CMakeFiles/Makefile2:178: recipe for target 'src/mpi/CMakeFiles/caf_mpi.dir/all' failed
make[2]: *** [src/mpi/CMakeFiles/caf_mpi.dir/all] Error 2

This appears (randomly?) when make -j2 or -j4 used, but not when -j1. A race condition?

This with openmpi-2.0.2

@zbeekman
Copy link
Collaborator

This appears to be a CMake bug, although I need to confirm this to ensure we're setting up the dependencies correctly. Which version of CMake are you using?

@zbeekman zbeekman self-assigned this Apr 25, 2017
@zbeekman
Copy link
Collaborator

@amckinstry I suspect that your patch to src/mpi/CMakeLists.txt is to blame for this race condition. IMO, this IS still a CMake bug. CMake should be able to handle building a static and shared Fortran library in parallel that uses module files. Due to the way CMake handle's mod files, building two targets producing the same mod file appears to cause a race condition. I suspect that if we serialize the build of the shared and static libs that this problem will go away.

Knowing which version of CMake you're using would still be helpful so I can test/confirm locally and then submit a bug report to CMake.

@amckinstry
Copy link
Author

3.7.2
It might also be triggered by my patch:
#365 (comment)
but I can't tell why.

@zbeekman
Copy link
Collaborator

It might also be triggered by my patch:
#365 (comment)
but I can't tell why.

Yes, it is triggered by your patch. It's because we're compiling and adding the coarrays extension module with the library... If we were to make that it's own target this would go away... I'll look into this.

@zbeekman
Copy link
Collaborator

Moving patch issue from #365 here because implementing the patch goes hand in hand with fixing the parallel build issue...

Index: ./src/mpi/CMakeLists.txt
===================================================================
--- ./src/mpi/CMakeLists.txt
+++ ./src/mpi/CMakeLists.txt
@@ -36,7 +36,8 @@ if (MPI_Fortran_MODULE_COMPILES)
   set(MPI_CAF_FORTRAN_FILES ../extensions/opencoarrays.F90)
 endif()

-add_library(caf_mpi mpi_caf.c ../common/caf_auxiliary.c ${MPI_CAF_FORTRAN_FILES})
+add_library(caf_mpi SHARED mpi_caf.c ../common/caf_auxiliary.c ${MPI_CAF_FORTRAN_FILES})
+add_library(caf_mpi_static STATIC  mpi_caf.c ../common/caf_auxiliary.c ${MPI_CAF_FORTRAN_FILES})
 target_link_libraries(caf_mpi PRIVATE ${MPI_C_LIBRARIES} ${MPI_Fortran_LIBRARIES})

 set_target_properties ( caf_mpi
@@ -53,9 +54,14 @@ endif()
 include_directories(${CMAKE_BINARY_DIR}/mod)

 install(TARGETS caf_mpi EXPORT OpenCoarraysTargets
+  DESTINATION "${CMAKE_INSTALL_LIBDIR}"
+  LIBRARY DESTINATION "${CMAKE_INSTALL_LIBDIR}"
+)
+install(TARGETS caf_mpi_static EXPORT OpenCoarraysTargets
   ARCHIVE DESTINATION "${CMAKE_INSTALL_LIBDIR}"
   LIBRARY DESTINATION "${CMAKE_INSTALL_LIBDIR}"
 )
+set_target_properties(caf_mpi PROPERTIES SOVERSION 1 SONAME "libcaf_mpi.so.${OpenCoarraysVersion}")

 # Install modules to standard include dir, but namespace them with compiler/version
 set (mod_install "OpenCoarrays/${CMAKE_Fortran_COMPILER_ID}/${CMAKE_Fortran_COMPILER_VERSION}")

dylib-targets.patch.txt

@zbeekman zbeekman changed the title Build failure when compiling in parallel Build failure when compiling w/ patched build system in parallel May 3, 2017
@adamryczkowski
Copy link

Hello. I come from completely different Fortran project. We also have hit (I believe) the same race condition-during-compilation bug. Unlike you, we don't build shared libraries, though. Error is non-deterministic and hard to reliably reproduce. I'd like to file a bug report against (CMake? gfortran?), but I can't really nail it.

I understand from this topic that you also didn't find the root cause of the bug. The bug went gone accidentally with the fix for another issue, #365. Am I correct?

@zbeekman
Copy link
Collaborator

zbeekman commented May 31, 2017

@adamryczkowski

I understand from this topic that you also didn't find the root cause of the bug. The bug went gone accidentally with the fix for another issue, #365. Am I correct?

No, I know the root cause of this behavior. It is due to CMake's handling of .mod files. The "bug" arises when one or more Fortran source files containing modules is used in the compilation of multiple targets. CMake manipulates the .mod file output by the compiler, so make (or what ever "generator" you're using) might launch parallel compilation tasks for each target that has the Fortran source with the module, but then one of those jobs ends up overwriting/moving/renaming etc. the output .mod file while the other one needs it.

CMake/Kitware is aware of this behavior, and it is intentional (not a bug). In fact, under some circumstances the same problem may be encountered using make directly. The CMake way to handle this is to put the Fortran source file(s) with the module into their own library. If you don't want to explicitly build an additional library you can create an "object library" which is a sort of alias to a collection of object files. You can't install or export this "object library" target, but you CAN pass it, via a generator expression, in the sources list to add_library and add_executable:

add_library(my_obj_lib OBJECT ftn_mod1.f90 ftn_mod2.f90 ftn_mod3.f90)
add_library(lib_using_ftn_mods STATIC $<TARGET_OBJECTS:my_obj_lib> lib_src.f90)
add_executable(exe_using_ftn_mods $<TARGET_OBJECTS:my_obj_lib> a.out)
add_library(other_ftn_lib STATIC $<TARGET_OBJECTS:my_obj_lib> other_lib_src.f90)

Now CMake compiles object files (and corresponding .mod files) into the target my_obj_lib first and then is able to safely use them across multiple targets compiled in parallel.

Hope this helps!

@zbeekman
Copy link
Collaborator

zbeekman commented Sep 5, 2017

Fixed in #440

@zbeekman zbeekman closed this as completed Sep 5, 2017
@tclune
Copy link

tclune commented Sep 5, 2017

I am seeing a very similar race condition, but was thinking this problem was with the Intel Fortran compiler.

In my case I compile the same source directory twice in two different build directories. Each creates a different library and the .mod files are copied to different "include" subdirectories. (One build is single precision and one is double precision, but I don't think that is relevant here.) I was thinking that the Intel compiler is somehow using a shared resource because the source file is the same for both builds and I get errors like "empty" stream. CMake uses Intel's "-module" flag to move the .module, and I think this is correlated.

OTOH, I've not been able to make a non-cmake reproducer for this behavior yet. I'll try some more experiments tomorrow. Unfortunately, I only hit the error about 50% of my builds and it takes about 2 minutes to get to the race after a make clean.

@zbeekman
Copy link
Collaborator

zbeekman commented Sep 5, 2017

oh boy, that doesn't sound fun to debug...

I'm not sure I completely understand though... are the builds in the two separate directories (single and double precision) happening concurrently? Are they triggered manually or by a script/super-build?

My thoughts on the matter are:

  1. It's possible that the connection with two build directories could be a red herring, if a parallel build is being attempted in each; there could be source files defining modules used to build multiple targets that don't always get serialized
  2. If the two build directories are built at the same time, it's possible that the compilation order may be non-deterministic if the build system is trying to compile more files than there are available threads. If the collision is happening across the two build directories, somehow, then I would suspect Intel is to blame somehow, but this strikes me as unlikely.

I would make sure that all source files containing module definitions appear in one and only one add_library or add_executable CMake statement. If you need to specify a source file defining a module in multiple targets, your best bet is to move to using the "object" library w/ generator expression technique. Another idea is to serialize targets referencing the same source file defining the module, but this will slow down the build in an unnecessary fashion.

I've never been one to shy away from accusing Intel of having Fortran compiler bugs, but the "empty stream" error you're getting really sounds to me like a CMake parallel build issue. This CMake behavior is pretty obnoxious, IMO, whether they consider it a "bug" or not, and I'd be willing to share my opinions on the matter with them if you file a new issue.

@zbeekman
Copy link
Collaborator

zbeekman commented Sep 5, 2017

Just to elaborate more, the "empty stream" error sounds like one instance/thread of ifort is opening a freshly generated .mod file to compile or link a target, but then another target has been told to recompile with the source file creating that .mod file and is stepping on it after it's been opened by the first compiler thread/instance for reading.

@tclune
Copy link

tclune commented Sep 5, 2017

The directory in question is built in two separate directories. This is using cmake's optional argument for add_subdirectory. Until recently, only one of the directories was actually being built because only one was a dependency for my targets. Having resolved a long standing issue with another target, both directories are now dependencies for my ultimate target and so now both are being built - apparently simultaneously.

In theory, there is no need to serialize anything. If the compiler only creates files in the different working directories, it should be fine that two threads are reading from the same file simultaneously.

I've had a number of issues where Intel is "overly" clever about finding .mod files in directories that are not part of the build, so I'm perhaps predisposed to be suspicious here. I have a project where I can do a conventional GNUmake build in the source tree or a CMake build in a build tree. I've spent hours tracking down a problem that eventually turned out to be that Intel was first looking in the same directory as the source file for .mod files rather than the working directory. (No "-I" was pointing to the source directory.) May have been tied to a particular version of Intel, but I'm now careful to either clone or to do make clean if I need to check the GNUmake build.

@tclune
Copy link

tclune commented Sep 5, 2017

Regarding your "empty stream" comment. Each invocation of the compiler should be creating a different .mod file. Either in a different build directory as explained above, or due to the use of CMake's module move option which specifies a different target directory for each precision.

Hence why I think Intel is doing something unnecessary (like a tmp file based upon the source file name) but possibly not technically a bug.

@tclune
Copy link

tclune commented Sep 5, 2017

And of course, typing "make" a second time always appears to work, so it is more annoying than problematic. (May undermine my attempt to cell cmake to the org though.)

@zbeekman
Copy link
Collaborator

zbeekman commented Sep 5, 2017 via email

@tclune
Copy link

tclune commented Jul 21, 2018

I am now encountering this issue (or similar) with gfortran on Darwin. It happens almost immediately when I start to build (but of course not with VERBOSE=1 or with serial).

[  0%] Building Fortran object GMAO_Shared/MAPL_cfio_r4/CMakeFiles/MAPL_cfio_r4.dir/ESMF_CFIOBaseMod.f.o
[  0%] Building Fortran object GMAO_Shared/MAPL_pFUnit/CMakeFiles/MAPL_pFUnit.dir/ESMF_TestParameter.F90.o
[  0%] Building Fortran object GMAO_Shared/GMAO_pFIO/CMakeFiles/GMAO_pFIO.dir/pFIO_Constants.F90.o
[  0%] Building Fortran object GMAO_Shared/GMAO_pFIO/CMakeFiles/GMAO_pFIO.dir/pFIO_Constants.F90.o
f951: Fatal Error: Can't rename module file '../../include/GMAO_pFIO/pfio_constantsmod.mod0' to '../../include/GMAO_pFIO/pfio_constantsmod.mod': No such file or directory

Notice that apparently two different threads are trying to build the same file. I've double checked that this file is only listed once in a single target for the build. Further it is not always this file or even this target that is effected. Another potential data point, is that this is currently happening when I am building a top level target that includes EXCLUDE_FROM_ALL targets (building tests).

Hmmmm.

This is now with cmake 3.11.4

@tclune
Copy link

tclune commented Jul 21, 2018

After a bit more investigating, I understand what was causing my current issue. It could easily be argued that it is not a CMake bug.

I had used ALLOW_DUPLICATE_CUSTOM_TARGETS for my tests and this was triggering simultaneous builds. I had already planned my next bit of work to be eliminating that cmake antipattern before I raised the issue above. Having done that now, the build seems fine again.

@zbeekman
Copy link
Collaborator

Happy to hear you've sorted it out. This is always a source of confusion. My rule of thumb is never, ever allow a source file providing a .mod module file to be built by more than one thread. In practice this usually means the use of object libraries.

@tclune
Copy link

tclune commented Jul 23, 2018 via email

@zbeekman
Copy link
Collaborator

zbeekman commented Jul 23, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants