Pull in failed images support. #370

vehre · 2017-04-30T11:42:59Z

Avg response time	coverage on master
![Issue Stats][PR response img]	![Codecov branch][coverage]

Summary of changes

Added failed images support.
Fixed #309.
Fixed #354.

Rationale for changes

It wasn't present yet?!

Additional info and certifications

This pull request (PR) is a:

Bug fix
Feature addition
Other, Please describe:

I certify that:

I have reviewed the [contributing guidelines]
If this PR is a work in progress I have added WIP: to the
beginning of the PR title
If this PR is problematic for any reason, I have added
DO NOT MERGE: to the beginning of the title
The branch name and title of this PR contains the text
issue-<#> where <#> is replaced by the issue that this PR
is addressing
I have deleted trailing white space on any lines that this PR
touches
I have used spaces for indentation on any lines that this PR
touches
I have included some comments to explain non-obvious code
changes
I have run the tests localy (ctest) and all tests pass
Each commit is a logically atomic, self-consistent, cohesive
set of changes

Come on! Get a grip on all these options. My time is limited!

ATTENTION !!! Build-system work is needed !!!

For this patch to be most effective these requirements need to be added to the build system:

I propose to add a general flag to the build system, to allow the user to switch on failed_images support on demand. Additionally to this flag the following lines are needed in the build system:

if (compiler is 'gcc' and gcc-version >= 7 and mpi is 'mpich' and mpich-version >= 3.2) then
add_definition ('-DUSE_FAILED_IMAGES') to compile-statement of caf_mpi.c

// There is currently no sufficient support for openmpi or other mpi-implementations for fault tolerant mpi. When they add support other may be added.

In the top-most CMakeList, the #ifdef WITH_FAIL_IMAGES needs to be replaced by something reasonable to switch on these tests only when failed images support is present.

I have also added another preprocessor flag 'STOP_ON_UNSUPPORTED' independent of the above to allow stopping execution when an unsupported or unimplemented feature is hit. (see caf_mpi.c at the end).

Tested with gcc-trunk (aka 8.0), gcc-6.3.1 (the failed images test are not compilable, because 6.3 does know nothing about the new functions) and mpich v3.2 on Fedora 25 x86_64.

zbeekman · 2017-05-07T04:16:55Z

src/tests/unit/fail_images/image_fail_test_1.f90

+    if (me == 2) print *, "Test failed."
+
+    sync all (STAT=syncAllStat)
+    if (me == 1) print *, "Test passed."


Should we check syncAllStat==STAT_FAILED_IMAGE?

never mind, I see you have done a very nice and thorough job testing all aspects (that are currently obvious to me) of failed images! Nice work! 🎉 💯 🥇 🙏 🙇

We could, but I like to keep unit tests as small as necessary to test a single feature. The check whether sync all returns the correct stat-value is done in a separate test.

zbeekman · 2017-05-07T04:28:16Z

This looks good to me, all that's left is to merge in the latest changes and provide some CMake logic to ensure tests are only run with GCC 7 and MPICH 3.2, as well as minor cleanups etc.

Very nice work @afanfa and @vehre 👏

rouson · 2017-05-23T03:20:14Z

@zbeekman Any chance this can be merged this week? I start teaching the short course at the University of Cyprus on Wednesday and would like to give the attendees an outline of what I'll cover. I hope to introduce failed images next week.

zbeekman · 2017-05-23T03:25:00Z

Yes I'm close. Adding required build system stuff. Hope it will be done by the call.

…

On Mon, May 22, 2017 at 11:20 PM Damian Rouson ***@***.***> wrote: @zbeekman <https://github.com/zbeekman> Any chance this can be merged this week? I start teaching the short course at the University of Cyprus on Wednesday and would like to give the attendees an outline of what I'll cover. I hope to introduce failed images next week. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#370 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAREPAahlaNzb-d21QYawiV2vb1BFKJJks5r8lBugaJpZM4NMi7C> .

- Ensure that tests requiring GCC 7 are only built when it is available - Add a default option (set to ON) to enable failed images support when the required experimental/proposed fault tolerant MPI features are detected.

This will create better grouping in `ccmake` and `cmake-gui`

- Test for existence, assume symbols/functionality provided elsewhere if it is missing. Should induce compile time error if this is not true.

…rrays into vehre/failed-images

codecov · 2017-05-23T22:58:17Z

Codecov Report

Merging #370 into master will increase coverage by 4.07%.
The diff coverage is 62%.

@@            Coverage Diff             @@
##           master     #370      +/-   ##
==========================================
+ Coverage    39.3%   43.38%   +4.07%     
==========================================
  Files           3        3              
  Lines        1707     1950     +243     
  Branches      294      320      +26     
==========================================
+ Hits          671      846     +175     
- Misses        937      993      +56     
- Partials       99      111      +12

May fix #388 (issue with Clang 4 on FreeBSD)

zbeekman · 2017-05-24T00:29:03Z

@vehre It looks like there may be a race condition or another parallelism issue: image_fail_and_sync_test_2 seems to reliably hang if you run the test multiple times in succession: https://travis-ci.org/sourceryinstitute/OpenCoarrays/jobs/235412457#L1160

If you run with, e.g., ctest --output-on-failure --repeat-until-fail 10 you might be able to trigger this on your own machine... another idea may be to try running it more oversubscribed.

This test is similar to the tests bearing the same name, but with the 1 and 2 suffixes. The biggest difference is that this test calls sync images(*, STAT=...) whereas the other two test sync all (STAT=...) and image_status().

I've created a branch with a spurious sync all (STAT=...) ahead of the sync images(*, STAT=...) to see if that will prevent the bug/race condition.

zbeekman · 2017-05-24T17:44:04Z

@vehre: Adding the extra synchronization in 3e11eac causes image_fail_and_sync_test_2 to pass every time now: https://travis-ci.org/sourceryinstitute/OpenCoarrays/jobs/235695125#L4287

This leads me to suspect that there is a race condition in sync images(*, STAT=...) when an image is failing... I think for now I'll add this to the "developer only" tests and open a new issue for it.

zbeekman · 2017-05-24T21:39:48Z

@verhe: Additional update: It seems that some other the other failed images tests can fail sometimes and that my attempted work around exhibits this... I'm not really sure what's going on.

Is it possible that MPI could have a bug/issue where the -disable-auto-cleanup flag causes some issues if a new job is launched too quickly afterwards? I'm going to push some other small commits and see if I see this behavior again or not.. if it always works the first time it is launched, perhaps MPI has a bug that is dependent on machine state. If this is the case I guess I'll put any test that fails regularly in the "developer tests" so that we can move on.

- Intermittent failure on Travis-CI when run repeatedly `ctest --output-on-failure --repeat-until-fail 10` - Bug report: #390

vehre · 2017-05-25T11:21:57Z

@zbeekman I don't have a single issue with running the testcases even multiple times. My mpi-version is mpich-3.2-6 Fedora system supplied.

As for the failing travis-run for gcc7, I suppose that it is simply timing out after 10 minutes total runtime.

…nds. That the number of images failed is reported incorrectly, may happen when in the finalization stage. To prevent this, the mpi-error handler now checks for the correct number and on incorrect ones just exists ok. This commit should Fix #390.

…encoarrays into vehre/failed-images

vehre · 2017-05-25T11:54:44Z

I only see segfaults on massive oversubscription like 64 processes on a 4-core machine. I have added a commit that addresses these.

zbeekman · 2017-05-25T11:58:28Z

Ok, the time out is because the one test hangs for 10+ minutes, not because the entire run takes over ten minutes. But I suspect this might be a virtualization/cloud/MPI issue so until we get reports of others encountering it or find a way to reliably reproduce I'm comfortable closing the issue & merging

…

On Thu, May 25, 2017 at 7:54 AM Dr. Andre Vehreschild < ***@***.***> wrote: I only see segfaults on massive oversubscription like 64 processes on a 4-core machine. I have added a commit that addresses these. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#370 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAREPMu3_Vf5S5cl1pBtciMaZix4sd9Oks5r9WwEgaJpZM4NMi7C> .

zbeekman · 2017-05-25T13:41:45Z

LGTM

zbeekman · 2017-05-25T13:46:02Z

@vehre I'm going to delete this branch on Github since this is merged. Let me know if you would like me to restore it for any reason. Thanks so much to @vehre and @afanfa for your hard work, and ninja coding skills! 💯 🎉

Alessandro Fanfarillo added 30 commits May 9, 2016 08:48

Multiple communicators

26d788f

fail image and communicator error handling

2ba109e

MPI Win error handling

b6688c1

MPI Win error handling

fb24a1b

Failed image number detected

9226f3a

Image_status

6dc12a4

Disabling general error detection in image_status

b38abc2

Supporting stat_stopped_image in image_status

a99a2cc

Resuming multiple communicators

c01a05a

Failed images draft

8cdc90e

Shrinking communicator

d6640d9

Partial fault tolerant support for locks

17699e5

Partial patch for locks

d6a57b8

Partial patch for locks

b72742d

Minor changes

23d2f47

Debug message as last instruction

b9d0761

Working version of failed_images

b3d2f9e

Sorting failed images

aa4a61a

Managing stat for sync all

c4841e8

Managing stat for sync all

01b75ce

Fixed bug in sync_all

5448ec5

Fixed bug in sync_all

be8d98f

Minor change

c12de97

Adding stat constraint to sync images, lock, unlock

f7de5fd

Adding stat constraint to event post and event wait

32e4cb4

Fixed few bugs on

c1190e8

Adding stat constraint to allocate

881cc26

Fixed few bugs on

02ffc81

Fixed bug in register

f607483

Using MPI_Comm_revoke

4fb6da1

zbeekman reviewed May 7, 2017

View reviewed changes

Merge branch 'master' into vehre/failed-images

e720481

Damian Rouson and others added 5 commits May 23, 2017 21:17

Add comment to syncimages_status test

d45aa2c

Update build system for failed images

0eb8345

- Ensure that tests requiring GCC 7 are only built when it is available - Add a default option (set to ON) to enable failed images support when the required experimental/proposed fault tolerant MPI features are detected.

Rename CMake OpenCoarrays developer tests option

dd8442e

This will create better grouping in `ccmake` and `cmake-gui`

Handle <alloca.h> better

151eb80

- Test for existence, assume symbols/functionality provided elsewhere if it is missing. Should induce compile time error if this is not true.

Merge branch 'master' of https://github.com/sourceryinstitute/OpenCoa…

244b01a

…rrays into vehre/failed-images

Ensure that libcaf.h is early in includes

59b0165

May fix #388 (issue with Clang 4 on FreeBSD)

zbeekman and others added 2 commits May 24, 2017 19:41

Flag image_fail_and_sync_test_2 as developer

9c6eb82

- Intermittent failure on Travis-CI when run repeatedly `ctest --output-on-failure --repeat-until-fail 10` - Bug report: #390

Merge branch 'master' into vehre/failed-images

9674d71

vehre added 2 commits May 25, 2017 13:44

Merge branch 'vehre/failed-images' of github.com:sourceryinstitute/op…

b8e9f56

…encoarrays into vehre/failed-images

zbeekman changed the title ~~WIP: Pull in failed images support.~~ Pull in failed images support. May 25, 2017

zbeekman removed WIP waiting labels May 25, 2017

zbeekman merged commit 5915924 into master May 25, 2017

zbeekman removed the needs review label May 25, 2017

zbeekman deleted the vehre/failed-images branch May 25, 2017 13:46

Uh oh!

Pull in failed images support. #370

Pull in failed images support. #370

Uh oh!

Conversation

vehre commented Apr 30, 2017

Summary of changes

Rationale for changes

Additional info and certifications

Uh oh!

zbeekman May 7, 2017

Choose a reason for hiding this comment

Uh oh!

zbeekman May 7, 2017

Choose a reason for hiding this comment

Uh oh!

vehre May 8, 2017

Choose a reason for hiding this comment

Uh oh!

zbeekman commented May 7, 2017

Uh oh!

rouson commented May 23, 2017

Uh oh!

zbeekman commented May 23, 2017 via email

Uh oh!

codecov bot commented May 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zbeekman commented May 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zbeekman commented May 24, 2017

Uh oh!

zbeekman commented May 24, 2017

Uh oh!

vehre commented May 25, 2017

Uh oh!

vehre commented May 25, 2017

Uh oh!

zbeekman commented May 25, 2017 via email

Uh oh!

zbeekman commented May 25, 2017 • edited by rouson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zbeekman commented May 25, 2017

Uh oh!

Uh oh!

codecov bot commented May 23, 2017 •

edited

Loading

zbeekman commented May 24, 2017 •

edited

Loading

zbeekman commented May 25, 2017 •

edited by rouson

Loading