Skip to content

Pull in failed images support. #370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 67 commits into from
May 25, 2017
Merged

Pull in failed images support. #370

merged 67 commits into from
May 25, 2017

Conversation

vehre
Copy link
Collaborator

@vehre vehre commented Apr 30, 2017

Avg response time coverage on master
![Issue Stats][PR response img] ![Codecov branch][coverage]

Summary of changes

Added failed images support.
Fixed #309.
Fixed #354.

Rationale for changes

It wasn't present yet?!

Additional info and certifications

This pull request (PR) is a:

  • Bug fix
  • Feature addition
  • Other, Please describe:

I certify that:

  • I have reviewed the [contributing guidelines]
  • If this PR is a work in progress I have added WIP: to the
    beginning of the PR title
  • If this PR is problematic for any reason, I have added
    DO NOT MERGE: to the beginning of the title
  • The branch name and title of this PR contains the text
    issue-<#> where <#> is replaced by the issue that this PR
    is addressing
  • I have deleted trailing white space on any lines that this PR
    touches
  • I have used spaces for indentation on any lines that this PR
    touches
  • I have included some comments to explain non-obvious code
    changes
  • I have run the tests localy (ctest) and all tests pass
  • Each commit is a logically atomic, self-consistent, cohesive
    set of changes

Come on! Get a grip on all these options. My time is limited!

ATTENTION !!! Build-system work is needed !!!

For this patch to be most effective these requirements need to be added to the build system:

I propose to add a general flag to the build system, to allow the user to switch on failed_images support on demand. Additionally to this flag the following lines are needed in the build system:

if (compiler is 'gcc' and gcc-version >= 7 and mpi is 'mpich' and mpich-version >= 3.2) then
add_definition ('-DUSE_FAILED_IMAGES') to compile-statement of caf_mpi.c

// There is currently no sufficient support for openmpi or other mpi-implementations for fault tolerant mpi. When they add support other may be added.

In the top-most CMakeList, the #ifdef WITH_FAIL_IMAGES needs to be replaced by something reasonable to switch on these tests only when failed images support is present.

I have also added another preprocessor flag 'STOP_ON_UNSUPPORTED' independent of the above to allow stopping execution when an unsupported or unimplemented feature is hit. (see caf_mpi.c at the end).

Tested with gcc-trunk (aka 8.0), gcc-6.3.1 (the failed images test are not compilable, because 6.3 does know nothing about the new functions) and mpich v3.2 on Fedora 25 x86_64.

if (me == 2) print *, "Test failed."

sync all (STAT=syncAllStat)
if (me == 1) print *, "Test passed."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check syncAllStat==STAT_FAILED_IMAGE?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, I see you have done a very nice and thorough job testing all aspects (that are currently obvious to me) of failed images! Nice work! 🎉 💯 🥇 🙏 🙇

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I like to keep unit tests as small as necessary to test a single feature. The check whether sync all returns the correct stat-value is done in a separate test.

@zbeekman
Copy link
Collaborator

zbeekman commented May 7, 2017

This looks good to me, all that's left is to merge in the latest changes and provide some CMake logic to ensure tests are only run with GCC 7 and MPICH 3.2, as well as minor cleanups etc.

Very nice work @afanfa and @vehre 👏

@rouson
Copy link
Member

rouson commented May 23, 2017

@zbeekman Any chance this can be merged this week? I start teaching the short course at the University of Cyprus on Wednesday and would like to give the attendees an outline of what I'll cover. I hope to introduce failed images next week.

@zbeekman
Copy link
Collaborator

zbeekman commented May 23, 2017 via email

Damian Rouson and others added 5 commits May 23, 2017 21:17
 - Ensure that tests requiring GCC 7 are only built when it is available
 - Add a default option (set to ON) to enable failed images support when
   the required experimental/proposed fault tolerant MPI features are
   detected.
 This will create better grouping in `ccmake` and `cmake-gui`
 - Test for existence, assume symbols/functionality provided elsewhere
   if it is missing. Should induce compile time error if this is not
   true.
@codecov
Copy link

codecov bot commented May 23, 2017

Codecov Report

Merging #370 into master will increase coverage by 4.07%.
The diff coverage is 62%.

@@            Coverage Diff             @@
##           master     #370      +/-   ##
==========================================
+ Coverage    39.3%   43.38%   +4.07%     
==========================================
  Files           3        3              
  Lines        1707     1950     +243     
  Branches      294      320      +26     
==========================================
+ Hits          671      846     +175     
- Misses        937      993      +56     
- Partials       99      111      +12

 May fix #388 (issue with Clang 4 on FreeBSD)
@zbeekman
Copy link
Collaborator

zbeekman commented May 24, 2017

@vehre It looks like there may be a race condition or another parallelism issue: image_fail_and_sync_test_2 seems to reliably hang if you run the test multiple times in succession: https://travis-ci.org/sourceryinstitute/OpenCoarrays/jobs/235412457#L1160

If you run with, e.g., ctest --output-on-failure --repeat-until-fail 10 you might be able to trigger this on your own machine... another idea may be to try running it more oversubscribed.

This test is similar to the tests bearing the same name, but with the 1 and 2 suffixes. The biggest difference is that this test calls sync images(*, STAT=...) whereas the other two test sync all (STAT=...) and image_status().

I've created a branch with a spurious sync all (STAT=...) ahead of the sync images(*, STAT=...) to see if that will prevent the bug/race condition.

@zbeekman
Copy link
Collaborator

@vehre: Adding the extra synchronization in 3e11eac causes image_fail_and_sync_test_2 to pass every time now: https://travis-ci.org/sourceryinstitute/OpenCoarrays/jobs/235695125#L4287

This leads me to suspect that there is a race condition in sync images(*, STAT=...) when an image is failing... I think for now I'll add this to the "developer only" tests and open a new issue for it.

@zbeekman
Copy link
Collaborator

@verhe: Additional update: It seems that some other the other failed images tests can fail sometimes and that my attempted work around exhibits this... I'm not really sure what's going on.

Is it possible that MPI could have a bug/issue where the -disable-auto-cleanup flag causes some issues if a new job is launched too quickly afterwards? I'm going to push some other small commits and see if I see this behavior again or not.. if it always works the first time it is launched, perhaps MPI has a bug that is dependent on machine state. If this is the case I guess I'll put any test that fails regularly in the "developer tests" so that we can move on.

zbeekman and others added 2 commits May 24, 2017 19:41
 - Intermittent failure on Travis-CI when run repeatedly
   `ctest --output-on-failure --repeat-until-fail 10`
 - Bug report: #390
@vehre
Copy link
Collaborator Author

vehre commented May 25, 2017

@zbeekman I don't have a single issue with running the testcases even multiple times. My mpi-version is mpich-3.2-6 Fedora system supplied.

As for the failing travis-run for gcc7, I suppose that it is simply timing out after 10 minutes total runtime.

vehre added 2 commits May 25, 2017 13:44
…nds.

That the number of images failed is reported incorrectly, may happen when in the finalization stage.
To prevent this, the mpi-error handler now checks for the correct number and on incorrect ones
just exists ok.

This commit should
Fix #390.
@vehre
Copy link
Collaborator Author

vehre commented May 25, 2017

I only see segfaults on massive oversubscription like 64 processes on a 4-core machine. I have added a commit that addresses these.

@zbeekman
Copy link
Collaborator

zbeekman commented May 25, 2017 via email

@zbeekman zbeekman changed the title WIP: Pull in failed images support. Pull in failed images support. May 25, 2017
@zbeekman
Copy link
Collaborator

zbeekman commented May 25, 2017

LGTM

Approved with PullApprove

@zbeekman zbeekman merged commit 5915924 into master May 25, 2017
@zbeekman
Copy link
Collaborator

@vehre I'm going to delete this branch on Github since this is merged. Let me know if you would like me to restore it for any reason. Thanks so much to @vehre and @afanfa for your hard work, and ninja coding skills! 💯 🎉

@zbeekman zbeekman deleted the vehre/failed-images branch May 25, 2017 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Defect: SYNC (ALL|IMAGES) w/o STAT do not error out on error stop_numeric, stop_string not handled correctly
3 participants