Skip to content

Commit 4bcdb37

Browse files
dulinrileyfacebook-github-bot
authored andcommitted
Add more logging to check_supervision in case it is failing
Summary: For a OneWay message like `ControllerActor::check_supervision`, if it fails halfway through it won't send a message to the client, and it won't reschedule itself to run again. We need to have logs to make sure the message is sent. I suspect that occasionally when a worker fails with a DeviceException it isn't getting propagated back to the client Reviewed By: shayne-fletcher Differential Revision: D75232803 fbshipit-source-id: 25c6fa2e927a762305b8bbd73a181174a1129bda
1 parent e3911c9 commit 4bcdb37

File tree

2 files changed

+3
-0
lines changed

2 files changed

+3
-0
lines changed

controller/src/lib.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -519,10 +519,12 @@ impl ControllerMessageHandler for ControllerActor {
519519
address: failed_state.proc_addr.to_string(),
520520
backtrace: failure_reason,
521521
});
522+
tracing::error!("Sending failure to client: {exc:?}");
522523
// Seq does not matter as the client will raise device error immediately before setting the results.
523524
self.client()?
524525
.result(this, Seq::default(), Some(Err(exc)))
525526
.await?;
527+
tracing::error!("Failure successfully sent to client");
526528

527529
// No need to set history failures as we are directly sending back failure results.
528530
}

python/monarch/common/client.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,7 @@ def _handle_pending_result(self, output: MessageResult) -> None:
302302
self.last_processed_seq = max(self.last_processed_seq, seq)
303303

304304
if error is not None:
305+
logging.error("Received error for seq %s: %s", seq, error)
305306
# We should not have set result if we have an error.
306307
assert result is None
307308
if not isinstance(error, RemoteException):

0 commit comments

Comments
 (0)