Skip to content

Commit ad7c121

Browse files
committed
Bug#29885899 NDB : LCP F-S WATCHDOG FIRES DUE TO GCP STOP IN STATE LCP_WAIT_END_LCP
Problem The LCP watchdog monitors LCP progress in an LDM instance including fragment scans and the overall LCP round including the end phase. During the LCP, fragment data files from the previous round are scheduled for deletion when the latest fragment data files become valid, on some pending GCI boundary. The end of the LCP round is delayed until all scheduled data file deletions have completed, and so there is an indirect dependency from the LCP to the GCP protocol, as LCP completion requires that GCPs are completed. The GCP protocol is distributed and multi-phase. It has its own GCP Monitor which checks its liveness and takes escalating action to evict lagging nodes if the protocol stalls, potentially allowing the cluster to survive situations causing a stall on one or more nodes (overload, IO problems, software bugs...) The LCP watchdog is designed to give time for the GCP Monitor to take effect before it takes action. This is essential as the LCP watchdog observes node local conditions and if there is a distributed GCP stall, all node local LCP watchdogs will detect this around the same time and escalate to node failure. This is very likely to cause cluster failure. In the default case where no specific GCP Monitor timeout is defined, the GCP Monitor will not take any escalating action when a GCP protocol is stalled. If the stall lasts for longer than the LCP Watchdog configuration (default 60s in 7.6, 180s in 8.0) then all LCP watchdog instances will consider their local LCPs to be stalled, and escalate to node failures, resulting in cluster failure. While this escalation raises the profile of the problem, it presents a GCP problem as an LCP problem, and may turn a node local problem into a full cluster problem, defeating the system redundancy. Solution The LCP watchdog should always allow GCP handling to take precedence in the wait end state. In the case where no GCP Monitor timeout is configured, the LCP watchdog should not take escalating action in the case where it is waiting for the LCP to end. In this situation, the GCP Monitor and the LCP watchdog will generate logs indicating that there is a stall, but no action will be taken. For users that want protection, the GCP Monitor timeout should be configured, giving protection from both the GCP Monitor and LCP watchdog. A future fix will change the default for GCP protection so that all users have *some* loose GCP Monitor protection unless they specifically configure it away. Testing - Cause GCP stall for period longer than LCP watchdog timeout - Observe LCP watchdog logs with 0 max time - Observe that configured limit passes with no escalation Reviewed by : Maitrayi Sabaratnam <[email protected]> Change-Id: I19901096ddd753bc52ec8091e94db48424e9b432
1 parent eb3d7b2 commit ad7c121

File tree

1 file changed

+27
-5
lines changed

1 file changed

+27
-5
lines changed

storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31922,11 +31922,32 @@ Dblqh::checkLcpFragWatchdog(Signal* signal)
3192231922
Uint32 max_no_progress_time =
3192331923
c_lcpFragWatchdog.MaxElapsedWithNoProgressMillis;
3192431924

31925-
if ((c_lcpFragWatchdog.lcpState == LcpStatusConf::LCP_WAIT_END_LCP) &&
31926-
(max_no_progress_time < c_lcpFragWatchdog.MaxGcpWaitLimitMillis))
31925+
if (c_lcpFragWatchdog.lcpState == LcpStatusConf::LCP_WAIT_END_LCP)
3192731926
{
3192831927
jam();
31929-
max_no_progress_time = c_lcpFragWatchdog.MaxGcpWaitLimitMillis;
31928+
/**
31929+
* In WAIT_END_LCP state we have a dependency on GCP completion
31930+
* We will therefore extend the allowed duration so that GCP
31931+
* related issues do not result in LCP shutdown
31932+
*/
31933+
if (c_lcpFragWatchdog.MaxGcpWaitLimitMillis == 0)
31934+
{
31935+
jam();
31936+
/**
31937+
* No GCP limit set, therefore LCP should not time out in
31938+
* WAIT_END_LCP state.
31939+
*/
31940+
max_no_progress_time = 0;
31941+
}
31942+
else if (max_no_progress_time < c_lcpFragWatchdog.MaxGcpWaitLimitMillis)
31943+
{
31944+
jam();
31945+
/**
31946+
* GCP limit set which is larger than LCP limit, extend
31947+
* LCP limit to cover it
31948+
*/
31949+
max_no_progress_time = c_lcpFragWatchdog.MaxGcpWaitLimitMillis;
31950+
}
3193031951
}
3193131952

3193231953
char buf2[512];
@@ -31950,11 +31971,12 @@ Dblqh::checkLcpFragWatchdog(Signal* signal)
3195031971
g_eventLogger->info("%s", buf2);
3195131972
}
3195231973

31953-
if (c_lcpFragWatchdog.elapsedNoProgressMillis >= max_no_progress_time)
31974+
if ((max_no_progress_time > 0) &&
31975+
(c_lcpFragWatchdog.elapsedNoProgressMillis >= max_no_progress_time))
3195431976
{
3195531977
jam();
3195631978
/* Too long with no progress... */
31957-
warningEvent("Waited too long with LCP not progressing.");
31979+
warningEvent("Waited too long with LCP not progressing.");
3195831980
g_eventLogger->info("Waited too long with LCP not progressing.");
3195931981

3196031982
/**

0 commit comments

Comments
 (0)