Bug#29885899 NDB : LCP F-S WATCHDOG FIRES DUE TO GCP STOP IN STATE LCP_WAIT_END_LCP

frazerclement · frazerclement · commit ad7c1213eaf8 · 2023-06-20T18:30:56.000+01:00
Problem

The LCP watchdog monitors LCP progress in an LDM instance
including fragment scans and the overall LCP round including
the end phase.
During the LCP, fragment data files from the previous round
are scheduled for deletion when the latest fragment data
files become valid, on some pending GCI boundary.
The end of the LCP round is delayed until all scheduled data
file deletions have completed, and so there is an indirect
dependency from the LCP to the GCP protocol, as LCP completion
requires that GCPs are completed.

The GCP protocol is distributed and multi-phase.  It has its
own GCP Monitor which checks its liveness and takes escalating
action to evict lagging nodes if the protocol stalls, potentially
allowing the cluster to survive situations causing a stall on one
or more nodes (overload, IO problems, software bugs...)

The LCP watchdog is designed to give time for the GCP Monitor
to take effect before it takes action.  This is essential as
the LCP watchdog observes node local conditions and if there
is a distributed GCP stall, all node local LCP watchdogs
will detect this around the same time and escalate to node
failure.  This is very likely to cause cluster failure.

In the default case where no specific GCP Monitor timeout is
defined, the GCP Monitor will not take any escalating action when
a GCP protocol is stalled.
If the stall lasts for longer than the LCP Watchdog configuration
(default 60s in 7.6, 180s in 8.0) then all LCP watchdog instances
will consider their local LCPs to be stalled, and escalate to node
failures, resulting in cluster failure.

While this escalation raises the profile of the problem, it presents
a GCP problem as an LCP problem, and may turn a node local problem
into a full cluster problem, defeating the system redundancy.

Solution

The LCP watchdog should always allow GCP handling to take precedence
in the wait end state.

In the case where no GCP Monitor timeout is configured, the LCP
watchdog should not take escalating action in the case where it is
waiting for the LCP to end.

In this situation, the GCP Monitor and the LCP watchdog will generate
logs indicating that there is a stall, but no action will be taken.

For users that want protection, the GCP Monitor timeout should be
configured, giving protection from both the GCP Monitor and LCP watchdog.

A future fix will change the default for GCP protection so that all
users have *some* loose GCP Monitor protection unless they specifically
configure it away.

Testing
 - Cause GCP stall for period longer than LCP watchdog timeout
 - Observe LCP watchdog logs with 0 max time
 - Observe that configured limit passes with no escalation

Reviewed by : Maitrayi Sabaratnam &lt;maitrayi.sabaratnam@oracle.com&gt;

Change-Id: I19901096ddd753bc52ec8091e94db48424e9b432
diff --git a/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp b/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
@@ -31922,11 +31922,32 @@ Dblqh::checkLcpFragWatchdog(Signal* signal)
     Uint32 max_no_progress_time =
       c_lcpFragWatchdog.MaxElapsedWithNoProgressMillis;
 
-    if ((c_lcpFragWatchdog.lcpState == LcpStatusConf::LCP_WAIT_END_LCP) &&
-        (max_no_progress_time < c_lcpFragWatchdog.MaxGcpWaitLimitMillis))
+    if (c_lcpFragWatchdog.lcpState == LcpStatusConf::LCP_WAIT_END_LCP)
     {
       jam();
-      max_no_progress_time = c_lcpFragWatchdog.MaxGcpWaitLimitMillis;
+      /**
+       * In WAIT_END_LCP state we have a dependency on GCP completion
+       * We will therefore extend the allowed duration so that GCP
+       * related issues do not result in LCP shutdown
+       */
+      if (c_lcpFragWatchdog.MaxGcpWaitLimitMillis == 0)
+      {
+        jam();
+        /**
+         * No GCP limit set, therefore LCP should not time out in
+         * WAIT_END_LCP state.
+         */
+        max_no_progress_time = 0;
+      }
+      else if (max_no_progress_time < c_lcpFragWatchdog.MaxGcpWaitLimitMillis)
+      {
+        jam();
+        /**
+         * GCP limit set which is larger than LCP limit, extend
+         * LCP limit to cover it
+         */
+        max_no_progress_time = c_lcpFragWatchdog.MaxGcpWaitLimitMillis;
+      }
     }
 
     char buf2[512];
@@ -31950,11 +31971,12 @@ Dblqh::checkLcpFragWatchdog(Signal* signal)
       g_eventLogger->info("%s", buf2);
     }
 
-    if (c_lcpFragWatchdog.elapsedNoProgressMillis >= max_no_progress_time)
+    if ((max_no_progress_time > 0) &&
+        (c_lcpFragWatchdog.elapsedNoProgressMillis >= max_no_progress_time))
     {
       jam();
       /* Too long with no progress... */
-      warningEvent("Waited too long with  LCP not progressing.");
+      warningEvent("Waited too long with LCP not progressing.");
       g_eventLogger->info("Waited too long with LCP not progressing.");
 
       /**