[ETCM-211] Tracking of headers/bodies requests that will be ignored due to invalidation #749

ntallar · 2020-10-20T18:29:45Z

Description

Address the cause of CPU and network spikes on the testnet

Proposed Solution

Start tracking when fetcher should start from scratch but there are requests in progress. This would allow the fetcher to ignore the responses when received and only then (when there are no requests in progress) send a new request
Receiving a new block should never cancel a request in progress (as it was on one scenario)
Document better syncing behavior

Scenario addressed

Temporarily add logging for when a best block number is saved on the db (here)
Start node 3 and with mocked miner mine 500 blocks. Check with the log in 1. that a number close to 500 of blocks was saved on the db, if not continue mining till that's the case
Stop node 3
Start node 2 and with mocked miner mine 300 blocks
Start node 1 and connect it to node 2, await till it synced all it's blocks
Start node 3 and connect it to node 1
Mine enough blocks on node 3 so that it has more blocks than the best block number that had been saved in node 2.
The new blocks from 7. should have triggered that node 1 starts syncing from node 3.

Eventually the invalidation of local blocks due to unknown branch should have happened while a request was in progress, causing a new request to be sent, having at that moment 2 of them in parallel. As the sync doesn't handle 2 requests in parallel for the same type of data, eventually it should cause node 1 to have a lot of requests in parallel, using up a lot of network and cpu resources.

Note that the step 1. and related checks are needed to prevent falling into the issue: https://jira.iohk.io/browse/ETCM-246
Note that step 8 is needed due to: https://jira.iohk.io/browse/ETCM-248

Indicators for the issue ocurring:

CPU usage should increase abnormally
Logs like:

2020-10-20 15:27:47,315 [i.i.e.blockchain.sync.PeersClient] - Peer (PeerId(127.0.0.1:35232)) would be blacklisted (reason: Given headers are not sequence with already fetched ones), but blacklisting duration is zero
2020-10-20 15:27:47,315 [i.i.e.blockchain.sync.PeersClient] - Selected peer PeerId(127.0.0.1:35232) with address 127.0.0.1
2020-10-20 15:27:47,315 [i.i.e.blockchain.sync.PeersClient] - Peer (PeerId(127.0.0.1:35232)) would be blacklisted (reason: Given headers are not sequence with already fetched ones), but blacklisting duration is zero
2020-10-20 15:27:47,315 [i.i.e.blockchain.sync.PeersClient] - Selected peer PeerId(127.0.0.1:35232) with address 127.0.0.1
2020-10-20 15:27:47,315 [i.i.e.blockchain.sync.PeersClient] - Peer (PeerId(127.0.0.1:35232)) would be blacklisted (reason: Given headers are not sequence with already fetched ones), but blacklisting duration is zero
2020-10-20 15:27:47,315 [i.i.e.blockchain.sync.PeersClient] - Selected peer PeerId(127.0.0.1:35232) with address 127.0.0.1
2020-10-20 15:27:47,315 [i.i.e.blockchain.sync.PeersClient] - Peer (PeerId(127.0.0.1:35232)) would be blacklisted (reason: Given headers are not sequence with already fetched ones), but blacklisting duration is zero
2020-10-20 15:27:47,315 [i.i.e.blockchain.sync.PeersClient] - Selected peer PeerId(127.0.0.1:35232) with address 127.0.0.1
2020-10-20 15:27:47,317 [i.i.e.b.sync.regular.BlockFetcher] - Fetched 100 headers starting from block Some(359)
2020-10-20 15:27:47,317 [i.i.e.b.sync.regular.BlockFetcher] - Fetching headers from block 246
2020-10-20 15:27:47,317 [i.i.e.blockchain.sync.PeersClient] - Peer (PeerId(127.0.0.1:35232)) would be blacklisted (reason: Given headers are not sequence with already fetched ones), but blacklisting duration is zero
2020-10-20 15:27:47,317 [i.i.e.blockchain.sync.PeersClient] - Selected peer PeerId(127.0.0.1:35232) with address 127.0.0.1
2020-10-20 15:27:47,317 [i.i.e.b.sync.regular.BlockFetcher] - Fetched 100 headers starting from block Some(359)
2020-10-20 15:27:47,317 [i.i.e.b.sync.regular.BlockFetcher] - Fetching headers from block 246
2020-10-20 15:27:47,317 [i.i.e.b.sync.regular.BlockFetcher] - Fetched 100 headers starting from block Some(359)
2020-10-20 15:27:47,317 [i.i.e.b.sync.regular.BlockFetcher] - Fetching headers from block 246
2020-10-20 15:27:47,317 [i.i.e.b.sync.regular.BlockFetcher] - Fetched 100 headers starting from block Some(359)
2020-10-20 15:27:47,317 [i.i.e.b.sync.regular.BlockFetcher] - Fetching headers from block 246

Testing

Reproducing the scenario that triggered this task and check that it's no longer there
Check that syncing continues to work, either with mainnet or mordor, just having it sync a couple thousands blocks should be enough. Synced 300000 blocks from mordor without any problems
Add unit tests on the block fetcher

…ue to invalidation

KonradStaniec · 2020-10-22T08:36:45Z

src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcher.scala

-          state.appendHeaders(validHeaders)
-      }
+      val newState =
+        if (state.fetchingHeadersState == AwaitingHeadersToBeIgnored) {


Maybe I am missing somehing, but what will happen in case of request time out ?
I see such path:

request headers

fetcher receive invalidate

request times out, but we are still in AwaitingHeadersToBeIgnored state

fetcher create request for new headers, from new invalidated blocks

fetcher receivers response witth new block headers, but it is ignored as fetcher is still in AwaitingHeadersToBeIgnored

Is it correct, or there is some detail i am missing ?

You are right! I thought I had analyzed that but I missed it

I'll update this PR with something even closer to what we do when receiving the response

…ling of responses

KonradStaniec · 2020-10-26T13:50:13Z

src/test/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherSpec.scala

+
+  "BlockFetcher" - {
+
+    "should not requests headers upon invalidation if a request is already in progress" in new TestSetup {


Maybe we could add separate test case for time case case ?

KonradStaniec · 2020-10-26T14:18:09Z

src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcher.scala


      fetchBlocks(newState)
    case RetryHeadersRequest if state.isFetchingHeaders =>
-      log.debug("Retrying request for headers")
-      fetchHeaders(state)
+      log.debug("Time-out occurred while waiting for headers")


more of discussion type comment. Do you think it would bring problems if we would receive this time out response later ? i.e something like:

We request header lets say from block with number 100

we receive invalidate from 100 to 80

we receive timeout for our request, we blacklist peer and clear state.

after some time (200 s by default) peer is unblacklisted , and we request headers from block with number 90 form it.

we receive old response due to some network traffic problems or something like that.

Do you think it could result in something more than blacklising peer ? (for example triggering this bug with unbounded resources usage)

Definitely, we are currently doing no validations that the headers received match the request

With that sort of validations we should be safe I think, but for now I was assuming that that case will never happen, the 30 seconds till timeout should be more than enough to prevent that, right?

That makes me think if this whole solution shouldn't be implemented as Actor/class in between fetcher and peers client. It could track requests made and invalidate previous ones basing on strategy choosen (at-most-n, allow-all, etc.).
This could simplify things significantly and allow code in fetcher focus on just orchestrating fetching of blocks.

What would this actor do? Are you maybe thinking of something like the BlockChainDataFetcher of our other project? (only on how it requests headers and bodies)

It could track requests

That at least would focus on ☝️ part and allow validating that the responses received match the requests we sent

Either way I'd do that effort in parallel with this kind of patching that only touches the minimum required (and is needed for the testnet)

I'm a bit worried of how risky any design changes on our sync could be... though that BlockChainDataFetcher refactoring would be low risk I think

Well, BlockFetcher is more or less the same as BlockchainDataFetcher there already IMO. I think more on structure like that:

block fetcher | | | headers fetcher(Fetcher) bodies fetcher(Fetcher) state nodes fetcher(Fetcher) | | | peers client

Where Fetcher is that specialized class/actor that manages requests of given type and could have interface like:

abstract class Fetcher(multipleReqStrategy: AtMostN|AllowAll) { def makeRequest(msg: MessageSerializable): IO[msg.Response] //Issues a request and add it to tracked pool def cancelPendingRequests: IO[Unit] //Stores information about cancellation so knows which responses cannot be passed further if received }

so imo details up to discussion, but i am all for doing some kinda detailed request-response tracking. Unfortunately as eth do not have any request-id for its request-response messages, such thing is ultimately necessary if we want to have possiblitly to cancel in fly requests.

we receive old response due to some network traffic problems or something like that.

I just realized this might not be a problem, our PeerRequestHandler awaiting the response should have been killed by then so any too long into the future response should be discarded.

Either way I added this task to add validations that the request/response match and then detect malicious peer behaviour: https://jira.iohk.io/browse/ETCM-283

Unfortunately as eth do not have any request-id for its request-response messages, such thing is ultimately necessary if we want to have possiblitly to cancel in fly requests.

Due to our PeerRequestHandler design this might be easier to do 🤔 we can identify the request by the actor that's in charge of getting the response. If we use a single peer per request type that might work (for now we don't have even more than one request type in parallel)

kapke · 2020-10-27T09:13:53Z

src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcher.scala

-        case Right(validHeaders) =>
-          state.appendHeaders(validHeaders)
-      }
+      val newState =


I'd move this code into method on BlockFetcherState and leave here only logging

As a handleHeaderResponse function you mean? Isn't the current structure more oriented to doing the handling of responses on the actor? Shouldn't we move the bodies/nodes response handling as well?

After second look - let's keep it as is. It's just - the more I work with actors, more I want extract all the logic from them so it can be tested without creating that actor.

it is totally my current sentiments about actors, they should have as little logic as possible and should be only thin communication layer over main logic.

Yeah, I'm having the same concerns with them

However for this case in particular I think we would fall into a too big and not understandable BlockFetcherState class if we start moving everything there, we should probably start splitting up this logic into other classes

kapke · 2020-10-27T09:34:18Z

src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcher.scala


      fetchBlocks(newState)
    case RetryHeadersRequest if state.isFetchingHeaders =>
-      log.debug("Retrying request for headers")
-      fetchHeaders(state)
+      log.debug("Time-out occurred while waiting for headers")


That makes me think if this whole solution shouldn't be implemented as Actor/class in between fetcher and peers client. It could track requests made and invalidate previous ones basing on strategy choosen (at-most-n, allow-all, etc.).
This could simplify things significantly and allow code in fetcher focus on just orchestrating fetching of blocks.

kapke · 2020-10-27T09:35:15Z

src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherState.scala

-  def fetchingHeaders(isFetching: Boolean): BlockFetcherState = copy(isFetchingHeaders = isFetching)
+  def isFetchingHeaders: Boolean = fetchingHeadersState != NoHeadersFetched
+  def withNewHeadersFetch: BlockFetcherState = copy(fetchingHeadersState = AwaitingHeaders)
+  def withHeaderFetchReceived: BlockFetcherState = copy(fetchingHeadersState = NoHeadersFetched)


no fetched or not fetching?

kapke · 2020-10-27T09:35:21Z

src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherState.scala

-  def fetchingBodies(isFetching: Boolean): BlockFetcherState = copy(isFetchingBodies = isFetching)
+  def isFetchingBodies: Boolean = fetchingBodiesState != NoBodiesFetched
+  def withNewBodiesFetch: BlockFetcherState = copy(fetchingBodiesState = AwaitingBodies)
+  def withBodiesFetchReceived: BlockFetcherState = copy(fetchingBodiesState = NoBodiesFetched)


kapke · 2020-10-27T09:36:44Z

src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherState.scala

+
+  trait FetchingHeadersState
+  case object NoHeadersFetched extends FetchingHeadersState
+  case object AwaitingHeaders extends FetchingHeadersState


what about tracking parameters for request made? It would allow to track (and at least log for now) case which Konrad mentioned - received response in a moment when we're waiting for different one

I'd do this on a separate task, one that adds the BlockChainDataFetcher class and all the validations of requests from there, wdyt?

(maybe not adding it as is but just a class that could be used by the fetcher that contain all the request/response validations)

kapke · 2020-10-27T09:39:47Z

src/test/scala/io/iohk/ethereum/BlockHelpers.scala

+import akka.util.ByteString
+import io.iohk.ethereum.domain.Block
+
+trait BlockHelpers {


Could you make it an object? I'd move helpers I extracted there: https://github.com/input-output-hk/mantis/pull/735/files#diff-840c293a5c8d271766b7ad38d72f23fdc2a5d0406db32ee60fd727cc6615ca75R1 into the same package, so less conflicts and/or code duplicates are created

…dded extra test for the block fetcher

kapke

LGTM!

…requests

ntallar added the bug Something isn't working label Oct 20, 2020

ntallar force-pushed the etcm-211-removing-multiple-parallel-sync-requests branch 5 times, most recently from 4350799 to 0a18fc7 Compare October 21, 2020 15:24

ntallar marked this pull request as ready for review October 21, 2020 15:24

ntallar requested review from kapke and KonradStaniec October 21, 2020 15:24

[ETCM-211] Tracking of headers/bodies requests that will be ignored d…

523e90a

…ue to invalidation

ntallar force-pushed the etcm-211-removing-multiple-parallel-sync-requests branch from 0a18fc7 to 523e90a Compare October 21, 2020 15:26

KonradStaniec reviewed Oct 22, 2020

View reviewed changes

[ETCM-211] Improve handling of timeout cases, moved it closer to hand…

5f49352

…ling of responses

ntallar requested a review from KonradStaniec October 23, 2020 22:24

KonradStaniec reviewed Oct 26, 2020

View reviewed changes

kapke reviewed Oct 27, 2020

View reviewed changes

[ETCM-211] Rename fetching states; make block helpers an object and a…

6fbad03

…dded extra test for the block fetcher

ntallar requested review from KonradStaniec and kapke October 27, 2020 14:09

kapke approved these changes Oct 27, 2020

View reviewed changes

KonradStaniec approved these changes Oct 27, 2020

View reviewed changes

Merge branch 'develop' into etcm-211-removing-multiple-parallel-sync-…

822a21f

…requests

ntallar merged commit ae398e1 into develop Oct 27, 2020

ntallar deleted the etcm-211-removing-multiple-parallel-sync-requests branch October 27, 2020 19:45


		"BlockFetcher" - {

		"should not requests headers upon invalidation if a request is already in progress" in new TestSetup {

[ETCM-211] Tracking of headers/bodies requests that will be ignored due to invalidation #749

[ETCM-211] Tracking of headers/bodies requests that will be ignored due to invalidation #749

Uh oh!

Conversation

ntallar commented Oct 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Proposed Solution

Scenario addressed

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntallar Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kapke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ntallar commented Oct 20, 2020 •

edited

Loading

ntallar Oct 27, 2020 •

edited

Loading