Skip to content

[ETCM-102] Fast sync integration tests #672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Sep 22, 2020

Conversation

KonradStaniec
Copy link
Contributor

@KonradStaniec KonradStaniec commented Sep 16, 2020

Description

Adds integration test suite for fast sync with 3 main scenarios on happy path i.e

  • fast sync blockchain with empty mpt
  • fast sync blockchain and non-empty mpt
  • fast sync blockchain with target block update

I intended to add more testcases and refactor fast sync a little bit, but due to bug found along the way this pr grew up to large for my taste.

Important Changes Introduced

During Testing 3 bugs were founds and fixed:

  • Bug in FastSync, if target block contains hash of empty trie the fast sync would get stuck asking peers for non exisiting nodes

  • Bug in EtcPeerManager peer info handling, after initial handshake peer block best hash info was not updated. It lead to FastSync asking for most probably stale blocks, which lead to syncing to target block which were already pruned.

  • Bug in EtcPeerManager providing handshaked peers to other parts of the system, if newly connected peer was just starting i.e its best block was its genesis block, then it would not be treated as handshaked peer. It is important as mantis would not even broadcast block to such peer.

@KonradStaniec KonradStaniec force-pushed the etcm-102/fast-sync-integration-test branch from 01584d5 to 72da7a3 Compare September 16, 2020 09:48
appStateStorage.putEstimatedHighestBlock(maxBlockNumber)

if (maxBlockNumber > initialPeerInfo.maxBlockNumber)
initialPeerInfo.withMaxBlockNumber(maxBlockNumber).withMaxBlockHash(maxBlockHash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use a method that updates both number and hash atomically? I think, there is no case to update only one field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about it, so if there is another person with same concern I will change it.

object FakePeer {
def startFakePeer(peerName: String): Task[FakePeer] = {
for {
peer <- Task(new FakePeer(peerName)).memoizeOnSuccess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is def so .memoizeOnSuccess won't work here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some leftover from debugging i will remove it

@@ -112,7 +118,7 @@ class EtcPeerManagerActor(peerManagerActor: ActorRef, peerEventBusActor: ActorRe
* @return new updated peer info
*/
private def handleSentMessage(message: Message, initialPeerWithInfo: PeerWithInfo): PeerInfo =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need message: Message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not, but updatePeersWithInfo expect function with shape messageHandler: (Message, PeerWithInfo) => PeerInfo that why i would leave it as it is.

Copy link
Contributor

@mmrozek mmrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@ntallar ntallar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catches 🥇

sender() ! HandshakedPeers(peersWithInfo.collect {
case (_, PeerWithInfo(peer, peerInfo)) if peerInfo.maxBlockNumber > 0 => peer -> peerInfo
case (_, PeerWithInfo(peer, peerInfo)) if peerHasUdpatedBestBlock(peerInfo) => peer -> peerInfo
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the status exchange step guarantee that the peer has our node's genesis?

And as we are updating the bestBlockHash with the bestBlockNumber, shouldn't bestBlockHash!=genesisHash always imply maxBlockNumber>0? What's the case that peerHasUdpatedBestBlock isn't true for a peer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take in mind that we do not exchange best block number handshake, and just after handshake each peer has maxBlockNumber set to 0. Only just after hadshake we make additional request for peer best block header, and only update maxBlockNumber if we receive this header.

So just after handshake there are peers which can have bestBlockHash!=genesisHash && maxBlockNumber==0, and we do not want to provide them as full handshaked peers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see now the the purpose of this

But why don't we want to provide them as full handshaked peers? I wasn't able to find any usage of the maxBlockNumber on our this response for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my guess is that it is mostly due to maxBlockNumber taking part in broadcasting blocks i.e if we were to provide peers with maxBlockNumber set to 0, then until they sens us some newBlock, we would broadcast to them all blocks even really old ones.

Imo we could get rid of it if we were to port known nodes and transaction tracking from other project

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sounds like a reasonable guess, should we add a comment on this case?

Copy link

@ntallar ntallar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comments, apart from which LGTM!

sender() ! HandshakedPeers(peersWithInfo.collect {
case (_, PeerWithInfo(peer, peerInfo)) if peerInfo.maxBlockNumber > 0 => peer -> peerInfo
case (_, PeerWithInfo(peer, peerInfo)) if peerHasUdpatedBestBlock(peerInfo) => peer -> peerInfo
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see now the the purpose of this

But why don't we want to provide them as full handshaked peers? I wasn't able to find any usage of the maxBlockNumber on our this response for now

Copy link

@ntallar ntallar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awaiting the results from testing it on mainnet, apart from which LGTM!

 .\\            //.
. \ \          / /.
.\  ,\     /` /,.-
 -.   \  /'/ /  .
 ` -   `-'  \  -
   '.       /.\`
      -    .-
      :`//.'
      .`.'
      .' BP 

@KonradStaniec KonradStaniec force-pushed the etcm-102/fast-sync-integration-test branch from 2ea0d6c to 68cf614 Compare September 21, 2020 09:38
@KonradStaniec KonradStaniec force-pushed the etcm-102/fast-sync-integration-test branch from 68cf614 to cd44ef1 Compare September 21, 2020 09:45
@KonradStaniec
Copy link
Contributor Author

KonradStaniec commented Sep 21, 2020

so @mmrozek @ntallar some info from sync testing:

  1. after all the fixes (and tweaking some config values for which i will made second pr) fast sync started working ! I was able to sync up 2 times to the tip of the mainnet and hold a litte bit on top.
  2. I found another bug which made test flaky. Basically if we read 2 frames from the socket at once i.e Seq(Hello, Status), then based on Hello message we should decompress (or not) Status. With current way of handling , it was imposible as we have read all messages form socket and only then set compression option. It could lead to not handling Status message correctly, and due to that failed handshakes.

@KonradStaniec
Copy link
Contributor Author

KonradStaniec commented Sep 21, 2020

also note: during debuggig i have found that some of the parity nodes does not follow protocol in case of disconnect message i.e they do not compress Disconnect message sent after Hello message which leads to decoding failures on our side. My approach it to just ignore it, as parity no longer supports ETC and it forms much smaller base of clients than geth.

@ntallar
Copy link

ntallar commented Sep 21, 2020

My approach it to just ignore it, as parity no longer supports ETC and it forms much smaller base of clients than geth.

Could this be a problem for Mantis ETH support? (as we are at least for now keeping it)

Btw, the last bug fix LGTM!

@KonradStaniec
Copy link
Contributor Author

I do not think it could be a big probem, as the only message which did not follow protocl was Disconnect so eitherway we are disconnecting from this peer, and even on eth 80% clients are geth nodes.

@KonradStaniec KonradStaniec merged commit 6a35748 into develop Sep 22, 2020
@KonradStaniec KonradStaniec deleted the etcm-102/fast-sync-integration-test branch September 22, 2020 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants