-
Notifications
You must be signed in to change notification settings - Fork 788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to handling partial responses #1686
Conversation
Context: paritytech#493 (comment) Problem: When downloading from a fork, the peer may send a partial response that extends the current chain. Even though the response starts at a block number after the current best number, the parent block from the fork is not yet downloaded/imported. We currently try to import the partial response in this scenario, which causes import to fail as parent is not found in the backend. This causes sync to be aborted/restarted which may not converge. Fix: when we receive a partial response in this scenario, hold on to the partial blocks and try to download the missing blocks. And import when the full range is downloaded.
bot fmt |
@altonen https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3799758 was started for your command Comment |
@altonen Command |
I don't understand how the fmt works with these changes. I try running cargo fmt locally (or with the command from the failing fmt job), but it formats several files not part of the change, or most of code in same file that the PR does not touch. The contributor guidelines also does not talk about this much. |
@rahulksnv did you run |
warn!( | ||
target: LOG_TARGET, | ||
"Download state: unable to get block number of the downloaded blocks: \ | ||
common = {}, range = {:?}, downloaded = {}", | ||
self.common_number, self.range, self.downloaded.len() | ||
); | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the block not be available?
It looks like right now the peer is still left in DownloadingNew
state so I assume it will attempt to retry with the same failed result. It may be safer to disconnect the peer in case this function returns None
unless syncing can somehow recover from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the existing behavior when blocks are validated: /~https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/sync/src/lib.rs#L2796
Blocks are validated before adding to the self.downloaded
list, so this should never happen (why this is a warn
message). Initially thought of caching the block number when it is added to the list, but that would be redundant fields, and we need to worry about keeping them in sync
let from = if peer_best_number == last { | ||
FromBlock::Hash(*peer_best_hash) | ||
} else { | ||
FromBlock::Number(last) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment explaining this logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is existing behavior, just retained as is: /~https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/sync/src/lib.rs#L2561
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in case of downloading a fork we always need to use FromBlock::Hash
, looking it up from the first block in the already downloaded range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The request.from
is the hash of the first block we want to download, which we don't know until the block is downloaded (now I understand why current code sets it hash if block_number == peer.best_block_number
, in this case we happen to know the corresponding peer.best_block_hash
)
Unless we want to set it to parent hash of the first block already downloaded. I would prefer to keep this uniform with existing code, don't know why current code does not set it to parent hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant setting it to the parent hash of the already downloaded first block in the range. Otherwise we can't be sure that we are downloading the right fork and can fail when importing the range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to pass parent hash
new_blocks.append(&mut self.downloaded); | ||
self.downloaded.append(&mut new_blocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this? Since it's dealing with the block bodies which can be large, it'd be better if we'd tried to handle them efficiently. Looking at this code alone I don't know whether it's making full copies of them or not but this code looks quite dubious in the first place. Why does it first have to append blocks from self.downloaded
to new_blocks
and then append them back to self.downloaded
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prepends the new blocks to the existing blocks (updated the comment to reflect it). Prepending is needed because the blocks are returned in the reverse order (i.e) the request sets request.from = FromBlock::Number(x)
, and request.direction = Direction::Descending
.
The Vec::append()
is a move only operation, no copies/clones involved. The first append appends/empties self.downloaded
, second one moves it back
if self.range.start == (self.common_number + One::one()) && | ||
self.downloaded.len() < range_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for this check please. Otherwise it'll be hard to maintain this code because looking at this alone, I don't know what self.range.start == (self.common_number + One::one())
implies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had comments below, moved it up and expanded it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check self.range.start == (self.common_number + One::one())
looks like a hackish way to determine if PeerDownloadState
is responsible for accumulating blocks, or BlockCollection
is responsible for this on the upper level in ChainSync
. May be it's possible to incorporate this code into BlockCollection
so that not distribute the logic over multiple components?
If for some reasons it's not possible to incorporate this code into BlockCollection
, I would introduce a clear separation of the code needed to pass only one chunk to BlockCollection
, and the code responsible for holding up the blocks. For example, this can be done by introducing a separate enum variant into PeerSyncState
for the latter (DownloadingRange
) with comments that it's used specifically for fork downloading in multiple chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check self.range.start == (self.common_number + One::one()) looks like a hackish way to determine if PeerDownloadState is responsible for accumulating blocks, or BlockCollection is responsible for this on the upper level in ChainSync. May be it's possible to incorporate this code into BlockCollection so that not distribute the logic over multiple components?
BlockCollection
currently does something slightly different: manages the downloaded range (with holes + merging holes) across several peers, which the PeerDownloadState
manages a range with a single peer. It would be cleaner/less complex to keep this separation. The PeerDownloadState
is just a staging area before adding the blocks to the collection.
If for some reasons it's not possible to incorporate this code into BlockCollection, I would introduce a clear separation of the code needed to pass only one chunk to BlockCollection, and the code responsible for holding up the blocks. For example, this can be done by introducing a separate enum variant into PeerSyncState for the latter (DownloadingRange) with comments that it's used specifically for fork downloading in multiple chunks.
This change does the same thing, except it enhances the state maintained by PeerSyncState::DownloadingNew
, to manage download of new blocks from a single/unified path. I feel that adding separate variant only adds more complexity/code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check self.range.start == (self.common_number + One::one())
looks confusing and can be replaced with a single flag download_full_range
, the value of which we know from the creation of PeerDownloadState
. Otherwise we don't need self.common_number
for anything. This is mostly equivalent to having a dedicated enum type for the case when we need to download the full range.
Having said that, IMO updating BlockCollection
/ importing logic to trigger import only when the parent hash corresponds to the already imported block (instead of comparison by number) would be less hacky than introducing a special case for downloading a range starting from common_number
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check self.range.start == (self.common_number + One::one()) looks confusing and can be replaced with a single flag download_full_range, the value of which we know from the creation of PeerDownloadState. Otherwise we don't need self.common_number for anything. This is mostly equivalent to having a dedicated enum type for the case when we need to download the full range.
Yeah this looks cleaner, changed it to have a separate enum. The current DownloadingNew
remains unchanged
Having said that, IMO updating BlockCollection / importing logic to trigger import only when the parent hash corresponds to the already imported block (instead of comparison by number) would be less hacky than introducing a special case for downloading a range starting from common_number.
Agreed. I looked at it, this looks like a much bigger change to use block hash instead of block number (this potentially touches several other paths). I feel this is better done as a separate change
let mut downloaded = Vec::new(); | ||
downloaded.append(&mut self.downloaded); | ||
collection.insert(start_block, downloaded, *who); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here, why is this append()
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar, it moves the completed blocks out and passes it to the BlockCollection
@altonen this repo should seriously consider using Nightly in |
Yes, I also ran the cmd from the failing job, but same behavior |
@rahulksnv update nightly to latest and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very promising, left some comments.
return Err(BadPeer(*who, rep::BAD_RESPONSE)) | ||
} | ||
let new_blocks_len = new_blocks.len(); | ||
new_blocks.append(&mut self.downloaded); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would check first that start_block + new_blocks.len() == self.downloaded.first().number()
(pseudocode) or start_block + new_blocks.len() == self.range.end()
(if self.downloaded
is empty). I.e., we can correctly prepend the range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored this part to break into smaller functions, added unit tests for the new paths
Self { common_number, range, downloaded: Vec::new() } | ||
} | ||
|
||
/// Handles the new blocks received from the peer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment stating that it only works when downloading the range backwards, may be even rename it to reflect this (prepend_blocks
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed name and expanded the comments
if self.range.start == (self.common_number + One::one()) && | ||
self.downloaded.len() < range_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check self.range.start == (self.common_number + One::one())
looks like a hackish way to determine if PeerDownloadState
is responsible for accumulating blocks, or BlockCollection
is responsible for this on the upper level in ChainSync
. May be it's possible to incorporate this code into BlockCollection
so that not distribute the logic over multiple components?
If for some reasons it's not possible to incorporate this code into BlockCollection
, I would introduce a clear separation of the code needed to pass only one chunk to BlockCollection
, and the code responsible for holding up the blocks. For example, this can be done by introducing a separate enum variant into PeerSyncState
for the latter (DownloadingRange
) with comments that it's used specifically for fork downloading in multiple chunks.
let from = if peer_best_number == last { | ||
FromBlock::Hash(*peer_best_hash) | ||
} else { | ||
FromBlock::Number(last) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in case of downloading a fork we always need to use FromBlock::Hash
, looking it up from the first block in the already downloaded range.
41626e8
to
4477868
Compare
@rahulksnv sorry for not getting back with the review of PR for quite some time. We have been having internal discussions regarding the bug and the PR, and, to be honest, have not decided yet if we should merge this PR or properly fix the underlying issue (import triggered by number instead of parent hash) instead. We are currently looking if it's feasible to properly fix the import. I'm going to publish the review comments I've already written, but it may be a good idea to hold back and wait until we have a clear understanding of what way of fixing the issue to prefer before addressing them. The review is not complete either, so this is another reason why I'd ask you to wait. We'll try to get back with the decision next week. |
} | ||
} | ||
|
||
/// `PeerDownloadStateState` tracks the download from a fork. It acts a staging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick:
/// `PeerDownloadStateState` tracks the download from a fork. It acts a staging | |
/// `PeerDownloadState` tracks the download from a fork. It acts as a staging |
downloaded.blocks.append(&mut new_blocks); | ||
downloaded.start_block = start_block; | ||
downloaded.start_block_parent_hash = start_block_parent_hash; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick:
downloaded.blocks.append(&mut new_blocks); | |
downloaded.start_block = start_block; | |
downloaded.start_block_parent_hash = start_block_parent_hash; | |
downloaded = PeerDownloaded { blocks: new_blocks, start_block, start_block_parent_hash }; |
This way we will catch possible errors if the new fields are added to PeerDownloaded
in the future.
let first_different = peer.common_number + One::one(); | ||
if range.start == first_different { | ||
peer.state = PeerSyncState::DownloadingFork(PeerDownloadState::new(range)); | ||
} else { | ||
peer.state = PeerSyncState::DownloadingNew(range.start); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a comment here why we are handling DownloadFork
case differently? And also add a TODO
with a reference to the issue of triggering import by hash instead of number.
/// So instead, hold on to the partial response in this scenario, issue more requests to | ||
/// download the missing [916, 925). And submit to the block collection after the full | ||
/// range is downloaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a TODO
to this description referencing the issue of triggering block import by hash instead of number (#1778)? Something like:
/// So instead, hold on to the partial response in this scenario, issue more requests to | |
/// download the missing [916, 925). And submit to the block collection after the full | |
/// range is downloaded. | |
/// So instead, hold on to the partial response in this scenario, issue more requests to | |
/// download the missing [916, 925). And submit to the block collection after the full | |
/// range is downloaded. | |
/// | |
/// TODO: trigger block import by parent hash comparison instead of block number | |
/// comparison. See /~https://github.com/paritytech/polkadot-sdk/issues/1778. |
Thanks, will address the comments and hold on to it for now. I would also take a look at moving to a hash based scheme, to get an idea |
CC @arkpar |
@rahulksnv We can probably close this PR now as superseded by #1812. |
yeah, closing this in favor of #1812 |
Context: #493 (comment)
Problem: When downloading from a fork, the peer may send a partial response that extends the current chain. Even though the response starts at a block number after the current best number, the parent block from the fork is not yet downloaded/imported. We currently try to import the partial response in this scenario, which causes import to fail as parent is not found in the backend. This causes sync to be aborted/restarted which may not converge.
Fix: when we receive a partial response in this scenario, hold on to the partial blocks and try to download the missing blocks. And import when the full range is downloaded.