Publish metrics test is flaky #7267

ghost · 2023-05-22T13:18:15Z

Working on some HostConfiguration changes, I had the runtime_can_publish_metrics test failing https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2871112.

alexggh · 2023-05-23T14:54:04Z

Looking at the logs, and it seems the reason the test failed is because the assumption that Alice produced 2 blocks is not met:

// Wait for Alice to author two blocks.
alice.wait_for_blocks(2).await;

The reason that is not happening is because a re-org is happening in the test, so instead of producing two blocks we receive two notifications for block one:

[2023-05-22 12:36:56] 2023-05-22 12:36:56.512  INFO               tokio-runtime-worker substrate: [//Alice] ✨ Imported #1 (0x46d1…3c55)    
[2023-05-22 12:36:56] 2023-05-22 12:36:56.512  INFO               tokio-runtime-worker sc_informant: [//Alice] ♻️  Reorg on #1,0x46d1…3c55 to #1,0x6866…80ca, common ancestor #0,0x7b54…4b42    
[2023-05-22 12:36:56] 2023-05-22 12:36:56.512  INFO               tokio-runtime-worker substrate: [//Alice] ✨ Imported #1 (0x6866…80ca)

The code for wait_for_blocks is something like this, so it will count twice block one and it will exit, hence why the fail assert later down the line.

while let Some(notification) = import_notification_stream.next().await {
    if notification.is_new_best {
	    blocks.insert(notification.hash);
	    if blocks.len() == count {
		    break
	    }
    }
}

Possible solutions:

The quickest would be to wait more than 2 blocks before we ask for the runtime metrics, here
Implement a test helper alternative to wait_for_block that takes into consideration Reorgs might happen.
Fix wait_for_block helper in substrate, TBD if the current behaviour is correct or not.

When an re-org happens wait_for_blocks(2) would actually exit after the second import of blocks 1, so the conditions for the metric to exist won't be met hence the occasional test failure. More details in: paritytech#7267 Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alexggh · 2023-05-24T07:57:16Z

Proposed fix here: #7279

bkchr · 2023-05-24T08:02:47Z

3. Fix wait_for_block helper in substrate, TBD if the current behaviour is correct or not.

Good idea.

alexggh · 2023-05-24T08:24:51Z

Fix wait_for_block helper in substrate, TBD if the current behaviour is correct or not.

Good idea.

Any thoughts if the current behaviour of the the function is correct in the presence of a Reorg, just by looking at the documentation of the API, I would say it does what we expect it to do:

/// Wait for `count` blocks to be imported in the node and then exit. This function will not
/// return if no blocks are ever created, thus you should restrict the maximum amount of time of
/// the test execution.
fn wait_for_blocks(&self, count: usize) -> Pin<Box<dyn Future<Output = ()> + Send>>;

bkchr · 2023-05-24T08:38:48Z

I think I have written this and the idea was waiting for the chain to build n unique blocks. Not just imports.

* metrics: tests: Fix flaky runtime_can_publish_metrics When an re-org happens wait_for_blocks(2) would actually exit after the second import of blocks 1, so the conditions for the metric to exist won't be met hence the occasional test failure. More details in: #7267 Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> * metrics: tests: Cleanup un-needed box pin Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

sandreim · 2023-05-24T11:32:57Z

fixed in #7279

In the cases where a reorg happens we might receive notifications for different blocks at the same level, so instead of the chain having count new blocks it has less and that will break the tests which use this function. So, use the block number to identify that `count` blocks have been built in the chain. Examples where this issue was hit: paritytech/polkadot#7267 Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alexggh · 2023-05-24T13:47:27Z

I think I have written this and the idea was waiting for the chain to build n unique blocks. Not just imports.

I guess you mean something like this: paritytech/substrate#14215 ?

In the cases where a reorg happens we might receive notifications for different blocks at the same level, so instead of the chain having count new blocks it has less and that will break the tests which use this function. So, use the block number to identify that `count` blocks have been built in the chain. Examples where this issue was hit: paritytech/polkadot#7267 Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

ghost moved this to To do in Parachains-core May 22, 2023

ghost added this to Parachains-core May 22, 2023

ghost added the T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. label May 22, 2023

sandreim assigned alexggh May 22, 2023

alexggh mentioned this issue May 24, 2023

metrics: tests: Fix flaky runtime_can_publish_metrics #7279

Merged

bkchr closed this as completed May 24, 2023

github-project-automation bot moved this from To do to Done in Parachains-core May 24, 2023

alexggh mentioned this issue May 24, 2023

test-utils: Fix wait_for_blocks in presence of reorgs paritytech/substrate#14215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish metrics test is flaky #7267

Publish metrics test is flaky #7267

ghost commented May 22, 2023

alexggh commented May 23, 2023

alexggh commented May 24, 2023

bkchr commented May 24, 2023

alexggh commented May 24, 2023

bkchr commented May 24, 2023

sandreim commented May 24, 2023

alexggh commented May 24, 2023

Publish metrics test is flaky #7267

Publish metrics test is flaky #7267

Comments

ghost commented May 22, 2023

alexggh commented May 23, 2023

Possible solutions:

alexggh commented May 24, 2023

bkchr commented May 24, 2023

alexggh commented May 24, 2023

bkchr commented May 24, 2023

sandreim commented May 24, 2023

alexggh commented May 24, 2023