Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Publish metrics test is flaky #7267

Closed
ghost opened this issue May 22, 2023 · 7 comments
Closed

Publish metrics test is flaky #7267

ghost opened this issue May 22, 2023 · 7 comments
Assignees
Labels
T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.

Comments

@ghost
Copy link

ghost commented May 22, 2023

Working on some HostConfiguration changes, I had the runtime_can_publish_metrics test failing https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2871112.

@ghost ghost moved this to To do in Parachains-core May 22, 2023
@ghost ghost added this to Parachains-core May 22, 2023
@ghost ghost added the T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. label May 22, 2023
@alexggh
Copy link
Contributor

alexggh commented May 23, 2023

Looking at the logs, and it seems the reason the test failed is because the assumption that Alice produced 2 blocks is not met:

// Wait for Alice to author two blocks.
alice.wait_for_blocks(2).await;

The reason that is not happening is because a re-org is happening in the test, so instead of producing two blocks we receive two notifications for block one:

[2023-05-22 12:36:56] 2023-05-22 12:36:56.512  INFO               tokio-runtime-worker substrate: [//Alice] ✨ Imported #1 (0x46d1…3c55)    
[2023-05-22 12:36:56] 2023-05-22 12:36:56.512  INFO               tokio-runtime-worker sc_informant: [//Alice] ♻️  Reorg on #1,0x46d1…3c55 to #1,0x6866…80ca, common ancestor #0,0x7b54…4b42    
[2023-05-22 12:36:56] 2023-05-22 12:36:56.512  INFO               tokio-runtime-worker substrate: [//Alice] ✨ Imported #1 (0x6866…80ca)   

The code for wait_for_blocks is something like this, so it will count twice block one and it will exit, hence why the fail assert later down the line.

while let Some(notification) = import_notification_stream.next().await {
    if notification.is_new_best {
	    blocks.insert(notification.hash);
	    if blocks.len() == count {
		    break
	    }
    }
}

Possible solutions:

  1. The quickest would be to wait more than 2 blocks before we ask for the runtime metrics, here
  2. Implement a test helper alternative to wait_for_block that takes into consideration Reorgs might happen.
  3. Fix wait_for_block helper in substrate, TBD if the current behaviour is correct or not.

alexggh added a commit to alexggh/polkadot that referenced this issue May 24, 2023
When an re-org happens wait_for_blocks(2) would actually exit after the second
import of blocks 1, so the conditions for the metric to exist won't be met
hence the occasional test failure.

More details in:
  paritytech#7267

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor

alexggh commented May 24, 2023

Proposed fix here: #7279

@bkchr
Copy link
Member

bkchr commented May 24, 2023

3. Fix wait_for_block helper in substrate, TBD if the current behaviour is correct or not.

Good idea.

@alexggh
Copy link
Contributor

alexggh commented May 24, 2023

  1. Fix wait_for_block helper in substrate, TBD if the current behaviour is correct or not.

Good idea.

Any thoughts if the current behaviour of the the function is correct in the presence of a Reorg, just by looking at the documentation of the API, I would say it does what we expect it to do:

/// Wait for `count` blocks to be imported in the node and then exit. This function will not
/// return if no blocks are ever created, thus you should restrict the maximum amount of time of
/// the test execution.
fn wait_for_blocks(&self, count: usize) -> Pin<Box<dyn Future<Output = ()> + Send>>;

@bkchr
Copy link
Member

bkchr commented May 24, 2023

I think I have written this and the idea was waiting for the chain to build n unique blocks. Not just imports.

paritytech-processbot bot pushed a commit that referenced this issue May 24, 2023
* metrics: tests: Fix flaky runtime_can_publish_metrics

When an re-org happens wait_for_blocks(2) would actually exit after the second
import of blocks 1, so the conditions for the metric to exist won't be met
hence the occasional test failure.

More details in:
  #7267

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

* metrics: tests: Cleanup un-needed box pin

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@sandreim
Copy link
Contributor

fixed in #7279

@bkchr bkchr closed this as completed May 24, 2023
@github-project-automation github-project-automation bot moved this from To do to Done in Parachains-core May 24, 2023
alexggh added a commit to alexggh/substrate that referenced this issue May 24, 2023
In the cases where a reorg happens we might receive notifications
for different blocks at the same level, so instead of the chain having
count new blocks it has less and that will break the tests which use this
function.

So, use the block number to identify that `count` blocks have been built in the
chain.

Examples where this issue was hit:
  paritytech/polkadot#7267

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor

alexggh commented May 24, 2023

I think I have written this and the idea was waiting for the chain to build n unique blocks. Not just imports.

I guess you mean something like this: paritytech/substrate#14215 ?

paritytech-processbot bot pushed a commit to paritytech/substrate that referenced this issue May 26, 2023
In the cases where a reorg happens we might receive notifications
for different blocks at the same level, so instead of the chain having
count new blocks it has less and that will break the tests which use this
function.

So, use the block number to identify that `count` blocks have been built in the
chain.

Examples where this issue was hit:
  paritytech/polkadot#7267

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Ank4n pushed a commit to paritytech/substrate that referenced this issue Jul 8, 2023
In the cases where a reorg happens we might receive notifications
for different blocks at the same level, so instead of the chain having
count new blocks it has less and that will break the tests which use this
function.

So, use the block number to identify that `count` blocks have been built in the
chain.

Examples where this issue was hit:
  paritytech/polkadot#7267

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
nathanwhit pushed a commit to nathanwhit/substrate that referenced this issue Jul 19, 2023
In the cases where a reorg happens we might receive notifications
for different blocks at the same level, so instead of the chain having
count new blocks it has less and that will break the tests which use this
function.

So, use the block number to identify that `count` blocks have been built in the
chain.

Examples where this issue was hit:
  paritytech/polkadot#7267

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants