Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdev_mirror: when resilvering, try reading first #12327

Closed
wants to merge 1 commit into from

Conversation

nwf
Copy link
Contributor

@nwf nwf commented Jul 4, 2021

Motivation and Context

From experience, resilvers can make it most of the way through and die for one reason or another.

Description

Optimistically, then, fling reads at devices being resilvered and respond to checksum mismatches rather than always writing.

As this is a kind of nop-write, and out of an abundnace of caution, restrict this to checksum types strong enough to be considered good enough. (This, notably, excludes fletcher4, which is always used for pool metadata.)

Signed-off-by: Nathaniel Wesley Filardo nwfilardo@gmail.com

How Has This Been Tested?

Local use with zhack scrub.

Types of changes

  • Performance enhancement (non-breaking change which improves efficiency)

Checklist:

From experience, resilvers can make it most of the way through and die for one
reason or another.  Optimistically, then, fling reads at devices being
resilvered and respond to checksum mismatches rather than always writing.

As this is a kind of nop-write, and out of an abundnace of caution, restrict
this to checksum types strong enough to be considered good enough.  (This,
notably, excludes fletcher4, which is always used for pool metadata.)

Signed-off-by: Nathaniel Wesley Filardo <nwfilardo@gmail.com>
@jumbi77
Copy link
Contributor

jumbi77 commented Jul 4, 2021

Just referencing an old Illumos / OpenZFS ticket 5532, which covers similar code lines..but not really sure if its applicable...

@nwf
Copy link
Contributor Author

nwf commented Jul 4, 2021

Just referencing an old Illumos / OpenZFS ticket 5532, which covers similar code lines..but not really sure if its applicable...

Looks still applicable to me. Adding a test for mm->children > 1 to the conditional here will cause the mm->children == 1 case to fall through to vdev_mirror_child_select, which will pick the one child (barring badness) and reach the later zio calls in vdev_mirror_io_start, which avoids abd_alloc_sametype, bcopy, and abd_free. In fact it might be nice to reach that case if there's one child left standing after the test introduced by #11930, too.

@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Jul 6, 2021
Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is saying that when we resilver, we read from both sides (old and new), and then write to the new side. Does this also short-circuit the write if the read from the new side checksums correctly? If so, that means that a resilver would read the entire new disk and then also write it, making resilvers go about half as fast? What's the benefit behind this?

!mm->mm_resilvering;
boolean_t resilver_read = have_checksum &&
(zio->io_flags & (ZIO_FLAG_SCRUB | ZIO_FLAG_RESILVER)) &&
(zio_checksum_table[BP_GET_CHECKSUM(zio->io_bp)].ci_flags &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good place to add a comment explaining the reasoning behind this.

@nwf
Copy link
Contributor Author

nwf commented Jul 9, 2021

For some drive technologies, most notably SMR, writing can be significantly slower than reading and this can significantly improve recovering from aborted resilvers or from the ill effects of pre-#11930 scans of mirrors with offline devices.

That said, it's probably a good idea to request this read-before-write strategy only on some scans and on some pools, so that we don't pay the read-then-write slowdown when we don't expect the new side to already be full of the correct data. That might mean a new ZIO flag that's controlled by the scan logic. I'll flesh the proposal out to include that if you think this change isn't a challenge to data integrity.

@jumbi77
Copy link
Contributor

jumbi77 commented Jul 12, 2021

Just referencing an old Illumos / OpenZFS ticket 5532, which covers similar code lines..but not really sure if its applicable...

Looks still applicable to me. Adding a test for mm->children > 1 to the conditional here will cause the mm->children == 1 case to fall through to vdev_mirror_child_select, which will pick the one child (barring badness) and reach the later zio calls in vdev_mirror_io_start, which avoids abd_alloc_sametype, bcopy, and abd_free. In fact it might be nice to reach that case if there's one child left standing after the test introduced by #11930, too.

Will you then add the mm->mm_children > 1 check ?

@jumbi77
Copy link
Contributor

jumbi77 commented Sep 18, 2021

@nwf Can you may rebase this, resolve the feedback (add requested comment) and optionally add the mm->mm_children > 1 check ? Much thanks in advance and looking forward to get this upstreamed.

@mmaybee Can you may take a look too? Thanks in advance for your review.

@behlendorf behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Design Review Needed Architecture or design is under discussion labels Sep 16, 2022
@jumbi77
Copy link
Contributor

jumbi77 commented Sep 22, 2022

@nwf Can you may rebase this, resolve the feedback (add requested comment) and optionally add the mm->mm_children > 1 check ? Much thanks in advance and looking forward to get this upstreamed.

@mmaybee Can you may take a look too? Thanks in advance for your review.

Politely ping @nwf. Do you may can rebase this? Much thanks in advance.

@amotin
Copy link
Member

amotin commented Oct 31, 2024

I think we should address resilver restarts issue via reducing the amount of block pointer we needlessly accumulate in RAM for sorting if the pool is not fragmented, so that we could create checkpoints more often. I don't think it is reasonable to expect during general case resilver (unlike scrub) for the target device to already have all the data. I agree with @ahrens that this would take at least twice more time for new disks.

@amotin amotin closed this Oct 31, 2024
@jumbi77
Copy link
Contributor

jumbi77 commented Nov 1, 2024

I think we should address resilver restarts issue via reducing the amount of block pointer we needlessly accumulate in RAM for sorting if the pool is not fragmented, so that we could create checkpoints more often. I don't think it is reasonable to expect during general case resilver (unlike scrub) for the target device to already have all the data. I agree with @ahrens that this would take at least twice more time for new disks.

@amotin And is just the one liner mm->mm_children > 1 from your original illumos ticket applyable (any benefit without side effect?)

@amotin
Copy link
Member

amotin commented Nov 1, 2024

And is just the one liner mm->mm_children > 1 from your original illumos ticket applyable (any benefit without side effect?)

@jumbi77 If you are about avoiding extra memory copy, then it is already handled by #13606 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Revision Needed Changes are required for the PR to be accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants