cluster: reject writes only when data disk is degraded #24436

nvartolomei · 2024-12-04T19:48:44Z

Background https://redpandadata.atlassian.net/browse/CORE-8349

Health monitor tracks only the data disk now as the only use of that state is for rejecting writes. Cache disk state is irrelevant at cluster level.

This was tested manually by creating a cluster with custom cache disk mountpoint and trying to produce to it.

Before this commit, producing would have failed with a full cache disk. After this commit, producing fails only if the data disk is full.

Backports Required

Release Notes

Bug Fixes

If a discrete disk is used for cloud storage cache Redpanda previously rejected writes if that disk (cache disk) was full (in degraded state). This is incorrect since the cache disk isn't in the way of writes. From now on, reject writes only if the data disk is full (in degraded state).

vbotbuildovich · 2024-12-04T23:00:36Z

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59241#01939393-73db-4935-87e8-9b193bc30f60:

"rptest.tests.full_disk_test.WriteRejectTest.test_refresh_disk_health"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59251#01939471-a183-4f65-8f55-0d0b88888849:

"rptest.tests.archive_retention_test.CloudArchiveRetentionTest.test_delete.cloud_storage_type=CloudStorageType.ABS.retention_type=retention.ms"

vbotbuildovich · 2024-12-04T23:00:47Z

Retry command for Build#59241

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/full_disk_test.py::WriteRejectTest.test_refresh_disk_health

vbotbuildovich · 2024-12-05T02:20:27Z

the below tests from https://buildkite.com/redpanda/redpanda/builds/59251#01939417-d322-4b6a-b8f3-30c4e837a917 have failed and will be retried

gtest_raft_rpunit

vbotbuildovich · 2024-12-05T03:36:35Z

Retry command for Build#59251

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/archive_retention_test.py::CloudArchiveRetentionTest.test_delete@{"cloud_storage_type":2,"retention_type":"retention.ms"}

dotnwat

makes sense to me!

nvartolomei · 2024-12-06T12:53:44Z

/ci-repeat 1
tests/rptest/tests/archive_retention_test.py::CloudArchiveRetentionTest.test_delete@{"cloud_storage_type":2,"retention_type":"retention.ms"}

nvartolomei · 2024-12-06T20:50:52Z

/ci-repeat 1
skip-redpanda-builds

Health monitor tracks only the data disk now as the only use of that state is for rejecting writes. Cache disk state is irrelevant at cluster level. This was tested manually by creating a cluster with custom cache disk mountpoint and trying to produce to it. Before this commit, producing would have failed with a full cache disk. After this commit, producing fails only if the data disk is full.

vbotbuildovich · 2024-12-07T03:35:46Z

/backport v24.3.x

vbotbuildovich · 2024-12-07T03:35:47Z

/backport v24.2.x

vbotbuildovich · 2024-12-07T03:35:48Z

/backport v24.1.x

nvartolomei requested a review from mmaslankaprv December 4, 2024 19:48

github-actions bot added the area/redpanda label Dec 4, 2024

nvartolomei requested a review from ztlpn December 4, 2024 19:49

nvartolomei mentioned this pull request Dec 4, 2024

cluster: report cluster as unhealthy if data disk is degraded #24437

Open

7 tasks

nvartolomei force-pushed the nv/CORE-8349 branch from a5f08a4 to 88e2995 Compare December 4, 2024 23:52

ztlpn approved these changes Dec 5, 2024

View reviewed changes

dotnwat approved these changes Dec 5, 2024

View reviewed changes

nvartolomei changed the title ~~cluster: reject writes only if data disk is degraded~~ cluster: reject writes only when data disk is degraded Dec 6, 2024

nvartolomei force-pushed the nv/CORE-8349 branch from 88e2995 to b6b9a60 Compare December 6, 2024 20:54

nvartolomei merged commit 252f5be into redpanda-data:dev Dec 7, 2024
17 checks passed

nvartolomei deleted the nv/CORE-8349 branch December 7, 2024 03:35

nvartolomei restored the nv/CORE-8349 branch December 7, 2024 03:35

This was referenced Dec 7, 2024

[v24.2.x] cluster: reject writes only when data disk is degraded #24484

Merged

[v24.1.x] cluster: reject writes only when data disk is degraded #24485

Merged

[v24.3.x] cluster: reject writes only when data disk is degraded #24486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: reject writes only when data disk is degraded #24436

cluster: reject writes only when data disk is degraded #24436

nvartolomei commented Dec 4, 2024 •

edited

Loading

vbotbuildovich commented Dec 4, 2024 •

edited

Loading

vbotbuildovich commented Dec 4, 2024

vbotbuildovich commented Dec 5, 2024

vbotbuildovich commented Dec 5, 2024

dotnwat left a comment

nvartolomei commented Dec 6, 2024

nvartolomei commented Dec 6, 2024

vbotbuildovich commented Dec 7, 2024

vbotbuildovich commented Dec 7, 2024

vbotbuildovich commented Dec 7, 2024

cluster: reject writes only when data disk is degraded #24436

cluster: reject writes only when data disk is degraded #24436

Conversation

nvartolomei commented Dec 4, 2024 • edited Loading

Backports Required

Release Notes

Bug Fixes

vbotbuildovich commented Dec 4, 2024 • edited Loading

vbotbuildovich commented Dec 4, 2024

Retry command for Build#59241

vbotbuildovich commented Dec 5, 2024

vbotbuildovich commented Dec 5, 2024

Retry command for Build#59251

dotnwat left a comment

Choose a reason for hiding this comment

nvartolomei commented Dec 6, 2024

nvartolomei commented Dec 6, 2024

vbotbuildovich commented Dec 7, 2024

vbotbuildovich commented Dec 7, 2024

vbotbuildovich commented Dec 7, 2024

nvartolomei commented Dec 4, 2024 •

edited

Loading

vbotbuildovich commented Dec 4, 2024 •

edited

Loading