Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix broker crash in resource.status RPC handling when exlcuded ranks are also down #5870

Merged
merged 4 commits into from
Apr 8, 2024

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Apr 8, 2024

This PR fixes issue #5856.

When ranks are excluded, the resource module removes them from the constructed rlist with rlist_remove_ranks(). However, there is a bug in rlist_remove_ranks() that leaves pointers to the destroyed rnode objects in the rnode lookup hash. This causes a subsequent rlist_mark_down() to crash if an excluded rank is also down.

This PR fixes that main issue and also ignores missing ranks in rlist_mark_down() and rlist_mark_up() to avoid a subsequent NULL deref.

@grondo
Copy link
Contributor Author

grondo commented Apr 8, 2024

Hm, asan detected another (or I didn't really fix the original) use-after-free, so hold off on a review.

grondo added 2 commits April 8, 2024 14:13
Problem: rlist_remove_rank() removes the requested rnode from the rlist
nodes list, but doesn't delete the item from the rank hash.  This can
lead to a use-after-free if the removed rank is looked up after this.

Call rank_hash_delete() after the rank is deleted from the nodes list
to ensure references to the removed rnode.
Problem: rlist_mark_state() doesn't check the return value from
rlist_find_rank(). If a rank is passed to rlist_mark_up() or
rlist_mark_down() that doesn't exist in the current rlist, then this
will result in a dereference of NULL and a segfault.

Skip ranks that don't exist in rlist_mark_start(), effectively
ignoring invalid ranks passed to rlist_mark_up() and rlist_mark_down().

This is simpler than returning an error, which would require multiple
iterations of the ranks idset as well as more complex error handling
on the caller side.
grondo added 2 commits April 8, 2024 14:14
Problem: In some cases, rlist_alloc() leaves errp->text uninitialized
when returning an error. This can lead the caller to access unitialized
bytes, potentially causing a crash.

Always use errprintf() before returning an error from this function.
Problem: There is no test that ensure marking ranks up/down works
after some ranks have been removed from an rlist.

Add a test specific to this case (Issue flux-framework#5868)
@grondo
Copy link
Contributor Author

grondo commented Apr 8, 2024

Ok, duh, you can't remove the object from the hash after it has been destroyed. Fixed and repushed. Also fixed an uninitialized bytes warning from valgrind (error.text not initialized after all error returns from rlist_alloc())

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for running this down so fast.

@grondo
Copy link
Contributor Author

grondo commented Apr 8, 2024

Thanks I'll set MWP. (Hopefully tests pass)

Copy link

codecov bot commented Apr 8, 2024

Codecov Report

Merging #5870 (17a2855) into master (e479077) will decrease coverage by 0.02%.
The diff coverage is 87.50%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5870      +/-   ##
==========================================
- Coverage   83.30%   83.29%   -0.02%     
==========================================
  Files         514      514              
  Lines       82751    82756       +5     
==========================================
- Hits        68939    68932       -7     
- Misses      13812    13824      +12     
Files Coverage Δ
src/common/librlist/rlist.c 85.74% <87.50%> (+0.04%) ⬆️

... and 18 files with indirect coverage changes

@mergify mergify bot merged commit 3464b31 into flux-framework:master Apr 8, 2024
35 checks passed
@grondo grondo deleted the issue#5868 branch April 8, 2024 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants