-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix broker crash in resource.status
RPC handling when exlcuded ranks are also down
#5870
Conversation
Hm, asan detected another (or I didn't really fix the original) use-after-free, so hold off on a review. |
Problem: rlist_remove_rank() removes the requested rnode from the rlist nodes list, but doesn't delete the item from the rank hash. This can lead to a use-after-free if the removed rank is looked up after this. Call rank_hash_delete() after the rank is deleted from the nodes list to ensure references to the removed rnode.
Problem: rlist_mark_state() doesn't check the return value from rlist_find_rank(). If a rank is passed to rlist_mark_up() or rlist_mark_down() that doesn't exist in the current rlist, then this will result in a dereference of NULL and a segfault. Skip ranks that don't exist in rlist_mark_start(), effectively ignoring invalid ranks passed to rlist_mark_up() and rlist_mark_down(). This is simpler than returning an error, which would require multiple iterations of the ranks idset as well as more complex error handling on the caller side.
Problem: In some cases, rlist_alloc() leaves errp->text uninitialized when returning an error. This can lead the caller to access unitialized bytes, potentially causing a crash. Always use errprintf() before returning an error from this function.
Problem: There is no test that ensure marking ranks up/down works after some ranks have been removed from an rlist. Add a test specific to this case (Issue flux-framework#5868)
Ok, duh, you can't remove the object from the hash after it has been destroyed. Fixed and repushed. Also fixed an uninitialized bytes warning from valgrind ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for running this down so fast.
Thanks I'll set MWP. (Hopefully tests pass) |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5870 +/- ##
==========================================
- Coverage 83.30% 83.29% -0.02%
==========================================
Files 514 514
Lines 82751 82756 +5
==========================================
- Hits 68939 68932 -7
- Misses 13812 13824 +12
|
This PR fixes issue #5856.
When ranks are excluded, the resource module removes them from the constructed rlist with
rlist_remove_ranks()
. However, there is a bug inrlist_remove_ranks()
that leaves pointers to the destroyedrnode
objects in the rnode lookup hash. This causes a subsequentrlist_mark_down()
to crash if an excluded rank is also down.This PR fixes that main issue and also ignores missing ranks in
rlist_mark_down()
andrlist_mark_up()
to avoid a subsequent NULL deref.