Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: crash for delconsumer during stream reading #4513

Merged
merged 1 commit into from
Jan 27, 2025

Conversation

BorysTheDev
Copy link
Contributor

if (sitem.consumer->pel->numnodes == 0) {
LOG(DFATAL) << "Internal error when accessing consumer data, seen_time "
<< sitem.consumer->seen_time;
result = OpStatus::CANCELLED;
return OpStatus::OK;
}
}

range_opts.noack = opts->noack;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this? does the test cover this fix as well?

Copy link
Contributor Author

@BorysTheDev BorysTheDev Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was 2340 line I haven't changed it.

@BorysTheDev BorysTheDev merged commit e434f62 into main Jan 27, 2025
10 checks passed
@BorysTheDev BorysTheDev deleted the fix_crash_stream_delconsumer branch January 27, 2025 16:32
}
sitem.consumer = range_opts.consumer;
if (!sitem.consumer) {
return OpStatus::OUT_OF_MEMORY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly misleading in this context, because FindOrAddConsumer did not fail because we are out of memory but because the consumer group (alice in the test) is deleted and the pointer is nullptr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is something else that is happening, I did not understand the crash until I actually used the debugger and stepped through but a few comments:

if sitem.consumer == nullptr we would return OUT_OF_MEMORY which would make the XREAD command return return rb->SendNullArray(); (see how we handle the transaction result below).

So that means that we never triggered that path in the first place! The actual crash came from below:

        if (sitem.consumer->pel->numnodes == 0) {
          LOG(DFATAL) << "Internal error when accessing consumer data, seen_time "
                      << sitem.consumer->seen_time;
          result = OpStatus::CANCELLED;
          return OpStatus::OK;
        }

So my question here is:

How does the introduced flow which is never triggered change the value of the check below ?

if (sitem.consumer->pel->numnodes == 0)

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BorysTheDev the first comment is worth responding.
@kostasrim I believe the fix to the crash is this line:
sitem.consumer = range_opts.consumer;

Copy link
Contributor Author

@BorysTheDev BorysTheDev Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romange You are absolutely right sitem.consumer = range_opts.consumer; is the fix, because sitem.consumer contains already invalidated pointer

I've added

if (!sitem.consumer) {
          return OpStatus::OUT_OF_MEMORY;

for the case if the memory for the consumer can not be allocated, but it isn't a fix of the crash

Copy link
Contributor

@kostasrim kostasrim Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for (1) and from my understanding FindOrAddConsumer shall return null if cg is empty or if we try to insert the consumer into the rax data structure and we fail. Not sure if the second can fail only under out of memory scenario but either way it would worth a comment.

For (2), yes I got it and checked it myself. I thought the fix was the early return but that didn't make sense and that's why I left the comment above yesterday.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BorysTheDev but we don't return NULL within FindOrAddConsumer only when OUT_OF_MEMORY right ?
From my understanding, cg can be null, if the consumer does not exist then it must be interted with raxInsert which can return NULL as well (not sure if this is only for OUT OF MEMORY though)

The rest make sense :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep we got checks for cg + raxInsert will return null only if it fails to insert. all good 👍

romange pushed a commit that referenced this pull request Jan 28, 2025
fix: crash for delconsumer during reading stream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants