-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[19696][20409] Fix data race on PDP #4220
Conversation
3f6008b
to
4e428af
Compare
@richiprosima please test this |
1 similar comment
@richiprosima please test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There is a test that has failed with segmentation fault both in Mac and Windows
- I don't like the solution of copying the whole
ParticipantProxyData
, it might be better to change the API of the methods receiving it so they only use the necessary information, and just copy that information
4e428af
to
d25b750
Compare
@richiprosima please test this |
@richiprosima please test this |
0c75e09
to
11288a8
Compare
@richiprosima please test this |
@juanlofer-eprosima please address linter issue. Is the linux discovery failed test related with the PR? |
Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Signed-off-by: tempate <danieldiaz@eprosima.com>
…rectly Signed-off-by: tempate <danieldiaz@eprosima.com>
Signed-off-by: tempate <danieldiaz@eprosima.com>
Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
11288a8
to
b2e6398
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job in this PR as it effectively solves the data race it describes, I also like the simplification on the PDPClient
in order to use the is_connected
, avoiding to store the entire ParticipantProxyData
. I am approving though leaving a reflection.
Trying to give a second thought on @MiguelCompany suggested, I think this work may admit an even fine-grained solution to avoid copying the entire ParcipantProxyData
in stack. But it is also true that we are highly tied to the entire pdata
within PDP::assign_remote_endpoints()
specially when we are notifying the user with the DiscoveredParticipantInfo
. The rest of the methods, matched_reader/writer_add()
and assign_low_level_endpoints()
do not make an intensive use of the ParticipantProxyData
and it would be enough to pass them only some of the fields: some locator lists
, the guid
and the m_availableBuiltinEndpoints
in essence.
Finally, I am attaching a possible patch of a different approach instead of stack-copying, which explores the option of synchronizing the PDP::remove_remote_participant()
and PDP::assign_remote_endpoints()
only in the case that we are trying to delete a pdata
that is is currently being processed. The fix is fast and dirty just to make an idea of the approach and I would still need to think if it would solve all the situations that we are solving by making that stack copy.
pdp_patch_another_approach.txt
Approving !
I included the backports in the PR description |
@Mergifyio backport 2.12.x 2.10.x 2.6.x |
✅ Backports have been created
|
I also agree the solution @MiguelCompany suggested was better, but we decided to proceed with the copy approach given the effort required, and also considering that the root of the problem (releasing the PDP mutex before completing an entire discovery operation, which might lead to inconsistent states apart from data races as the one we are dealing with here) would still remain unsolved. We also discussed that a possible solution to tackle the root of the problem would be to derive all discovery actions to a dedicated thread. Regarding your proposal, I believe it is a bit risky to wait as some lower level resources might be blocked by that thread, leading to potential deadlocks. And two additional comments:
|
Yeah, agree that it would be risky as a consequence of relying on a cond var. A further |
That's an interesting approach we could further explore. The only caveat I see at the moment is that |
* Add regression test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Make a copy of the participant proxy data Signed-off-by: tempate <danieldiaz@eprosima.com> * Use the copy method to ensure all the attributes are being copied correctly Signed-off-by: tempate <danieldiaz@eprosima.com> * Lookup the participant proxy data to pass the discovery-server tests Signed-off-by: tempate <danieldiaz@eprosima.com> * Get rid of proxy reference in RemoteServerAttributes Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Fix failing test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Uncrustify Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> --------- Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> Signed-off-by: tempate <danieldiaz@eprosima.com> Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com> (cherry picked from commit 3053832)
* Add regression test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Make a copy of the participant proxy data Signed-off-by: tempate <danieldiaz@eprosima.com> * Use the copy method to ensure all the attributes are being copied correctly Signed-off-by: tempate <danieldiaz@eprosima.com> * Lookup the participant proxy data to pass the discovery-server tests Signed-off-by: tempate <danieldiaz@eprosima.com> * Get rid of proxy reference in RemoteServerAttributes Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Fix failing test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Uncrustify Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> --------- Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> Signed-off-by: tempate <danieldiaz@eprosima.com> Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com> (cherry picked from commit 3053832)
Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
* Fix data race on PDP (#4220) * Add regression test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Make a copy of the participant proxy data Signed-off-by: tempate <danieldiaz@eprosima.com> * Use the copy method to ensure all the attributes are being copied correctly Signed-off-by: tempate <danieldiaz@eprosima.com> * Lookup the participant proxy data to pass the discovery-server tests Signed-off-by: tempate <danieldiaz@eprosima.com> * Get rid of proxy reference in RemoteServerAttributes Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Fix failing test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Uncrustify Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> --------- Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> Signed-off-by: tempate <danieldiaz@eprosima.com> Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com> (cherry picked from commit 3053832) * Export ParticipantProxyData constructor/destructor Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> --------- Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> Co-authored-by: Daniel Díaz Quílez <33602217+Tempate@users.noreply.github.com> Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
* Add regression test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Make a copy of the participant proxy data Signed-off-by: tempate <danieldiaz@eprosima.com> * Use the copy method to ensure all the attributes are being copied correctly Signed-off-by: tempate <danieldiaz@eprosima.com> * Lookup the participant proxy data to pass the discovery-server tests Signed-off-by: tempate <danieldiaz@eprosima.com> * Get rid of proxy reference in RemoteServerAttributes Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Fix failing test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Uncrustify Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> --------- Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> Signed-off-by: tempate <danieldiaz@eprosima.com> Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com> (cherry picked from commit 3053832)
Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
* Fix data race on PDP (#4220) Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Export ParticipantProxyData constructor/destructor Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> --------- Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
* Fix data race on PDP (#4220) * Add regression test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Make a copy of the participant proxy data Signed-off-by: tempate <danieldiaz@eprosima.com> * Use the copy method to ensure all the attributes are being copied correctly Signed-off-by: tempate <danieldiaz@eprosima.com> * Lookup the participant proxy data to pass the discovery-server tests Signed-off-by: tempate <danieldiaz@eprosima.com> * Get rid of proxy reference in RemoteServerAttributes Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Fix failing test Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> * Uncrustify Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> --------- Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> Signed-off-by: tempate <danieldiaz@eprosima.com> Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com> (cherry picked from commit 3053832) * Export ParticipantProxyData constructor/destructor Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> --------- Signed-off-by: Juan Lopez Fernandez <juanlopez@eprosima.com> Co-authored-by: Daniel Díaz Quílez <33602217+Tempate@users.noreply.github.com> Co-authored-by: Juan Lopez Fernandez <juanlopez@eprosima.com>
Description
This PR intends to solve a race condition on PDP discovery phase, in which a data UP is received and processed while its corresponding previously received data P is still being processed.
As a result, the participant proxy created and registered can be cleared before calling assignRemoteEndpoints, as the PDP mutex is released just before (on purpose to avoid interlocking problems).
This issue has been reported by TSAN in this particular scenario, when the data P is received in the UDP multicast thread and then the data UP in the unicast one. However, this problem should also affect other scenarios such as:
The proposed solution is to copy the received participant proxy data before releasing the PDP mutex, and then passing this copy forward to the methods requiring this information.
@Mergifyio backport 2.12.x 2.10.x 2.6.x
Contributor Checklist
versions.md
file (if applicable).Reviewer Checklist