-
-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make update_execution() atomic #5358
Make update_execution() atomic #5358
Conversation
…1/st2 into bkhushboo/race_condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically, all you did was add one line to create the with ...get_lock(...):
line, and then indent then block of code to protect it within the lock. And then add some more to the tests. Nice.
Why did adding 1 lock in 1 method increase the number of lock side effects so much in the tests?
with Timer(key="action.executions.calculate_result_size"): | ||
result_size = len( | ||
ActionExecutionDB.result._serialize_field_value(liveaction_db.result) | ||
with coordination.get_coordinator().get_lock(liveaction_db.id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a prefix to this to scope the change just to this block? Or are there other places already using this the live action id as the lock name that you also want to block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cognifloyd No, get_lock() is not being used anywhere else.
And to answer why tests are failing, /~https://github.com/StackStorm/st2/blob/master/st2actions/tests/unit/test_workflow_engine.py#L155-L159 mocks get_lock() to return ToozConnectionError
(to test
st2/st2common/st2common/services/workflows.py
Line 941 in 007beed
with coord_svc.get_coordinator(start_heart=True).get_lock(wf_ex_id): |
Because of the above mocking, action execution update calls will fail with my changes and cause the test assertions to fail. I'm working on the fix for this.
@cognifloyd All the tests are passing now. Could you please review and merge the request? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I'm not sure if this will get merged for 3.6.0 or if we'll wait till 3.7.0 to include it.
I remember there is another PR that is more involved related to changing the behavior in st2workflowengine #5367. @khushboobhatia01 Could you please include the Changelog record as well for this PR to make it complete? |
I added a changelog entry |
Thanks, looks we're all good ✔️ on this one! |
Making update_execution() atomic to avoid concurrent updates causing inconsistency.
In the below screenshot two concurrent updates to the execution object resulted in inconsistent data where overall execution is set to running, but the log field depicts the execution had succeeded.

How did this happen?
Two interleaved update_execution() can cause this.
succeeded
log is pushedAfter the above interleaved execution, P1 will overwrite overall execution status (succeeded set by P2) to running.