-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Ability to cancel in-flight activities / sub-orchestrations when terminating parent #506
Comments
@cgillum Any news on this topic? Is there an ETA? We are doing also running some ETL on huge datasets and would like to have the option to stop activities that are currently awaited by the orchestration, when the orchestration gets terminated. |
We've had some internal discussions about this, but there is no ETA yet. Until we have something, the workaround would be to use a background thread to periodically poll the status of the parent orchestration to see if it was terminated. |
@cgillum +1 for this please |
+1 |
It'd be great +1 |
Any progress on this? |
For anyone waiting for something like this, you can achieve this (at least for suborchestrators) by prefixing all your suborchestrations with the instanceid of the parent, then using ListInstances with a prefix to find all child orchestrators to terminate. It doesn't do anything about in-flight activities, but that might not be required for your usecase. |
Hi, |
No updates at this point. Adding @AnatoliB for visibility on this ask. |
I've considered building this myself in the past, I think it would be relatively simple to do if you wanted to give it a go. It involves the following :
Here's a quick sequence diagram that expresses the flow. Notes
@cgillum any ideas why this might not work? |
Thanks @olitomlinson for this detailed suggestion. There is definitely some complexity involved that makes me worry about potential corner cases where things could go wrong (notes 6 and 7 would have me especially worried). But I think the most important thing you point out is this:
Based on my reading of this issue and the feedback we receive internally, most users need to be able to terminate activity executions that have already started. For example, there's a runaway activity execution that's consuming important resources and needs to be explicitly killed. The existing termination logic will already ensure that no new activities will be scheduled (assuming there's not already a message in the work-item queue) so it seems the window for which this approach applies might be too small. All that said, this issue is coming up frequently enough that we've decided to work on an official proposal for how to solve it. I'll create a separate issue tracking the proposal and post a link to it here for folks that are interested in weighing in. |
A simple (but disruptive) answer to this is to let the operator recycle the app? That would kill the activity, right? If so, then the next challenge is how to alert the operator that there is a runaway Activity? This sounds like it could be highly subjective? |
It will stop the current activity execution, but the activity will start running again as soon as the app starts back up. I'm thinking we need something to handle this case as well. |
Ah okay, sorry I thought a runaway process (in this instance) might be when the runtime gets stuck due to some rare race condition, thus just recycle to let the app try the Activity again. But I see that’s not the case here. — Thought : I wonder if the cancellationToken that is passed from the host to every Function invocation, could be leveraged and extended to support targeted Activity termination? Somehow DF would hook into the pipeline of that CancellationToken and wrap/combine it with activity-aware cancellation behaviours. This would require developers to actively decorate their Activity code to react to the cancellationToken being invoked, but IMO that’s a best practice anyway - we provide many reasons to encourage developers to utilise the provided cancellationToken, as they all have the same outcome which is the opportunity for graceful termination regardless of the cause of the termination. |
Yes, ideally this is how we'd surface it so that developers don't have to learn any new concepts. There is a technical problem, which is that I don't think we, as an extension, have access to this cancellation token and thus can't change it or wire it up to our own. A bigger problem, however, is that cancellation tokens are unique to .NET. There isn't any equivalent in Python, JavaScript, PowerShell, or even Java. We'd likely need to invent some new abstraction for those. |
I remember seeing an exception-based pattern for simulating cancellation tokens in JS and Python that we could look into. Just my 2 cents. |
For PowerShell, exposing the .NET cancellation token object as is should be fine: it's not the most idiomatic, but there is nothing idiomatic, so we can at least avoid inventing an entirely new thing. |
It's tagged as high priority and has been four years.... |
Linking to existing DTFx issue here: Azure/durabletask#446 |
Wondering if it would be acceptable to add |
@brandonh-msft @cgillum, am I right in saying that this being addressed in this PR? Azure/durabletask#787 |
@gorillapower: almost! That PR applies specifically to cascading a management operations (terminate, suspend, resume) to sub-orchestrators. It doesn't apply to Activities, which execute quite differently. |
@brandonh-msft: That's one of the options we're exploring One blocker is that we will probably need support from the Azure Functions Host in order to manage that Cancellation token. |
Any update in this topic? Do you consider to add support for CancellationToken for activities @davidmrdavid ? |
@yoozek: We have discussed this a few times internally since the last update here, but it is not currently in our immediate agenda. As always, activity in this issue, like your comment (thanks!), is a helpful signal to give it more priority. |
A small update on this: I recently merged a protobuf update for supporting cascade terminate for orchestrations. This will make it possible for gRPC-based out-of-proc SDKs (including .NET Isolated) to implement the cascading termination of orchestrations if/when it's supported by the Durable Task Framework for .NET. |
We are running a big migration through a hierarchy of orchestrations, the whole thing can take days. Sometimes we need to pause, like for system upgrade. So we keep track of all the instance id's in sql table. When we need to pause, we run a routine to kill all sub-orchestrations, then another routine to restart the thing.. So yes.. we would be very interested in this functionality :) |
+1 on the need for this feature :) any updates? |
+1 here also. Anyone got a working workaround for terminating activities? |
Workaround.. In your activity function you can call |
Just wanted to check in on a status of this and see if there is an ETA. Looking forward to hopefully getting this feature. Some of our activities are very long running. Thank you! |
Unfortunately, this hasn't been prioritized yet, but it is an item of constant discussion, and I was just thinking about it this morning. I think the main challenge is minimizing cost: a long-running activity will need to query some external store to determine if it's parent orchestrator is terminated, and if we do that too often that'll result in more storage costs for the end user. Here's my current thinking (for the Azure Storage backend):
Not sure how applicable this design is to the other storage providers though, at least for the Azure Storage this seems reasonable to me. |
Would there be any additional cost to make the orchestrator stop picking up new activities once the parent has been terminated? My interest in this ticket is about canceling an in-progress orchestration that is using a fan-out approach which is comprised of many small tasks |
@chazmuzz: hmm, can you clarify what this scenario looks like? Sounds to me like there's two orchestrators: A and B, that A calls B as a sub-orchestrator, then B fans-out, and A gets terminated. Did I get that right? And then you want "B" to no longer perform any replays? |
Something to consider, if not already mentioned. If you recorded the activities related to an orchestration in storage (or orchestration instances) you can certainly add to storage costs but like mentioned earlier this could be opt-in. For example, for us our volume is somewhat low, and the priority is on audit/control so it is of high value to see the activities and a run state and be able to control them better as possible. As a mention, azure storage retention policies (lifecycle management) can help with storage costs as well. Of course, using lifecycle management really only helps with Azure Storage implementations. Other storage mechanisms would need to implement something similar as possible. For example, if SQL Server is used perhaps system-version tables could be utilized to track history of orchestrations/activities and keep the history for a limited time period. |
@chazmuzz Activity functions cannot access the |
I believe they can access the client if you declare the client as a binding. You can definitely have an activity function trigger with arbitrary bindings, and one such binding is the durable client. |
For Durable function anyone have proper example of cancellation token. Suppose user cancel the function call from browser then durable function entire workflow will stop. Please share the code snippet |
@sand06-web I would recommend looking into external events as a way to implement cancellation. Please see: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-external-events?tabs=csharp |
We've built a few ETL processes using Durable Functions that fan-out to thousands of activities. If we find a problem anywhere, we still generally have to wait for all child functions to finish execution, even if we terminate the parent orchestration directly (which is related to #153 but I'm including activities as well here).
It would be really useful and save a lot of time for us to be able to gracefully cancel OR abort child functions based on a parent instance being terminated. I could envisage this working (hypothetically) in a couple of ways:
These are common approaches that I've seen before for cancelling / aborting long-running jobs. It would be great to understand if this is something on the roadmap for delivery, as it is something missing that would be trivial to do in standard .NET apps.
Additionally, for fan-out scenarios, if the parent instance is terminated and a child activity has not been executed yet, it should just be removed from queues and not executed at all.
Having the ability to abort functions as quickly as possible becomes particularly valuable when the work being performed could be very damaging to a business (eg, accidentally mailing confidential information to the wrong group of users).
The text was updated successfully, but these errors were encountered: