[Feature] Ability to cancel in-flight activities / sub-orchestrations when terminating parent #506

sam-piper · 2018-11-09T02:11:16Z

We've built a few ETL processes using Durable Functions that fan-out to thousands of activities. If we find a problem anywhere, we still generally have to wait for all child functions to finish execution, even if we terminate the parent orchestration directly (which is related to #153 but I'm including activities as well here).

It would be really useful and save a lot of time for us to be able to gracefully cancel OR abort child functions based on a parent instance being terminated. I could envisage this working (hypothetically) in a couple of ways:

Inject CancellationToken into activity / sub-orchestrator function signatures, so that function code can check for cancellation on this token whenever possible. This also supports opt-in to the feature when required, similar to ILogger injection.
Abort execution immediately by throwing a specific exception type on the underlying thread running the child code (OperationCanceledException or similar) - this might require different high-level operations on the client API (eg TerminateAsync with new options to mimic cancel vs abort semantics)

These are common approaches that I've seen before for cancelling / aborting long-running jobs. It would be great to understand if this is something on the roadmap for delivery, as it is something missing that would be trivial to do in standard .NET apps.

Additionally, for fan-out scenarios, if the parent instance is terminated and a child activity has not been executed yet, it should just be removed from queues and not executed at all.

Having the ability to abort functions as quickly as possible becomes particularly valuable when the work being performed could be very damaging to a business (eg, accidentally mailing confidential information to the wrong group of users).

lukasvosyka · 2019-10-21T17:15:01Z

@cgillum Any news on this topic? Is there an ETA? We are doing also running some ETL on huge datasets and would like to have the option to stop activities that are currently awaited by the orchestration, when the orchestration gets terminated.

cgillum · 2019-10-21T19:40:44Z

We've had some internal discussions about this, but there is no ETA yet. Until we have something, the workaround would be to use a background thread to periodically poll the status of the parent orchestration to see if it was terminated.

olitomlinson · 2020-01-27T13:34:51Z

@cgillum +1 for this please

vany0114 · 2020-04-14T02:08:50Z

+1

pkarda · 2020-04-28T14:42:52Z

It'd be great +1

tylermurry · 2021-02-19T19:41:41Z

Any progress on this?

emilbm · 2021-06-09T14:00:38Z

For anyone waiting for something like this, you can achieve this (at least for suborchestrators) by prefixing all your suborchestrations with the instanceid of the parent, then using ListInstances with a prefix to find all child orchestrators to terminate.

It doesn't do anything about in-flight activities, but that might not be required for your usecase.

paramoh · 2022-03-11T12:53:05Z

Hi,
Any update on this one ?
We are migrating our workflows to Azure and this is a deal breaker for us.
Whenever the operations team terminates the orchestration, we want the activity function to terminate as well.

cgillum · 2022-03-11T17:40:47Z

No updates at this point. Adding @AnatoliB for visibility on this ask.

olitomlinson · 2022-03-11T22:33:05Z

@paramoh

I've considered building this myself in the past, I think it would be relatively simple to do if you wanted to give it a go.

It involves the following :

Each Activity function checking for the presence of a terminated blob in a container. If the blob is present, skip any logic in the activity function.
The terminated blob is placed in the container by a new Azure Function. This Azure Function is wired up to the lifecycle events. Tutorial for how to wire-up the Lifecycle events is here.

Here's a quick sequence diagram that expresses the flow.

Notes

This is pseudo-termination. Semantically this is the same as activities being terminated prior to being executed. However the reality is the Activity runs, but the business logic inside the Activity does not, as it is diverted by user-code checking for the presence of the terminated blob.
Any in-flight Activities wouldn't be terminated. Only upcoming Activities that have yet to be executed will be terminated.
There may be a lag in between the Terminate command being processed and the terminated blob being created. This is due to the async nature of how lifecycle events are fired. Given that the EventGrid back plane is designed for low-latency, this should be a very small window of time. Easily <100ms under normal conditions.
This implementation is tuned to the scenario that most of the time your orchestrations are allowed to run to completion with out any intervention through Termination.
You are beholden to the throughput performance of the Storage Account for checking this blob. This may or may not matter depending on the scale of your operation. Unlikely to matter if your App architecture hasn't already hit choke points and bottlenecks.
You may need to run a house-keeping process that cleans up any terminated blobs when they are no longer needed.
Be careful when re-using Orchestration Ids. You wouldn't want to accidentally cripple a new orchestration because it re-used an Orchestration Id from a previous Orchestration which was terminated.
If your use-case is high-throughput, you may wish to opt-in only to the terminated lifecycle event. This is a host.json configuration change

@cgillum any ideas why this might not work?

cgillum · 2022-03-16T22:46:13Z

Thanks @olitomlinson for this detailed suggestion. There is definitely some complexity involved that makes me worry about potential corner cases where things could go wrong (notes 6 and 7 would have me especially worried). But I think the most important thing you point out is this:

Any in-flight Activities wouldn't be terminated. Only upcoming Activities that have yet to be executed will be terminated.

Based on my reading of this issue and the feedback we receive internally, most users need to be able to terminate activity executions that have already started. For example, there's a runaway activity execution that's consuming important resources and needs to be explicitly killed. The existing termination logic will already ensure that no new activities will be scheduled (assuming there's not already a message in the work-item queue) so it seems the window for which this approach applies might be too small.

All that said, this issue is coming up frequently enough that we've decided to work on an official proposal for how to solve it. I'll create a separate issue tracking the proposal and post a link to it here for folks that are interested in weighing in.

olitomlinson · 2022-03-25T14:03:12Z

@cgillum

For example, there's a runaway activity execution that's consuming important resources and needs to be explicitly killed.

A simple (but disruptive) answer to this is to let the operator recycle the app? That would kill the activity, right?

If so, then the next challenge is how to alert the operator that there is a runaway Activity? This sounds like it could be highly subjective?

cgillum · 2022-03-25T16:44:44Z

A simple (but disruptive) answer to this is to let the operator recycle the app? That would kill the activity, right?

It will stop the current activity execution, but the activity will start running again as soon as the app starts back up. I'm thinking we need something to handle this case as well.

olitomlinson · 2022-03-25T18:28:47Z

@cgillum

Ah okay, sorry I thought a runaway process (in this instance) might be when the runtime gets stuck due to some rare race condition, thus just recycle to let the app try the Activity again.

But I see that’s not the case here.

—

Thought : I wonder if the cancellationToken that is passed from the host to every Function invocation, could be leveraged and extended to support targeted Activity termination? Somehow DF would hook into the pipeline of that CancellationToken and wrap/combine it with activity-aware cancellation behaviours.

This would require developers to actively decorate their Activity code to react to the cancellationToken being invoked, but IMO that’s a best practice anyway - we provide many reasons to encourage developers to utilise the provided cancellationToken, as they all have the same outcome which is the opportunity for graceful termination regardless of the cause of the termination.

cgillum · 2022-04-07T19:48:50Z

Thought : I wonder if the cancellationToken that is passed from the host to every Function invocation, could be leveraged and extended to support targeted Activity termination?

Yes, ideally this is how we'd surface it so that developers don't have to learn any new concepts. There is a technical problem, which is that I don't think we, as an extension, have access to this cancellation token and thus can't change it or wire it up to our own. A bigger problem, however, is that cancellation tokens are unique to .NET. There isn't any equivalent in Python, JavaScript, PowerShell, or even Java. We'd likely need to invent some new abstraction for those.

davidmrdavid · 2022-04-07T21:15:51Z

I remember seeing an exception-based pattern for simulating cancellation tokens in JS and Python that we could look into. Just my 2 cents.

AnatoliB · 2022-04-11T22:20:09Z

For PowerShell, exposing the .NET cancellation token object as is should be fine: it's not the most idiomatic, but there is nothing idiomatic, so we can at least avoid inventing an entirely new thing.

a2741890 · 2022-05-28T17:39:41Z

It's tagged as high priority and has been four years....
And there is still no way to stop an activity function.
The most important point is, it charges A LOT.

cgillum · 2022-07-15T16:43:10Z

Linking to existing DTFx issue here: Azure/durabletask#446

brandonh-msft · 2022-09-17T17:26:50Z

Wondering if it would be acceptable to add CancellationToken as a parameter on an [ActivityTrigger] function and simply inject a cancellation token? Then activities could use existing patterns to .ThrowIfCanceled or IsCancellationRequested, etc.

gorillapower · 2022-09-18T15:01:46Z

@brandonh-msft @cgillum, am I right in saying that this being addressed in this PR? Azure/durabletask#787

davidmrdavid · 2022-09-22T17:26:12Z

@gorillapower: almost! That PR applies specifically to cascading a management operations (terminate, suspend, resume) to sub-orchestrators. It doesn't apply to Activities, which execute quite differently.

davidmrdavid · 2022-09-22T17:28:20Z

@brandonh-msft: That's one of the options we're exploring One blocker is that we will probably need support from the Azure Functions Host in order to manage that Cancellation token.

yoozek · 2023-02-10T17:45:14Z

Any update in this topic? Do you consider to add support for CancellationToken for activities @davidmrdavid ?

davidmrdavid · 2023-02-10T19:21:25Z

@yoozek: We have discussed this a few times internally since the last update here, but it is not currently in our immediate agenda. As always, activity in this issue, like your comment (thanks!), is a helpful signal to give it more priority.

cgillum · 2023-07-10T19:48:32Z

A small update on this: I recently merged a protobuf update for supporting cascade terminate for orchestrations. This will make it possible for gRPC-based out-of-proc SDKs (including .NET Isolated) to implement the cascading termination of orchestrations if/when it's supported by the Durable Task Framework for .NET.

verydarkmagic · 2023-09-21T10:34:45Z

We are running a big migration through a hierarchy of orchestrations, the whole thing can take days. Sometimes we need to pause, like for system upgrade. So we keep track of all the instance id's in sql table. When we need to pause, we run a routine to kill all sub-orchestrations, then another routine to restart the thing..

So yes.. we would be very interested in this functionality :)

forteddyt · 2023-12-08T22:08:48Z

+1 on the need for this feature :) any updates?

TordJoranger · 2023-12-20T09:52:59Z

+1 here also. Anyone got a working workaround for terminating activities?

chazmuzz · 2024-01-08T01:24:29Z

+1 here also. Anyone got a working workaround for terminating activities?

Workaround.. In your activity function you can call client.getStatus(ParentInstanceId) and do an early return if the status is terminated

chassq · 2024-05-10T09:15:10Z

Just wanted to check in on a status of this and see if there is an ETA. Looking forward to hopefully getting this feature. Some of our activities are very long running. Thank you!

davidmrdavid · 2024-05-10T22:38:36Z

Unfortunately, this hasn't been prioritized yet, but it is an item of constant discussion, and I was just thinking about it this morning.

I think the main challenge is minimizing cost: a long-running activity will need to query some external store to determine if it's parent orchestrator is terminated, and if we do that too often that'll result in more storage costs for the end user.

Here's my current thinking (for the Azure Storage backend):

we probably want this to be an opt-in feature, to avoid sudden cost increases just from updating to the latest package
we could re-use the Instances table as the external store to determine if the calling orchestrator is terminated
the logic to check if the parent is terminated should trigger only after the activity has ran for long enough (say over ~30 minutes, or some user-configurable threshold), after which it could check the instances table every ~X many seconds (possibly also configurable) to see if the parent is terminated.

Not sure how applicable this design is to the other storage providers though, at least for the Azure Storage this seems reasonable to me.

chazmuzz · 2024-05-11T01:00:41Z

@davidmrdavid

Would there be any additional cost to make the orchestrator stop picking up new activities once the parent has been terminated?

My interest in this ticket is about canceling an in-progress orchestration that is using a fan-out approach which is comprised of many small tasks

davidmrdavid · 2024-05-11T01:08:38Z

@chazmuzz: hmm, can you clarify what this scenario looks like?

Sounds to me like there's two orchestrators: A and B, that A calls B as a sub-orchestrator, then B fans-out, and A gets terminated. Did I get that right? And then you want "B" to no longer perform any replays?

chassq · 2024-05-11T10:29:51Z

Something to consider, if not already mentioned. If you recorded the activities related to an orchestration in storage (or orchestration instances) you can certainly add to storage costs but like mentioned earlier this could be opt-in. For example, for us our volume is somewhat low, and the priority is on audit/control so it is of high value to see the activities and a run state and be able to control them better as possible.

As a mention, azure storage retention policies (lifecycle management) can help with storage costs as well. Of course, using lifecycle management really only helps with Azure Storage implementations. Other storage mechanisms would need to implement something similar as possible. For example, if SQL Server is used perhaps system-version tables could be utilized to track history of orchestrations/activities and keep the history for a limited time period.

aldrichdev · 2025-01-02T15:45:24Z

Workaround.. In your activity function you can call client.getStatus(ParentInstanceId) and do an early return if the status is terminated

@chazmuzz Activity functions cannot access the client can they? Only the activity context. Can you provide a larger code sample?

davidmrdavid · 2025-01-06T17:56:49Z

I believe they can access the client if you declare the client as a binding.

You can definitely have an activity function trigger with arbitrary bindings, and one such binding is the durable client.

sand06-web · 2025-02-16T23:10:06Z

For Durable function anyone have proper example of cancellation token. Suppose user cancel the function call from browser then durable function entire workflow will stop. Please share the code snippet

davidmrdavid · 2025-02-18T17:10:46Z

@sand06-web I would recommend looking into external events as a way to implement cancellation. Please see: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-external-events?tabs=csharp

cgillum added the Enhancement Feature requests. label Nov 24, 2018

cgillum mentioned this issue May 30, 2019

Durable functions RuntimeStatus is never in status ContinuedAsNew #777

Closed

ConnorMcMahon added this to the vNext milestone Mar 15, 2021

ConnorMcMahon added the needs-discussion label Mar 15, 2021

ConnorMcMahon mentioned this issue Mar 15, 2021

a termination call to an orchestrator doesn't terminate any sub-orchestrations it started #153

Closed

cgillum self-assigned this Mar 16, 2022

davidmrdavid mentioned this issue Apr 22, 2022

No documented way to cancel a sub-orchestrator task? Azure/azure-functions-durable-js#348

Closed

cgillum modified the milestone: High Priority Jul 15, 2022

cgillum mentioned this issue Feb 13, 2023

Forceful Termination of Running Orchestrations microsoft/durabletask-java#111

Open

nilsmehlhorn mentioned this issue Mar 8, 2023

Official support for rewinding failed orchestrations Azure/durabletask#731

Open

cgillum removed their assignment Jan 10, 2024

AnatoliB mentioned this issue Dec 25, 2024

Activity function keeps executing even after orchestrator has been terminated #2996

Open

[Feature] Ability to cancel in-flight activities / sub-orchestrations when terminating parent #506

[Feature] Ability to cancel in-flight activities / sub-orchestrations when terminating parent #506

Comments

sam-piper commented Nov 9, 2018

lukasvosyka commented Oct 21, 2019

cgillum commented Oct 21, 2019

olitomlinson commented Jan 27, 2020

vany0114 commented Apr 14, 2020

pkarda commented Apr 28, 2020

tylermurry commented Feb 19, 2021

emilbm commented Jun 9, 2021

paramoh commented Mar 11, 2022

cgillum commented Mar 11, 2022

olitomlinson commented Mar 11, 2022 • edited Loading

cgillum commented Mar 16, 2022

olitomlinson commented Mar 25, 2022 • edited Loading

cgillum commented Mar 25, 2022

olitomlinson commented Mar 25, 2022 • edited Loading

cgillum commented Apr 7, 2022

davidmrdavid commented Apr 7, 2022 • edited Loading

AnatoliB commented Apr 11, 2022

a2741890 commented May 28, 2022

cgillum commented Jul 15, 2022

brandonh-msft commented Sep 17, 2022

gorillapower commented Sep 18, 2022

davidmrdavid commented Sep 22, 2022

davidmrdavid commented Sep 22, 2022

yoozek commented Feb 10, 2023

davidmrdavid commented Feb 10, 2023

cgillum commented Jul 10, 2023

verydarkmagic commented Sep 21, 2023

forteddyt commented Dec 8, 2023 • edited Loading

TordJoranger commented Dec 20, 2023

chazmuzz commented Jan 8, 2024

chassq commented May 10, 2024

davidmrdavid commented May 10, 2024

chazmuzz commented May 11, 2024

davidmrdavid commented May 11, 2024

chassq commented May 11, 2024 • edited Loading

aldrichdev commented Jan 2, 2025

davidmrdavid commented Jan 6, 2025

sand06-web commented Feb 16, 2025 • edited Loading

davidmrdavid commented Feb 18, 2025

olitomlinson commented Mar 11, 2022 •

edited

Loading

olitomlinson commented Mar 25, 2022 •

edited

Loading

olitomlinson commented Mar 25, 2022 •

edited

Loading

davidmrdavid commented Apr 7, 2022 •

edited

Loading

forteddyt commented Dec 8, 2023 •

edited

Loading

chassq commented May 11, 2024 •

edited

Loading

sand06-web commented Feb 16, 2025 •

edited

Loading