Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Ability to cancel in-flight activities / sub-orchestrations when terminating parent #506

Open
sam-piper opened this issue Nov 9, 2018 · 39 comments
Labels

Comments

@sam-piper
Copy link

We've built a few ETL processes using Durable Functions that fan-out to thousands of activities. If we find a problem anywhere, we still generally have to wait for all child functions to finish execution, even if we terminate the parent orchestration directly (which is related to #153 but I'm including activities as well here).

It would be really useful and save a lot of time for us to be able to gracefully cancel OR abort child functions based on a parent instance being terminated. I could envisage this working (hypothetically) in a couple of ways:

  1. Inject CancellationToken into activity / sub-orchestrator function signatures, so that function code can check for cancellation on this token whenever possible. This also supports opt-in to the feature when required, similar to ILogger injection.
  2. Abort execution immediately by throwing a specific exception type on the underlying thread running the child code (OperationCanceledException or similar) - this might require different high-level operations on the client API (eg TerminateAsync with new options to mimic cancel vs abort semantics)

These are common approaches that I've seen before for cancelling / aborting long-running jobs. It would be great to understand if this is something on the roadmap for delivery, as it is something missing that would be trivial to do in standard .NET apps.

Additionally, for fan-out scenarios, if the parent instance is terminated and a child activity has not been executed yet, it should just be removed from queues and not executed at all.

Having the ability to abort functions as quickly as possible becomes particularly valuable when the work being performed could be very damaging to a business (eg, accidentally mailing confidential information to the wrong group of users).

@lukasvosyka
Copy link

@cgillum Any news on this topic? Is there an ETA? We are doing also running some ETL on huge datasets and would like to have the option to stop activities that are currently awaited by the orchestration, when the orchestration gets terminated.

@cgillum
Copy link
Member

cgillum commented Oct 21, 2019

We've had some internal discussions about this, but there is no ETA yet. Until we have something, the workaround would be to use a background thread to periodically poll the status of the parent orchestration to see if it was terminated.

@olitomlinson
Copy link
Contributor

@cgillum +1 for this please

@vany0114
Copy link

+1

@pkarda
Copy link

pkarda commented Apr 28, 2020

It'd be great +1

@tylermurry
Copy link

Any progress on this?

@emilbm
Copy link

emilbm commented Jun 9, 2021

For anyone waiting for something like this, you can achieve this (at least for suborchestrators) by prefixing all your suborchestrations with the instanceid of the parent, then using ListInstances with a prefix to find all child orchestrators to terminate.

It doesn't do anything about in-flight activities, but that might not be required for your usecase.

@paramoh
Copy link

paramoh commented Mar 11, 2022

Hi,
Any update on this one ?
We are migrating our workflows to Azure and this is a deal breaker for us.
Whenever the operations team terminates the orchestration, we want the activity function to terminate as well.

@cgillum
Copy link
Member

cgillum commented Mar 11, 2022

No updates at this point. Adding @AnatoliB for visibility on this ask.

@olitomlinson
Copy link
Contributor

olitomlinson commented Mar 11, 2022

@paramoh

I've considered building this myself in the past, I think it would be relatively simple to do if you wanted to give it a go.

It involves the following :

  • Each Activity function checking for the presence of a terminated blob in a container. If the blob is present, skip any logic in the activity function.
  • The terminated blob is placed in the container by a new Azure Function. This Azure Function is wired up to the lifecycle events. Tutorial for how to wire-up the Lifecycle events is here.

Here's a quick sequence diagram that expresses the flow.

image

Notes

  1. This is pseudo-termination. Semantically this is the same as activities being terminated prior to being executed. However the reality is the Activity runs, but the business logic inside the Activity does not, as it is diverted by user-code checking for the presence of the terminated blob.
  2. Any in-flight Activities wouldn't be terminated. Only upcoming Activities that have yet to be executed will be terminated.
  3. There may be a lag in between the Terminate command being processed and the terminated blob being created. This is due to the async nature of how lifecycle events are fired. Given that the EventGrid back plane is designed for low-latency, this should be a very small window of time. Easily <100ms under normal conditions.
  4. This implementation is tuned to the scenario that most of the time your orchestrations are allowed to run to completion with out any intervention through Termination.
  5. You are beholden to the throughput performance of the Storage Account for checking this blob. This may or may not matter depending on the scale of your operation. Unlikely to matter if your App architecture hasn't already hit choke points and bottlenecks.
  6. You may need to run a house-keeping process that cleans up any terminated blobs when they are no longer needed.
  7. Be careful when re-using Orchestration Ids. You wouldn't want to accidentally cripple a new orchestration because it re-used an Orchestration Id from a previous Orchestration which was terminated.
  8. If your use-case is high-throughput, you may wish to opt-in only to the terminated lifecycle event. This is a host.json configuration change
    image

@cgillum any ideas why this might not work?

@cgillum cgillum self-assigned this Mar 16, 2022
@cgillum
Copy link
Member

cgillum commented Mar 16, 2022

Thanks @olitomlinson for this detailed suggestion. There is definitely some complexity involved that makes me worry about potential corner cases where things could go wrong (notes 6 and 7 would have me especially worried). But I think the most important thing you point out is this:

Any in-flight Activities wouldn't be terminated. Only upcoming Activities that have yet to be executed will be terminated.

Based on my reading of this issue and the feedback we receive internally, most users need to be able to terminate activity executions that have already started. For example, there's a runaway activity execution that's consuming important resources and needs to be explicitly killed. The existing termination logic will already ensure that no new activities will be scheduled (assuming there's not already a message in the work-item queue) so it seems the window for which this approach applies might be too small.

All that said, this issue is coming up frequently enough that we've decided to work on an official proposal for how to solve it. I'll create a separate issue tracking the proposal and post a link to it here for folks that are interested in weighing in.

@olitomlinson
Copy link
Contributor

olitomlinson commented Mar 25, 2022

@cgillum

For example, there's a runaway activity execution that's consuming important resources and needs to be explicitly killed.

A simple (but disruptive) answer to this is to let the operator recycle the app? That would kill the activity, right?

If so, then the next challenge is how to alert the operator that there is a runaway Activity? This sounds like it could be highly subjective?

@cgillum
Copy link
Member

cgillum commented Mar 25, 2022

A simple (but disruptive) answer to this is to let the operator recycle the app? That would kill the activity, right?

It will stop the current activity execution, but the activity will start running again as soon as the app starts back up. I'm thinking we need something to handle this case as well.

@olitomlinson
Copy link
Contributor

olitomlinson commented Mar 25, 2022

@cgillum

Ah okay, sorry I thought a runaway process (in this instance) might be when the runtime gets stuck due to some rare race condition, thus just recycle to let the app try the Activity again.

But I see that’s not the case here.

Thought : I wonder if the cancellationToken that is passed from the host to every Function invocation, could be leveraged and extended to support targeted Activity termination? Somehow DF would hook into the pipeline of that CancellationToken and wrap/combine it with activity-aware cancellation behaviours.

This would require developers to actively decorate their Activity code to react to the cancellationToken being invoked, but IMO that’s a best practice anyway - we provide many reasons to encourage developers to utilise the provided cancellationToken, as they all have the same outcome which is the opportunity for graceful termination regardless of the cause of the termination.

@cgillum
Copy link
Member

cgillum commented Apr 7, 2022

Thought : I wonder if the cancellationToken that is passed from the host to every Function invocation, could be leveraged and extended to support targeted Activity termination?

Yes, ideally this is how we'd surface it so that developers don't have to learn any new concepts. There is a technical problem, which is that I don't think we, as an extension, have access to this cancellation token and thus can't change it or wire it up to our own. A bigger problem, however, is that cancellation tokens are unique to .NET. There isn't any equivalent in Python, JavaScript, PowerShell, or even Java. We'd likely need to invent some new abstraction for those.

@davidmrdavid
Copy link
Contributor

davidmrdavid commented Apr 7, 2022

I remember seeing an exception-based pattern for simulating cancellation tokens in JS and Python that we could look into. Just my 2 cents.

@AnatoliB
Copy link
Collaborator

For PowerShell, exposing the .NET cancellation token object as is should be fine: it's not the most idiomatic, but there is nothing idiomatic, so we can at least avoid inventing an entirely new thing.

@a2741890
Copy link

It's tagged as high priority and has been four years....
And there is still no way to stop an activity function.
The most important point is, it charges A LOT.

@cgillum cgillum modified the milestone: High Priority Jul 15, 2022
@cgillum
Copy link
Member

cgillum commented Jul 15, 2022

Linking to existing DTFx issue here: Azure/durabletask#446

@brandonh-msft
Copy link
Member

Wondering if it would be acceptable to add CancellationToken as a parameter on an [ActivityTrigger] function and simply inject a cancellation token? Then activities could use existing patterns to .ThrowIfCanceled or IsCancellationRequested, etc.

@gorillapower
Copy link

@brandonh-msft @cgillum, am I right in saying that this being addressed in this PR? Azure/durabletask#787

@davidmrdavid
Copy link
Contributor

@gorillapower: almost! That PR applies specifically to cascading a management operations (terminate, suspend, resume) to sub-orchestrators. It doesn't apply to Activities, which execute quite differently.

@davidmrdavid
Copy link
Contributor

@brandonh-msft: That's one of the options we're exploring One blocker is that we will probably need support from the Azure Functions Host in order to manage that Cancellation token.

@yoozek
Copy link

yoozek commented Feb 10, 2023

Any update in this topic? Do you consider to add support for CancellationToken for activities @davidmrdavid ?

@davidmrdavid
Copy link
Contributor

@yoozek: We have discussed this a few times internally since the last update here, but it is not currently in our immediate agenda. As always, activity in this issue, like your comment (thanks!), is a helpful signal to give it more priority.

@cgillum
Copy link
Member

cgillum commented Jul 10, 2023

A small update on this: I recently merged a protobuf update for supporting cascade terminate for orchestrations. This will make it possible for gRPC-based out-of-proc SDKs (including .NET Isolated) to implement the cascading termination of orchestrations if/when it's supported by the Durable Task Framework for .NET.

@verydarkmagic
Copy link

We are running a big migration through a hierarchy of orchestrations, the whole thing can take days. Sometimes we need to pause, like for system upgrade. So we keep track of all the instance id's in sql table. When we need to pause, we run a routine to kill all sub-orchestrations, then another routine to restart the thing..

So yes.. we would be very interested in this functionality :)

@forteddyt
Copy link

forteddyt commented Dec 8, 2023

+1 on the need for this feature :) any updates?

@TordJoranger
Copy link

+1 here also. Anyone got a working workaround for terminating activities?

@chazmuzz
Copy link

chazmuzz commented Jan 8, 2024

+1 here also. Anyone got a working workaround for terminating activities?

Workaround.. In your activity function you can call client.getStatus(ParentInstanceId) and do an early return if the status is terminated

@cgillum cgillum removed their assignment Jan 10, 2024
@chassq
Copy link

chassq commented May 10, 2024

Just wanted to check in on a status of this and see if there is an ETA. Looking forward to hopefully getting this feature. Some of our activities are very long running. Thank you!

@davidmrdavid
Copy link
Contributor

Unfortunately, this hasn't been prioritized yet, but it is an item of constant discussion, and I was just thinking about it this morning.

I think the main challenge is minimizing cost: a long-running activity will need to query some external store to determine if it's parent orchestrator is terminated, and if we do that too often that'll result in more storage costs for the end user.

Here's my current thinking (for the Azure Storage backend):

  • we probably want this to be an opt-in feature, to avoid sudden cost increases just from updating to the latest package
  • we could re-use the Instances table as the external store to determine if the calling orchestrator is terminated
  • the logic to check if the parent is terminated should trigger only after the activity has ran for long enough (say over ~30 minutes, or some user-configurable threshold), after which it could check the instances table every ~X many seconds (possibly also configurable) to see if the parent is terminated.

Not sure how applicable this design is to the other storage providers though, at least for the Azure Storage this seems reasonable to me.

@chazmuzz
Copy link

@davidmrdavid

Would there be any additional cost to make the orchestrator stop picking up new activities once the parent has been terminated?

My interest in this ticket is about canceling an in-progress orchestration that is using a fan-out approach which is comprised of many small tasks

@davidmrdavid
Copy link
Contributor

@chazmuzz: hmm, can you clarify what this scenario looks like?

Sounds to me like there's two orchestrators: A and B, that A calls B as a sub-orchestrator, then B fans-out, and A gets terminated. Did I get that right? And then you want "B" to no longer perform any replays?

@chassq
Copy link

chassq commented May 11, 2024

Something to consider, if not already mentioned. If you recorded the activities related to an orchestration in storage (or orchestration instances) you can certainly add to storage costs but like mentioned earlier this could be opt-in. For example, for us our volume is somewhat low, and the priority is on audit/control so it is of high value to see the activities and a run state and be able to control them better as possible.

As a mention, azure storage retention policies (lifecycle management) can help with storage costs as well. Of course, using lifecycle management really only helps with Azure Storage implementations. Other storage mechanisms would need to implement something similar as possible. For example, if SQL Server is used perhaps system-version tables could be utilized to track history of orchestrations/activities and keep the history for a limited time period.

@aldrichdev
Copy link

Workaround.. In your activity function you can call client.getStatus(ParentInstanceId) and do an early return if the status is terminated

@chazmuzz Activity functions cannot access the client can they? Only the activity context. Can you provide a larger code sample?

@davidmrdavid
Copy link
Contributor

I believe they can access the client if you declare the client as a binding.

You can definitely have an activity function trigger with arbitrary bindings, and one such binding is the durable client.

@sand06-web
Copy link

sand06-web commented Feb 16, 2025

For Durable function anyone have proper example of cancellation token. Suppose user cancel the function call from browser then durable function entire workflow will stop. Please share the code snippet

@davidmrdavid
Copy link
Contributor

@sand06-web I would recommend looking into external events as a way to implement cancellation. Please see: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-external-events?tabs=csharp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests