Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement retry for user deletion #50

Conversation

romainDavaze
Copy link

@romainDavaze romainDavaze commented May 27, 2021

Implementing retry for user deletion due to Graph API inconsistency when deleting a user right after creating it.

The API will return a 200 status code on the GET request indicating that the user has been created. However if you call the DELETE endpoint right after, it will return a 404 because it did not have time to spread user creation on API nodes.

After discussion with @manicminer, implementing retry here seems to be the cleaner way to handle this.

Maybe we should try to implement a proper retry mechanism in the future in the performRequest function as it only handles retries on rate-limiting related status codes for now.

@romainDavaze romainDavaze force-pushed the feature/add-retry-on-user-delete branch from 639f955 to e1a664b Compare May 27, 2021 08:23
@manicminer manicminer added the api-bug Issue caused by API bug or fix to workaround an API bug label May 27, 2021
@tsologub
Copy link
Contributor

I was passing by out of curiosity and have the following question/observation. (not a critique, but curious to get your opinion).

Short version:
Does it have to be an SDK responsibility to handle such kinds of scenarios or end-user responsibility? i.e., a user is aware of the eventual consistency model of Azure AD and has prepared retries/failover on his/her side?

Long version:
We have a similar situation. We are using hamilton to create a service principal, and right afterward, we call azure-sdk to create role assignments. Unfortunately, we often get 404, as the service principal write hasn't been replicated to all read replicas. So we need to perform a retry on our side for a couple of seconds. Following the suggested approach in this PR, it would be nice to have a similar retry in the CreateRoleAssignment() in azure-sdk. But does it have to be an SDK responsibility?

@romainDavaze
Copy link
Author

romainDavaze commented May 28, 2021

Hello @tsologub, thanks for giving out your opinion.

For me, this need to be implemented at the SDK level for couple reasons:

  1. A SDK is designed to be the interface between an existing API and an end-user (go developer in our case). To me, its role is to deliver consistent and viable services or functions to the end-user so that he/her does not have to handle specific problems caused by the API itself.
  2. The SDK is often developed by people either working on the API or knowing it pretty well to identify where the weaknesses are. I would guess that these people are in the best position to fix these problems.
  3. If this was an end-user responsibility, these kinds of problem would have to be fixed on every app that is using the SDK. I'd rather fix them upstream so that we don't have to think about these issues. I reckon this SDK is mostly used by the azuread terraform provider so we could fix it there, but I'm also thinking about other people wanting to use this SDK directly in their app.

I think you pointed out the real problem we both encountered. On most cases, operations are not spread on Azure read replicas quick enough or at least the API is not making sure that they have been properly replicated before responding. What is really awkward to me is that if you execute a GET right after the CREATE one has been executed, the Graph API will always return a 200 while the information is not clearly spread on enough nodes. I guess the proper way would be to fix this at the Graph API level but since we don't have control over it, to me this needs to be fixed here.

More generally, I think we need to have the choice to implement a client with or without a retry mechanism so that it will cover most use cases.

I would like to here @manicminer 's take on this since this SDK is developed by him and maybe he does have a better idea.

@tsologub
Copy link
Contributor

Unfortunately or luckily, depends on how you look, the Azure AD team went with the eventual consistency model instead of strong consistency.

On most cases, operations are not spread on Azure read replicas quick enough

They will be spread eventually.

the API is not making sure that they have been properly replicated before responding

This would call strong consistency. And every interaction would work much slower. i.e., AD would have to ensure that every read replica gets an update before responding to the client.

What is really awkward to me is that if you execute a GET right after the CREATE one has been executed, the Graph API will always return a 200

For that node, where your request landed, the write was successful; that is why AD responds as 200.

I guess the proper way would be to fix this at the Graph API level, but since we don't have control over it, to me, this needs to be fixed here.

I believe it is not the API level problem; it is Microsoft's deliberate decision. To achieve this, Microsoft must completely redesign the Active Directory.


I am getting your point. It would be nice to have such retries on the SDK level, especially in the GET scenarios. i.e., if the first response is 404, try a couple of times more. But I can't imagine how this could be implemented for the GET but with lists.

An example. I have a group, I delete this group, and now I want to list all deleted groups. Will my deleted group be on that list? - It depends on which node my ListeDeleted() request will hit, the updated one or not?

Maybe the retries that you are talking about don't have to be implemented for every GET, only for some. I don't know, simply thinking out loud.

@manicminer
Copy link
Owner

manicminer commented Jun 2, 2021

Where it's feasible, I would like to handle these kinds of consistency issues in the SDK. We might not catch everything but if we can absorb the most common of these I think it will still improve the experience for app developers.

I was hoping the ConsistencyLevel header might yield some useful behavior on this front but alas it seems designed solely for odata filtering/counting/searching and doesn't make any difference for naked GET requests, or non-idempotent requests like PATCH or DELETE.

I've been doing some experimenting with implicit request retrying and have had some success. It seems safe to retry on 404, and for 400 responses where the odata error is understood with confidence, e.g. when creating a service principal, if the related application is very fresh and not yet found by the node receiving the SP creation, a 400 with a distinctive odata error can be observed.

I'm hoping this can be built in to the base client and pending some more testing, I may have an implementation with safe defaults (i.e. no retrying unless opted in) and also allowing top level clients to decide what's retryable and what isn't.

@manicminer
Copy link
Owner

@romainDavaze I'm going to go ahead and close this one now that #57 is merged. Many thanks for the idea and inspiration 🙇

@manicminer manicminer closed this Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-bug Issue caused by API bug or fix to workaround an API bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants