-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarifying that orphan semantics only apply to provisioning #254
Conversation
Hey arschles! Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arschles I think we need more clarification in the requirements, see my comments
spec.md
Outdated
|
||
Orphans can result if the broker does not return a response before a provision request from the marketplace times out (typically 60 seconds). For example, if a broker does not return a response to a provision request before the request times out, the broker might eventually succeed in provisioning a service instance after the marketplace considers the request a failure. This results in an orphan service instance on the broker's side. | ||
|
||
To mitigate orphan service instances and bindings, the marketplace SHOULD attempt to deprovision resources it cannot be sure were successfully created, and SHOULD keep trying to delete them until the broker responds with a success. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it work with /~https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#blocking-operations requirement? Do we expect to receive 422 Unprocessable Entity
if the provisioning is still ongoing, and shall retry until we succeed deleting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it explicitly says "it cannot be sure were successfully created", what about the update request? /~https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#updating-a-service-instance
What happens if the update request times out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the marketplace made a create request and received an error, it should attempt to delete the resource. If the broker responds to the delete request with an error, CF retries for a week with a backoff schedule. @zrob Is that right?
There is no orphan mitigation in CF for failed updates.
spec.md
Outdated
| 5xx | Broker error | Yes | | ||
| Timeout | Failure | Yes | | ||
|
||
If the platform marketplace encounters an internal error provisioning a service instance or binding (for example, saving to the database fails), then it MUST at least send a single delete or unbind request to the service broker to prevent creation of an orphan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is ambiguous and confusing (I know you just copied it from the other section, but I think we need to fix it).
"it MUST at least send a single delete or unbind request" - what does it mean? What if this request doesn't reach service broker due to network failure, for example? What if marketplace crashed right after initiating provisioning? If there is no guaranteed persistence on marketplace side, it simply cannot guarantee to fulfill this MUST requirement (or, in other words, "at-least once" semantics).
I'd rather require marketplace to save the binding to the database before it makes the first call to service broker for provisioning. Given that marketplace is considered the source of truth, infrastructure failure should not be an excuse for running into out-of-sync state. I expect the platform and service brokers to always eventually become in sync without any manual human intervention (if there are no bugs in the implementation obviously).
If the marketplace fails to save the result of provisioning (update instance status), then it should whether sync the status later by invoking last operation endpoint, or retry deleting it until success (as mentioned above). I don't see why this case should be an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to be prescriptive in the API spec about marketplace behavior. Maybe some of this content belongs in the profile doc, as CF may choose to have different orphan mitigation behavior than K8s Service-catalog.
In the CF marketplace we decided eventual consistency was not acceptable. As the user-facing interface, the marketplace sets expectations to users for billable utilization. Unless the marketplace is certain that a service instance was successfully provisioned, it should not set contrary expectations, nor bill for the resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shalako I agree that marketplace behavior should have been out of scope of the OSB spec, but I don't think it's realistic to achieve this without breaking changes in the spec.
I personally would prefer service broker to be responsible for the cleanup, for example. But because it is considered "stateless" (see #225), marketplace has to do half of the broker's job. And orphan handling is part of this job, which has to be done by marketplace. And we need to be clear about this in the spec.
In other words, the spec has to be clear about which side (marketplace or broker) is responsible for the consistency of particular operation. But it should not be prescriptive on the "eventual consistency" and other extra expectations from the marketplace, I agree.
What do people think about removing this entire section and just giving each operation a more complete Response code table - even if it means duplicating some text in each one? I would find that much easier to follow and less confusing because even with this change we ask people to look in two different spots to figure out the right action to take upon a non-20x result. |
I think that'd be clearer. In most cases, the "Any other status code" description would be quite short anyway, like "Platforms SHOULD try again" etc. |
I'm happy to do that instead |
+1, I suggest that we do this one section at a time. I have been working through the possibilities for error handling just for provision alone, and it's pretty complicated. Let's discuss further on tomorrow's call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick comment, Could you use the spec defined terminology "Platform" instead of "Client" that is used while defining the orphan scenarios. Thanks.
Do we agree to close this one and address the return codes in subsequent PRs? |
+1 to "Platform" instead of "Client". |
I also like that this PR adds consistency with the /~https://github.com/arschles/servicebroker/blob/770a02f0dd8a704c8d4c00d3c8ca1cbb480dfb7f/spec.md#polling-last-operation section of the docs. |
@arschles did you push commits for this comment ? #254 (comment) |
@duglin I only added rows for failure codes to the table for provision. I suggest that we tackle those in subsequent PRs. Thoughts? As for the rest of this PR, I believe I need to change instances of "platform" to "client", and then it'll be ready for a re-review and hopefully a merge. |
I have a preference to do it all at once instead of piecemeal so that we don't leave the spec in an in-between state. |
My memory of all of CF's orphan mitigation behaviors is fuzzy. I think we do it for create-binding, as well as delete-instance and delete-binding. @zrob Could you help us confirm? |
Ref #255, which talks about orphan handling for deprovision & unbind |
Needs rebase. |
The binding, unbinding, and deprovisioning sections all specify their own semantics for handling orphans, so the specific major section on orphans was confusing because it used language that implied that it only applies to provisioning, but it was not under the provisioning section of the spec. This patch moves it to be under the provisioning section
Rebased |
Now Travis is failing. |
@arschles give me a sec to see if I can make the checker a little smarter to handle this case... |
Should fix the issue we see in openservicebrokerapi#254 Signed-off-by: Doug Davis <dug@us.ibm.com>
spec.md
Outdated
|
||
Responses with any other status code will be interpreted as a failure. Brokers can include a user-facing message in the `description` field; for details see [Broker Errors](#broker-errors). | ||
| Status Code | Description | Orphan Handling | | ||
|---|---|---| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| --- | --- | --- |
needs spaces
spec.md
Outdated
| 422 Unprocessable Entity | MUST be returned if the broker only supports asynchronous provisioning for the requested plan and the request did not include `?accepts_incomplete=true`. The expected response body is: `{ "error": "AsyncRequired", "description": "This service plan requires client support for asynchronous service operations." }`, as described below. | N/A | | ||
| Any other 4xx | MAY be returned if the broker rejected the request for reasons aside from those described in the above `408`, `409` and `422` response codes. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. | | ||
| Any 5xx| MAY be returned if the broker encountered an error. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. | | ||
| Connection timeout | It is RECOMMENDED that brokers do not time out in request processing. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing trailing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start, I think we need to fully define what is meant by orphan mitigation (I didn't see a place where it was clearly defined outside of the status table), and should clarify orphan mitigation behavior for bindings in this PR too.
spec.md
Outdated
broker related to this error. | ||
|
||
Brokers can include a user-facing message in the `description` field of any failure response. | ||
For details, see [Broker Errors](#broker-errors). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to have a section about mitigating orphaned resources that spells out exactly what the orphan mitigation behavior is supposed to be. The happy-land case of orphan mitigation is easy to imagine, but I think it would be good to spell out what the right thing to do is when the orphan-mitigating-deprovision itself times out.
Put another way, I do think there's a still a need for this text (which is no longer present in this diff):
To mitigate orphan service instances and bindings, the marketplace SHOULD attempt
to delete resources it cannot be sure were successfully created, and SHOULD keep
trying to delete them until the broker responds with a success.
I do also think we should clarify the behavior for bindings in this PR as well.
Though I find the description of orphans less clear than what we had before, I like the pattern of consolidating response codes and orphan mitigation. |
| 200 OK | MUST be returned if the service instance already exists, is fully provisioned, and the requested parameters are identical to the existing service instance. The expected response body is below. | N/A | | ||
| 201 Created | MUST be returned if the service instance was provisioned as a result of this request. The expected response body is below. | The client SHOULD attempt to issue a corresponding deprovision request if the response body was invalid. | | ||
| 202 Accepted | MUST be returned if the service instance provisioning is in progress. This triggers the platform marketplace to poll the [Service Instance Last Operation Endpoint](#polling-last-operation) for operation status. Note that a re-sent `PUT` request MUST return a `202 Accepted`, not a `200 OK`, if the service instance is not yet fully provisioned. | N/A | | ||
| Any other 2xx | MAY be returned by the broker in a failure scenario. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is trying to match what's in the spec today but the way its worded feels odd because it reads like we're changing the HTTP spec by defining a 2xx as an error condition, when the HTTP spec says its not.
Down below we say that any response code not listed is an error but no orphan handling happens - why is "other 2xx" not part of that category? Or, I guess it might be better to ask why the section below doesn't require orphan handling for any unexpected response code? If we do it for unknown 2xx errors (which per HTTP is a successful response code) then I would think we would do it for non-success response codes too.
If we do keep this line, I think a more accurate way of describing the situation is to say (for the middle column):
Brokers MUST NOT return any other 2xx status code and Platforms receiving any other 2xx status code MUST treat it as an error condition.
While this text gives us the same net result, it tries to make it clear that we're rejecting the response NOT because its an "error status code" but rather because the broker didn't adhere to the spec's "MUST NOT return any other 2xx" language.
1-week closing warning. |
I think it's ok to close this. Several relevant changes have been made to the spec and it would be better to revisit & reopen rather than rebase. |
The binding, unbinding, and deprovisioning sections all specify their own semantics for handling orphans, so the specific major section on orphans was confusing because it used language that implied that it only applies to provisioning, but it was not under the provisioning section of the spec.
This patch removes much of the orphans section and translates it to individual rows in the provision response codes table
Closes #256