Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifying that orphan semantics only apply to provisioning #254

Closed
wants to merge 10 commits into from

Conversation

arschles
Copy link
Contributor

@arschles arschles commented Jun 23, 2017

The binding, unbinding, and deprovisioning sections all specify their own semantics for handling orphans, so the specific major section on orphans was confusing because it used language that implied that it only applies to provisioning, but it was not under the provisioning section of the spec.

This patch removes much of the orphans section and translates it to individual rows in the provision response codes table

Closes #256

@cfdreddbot
Copy link

Hey arschles!

Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA.

Copy link

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arschles I think we need more clarification in the requirements, see my comments

spec.md Outdated

Orphans can result if the broker does not return a response before a provision request from the marketplace times out (typically 60 seconds). For example, if a broker does not return a response to a provision request before the request times out, the broker might eventually succeed in provisioning a service instance after the marketplace considers the request a failure. This results in an orphan service instance on the broker's side.

To mitigate orphan service instances and bindings, the marketplace SHOULD attempt to deprovision resources it cannot be sure were successfully created, and SHOULD keep trying to delete them until the broker responds with a success.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work with /~https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#blocking-operations requirement? Do we expect to receive 422 Unprocessable Entity if the provisioning is still ongoing, and shall retry until we succeed deleting?

Copy link

@nilebox nilebox Jun 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it explicitly says "it cannot be sure were successfully created", what about the update request? /~https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#updating-a-service-instance
What happens if the update request times out?

Copy link
Contributor

@shalako shalako Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the marketplace made a create request and received an error, it should attempt to delete the resource. If the broker responds to the delete request with an error, CF retries for a week with a backoff schedule. @zrob Is that right?

There is no orphan mitigation in CF for failed updates.

spec.md Outdated
| 5xx | Broker error | Yes |
| Timeout | Failure | Yes |

If the platform marketplace encounters an internal error provisioning a service instance or binding (for example, saving to the database fails), then it MUST at least send a single delete or unbind request to the service broker to prevent creation of an orphan.
Copy link

@nilebox nilebox Jun 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is ambiguous and confusing (I know you just copied it from the other section, but I think we need to fix it).

"it MUST at least send a single delete or unbind request" - what does it mean? What if this request doesn't reach service broker due to network failure, for example? What if marketplace crashed right after initiating provisioning? If there is no guaranteed persistence on marketplace side, it simply cannot guarantee to fulfill this MUST requirement (or, in other words, "at-least once" semantics).

I'd rather require marketplace to save the binding to the database before it makes the first call to service broker for provisioning. Given that marketplace is considered the source of truth, infrastructure failure should not be an excuse for running into out-of-sync state. I expect the platform and service brokers to always eventually become in sync without any manual human intervention (if there are no bugs in the implementation obviously).

If the marketplace fails to save the result of provisioning (update instance status), then it should whether sync the status later by invoking last operation endpoint, or retry deleting it until success (as mentioned above). I don't see why this case should be an exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to be prescriptive in the API spec about marketplace behavior. Maybe some of this content belongs in the profile doc, as CF may choose to have different orphan mitigation behavior than K8s Service-catalog.

In the CF marketplace we decided eventual consistency was not acceptable. As the user-facing interface, the marketplace sets expectations to users for billable utilization. Unless the marketplace is certain that a service instance was successfully provisioned, it should not set contrary expectations, nor bill for the resource.

Copy link

@nilebox nilebox Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shalako I agree that marketplace behavior should have been out of scope of the OSB spec, but I don't think it's realistic to achieve this without breaking changes in the spec.

I personally would prefer service broker to be responsible for the cleanup, for example. But because it is considered "stateless" (see #225), marketplace has to do half of the broker's job. And orphan handling is part of this job, which has to be done by marketplace. And we need to be clear about this in the spec.

In other words, the spec has to be clear about which side (marketplace or broker) is responsible for the consistency of particular operation. But it should not be prescriptive on the "eventual consistency" and other extra expectations from the marketplace, I agree.

@duglin
Copy link
Contributor

duglin commented Jun 26, 2017

What do people think about removing this entire section and just giving each operation a more complete Response code table - even if it means duplicating some text in each one? I would find that much easier to follow and less confusing because even with this change we ask people to look in two different spots to figure out the right action to take upon a non-20x result.

@mattmcneeney
Copy link
Contributor

I think that'd be clearer. In most cases, the "Any other status code" description would be quite short anyway, like "Platforms SHOULD try again" etc.

@arschles
Copy link
Contributor Author

I'm happy to do that instead

@pmorie
Copy link
Contributor

pmorie commented Jun 27, 2017

What do people think about removing this entire section and just giving each operation a more complete Response code table - even if it means duplicating some text in each one? I would find that much easier to follow and less confusing because even with this change we ask people to look in two different spots to figure out the right action to take upon a non-20x result.

+1, I suggest that we do this one section at a time. I have been working through the possibilities for error handling just for provision alone, and it's pretty complicated. Let's discuss further on tomorrow's call.

Copy link

@rajanra rajanra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick comment, Could you use the spec defined terminology "Platform" instead of "Client" that is used while defining the orphan scenarios. Thanks.

@angarg12
Copy link
Contributor

Do we agree to close this one and address the return codes in subsequent PRs?

@duglin
Copy link
Contributor

duglin commented Jun 28, 2017

+1 to "Platform" instead of "Client".

@avade
Copy link
Contributor

avade commented Jun 29, 2017

Does CF currently mitigate orphans for bindings @zrob / @shalako?

@avade
Copy link
Contributor

avade commented Jun 29, 2017

I also like that this PR adds consistency with the /~https://github.com/arschles/servicebroker/blob/770a02f0dd8a704c8d4c00d3c8ca1cbb480dfb7f/spec.md#polling-last-operation section of the docs.

@duglin
Copy link
Contributor

duglin commented Jun 29, 2017

@arschles did you push commits for this comment ? #254 (comment)

@arschles
Copy link
Contributor Author

arschles commented Jul 6, 2017

@duglin I only added rows for failure codes to the table for provision. I suggest that we tackle those in subsequent PRs. Thoughts?

As for the rest of this PR, I believe I need to change instances of "platform" to "client", and then it'll be ready for a re-review and hopefully a merge.

@duglin
Copy link
Contributor

duglin commented Jul 6, 2017

I have a preference to do it all at once instead of piecemeal so that we don't leave the spec in an in-between state.

@shalako
Copy link
Contributor

shalako commented Jul 10, 2017

My memory of all of CF's orphan mitigation behaviors is fuzzy. I think we do it for create-binding, as well as delete-instance and delete-binding.

@zrob Could you help us confirm?

@arschles
Copy link
Contributor Author

Ref #255, which talks about orphan handling for deprovision & unbind

@angarg12
Copy link
Contributor

Needs rebase.

The binding, unbinding, and deprovisioning sections all specify their
own semantics for handling orphans, so the specific major section on
orphans was confusing because it used language that implied that it
only applies to provisioning, but it was not under the provisioning
section of the spec.

This patch moves it to be under the provisioning section
@arschles
Copy link
Contributor Author

Rebased

@angarg12
Copy link
Contributor

Now Travis is failing.

@duglin
Copy link
Contributor

duglin commented Jul 25, 2017

@arschles give me a sec to see if I can make the checker a little smarter to handle this case...

duglin pushed a commit to duglin/servicebroker that referenced this pull request Jul 25, 2017
Should fix the issue we see in openservicebrokerapi#254

Signed-off-by: Doug Davis <dug@us.ibm.com>
@duglin
Copy link
Contributor

duglin commented Jul 25, 2017

@arschles #276 should fix the checker issue - but there are some edits that need to be made to your new text.

spec.md Outdated

Responses with any other status code will be interpreted as a failure. Brokers can include a user-facing message in the `description` field; for details see [Broker Errors](#broker-errors).
| Status Code | Description | Orphan Handling |
|---|---|---|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| --- | --- | --- | needs spaces

spec.md Outdated
| 422 Unprocessable Entity | MUST be returned if the broker only supports asynchronous provisioning for the requested plan and the request did not include `?accepts_incomplete=true`. The expected response body is: `{ "error": "AsyncRequired", "description": "This service plan requires client support for asynchronous service operations." }`, as described below. | N/A |
| Any other 4xx | MAY be returned if the broker rejected the request for reasons aside from those described in the above `408`, `409` and `422` response codes. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. |
| Any 5xx| MAY be returned if the broker encountered an error. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. |
| Connection timeout | It is RECOMMENDED that brokers do not time out in request processing. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing trailing |

Copy link
Contributor

@pmorie pmorie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start, I think we need to fully define what is meant by orphan mitigation (I didn't see a place where it was clearly defined outside of the status table), and should clarify orphan mitigation behavior for bindings in this PR too.

spec.md Outdated
broker related to this error.

Brokers can include a user-facing message in the `description` field of any failure response.
For details, see [Broker Errors](#broker-errors).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have a section about mitigating orphaned resources that spells out exactly what the orphan mitigation behavior is supposed to be. The happy-land case of orphan mitigation is easy to imagine, but I think it would be good to spell out what the right thing to do is when the orphan-mitigating-deprovision itself times out.

Put another way, I do think there's a still a need for this text (which is no longer present in this diff):

To mitigate orphan service instances and bindings, the marketplace SHOULD attempt
to delete resources it cannot be sure were successfully created, and SHOULD keep
trying to delete them until the broker responds with a success.

I do also think we should clarify the behavior for bindings in this PR as well.

@shalako
Copy link
Contributor

shalako commented Aug 7, 2017

Though I find the description of orphans less clear than what we had before, I like the pattern of consolidating response codes and orphan mitigation.

| 200 OK | MUST be returned if the service instance already exists, is fully provisioned, and the requested parameters are identical to the existing service instance. The expected response body is below. | N/A |
| 201 Created | MUST be returned if the service instance was provisioned as a result of this request. The expected response body is below. | The client SHOULD attempt to issue a corresponding deprovision request if the response body was invalid. |
| 202 Accepted | MUST be returned if the service instance provisioning is in progress. This triggers the platform marketplace to poll the [Service Instance Last Operation Endpoint](#polling-last-operation) for operation status. Note that a re-sent `PUT` request MUST return a `202 Accepted`, not a `200 OK`, if the service instance is not yet fully provisioned. | N/A |
| Any other 2xx | MAY be returned by the broker in a failure scenario. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is trying to match what's in the spec today but the way its worded feels odd because it reads like we're changing the HTTP spec by defining a 2xx as an error condition, when the HTTP spec says its not.

Down below we say that any response code not listed is an error but no orphan handling happens - why is "other 2xx" not part of that category? Or, I guess it might be better to ask why the section below doesn't require orphan handling for any unexpected response code? If we do it for unknown 2xx errors (which per HTTP is a successful response code) then I would think we would do it for non-success response codes too.

If we do keep this line, I think a more accurate way of describing the situation is to say (for the middle column):
Brokers MUST NOT return any other 2xx status code and Platforms receiving any other 2xx status code MUST treat it as an error condition.

While this text gives us the same net result, it tries to make it clear that we're rejecting the response NOT because its an "error status code" but rather because the broker didn't adhere to the spec's "MUST NOT return any other 2xx" language.

@avade
Copy link
Contributor

avade commented Nov 21, 2017

1-week closing warning.

@arschles
Copy link
Contributor Author

I think it's ok to close this. Several relevant changes have been made to the spec and it would be better to revisit & reopen rather than rebase.

@arschles arschles closed this Nov 21, 2017
@arschles arschles deleted the orphan-clarify branch November 21, 2017 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants