Clarifying that orphan semantics only apply to provisioning #254

arschles · 2017-06-23T18:37:57Z

The binding, unbinding, and deprovisioning sections all specify their own semantics for handling orphans, so the specific major section on orphans was confusing because it used language that implied that it only applies to provisioning, but it was not under the provisioning section of the spec.

This patch removes much of the orphans section and translates it to individual rows in the provision response codes table

Closes #256

cfdreddbot · 2017-06-23T18:37:59Z

Hey arschles!

Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA.

nilebox

@arschles I think we need more clarification in the requirements, see my comments

nilebox · 2017-06-25T14:39:22Z

spec.md

+
+Orphans can result if the broker does not return a response before a provision request from the marketplace times out (typically 60 seconds). For example, if a broker does not return a response to a provision request before the request times out, the broker might eventually succeed in provisioning a service instance after the marketplace considers the request a failure. This results in an orphan service instance on the broker's side.
+
+To mitigate orphan service instances and bindings, the marketplace SHOULD attempt to deprovision resources it cannot be sure were successfully created, and SHOULD keep trying to delete them until the broker responds with a success.


How does it work with /~https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#blocking-operations requirement? Do we expect to receive 422 Unprocessable Entity if the provisioning is still ongoing, and shall retry until we succeed deleting?

Also, it explicitly says "it cannot be sure were successfully created", what about the update request? /~https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#updating-a-service-instance
What happens if the update request times out?

If the marketplace made a create request and received an error, it should attempt to delete the resource. If the broker responds to the delete request with an error, CF retries for a week with a backoff schedule. @zrob Is that right?

There is no orphan mitigation in CF for failed updates.

nilebox · 2017-06-25T14:50:32Z

spec.md

+| 5xx | Broker error | Yes |
+| Timeout | Failure | Yes |
+
+If the platform marketplace encounters an internal error provisioning a service instance or binding (for example, saving to the database fails), then it MUST at least send a single delete or unbind request to the service broker to prevent creation of an orphan.


This sentence is ambiguous and confusing (I know you just copied it from the other section, but I think we need to fix it).

"it MUST at least send a single delete or unbind request" - what does it mean? What if this request doesn't reach service broker due to network failure, for example? What if marketplace crashed right after initiating provisioning? If there is no guaranteed persistence on marketplace side, it simply cannot guarantee to fulfill this MUST requirement (or, in other words, "at-least once" semantics).

I'd rather require marketplace to save the binding to the database before it makes the first call to service broker for provisioning. Given that marketplace is considered the source of truth, infrastructure failure should not be an excuse for running into out-of-sync state. I expect the platform and service brokers to always eventually become in sync without any manual human intervention (if there are no bugs in the implementation obviously).

If the marketplace fails to save the result of provisioning (update instance status), then it should whether sync the status later by invoking last operation endpoint, or retry deleting it until success (as mentioned above). I don't see why this case should be an exception.

I don't think we need to be prescriptive in the API spec about marketplace behavior. Maybe some of this content belongs in the profile doc, as CF may choose to have different orphan mitigation behavior than K8s Service-catalog.

In the CF marketplace we decided eventual consistency was not acceptable. As the user-facing interface, the marketplace sets expectations to users for billable utilization. Unless the marketplace is certain that a service instance was successfully provisioned, it should not set contrary expectations, nor bill for the resource.

@shalako I agree that marketplace behavior should have been out of scope of the OSB spec, but I don't think it's realistic to achieve this without breaking changes in the spec.

I personally would prefer service broker to be responsible for the cleanup, for example. But because it is considered "stateless" (see #225), marketplace has to do half of the broker's job. And orphan handling is part of this job, which has to be done by marketplace. And we need to be clear about this in the spec.

In other words, the spec has to be clear about which side (marketplace or broker) is responsible for the consistency of particular operation. But it should not be prescriptive on the "eventual consistency" and other extra expectations from the marketplace, I agree.

duglin · 2017-06-26T12:35:14Z

What do people think about removing this entire section and just giving each operation a more complete Response code table - even if it means duplicating some text in each one? I would find that much easier to follow and less confusing because even with this change we ask people to look in two different spots to figure out the right action to take upon a non-20x result.

mattmcneeney · 2017-06-26T13:50:36Z

I think that'd be clearer. In most cases, the "Any other status code" description would be quite short anyway, like "Platforms SHOULD try again" etc.

arschles · 2017-06-26T19:57:58Z

I'm happy to do that instead

pmorie · 2017-06-27T05:46:14Z

What do people think about removing this entire section and just giving each operation a more complete Response code table - even if it means duplicating some text in each one? I would find that much easier to follow and less confusing because even with this change we ask people to look in two different spots to figure out the right action to take upon a non-20x result.

+1, I suggest that we do this one section at a time. I have been working through the possibilities for error handling just for provision alone, and it's pretty complicated. Let's discuss further on tomorrow's call.

rajanra

Quick comment, Could you use the spec defined terminology "Platform" instead of "Client" that is used while defining the orphan scenarios. Thanks.

angarg12 · 2017-06-28T11:03:57Z

Do we agree to close this one and address the return codes in subsequent PRs?

duglin · 2017-06-28T12:10:11Z

+1 to "Platform" instead of "Client".

avade · 2017-06-29T10:37:53Z

Does CF currently mitigate orphans for bindings @zrob / @shalako?

avade · 2017-06-29T10:42:26Z

I also like that this PR adds consistency with the /~https://github.com/arschles/servicebroker/blob/770a02f0dd8a704c8d4c00d3c8ca1cbb480dfb7f/spec.md#polling-last-operation section of the docs.

duglin · 2017-06-29T11:56:31Z

@arschles did you push commits for this comment ? #254 (comment)

arschles · 2017-07-06T18:18:18Z

@duglin I only added rows for failure codes to the table for provision. I suggest that we tackle those in subsequent PRs. Thoughts?

As for the rest of this PR, I believe I need to change instances of "platform" to "client", and then it'll be ready for a re-review and hopefully a merge.

duglin · 2017-07-06T19:15:04Z

I have a preference to do it all at once instead of piecemeal so that we don't leave the spec in an in-between state.

shalako · 2017-07-10T18:08:12Z

My memory of all of CF's orphan mitigation behaviors is fuzzy. I think we do it for create-binding, as well as delete-instance and delete-binding.

@zrob Could you help us confirm?

arschles · 2017-07-11T17:07:29Z

Ref #255, which talks about orphan handling for deprovision & unbind

angarg12 · 2017-07-18T07:18:33Z

Needs rebase.

The binding, unbinding, and deprovisioning sections all specify their own semantics for handling orphans, so the specific major section on orphans was confusing because it used language that implied that it only applies to provisioning, but it was not under the provisioning section of the spec. This patch moves it to be under the provisioning section

arschles · 2017-07-24T21:20:43Z

Rebased

angarg12 · 2017-07-25T06:21:58Z

Now Travis is failing.

duglin · 2017-07-25T10:58:03Z

@arschles give me a sec to see if I can make the checker a little smarter to handle this case...

Should fix the issue we see in openservicebrokerapi#254 Signed-off-by: Doug Davis <dug@us.ibm.com>

duglin · 2017-07-25T12:49:21Z

@arschles #276 should fix the checker issue - but there are some edits that need to be made to your new text.

duglin · 2017-07-25T12:50:11Z

spec.md

-
-Responses with any other status code will be interpreted as a failure. Brokers can include a user-facing message in the `description` field; for details see [Broker Errors](#broker-errors).
+| Status Code | Description | Orphan Handling |
+|---|---|---|


| --- | --- | --- | needs spaces

duglin · 2017-07-25T12:50:33Z

spec.md

+| 422 Unprocessable Entity | MUST be returned if the broker only supports asynchronous provisioning for the requested plan and the request did not include `?accepts_incomplete=true`. The expected response body is: `{ "error": "AsyncRequired", "description": "This service plan requires client support for asynchronous service operations." }`, as described below. | N/A |
+| Any other 4xx | MAY be returned if the broker rejected the request for reasons aside from those described in the above `408`, `409` and `422` response codes. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. |
+| Any 5xx| MAY be returned if the broker encountered an error. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. |
+| Connection timeout | It is RECOMMENDED that brokers do not time out in request processing. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans.


missing trailing |

pmorie

Great start, I think we need to fully define what is meant by orphan mitigation (I didn't see a place where it was clearly defined outside of the status table), and should clarify orphan mitigation behavior for bindings in this PR too.

pmorie · 2017-07-25T15:48:29Z

spec.md

+ broker related to this error.
+
+ Brokers can include a user-facing message in the `description` field of any failure response.
+ For details, see [Broker Errors](#broker-errors).


I think it would be good to have a section about mitigating orphaned resources that spells out exactly what the orphan mitigation behavior is supposed to be. The happy-land case of orphan mitigation is easy to imagine, but I think it would be good to spell out what the right thing to do is when the orphan-mitigating-deprovision itself times out.

Put another way, I do think there's a still a need for this text (which is no longer present in this diff):

To mitigate orphan service instances and bindings, the marketplace SHOULD attempt to delete resources it cannot be sure were successfully created, and SHOULD keep trying to delete them until the broker responds with a success.

I do also think we should clarify the behavior for bindings in this PR as well.

shalako · 2017-08-07T17:12:04Z

Though I find the description of orphans less clear than what we had before, I like the pattern of consolidating response codes and orphan mitigation.

duglin · 2017-08-08T14:17:55Z

spec.md

+| 200 OK | MUST be returned if the service instance already exists, is fully provisioned, and the requested parameters are identical to the existing service instance. The expected response body is below. | N/A |
+| 201 Created | MUST be returned if the service instance was provisioned as a result of this request. The expected response body is below. | The client SHOULD attempt to issue a corresponding deprovision request if the response body was invalid. |
+| 202 Accepted | MUST be returned if the service instance provisioning is in progress. This triggers the platform marketplace to poll the [Service Instance Last Operation Endpoint](#polling-last-operation) for operation status. Note that a re-sent `PUT` request MUST return a `202 Accepted`, not a `200 OK`, if the service instance is not yet fully provisioned. | N/A |
+| Any other 2xx | MAY be returned by the broker in a failure scenario. | The client SHOULD attempt to issue a corresponding deprovision request to handle potential orphans. |


I know this is trying to match what's in the spec today but the way its worded feels odd because it reads like we're changing the HTTP spec by defining a 2xx as an error condition, when the HTTP spec says its not.

Down below we say that any response code not listed is an error but no orphan handling happens - why is "other 2xx" not part of that category? Or, I guess it might be better to ask why the section below doesn't require orphan handling for any unexpected response code? If we do it for unknown 2xx errors (which per HTTP is a successful response code) then I would think we would do it for non-success response codes too.

If we do keep this line, I think a more accurate way of describing the situation is to say (for the middle column):
Brokers MUST NOT return any other 2xx status code and Platforms receiving any other 2xx status code MUST treat it as an error condition.

While this text gives us the same net result, it tries to make it clear that we're rejecting the response NOT because its an "error status code" but rather because the broker didn't adhere to the spec's "MUST NOT return any other 2xx" language.

avade · 2017-11-21T14:52:30Z

1-week closing warning.

arschles · 2017-11-21T17:31:56Z

I think it's ok to close this. Several relevant changes have been made to the spec and it would be better to revisit & reopen rather than rebase.

nilebox reviewed Jun 25, 2017

View reviewed changes

duglin mentioned this pull request Jun 27, 2017

Incorrect http status code listing for orphan handling #256

Closed

This was referenced Jun 27, 2017

Orphan handling for deprovisioning and unbinding #255

Closed

Deprovision when provision failed #247

Closed

rajanra reviewed Jun 28, 2017

View reviewed changes

arschles force-pushed the orphan-clarify branch from 770a02f to f6e1f18 Compare July 24, 2017 21:16

arschles force-pushed the orphan-clarify branch from f6e1f18 to 630edd1 Compare July 24, 2017 21:20

duglin pushed a commit to duglin/servicebroker that referenced this pull request Jul 25, 2017

Fix table checker so it supports sentences in more than just last column

187b782

Should fix the issue we see in openservicebrokerapi#254 Signed-off-by: Doug Davis <dug@us.ibm.com>

duglin mentioned this pull request Jul 25, 2017

Fix table checker so it supports sentences in more than just last column #276

Merged

duglin reviewed Jul 25, 2017

View reviewed changes

pmorie reviewed Jul 25, 2017

View reviewed changes

arschles added 9 commits July 31, 2017 16:02

client => platform

0ca04f3

fixing syntax

13e66d9

Adding section explaining orphans

66bfdb1

Adding newlines and exponential backoff language

0836d14

Adding orphan handling language to binding

71a2cdf

Adding newlines

df8026a

Adding trailing pipe char

6936494

Using alternative, more descriptive language

32708f0

Adding trailing space

a3a7ba5

duglin reviewed Aug 8, 2017

View reviewed changes

wryun mentioned this pull request Aug 30, 2017

Validity of retrying PUT instance requests (and concurrent request behaviour) #301

Closed

duglin added the needs rebase/work label Sep 19, 2017

duglin mentioned this pull request Nov 21, 2017

SHOULD vs MUST with orphan mitigation #370

Closed

arschles closed this Nov 21, 2017

arschles deleted the orphan-clarify branch November 21, 2017 17:31


		Orphans can result if the broker does not return a response before a provision request from the marketplace times out (typically 60 seconds). For example, if a broker does not return a response to a provision request before the request times out, the broker might eventually succeed in provisioning a service instance after the marketplace considers the request a failure. This results in an orphan service instance on the broker's side.

		To mitigate orphan service instances and bindings, the marketplace SHOULD attempt to deprovision resources it cannot be sure were successfully created, and SHOULD keep trying to delete them until the broker responds with a success.

Clarifying that orphan semantics only apply to provisioning #254

Clarifying that orphan semantics only apply to provisioning #254

Conversation

arschles commented Jun 23, 2017 • edited Loading

cfdreddbot commented Jun 23, 2017

nilebox left a comment

Choose a reason for hiding this comment

nilebox Jun 25, 2017

Choose a reason for hiding this comment

nilebox Jun 25, 2017 • edited Loading

Choose a reason for hiding this comment

shalako Jun 26, 2017 • edited Loading

Choose a reason for hiding this comment

nilebox Jun 25, 2017 • edited Loading

Choose a reason for hiding this comment

shalako Jun 26, 2017

Choose a reason for hiding this comment

nilebox Jun 27, 2017 • edited Loading

Choose a reason for hiding this comment

duglin commented Jun 26, 2017

mattmcneeney commented Jun 26, 2017

arschles commented Jun 26, 2017

pmorie commented Jun 27, 2017

rajanra left a comment

Choose a reason for hiding this comment

angarg12 commented Jun 28, 2017

duglin commented Jun 28, 2017

avade commented Jun 29, 2017

avade commented Jun 29, 2017

duglin commented Jun 29, 2017

arschles commented Jul 6, 2017

duglin commented Jul 6, 2017

shalako commented Jul 10, 2017

arschles commented Jul 11, 2017

angarg12 commented Jul 18, 2017

arschles commented Jul 24, 2017

angarg12 commented Jul 25, 2017

duglin commented Jul 25, 2017

duglin commented Jul 25, 2017

duglin Jul 25, 2017

Choose a reason for hiding this comment

duglin Jul 25, 2017

Choose a reason for hiding this comment

pmorie left a comment

Choose a reason for hiding this comment

pmorie Jul 25, 2017

Choose a reason for hiding this comment

shalako commented Aug 7, 2017

duglin Aug 8, 2017

Choose a reason for hiding this comment

avade commented Nov 21, 2017

arschles commented Nov 21, 2017

arschles commented Jun 23, 2017 •

edited

Loading

nilebox Jun 25, 2017 •

edited

Loading

shalako Jun 26, 2017 •

edited

Loading

nilebox Jun 25, 2017 •

edited

Loading

nilebox Jun 27, 2017 •

edited

Loading