bug 1562412 - Notarization poller worker #121

escapewindow · 2019-12-12T01:15:51Z

This is the poller worker, as described in the user story for bug 1562412. We need to be able to claim and track multiple concurrent tasks, poll Apple for each of their statuses, and resolve tasks when Apple's ready.

It looks like we could potentially abstract away some of this logic into a base python worker module, and share it with scriptworker. I've explicitly decided not to do that this time around.

escapewindow · 2019-12-16T21:10:41Z

Moving to review-ready before I've fully tested, since I'm a ways away from in-tree patches. I'm hoping to avoid the situation where I'm all ready to land, but am missing review during the holidays. I can try to keep track of the interdiff between the reviewed revision and any new changes.

@mitchhentges do you have time / headspace for giving this a review pass? If not, possibly @JohanLorenzo ?

mitchhentges · 2019-12-18T19:13:30Z

Oof, this is a big patch. I'll look at this later this week if I have time :)

escapewindow · 2019-12-18T19:33:28Z

Yeah. @tomprince said he could take a look too.

JohanLorenzo

Looks great to me! I like how multiple tasks are handled in a single worker. I haven't found any major issue.

I haven't looked too closely at the project configuration (tox, flake8 and all), because I assumed it's a copy-pasta of the other projects.

👍 👍 👍

notarization_poller/Dockerfile

notarization_poller/README.md

notarization_poller/src/notarization_poller/config.py

notarization_poller/src/notarization_poller/constants.py

JohanLorenzo · 2019-12-19T13:37:27Z

notarization_poller/src/notarization_poller/exceptions.py

+    exit_code = STATUSES["internal-error"]
+
+
+class RetryError(WorkerError):


Nit: we can reuse

scriptworker-scripts/scriptworker_client/src/scriptworker_client/exceptions.py

Line 64 in fcf3a03

class RetryError(ClientError):

I wonder if we should move the other exceptions there too. What do you think?

Hm. This is a bit different in that we're setting a resource-unavailable status. Other than that, possibly.

notarization_poller/src/notarization_poller/task.py

tomprince

I've left a bunch of comments on the implementation with suggestions on some restructuring that I think would make the code easier to reason about, and in one case make the notarization code clearly separate from the taskcluster worker code.

It is probably worth chatting about this before addressing the comments.

notarization_poller/src/notarization_poller/config.py

notarization_poller/src/notarization_poller/constants.py

notarization_poller/src/notarization_poller/task.py

notarization_poller/src/notarization_poller/worker.py

tomprince · 2019-12-20T05:01:53Z

notarization_poller/src/notarization_poller/worker.py

+                    new_tasks = await self._run_cancellable(claim_work(self.config, queue, num_tasks=num_tasks_to_claim))
+                self.last_claim_work = arrow.utcnow()
+                for claim_task in new_tasks.get("tasks", []):
+                    new_task = Task(self.config, claim_task)


If Task has a .start method that returns a future, you can add a done_callback to remove it from running_tasks, rather than periodically pruning it.

I haven't done this. The current approach seems to be working. Not sure if this was a blocker?

tomprince · 2019-12-20T05:09:02Z

notarization_poller/src/notarization_poller/worker.py

+                    self.running_tasks.append(new_task)
+            await self.prune_running_tasks()
+            sleep_time = self.last_claim_work.timestamp + self.config["claim_work_interval"] - arrow.utcnow().timestamp
+            sleep_time > 0 and await self._run_cancellable(sleep(sleep_time))


I think if the body of this functions is wrapped in an asyncio.Task, then you don't need _run_cancellable, and cancelling the task will propagate to the awaited futures here.

I think I've done this in the latest working patchset.

I still see _run_cancellable here. It exists to propagate cancellation to the sleep and claim_work calls. If you don't have that function, and cancel the future from invoke, it should automatically propagate the cancellation to the sleep and claim_work calls.

That said, this seems like something that can be left for a followup.

notarization_poller/src/notarization_poller/task.py

escapewindow · 2020-01-09T02:16:14Z

Tom's wip here. I'll push up my latest changes here after I have a chance to clean them up. I haven't split the queue interactions from the apple interactions, but I may do so to support integration tests better (since we can easily "mock" apple interactions out by instantiating a different class for the task logic).

escapewindow · 2020-02-24T19:22:22Z

I'm planning on merging and deploying this. We can follow up with any needed changes.

ghost · 2020-02-24T20:33:12Z

/notarization_poller has errors:

missing file in notarization_poller/requirements: test.txt
missing file in notarization_poller/requirements: base.txt
missing file in notarization_poller/requirements: local.txt

notarization_poller/src/notarization_poller/task.py

tomprince · 2020-02-25T00:57:13Z

notarization_poller/src/notarization_poller/task.py

+                else:
+                    log.exception("reclaim_task unexpected exception: %s %s", self.task_id, self.run_id)
+                    self.status = STATUSES["internal-error"]
+                self.task_fut and self.task_fut.cancel()


I'd be slightly inclined to switch the creation of task_fut and reclaim_fut above, and drop the self.task_fut and here, but probably not important.

notarization_poller/src/notarization_poller/task.py

tomprince · 2020-02-25T01:14:42Z

notarization_poller/src/notarization_poller/worker.py

+                    self.running_tasks.append(new_task)
+            await self.prune_running_tasks()
+            sleep_time = self.last_claim_work.timestamp + self.config["claim_work_interval"] - arrow.utcnow().timestamp
+            sleep_time > 0 and await self._run_cancellable(sleep(sleep_time))


I still see _run_cancellable here. It exists to propagate cancellation to the sleep and claim_work calls. If you don't have that function, and cancel the future from invoke, it should automatically propagate the cancellation to the sleep and claim_work calls.

That said, this seems like something that can be left for a followup.

notarization_poller/src/notarization_poller/worker.py

tomprince · 2020-02-25T01:26:24Z

notarization_poller/src/notarization_poller/worker.py

+        log.info("SIGTERM received; shutting down")
+        nonlocal done
+        done = True
+        await running_tasks.cancel()


Since add_signal_handler does handle futures, I think it would be better if _handle_sigterm were sync and handled anything that needed to be async itself.

That said, it looks like it async so that running_tasks.cancel can wait on the task futures. Since nothing is waiting on the result of that, I don't think that await actually has any effect. Thus, I think this function and running_tasks.cancel can both be sync.

fae37f8#diff-5fb318e79005ec63cfed70064fa1e861R71 is where I handled waiting on all of those tasks to finish in my sketch

I mainly copied over these coroutines. I'm almost leaning towards tearing out this code, since the worst that can happen with notarization_poller dying is we wait a bit longer on tasks that just poll and wait. Do you have a preference between tearing these out and leaving them as-is?

I'm fine either way. My main concerns are:

the code is more complicated than it needs to be

the code makes it look like it will wait for the futures to complete when it likely won't. (Whether they will or not depends on the order of callback in the event loop, and may also depend on how much (if any) work they do in being cancelled)

Long-term, I think we should improve the code here and in scriptworker to handle this case well. But, I'm fine landing without this code or as-is.

RIght. My main concerns are: I have done a significant amount of testing with the code as is. Making non-trivial changes has the likelihood of introducing new errors, just as I'm about to roll out.

tomprince

One minor comment (about handling graceful shutdown well), and a number of possible followups.

tomprince · 2020-02-25T03:31:39Z

notarization_poller/src/notarization_poller/task.py

@@ -76,9 +76,12 @@ def start(self):
        except TaskError:
            self.status = STATUSES["malformed-payload"]
            self.task_log(traceback.format_exc(), level=logging.CRITICAL)
+        except asyncio.CancelledError:
+            # We already dealt with self.status in reclaim_task
+            self.task_log(traceback.format_exc(), level=logging.CRITICAL)


(followup) We may want to consider not logging here, since we'll have already logged elsewhere. It probably isn't worth worrying about until after this has been deployed and is in production for a while though.

notarization_poller/src/notarization_poller/task.py

tomprince · 2020-02-25T03:48:24Z

notarization_poller/src/notarization_poller/worker.py

+        log.info("SIGTERM received; shutting down")
+        nonlocal done
+        done = True
+        await running_tasks.cancel()


I'm fine either way. My main concerns are:

the code is more complicated than it needs to be

the code makes it look like it will wait for the futures to complete when it likely won't. (Whether they will or not depends on the order of callback in the event loop, and may also depend on how much (if any) work they do in being cancelled)

Long-term, I think we should improve the code here and in scriptworker to handle this case well. But, I'm fine landing without this code or as-is.

notarization_poller/src/notarization_poller/worker.py

tomprince · 2020-02-25T04:08:14Z

notarization_poller/src/notarization_poller/worker.py

+                    new_task.start()
+                    self.running_tasks.append(new_task)
+            await self.prune_running_tasks()
+            sleep_time = self.last_claim_work.timestamp + self.config["claim_work_interval"] - arrow.utcnow().timestamp


(potential followup) Since we don't do anything async[1] between setting last_claim_work and now, I suspect we could use claim_work_interval directly here, but it doesn't hurt to do this calculation. (It would be slightly more interesting if we took the time before calling claim work, so it was the time between calls to claim work)

[1] We await prune_running_tasks, so it may take several event loop iterations, but we won't wait on I/O or anything.

escapewindow mentioned this pull request Dec 12, 2019

bug 1562412 - iscript: split notarization into 3 tasks #55

Closed

escapewindow marked this pull request as ready for review December 16, 2019 20:17

escapewindow changed the title ~~WIP - Notarization poller~~ Notarization poller worker Dec 16, 2019

escapewindow changed the title ~~Notarization poller worker~~ bug 1562412 - Notarization poller worker Dec 16, 2019

mozilla-releng deleted a comment Dec 16, 2019

JohanLorenzo approved these changes Dec 19, 2019

View reviewed changes

tomprince suggested changes Dec 20, 2019

View reviewed changes

escapewindow mentioned this pull request Jan 21, 2020

calculate reclaim_interval based on takenUntil mozilla-releng/scriptworker#424

Open

escapewindow requested a review from tomprince February 20, 2020 02:15

escapewindow added 12 commits February 24, 2020 11:19

[scriptworker_client] init_config changes

de0596e

.gitignore pip-wheel-metadata

d952861

[notarization_poller] initial commit

ad769ad

[notarization_poller] add exceptions.py

32e4d8a

[notarization_poller] add task.py

848fd4c

[notarization_poller] add config.py

097c6fb

[notarization_poller] add constants.py

34ca932

[notarization_poller] worker.py

d3eb581

update old scriptworker comments, vars, and docstrings

738f373

address review comments

aaea45c

address review comments + fix bugs found while testing

36a98cf

moar review comments

8e735ec

frozendict->immutabledict; pin requirements

4183d42

also pin notarization_poller

39ccf18

tomprince reviewed Feb 25, 2020

View reviewed changes

review comments

6a26a15

tomprince approved these changes Feb 25, 2020

View reviewed changes

escapewindow added 2 commits February 25, 2020 08:04

moar review comments

80ae5a7

moar review comments

3015ef4

escapewindow mentioned this pull request Feb 25, 2020

[notarization-poller] revisit graceful shutdown #154

Closed

Merge branch 'master' into notarization-poller

e1af25b

escapewindow merged commit bd93447 into mozilla-releng:master Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug 1562412 - Notarization poller worker #121

bug 1562412 - Notarization poller worker #121

escapewindow commented Dec 12, 2019 •

edited

Loading

escapewindow commented Dec 16, 2019

mitchhentges commented Dec 18, 2019

escapewindow commented Dec 18, 2019

JohanLorenzo left a comment

JohanLorenzo Dec 19, 2019

escapewindow Dec 19, 2019

tomprince left a comment

tomprince Dec 20, 2019

escapewindow Feb 20, 2020

tomprince Dec 20, 2019

escapewindow Jan 9, 2020

tomprince Feb 25, 2020

escapewindow commented Jan 9, 2020

escapewindow commented Feb 24, 2020

ghost commented Feb 24, 2020

tomprince Feb 25, 2020

tomprince Feb 25, 2020

tomprince Feb 25, 2020

escapewindow Feb 25, 2020

tomprince Feb 25, 2020

escapewindow Feb 25, 2020

tomprince left a comment

tomprince Feb 25, 2020

tomprince Feb 25, 2020

tomprince Feb 25, 2020

		exit_code = STATUSES["internal-error"]


		class RetryError(WorkerError):

bug 1562412 - Notarization poller worker #121

bug 1562412 - Notarization poller worker #121

Conversation

escapewindow commented Dec 12, 2019 • edited Loading

escapewindow commented Dec 16, 2019

mitchhentges commented Dec 18, 2019

escapewindow commented Dec 18, 2019

JohanLorenzo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomprince left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

escapewindow commented Jan 9, 2020

escapewindow commented Feb 24, 2020

ghost commented Feb 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomprince left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

escapewindow commented Dec 12, 2019 •

edited

Loading