DWS: handle broken kubernetes connection #197

jameshcorbett · 2024-08-10T00:53:51Z

An error occurred on elcap:

Traceback (most recent call last):
  File "/usr/bin/coral2_dws.py", line 937, in <module>
    main()
  File "/usr/bin/coral2_dws.py", line 928, in main
    handle.reactor_run()
  File "/usr/lib64/flux/python3.6/flux/core/handle.py", line 322, in reactor_run
    Flux.raise_if_exception()
  File "/usr/lib64/flux/python3.6/flux/core/handle.py", line 133, in raise_if_exception
    raise cls.set_exception(None) from None
  File "/usr/lib64/flux/python3.6/flux/core/watchers.py", line 68, in timeout_handler_wrapper
    watcher.callback(watcher.flux_handle, watcher, revents, watcher.args)
  File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 63, in watch_cb
    watchers.watch()
  File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 99, in watch
    watch.watch()
  File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 47, in watch
    for event in stream:
  File "/usr/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/lib/python3.6/site-packages/urllib3/response.py", line 694, in read_chunked
    self._original_response.close()
  File "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3.6/site-packages/urllib3/response.py", line 378, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
  urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(2062 bytes read, 608 more expected

Relates to #145 and #159.

Because of this error a bunch of workflows were stranded in PreRun because the coral2-dws service was down when the dws.post_run RPCs were sent. And perhaps relatedly, it seems that a lot of dws.prolog_remove RPCs were sent (perhaps because the workflows had some updates occur while PreRun was ready: True) because @grondo noted that the Flux logs were filling up with messages like

 +40.069567] job-manager[0]: failed to fetch 'dws_prolog_active' aux for 508988203961679872: No such file or directory
[ +40.069597] job-manager[0]: Failed to setup DWS workflow object for job 508988203961679872
[ +40.396645] job-manager[0]: failed to fetch 'dws_prolog_active' aux for 508987988726775808: No such file or directory
[ +40.396659] job-manager[0]: Failed to setup DWS workflow object for job 508987988726775808
[ +40.774926] job-manager[0]: failed to fetch 'dws_prolog_active' aux for 508988073585934336: No such file or directory
[ +40.774943] job-manager[0]: Failed to setup DWS workflow object for job 508988073585934336

The text was updated successfully, but these errors were encountered:

jameshcorbett mentioned this issue Aug 22, 2024

dws: add error-handling cycle for kubernetes #199

Merged

mergify bot closed this as completed in #199 Aug 26, 2024

jameshcorbett mentioned this issue Nov 8, 2024

dws: retry k8s failures while cleaning up workflows #237

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DWS: handle broken kubernetes connection #197

DWS: handle broken kubernetes connection #197

jameshcorbett commented Aug 10, 2024

DWS: handle broken kubernetes connection #197

DWS: handle broken kubernetes connection #197

Comments

jameshcorbett commented Aug 10, 2024