Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DWS: handle broken kubernetes connection #197

Closed
jameshcorbett opened this issue Aug 10, 2024 · 0 comments · Fixed by #199
Closed

DWS: handle broken kubernetes connection #197

jameshcorbett opened this issue Aug 10, 2024 · 0 comments · Fixed by #199

Comments

@jameshcorbett
Copy link
Member

An error occurred on elcap:

Traceback (most recent call last):
  File "/usr/bin/coral2_dws.py", line 937, in <module>
    main()
  File "/usr/bin/coral2_dws.py", line 928, in main
    handle.reactor_run()
  File "/usr/lib64/flux/python3.6/flux/core/handle.py", line 322, in reactor_run
    Flux.raise_if_exception()
  File "/usr/lib64/flux/python3.6/flux/core/handle.py", line 133, in raise_if_exception
    raise cls.set_exception(None) from None
  File "/usr/lib64/flux/python3.6/flux/core/watchers.py", line 68, in timeout_handler_wrapper
    watcher.callback(watcher.flux_handle, watcher, revents, watcher.args)
  File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 63, in watch_cb
    watchers.watch()
  File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 99, in watch
    watch.watch()
  File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 47, in watch
    for event in stream:
  File "/usr/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/lib/python3.6/site-packages/urllib3/response.py", line 694, in read_chunked
    self._original_response.close()
  File "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3.6/site-packages/urllib3/response.py", line 378, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
  urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(2062 bytes read, 608 more expected

Relates to #145 and #159.

Because of this error a bunch of workflows were stranded in PreRun because the coral2-dws service was down when the dws.post_run RPCs were sent. And perhaps relatedly, it seems that a lot of dws.prolog_remove RPCs were sent (perhaps because the workflows had some updates occur while PreRun was ready: True) because @grondo noted that the Flux logs were filling up with messages like

 +40.069567] job-manager[0]: failed to fetch 'dws_prolog_active' aux for 508988203961679872: No such file or directory
[ +40.069597] job-manager[0]: Failed to setup DWS workflow object for job 508988203961679872
[ +40.396645] job-manager[0]: failed to fetch 'dws_prolog_active' aux for 508987988726775808: No such file or directory
[ +40.396659] job-manager[0]: Failed to setup DWS workflow object for job 508987988726775808
[ +40.774926] job-manager[0]: failed to fetch 'dws_prolog_active' aux for 508988073585934336: No such file or directory
[ +40.774943] job-manager[0]: Failed to setup DWS workflow object for job 508988073585934336
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant