You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/usr/bin/coral2_dws.py", line 937, in <module>
main()
File "/usr/bin/coral2_dws.py", line 928, in main
handle.reactor_run()
File "/usr/lib64/flux/python3.6/flux/core/handle.py", line 322, in reactor_run
Flux.raise_if_exception()
File "/usr/lib64/flux/python3.6/flux/core/handle.py", line 133, in raise_if_exception
raise cls.set_exception(None) from None
File "/usr/lib64/flux/python3.6/flux/core/watchers.py", line 68, in timeout_handler_wrapper
watcher.callback(watcher.flux_handle, watcher, revents, watcher.args)
File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 63, in watch_cb
watchers.watch()
File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 99, in watch
watch.watch()
File "/usr/lib64/flux/python3.6/flux_k8s/watch.py", line 47, in watch
for event in stream:
File "/usr/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/usr/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 46, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/usr/lib/python3.6/site-packages/urllib3/response.py", line 694, in read_chunked
self._original_response.close()
File "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3.6/site-packages/urllib3/response.py", line 378, in _error_catcher
raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(2062 bytes read, 608 more expected
Because of this error a bunch of workflows were stranded in PreRun because the coral2-dws service was down when the dws.post_run RPCs were sent. And perhaps relatedly, it seems that a lot of dws.prolog_remove RPCs were sent (perhaps because the workflows had some updates occur while PreRun was ready: True) because @grondo noted that the Flux logs were filling up with messages like
+40.069567] job-manager[0]: failed to fetch 'dws_prolog_active' aux for 508988203961679872: No such file or directory
[ +40.069597] job-manager[0]: Failed to setup DWS workflow object for job 508988203961679872
[ +40.396645] job-manager[0]: failed to fetch 'dws_prolog_active' aux for 508987988726775808: No such file or directory
[ +40.396659] job-manager[0]: Failed to setup DWS workflow object for job 508987988726775808
[ +40.774926] job-manager[0]: failed to fetch 'dws_prolog_active' aux for 508988073585934336: No such file or directory
[ +40.774943] job-manager[0]: Failed to setup DWS workflow object for job 508988073585934336
The text was updated successfully, but these errors were encountered:
An error occurred on elcap:
Relates to #145 and #159.
Because of this error a bunch of workflows were stranded in PreRun because the coral2-dws service was down when the
dws.post_run
RPCs were sent. And perhaps relatedly, it seems that a lot ofdws.prolog_remove
RPCs were sent (perhaps because the workflows had some updates occur whilePreRun
wasready: True
) because @grondo noted that the Flux logs were filling up with messages likeThe text was updated successfully, but these errors were encountered: