Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactivate dashboard CI #952

Merged
merged 50 commits into from
Nov 23, 2022
Merged

Conversation

notoraptor
Copy link
Collaborator

Description

Re-add CI tests for dashboard src.

Changes

  • Make sure to run backend in repo root, so that it can find database PKL file

Checklist

Tests

  • I added corresponding tests for bug fixes and new features. If possible, the tests fail without the changes
  • All new and existing tests are passing ($ tox -e py38; replace 38 by your Python version if necessary)

Documentation

  • I have updated the relevant documentation related to my changes

Quality

  • I have read the CONTRIBUTING doc
  • My commits messages follow this format
  • My code follows the style guidelines ($ tox -e lint)

@lebrice
Copy link
Collaborator

lebrice commented Jun 22, 2022

Is there a way for this to only run when a change has been made to the dashboard?

@notoraptor
Copy link
Collaborator Author

Hi @lebrice yes, it should be possible !

@notoraptor notoraptor force-pushed the reactivate-dashboard-ci branch from 999ec1b to 0c034be Compare June 28, 2022 15:45
@notoraptor
Copy link
Collaborator Author

Hi @bouthilx !

Thanks to your suggestion to use cpulimit, I am now able to reproduce timeout issues on my computer.

After further investigations, my current hypothesis is that timeout in frontend is (at least) based on timeout configuration for gunicorn in backend side, which seems to be 30 seconds by default: https://docs.gunicorn.org/en/stable/settings.html#timeout .

When running with cpulimit (with 10% limit of CPU usage applied to worker processus), I found in backend output these lines that seems to occur just before timeout error in frontend:

[2022-06-28 12:06:12 -0400] [172390] [CRITICAL] WORKER TIMEOUT (pid:172408)
[2022-06-28 12:06:12 -0400] [172408] [INFO] Worker exiting (pid: 172408)
[2022-06-28 12:06:12 -0400] [172390] [WARNING] Worker with pid 172408 was terminated due to signal 9
[2022-06-28 12:06:12 -0400] [172579] [INFO] Booting worker with pid: 172579

Backend terminates the worker due to signal 9 then restarts another worker, and this seems to disconnect API call.

I don't yet know why this happens, but I decide to just increase timeout in backend. To do that, I modified your branch feature/benchmark_webapi into another branch feature/benchmark_webapi_rebased with a rebase and supplementary commits to make sure timeout config is taken into account: /~https://github.com/Epistimio/orion/compare/develop...notoraptor:orion:feature/benchmark_webapi_rebased?expand=1 (see changes in files .github/workflows/orion/orion_config.yaml and src/orion/core/cli/serve.py ).

What do you think ?

@bouthilx
Copy link
Member

Thank you for the detailed explanation! I think increasing the timeout on the backend is a good solution. I did experience timeouts on the backend too when I was playing with large benchmarks. We'll need to improve the efficiency of the backend/API latter, but for now increasing the timeout is good.

@bouthilx
Copy link
Member

It's a bit late for me to say this but you could have removed all tests except for test-dashboard-build and make it run 10 times. This would have make it easier to automate the 10 runs and would have saved a lot of compute time. Sorry for saying this only after 7 reruns 😅

@notoraptor
Copy link
Collaborator Author

(for record/reminder) I am finally to get orion backend output when errors occur, using Github artifacts. I fortunately got an error with last commit, and artifact for corresponding error (with node 14.x) is available at bottom of this page: /~https://github.com/Epistimio/orion/actions/runs/2593818816

There is indeed an error reported in log, and then worker is restarted:

Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/multiprocessing/pool.py", line 729, in _terminate_pool
    p.join()
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
    sys.exit(1)
SystemExit: 1
[2022-07-01 01:59:41 +0000] [1862] [WARNING] Worker with pid 1873 was terminated due to signal 9
[2022-07-01 01:59:41 +0000] [2692] [INFO] Booting worker with pid: 2692

Some remarks:

  • exception seems related to gunicorn
  • I guess the worker restarted in backend is what causes the error in frontend.
  • error in frontend is because an expected test was not found in page after 300 seconds, so it seems worker lasts for at least 300 seconds in backend, which corresponds to 300 seconds timeout set in backend. So, is the exception in backend caused because of the timeout, or is it raised for another reason?

@notoraptor notoraptor force-pushed the reactivate-dashboard-ci branch 3 times, most recently from 1d6925f to d289e3e Compare July 12, 2022 17:37
@notoraptor notoraptor force-pushed the reactivate-dashboard-ci branch from d289e3e to e38b9b5 Compare September 13, 2022 03:28
@notoraptor notoraptor force-pushed the reactivate-dashboard-ci branch from 046990c to b29ae87 Compare September 25, 2022 20:54
@notoraptor notoraptor changed the title (WIP) Reactivate dashboard ci Reactivate dashboard CI Sep 25, 2022
Move Axios and HTTP implementations in separate files.
Still use Axios implementations.
…ark_webapi_rebased for orion backend

- [dashboard/src] Get back to Axios only
…ion/runs/7097242640?check_suite_focus=true

Trigger CI again, with backend branch updated to use only 1 worker as in default
CI passed. Trigger again bis.

Ci passed. Trigger again (x3)
…e updated here.

- add timeout 300
- set workers to 1 and threads to 1 as it is by default: https://docs.gunicorn.org/en/stable/settings.html
CI passed (2). Retry

CI passed (3). Retry.

CI passed (4). Retry.

CI passed (5). Retry.

CI passed (6). Retry.

CI passed (7). Retry.
Try to kill backend process on end, to make sure log files are saved on disk
@notoraptor notoraptor force-pushed the reactivate-dashboard-ci branch from b29ae87 to e395cdc Compare November 9, 2022 18:38
@notoraptor
Copy link
Collaborator Author

Rebased

Copy link
Member

@bouthilx bouthilx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!! :)

@bouthilx bouthilx merged commit ea7ae76 into Epistimio:develop Nov 23, 2022
@bouthilx bouthilx added the ci label Dec 19, 2022
@notoraptor notoraptor deleted the reactivate-dashboard-ci branch January 19, 2023 17:31
@notoraptor notoraptor mentioned this pull request Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants