Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine Learning Design #14

Open
vsoch opened this issue Apr 24, 2024 · 0 comments
Open

Machine Learning Design #14

vsoch opened this issue Apr 24, 2024 · 0 comments

Comments

@vsoch
Copy link
Member

vsoch commented Apr 24, 2024

These are old notes from a few weeks ago, about how to integrate ML here.

We would want to be able to have an algorithm that maximizes utilization, which means having nodes ready to go only when the jobs that need them are ready to run. With our current approach, we are just taking the next job in the queue, whatever it is, and scaling to that. This means that, in practice, we are too late (the job is ready but the nodes are not) and we have a job waiting for the scale up. We would want the request to go in to scale at the exact N-<seconds> before the job is ready. OR decide not to scale, that it's better to wait for jobs to finish (if they are finishing soon).

What we'd want to do is somehow have an algorithm that can predict when jobs that are running are finished, and if it's cheaper to wait for them to finish (and use the resources) or scale up then and there. This is actually just like what we started to think about with Rajib.

  • Start out submitting a bunch of jobs at random.
  • Start building a model for each ensemble type, and each size within that.
  • When we get to some number of jobs that are trained for the model, stop submitting at random.
  • When we stop submitting at random, set job urgencies to 0 so nothing submits.
  • Then based on calculating the time/cost for each size and ensemble type in the queue under two conditions:
    • if we wait for nodes to be ready
    • if we ask for them right now and then add nodes to the cluster

Choose the ensemble member / size and the solution above that minimizes the cost.

Ping @milroy since we recently chat about the above - I wrote this before our discussion yesterday anticipating it could be interesting to work on/think about. Please disregard if not interested / don't have time (I understand).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant