You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These are old notes from a few weeks ago, about how to integrate ML here.
We would want to be able to have an algorithm that maximizes utilization, which means having nodes ready to go only when the jobs that need them are ready to run. With our current approach, we are just taking the next job in the queue, whatever it is, and scaling to that. This means that, in practice, we are too late (the job is ready but the nodes are not) and we have a job waiting for the scale up. We would want the request to go in to scale at the exact N-<seconds> before the job is ready. OR decide not to scale, that it's better to wait for jobs to finish (if they are finishing soon).
What we'd want to do is somehow have an algorithm that can predict when jobs that are running are finished, and if it's cheaper to wait for them to finish (and use the resources) or scale up then and there. This is actually just like what we started to think about with Rajib.
Start out submitting a bunch of jobs at random.
Start building a model for each ensemble type, and each size within that.
When we get to some number of jobs that are trained for the model, stop submitting at random.
When we stop submitting at random, set job urgencies to 0 so nothing submits.
Then based on calculating the time/cost for each size and ensemble type in the queue under two conditions:
if we wait for nodes to be ready
if we ask for them right now and then add nodes to the cluster
Choose the ensemble member / size and the solution above that minimizes the cost.
Ping @milroy since we recently chat about the above - I wrote this before our discussion yesterday anticipating it could be interesting to work on/think about. Please disregard if not interested / don't have time (I understand).
The text was updated successfully, but these errors were encountered:
These are old notes from a few weeks ago, about how to integrate ML here.
Ping @milroy since we recently chat about the above - I wrote this before our discussion yesterday anticipating it could be interesting to work on/think about. Please disregard if not interested / don't have time (I understand).
The text was updated successfully, but these errors were encountered: