Machine Learning Design #14

vsoch · 2024-04-24T00:34:07Z

These are old notes from a few weeks ago, about how to integrate ML here.

We would want to be able to have an algorithm that maximizes utilization, which means having nodes ready to go only when the jobs that need them are ready to run. With our current approach, we are just taking the next job in the queue, whatever it is, and scaling to that. This means that, in practice, we are too late (the job is ready but the nodes are not) and we have a job waiting for the scale up. We would want the request to go in to scale at the exact N-<seconds> before the job is ready. OR decide not to scale, that it's better to wait for jobs to finish (if they are finishing soon).

What we'd want to do is somehow have an algorithm that can predict when jobs that are running are finished, and if it's cheaper to wait for them to finish (and use the resources) or scale up then and there. This is actually just like what we started to think about with Rajib.

Start out submitting a bunch of jobs at random.

Start building a model for each ensemble type, and each size within that.

When we get to some number of jobs that are trained for the model, stop submitting at random.

When we stop submitting at random, set job urgencies to 0 so nothing submits.

Then based on calculating the time/cost for each size and ensemble type in the queue under two conditions:

if we wait for nodes to be ready

if we ask for them right now and then add nodes to the cluster

Choose the ensemble member / size and the solution above that minimizes the cost.

Ping @milroy since we recently chat about the above - I wrote this before our discussion yesterday anticipating it could be interesting to work on/think about. Please disregard if not interested / don't have time (I understand).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine Learning Design #14

Machine Learning Design #14

vsoch commented Apr 24, 2024 •

edited

Loading

Machine Learning Design #14

Machine Learning Design #14

Comments

vsoch commented Apr 24, 2024 • edited Loading

vsoch commented Apr 24, 2024 •

edited

Loading