This repository is focused on benchmarking jobs and classifying them into relevant classes. The basic idea is that by feeding the traces that we have collected from Promethues after a job is done, we say that this job was a memory intensive, half memory intensive, CPU intensive or GPU intensive.
- The EDA and ML notebooks utilize the data available in the local repository, allowing for immediate execution.
-
The deep learning repository extends this work by assuming that job traces remain almost constant.
-
Under this assumption, we achieved 65% accuracy.
-
An alternative approach involves treating each job trace as a matrix with dimensions:
timesteps × number of signals
This method may yield improved results. We need to point out that there are some challenges with this idea like the jobs don't have the smae length, so each matrix of data can have its own row numbers. In the initial version, we took the longest job and added zeros to the rows for all other jobs to get the same length.
- The
cleaning_benchmark_job_data
notebook processes JSON files typically generated by the HPC team. - It queries SLURM to retrieve the exact node and runtime of each job.
- The
prom_benchmark_job_extraction
notebook assumes that traces for all nodes have already been gathered and stored in Parquet format. - This notebook identifies the node and time for each job and extracts the corresponding signals or traces.