Keras callbacks and metrics for tracking GPU utilization, temperature, and power consumption.
Supports nvidia GPUs only through the nvidia-ml-py library, a python wrapper for NVIDIA Management Library (NVML) APIs.
This library supports two main use cases:
- Instantaneous metrics: GPU utilization, temperature, and power consumption
- Metric Tracking during a training session
pip install keras-gpu-metrics
Basic example code to get current GPU status.
Track GPU utilization, temperature, and power consumption during training of a tensorflow model.
Demonstrates estimation of total GPU energy usage for training and testing a tensorflow model.
def get_gpu_statuses() -> List[GPUStatus]:
Returns a list of GPUStatus objects, one for each GPU on the system.
Usage:
from keras_gpu_metrics import get_gpu_statuses
# Gets a list of GPUStatus objects for each available GPU device
gpu_statuses = get_gpu_statuses()
Returns:
List[GPUStatus]
: A list of GPUStatus objects, one for each GPU on the system.
@dataclass
class GPUStatus:
timestamp: int # UNIX timestamp (seconds) of this measurement
gpu_id: int # Integer ID of the GPU
device_name: str # Name of the GPU i.e. 'NVIDIA GeForce RTX 3090'
pids: List[int] # List of process IDs using the GPU
utilization: int # GPU utilization percentage
clock_speed_mhz: int # GPU clock speed in MHz
temperature: int # GPU temperature in Celsius
memory_free: int # Free GPU memory usage in bytes
memory_used: int # Used GPU memory usage in bytes
fan_speed: int # Fan speed percentage
power_usage_mw: int # Power usage in milliwatts
from keras_gpu_metrics.gpu_info import get_gpu_statuses
# Gets a list of GPUStatus objects for each available GPU device
gpu_statuses = get_gpu_statuses()
# Prints the status of GPU 0
if len(gpu_statuses) > 0:
print(gpu_statuses[0])
else:
print('No compatible GPU detected')
Example output:
Status of GPU 0 at timestamp 1673587592 (f2023-01-12 21:26:32)
Device Name: NVIDIA GeForce RTX 3090
PIDs: [1840, 1832, 9692]
Utilization: 2%
Clock Speed: 210 MHz
Temperature: 34 C
Memory Free: 891281408 bytes
Memory Used: 24878522368 bytes
Fan Speed: 0%
Power Usage: 21824 mW
Tracking GPU stats as training metrics is supported using a Keras callback (GPUMetricTrackerCallback).
This callback should be added to the list of callbacks passed to the fit
method of a Keras model.
Metric values are updated at the end of training for each batch and can be tracked by using the metric functions
provided by the callback object.
from keras_gpu_metrics import GPUMetricTrackerCallback
# GPUMetricTrackerCallback is needed to update variables so that metrics
# (which are part of the tensorflow graph) can receive updated GPU info.
gpu_tracker_callback = GPUMetricTrackerCallback()
metrics = [
tf.keras.metrics.BinaryAccuracy(),
gpu_tracker_callback.utilization_metric(),
gpu_tracker_callback.clock_speed_metric(),
gpu_tracker_callback.temperature_metric(),
gpu_tracker_callback.fan_speed_metric(),
gpu_tracker_callback.power_usage_metric(),
]
model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
metrics=metrics
)
history = model.fit(
dataset,
epochs=10,
validation_data=dataset,
callbacks=[gpu_tracker_callback]
)