nvitop.callbacks package

Submodules

nvitop.callbacks.keras module

class nvitop.callbacks.keras.GpuStatsLogger(gpus: int | list[int | str] | tuple[int | str, ...], memory_utilization: bool = True, gpu_utilization: bool = True, intra_step_time: bool = False, inter_step_time: bool = False, fan_speed: bool = False, temperature: bool = False)[source]

Bases: Callback

Automatically log GPU stats during training stage. GpuStatsLogger is a callback and in order to use it you need to assign a TensorBoard callback or a CSVLogger callback to the model.

Parameters:
  • memory_utilization (bool) – Set to True to log used, free and the percentage of memory utilization at the start and end of each step. Default: True.

  • gpu_utilization (bool) – Set to True to log the percentage of GPU utilization at the start and end of each step. Default: True.

  • intra_step_time (bool) – Set to True to log the time of each step. Default: False.

  • inter_step_time (bool) – Set to True to log the time between the end of one step and the start of the next step. Default: False.

  • fan_speed (bool) – Set to True to log percentage of fan speed. Default: False.

  • temperature (bool) – Set to True to log the gpu temperature in degree Celsius. Default: False.

Raises:

ValueError – If NVIDIA driver is not installed, or the gpus argument does not match available devices.

Examples

>>> from tensorflow.python.keras.utils.multi_gpu_utils import multi_gpu_model
>>> from tensorflow.python.keras.callbacks import TensorBoard
>>> from nvitop.callbacks.keras import GpuStatsLogger
>>> gpus = ['/gpu:0', '/gpu:1']  # or gpus = [0, 1] or gpus = 2
>>> model = Xception(weights=None, ..)
>>> model = multi_gpu_model(model, gpus)
>>> model.compile(..)
>>> tb_callback = TensorBoard(log_dir='./logs')
>>> gpu_stats = GpuStatsLogger(gpus)
>>> model.fit(.., callbacks=[gpu_stats, tb_callback])
Note::

The GpuStatsLogger callback should be placed before the TensorBoard / CSVLogger callback.

GPU stats are mainly based on NVML queries. The description of the queries is as follows:

  • fan.speed - The fan speed value is the percent of maximum speed that the device’s fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.

  • memory.used - Total memory allocated by active contexts, in MiBs.

  • memory.free - Total free memory, in MiBs.

  • utilization.gpu - Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.

  • utilization.memory - Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.

  • temperature - Core GPU temperature, in degrees C.

GPU_NAME_PATTERN = re.compile('^/(\\w*device:)?GPU:(?P<ID>\\d+)$', re.IGNORECASE)
on_train_batch_end(batch, logs=None) None[source]

Called at the end of a training batch in fit methods.

Subclasses should override for any actions to run.

Note that if the steps_per_execution argument to compile in tf.keras.Model is set to N, this method will only be called every N batches.

Parameters:
  • batch – Integer, index of batch within the current epoch.

  • logs – Dict. Aggregated metric results up until this batch.

on_train_batch_start(batch, logs=None) None[source]
on_train_epoch_start(epoch, logs=None) None[source]

nvitop.callbacks.lightning module

class nvitop.callbacks.lightning.GpuStatsLogger(memory_utilization: bool = True, gpu_utilization: bool = True, intra_step_time: bool = False, inter_step_time: bool = False, fan_speed: bool = False, temperature: bool = False)[source]

Bases: Callback

Automatically log GPU stats during training stage. GpuStatsLogger is a callback and in order to use it you need to assign a logger in the Trainer.

Parameters:
  • memory_utilization (bool) – Set to True to log used, free and the percentage of memory utilization at the start and end of each step. Default: True.

  • gpu_utilization (bool) – Set to True to log the percentage of GPU utilization at the start and end of each step. Default: True.

  • intra_step_time (bool) – Set to True to log the time of each step. Default: False.

  • inter_step_time (bool) – Set to True to log the time between the end of one step and the start of the next step. Default: False.

  • fan_speed (bool) – Set to True to log percentage of fan speed. Default: False.

  • temperature (bool) – Set to True to log the gpu temperature in degree Celsius. Default: False.

Raises:

MisconfigurationException – If NVIDIA driver is not installed, not running on GPUs, or Trainer has no logger.

Examples

>>> from lightning.pytorch import Trainer
>>> from nvitop.callbacks.lightning import GpuStatsLogger
>>> gpu_stats = GpuStatsLogger()
>>> trainer = Trainer(gpus=[..], logger=True, callbacks=[gpu_stats])

GPU stats are mainly based on NVML queries. The description of the queries is as follows:

  • fan.speed - The fan speed value is the percent of maximum speed that the device’s fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.

  • memory.used - Total memory allocated by active contexts, in MiBs.

  • memory.free - Total free memory, in MiBs.

  • utilization.gpu - Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.

  • utilization.memory - Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.

  • temperature - Core GPU temperature, in degrees C.

on_train_batch_end(trainer: pl.Trainer, pl_module: pl.LightningModule, **kwargs: Any) None[source]

Called when the train batch ends.

Note

The value outputs["loss"] here will be the normalized value w.r.t accumulate_grad_batches of the loss returned from training_step.

on_train_batch_start(trainer: pl.Trainer, pl_module: pl.LightningModule, **kwargs: Any) None[source]

Called when the train batch begins.

on_train_epoch_start(trainer: pl.Trainer, pl_module: pl.LightningModule) None[source]

Called when the train epoch begins.

on_train_start(trainer: pl.Trainer, pl_module: pl.LightningModule) None[source]

Called when the train begins.

nvitop.callbacks.pytorch_lightning module

class nvitop.callbacks.pytorch_lightning.GpuStatsLogger(memory_utilization: bool = True, gpu_utilization: bool = True, intra_step_time: bool = False, inter_step_time: bool = False, fan_speed: bool = False, temperature: bool = False)[source]

Bases: Callback

Automatically log GPU stats during training stage. GpuStatsLogger is a callback and in order to use it you need to assign a logger in the Trainer.

Parameters:
  • memory_utilization (bool) – Set to True to log used, free and the percentage of memory utilization at the start and end of each step. Default: True.

  • gpu_utilization (bool) – Set to True to log the percentage of GPU utilization at the start and end of each step. Default: True.

  • intra_step_time (bool) – Set to True to log the time of each step. Default: False.

  • inter_step_time (bool) – Set to True to log the time between the end of one step and the start of the next step. Default: False.

  • fan_speed (bool) – Set to True to log percentage of fan speed. Default: False.

  • temperature (bool) – Set to True to log the gpu temperature in degree Celsius. Default: False.

Raises:

MisconfigurationException – If NVIDIA driver is not installed, not running on GPUs, or Trainer has no logger.

Examples

>>> from pytorch_lightning import Trainer
>>> from nvitop.callbacks.pytorch_lightning import GpuStatsLogger
>>> gpu_stats = GpuStatsLogger()
>>> trainer = Trainer(gpus=[..], logger=True, callbacks=[gpu_stats])

GPU stats are mainly based on NVML queries. The description of the queries is as follows:

  • fan.speed - The fan speed value is the percent of maximum speed that the device’s fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.

  • memory.used - Total memory allocated by active contexts, in MiBs.

  • memory.free - Total free memory, in MiBs.

  • utilization.gpu - Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.

  • utilization.memory - Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.

  • temperature - Core GPU temperature, in degrees C.

on_train_batch_end(trainer: pl.Trainer, pl_module: pl.LightningModule, **kwargs: Any) None[source]

Called when the train batch ends.

Note

The value outputs["loss"] here will be the normalized value w.r.t accumulate_grad_batches of the loss returned from training_step.

on_train_batch_start(trainer: pl.Trainer, pl_module: pl.LightningModule, **kwargs: Any) None[source]

Called when the train batch begins.

on_train_epoch_start(trainer: pl.Trainer, pl_module: pl.LightningModule) None[source]

Called when the train epoch begins.

on_train_start(trainer: pl.Trainer, pl_module: pl.LightningModule) None[source]

Called when the train begins.

nvitop.callbacks.tensorboard module

nvitop.callbacks.tensorboard.add_scalar_dict(writer: SummaryWriter, main_tag: str, tag_scalar_dict: dict[str, int | float | np.floating], global_step: int | np.integer | None = None, walltime: float | None = None) None[source]

Add a batch of scalars to the writer.

Batched version of writer.add_scalar.

nvitop.callbacks.utils module

nvitop.callbacks.utils.get_devices_by_logical_ids(device_ids: list[int], unique: bool = True) list[CudaDevice][source]
nvitop.callbacks.utils.get_gpu_stats(devices: list[Device], memory_utilization: bool = True, gpu_utilization: bool = True, fan_speed: bool = False, temperature: bool = False) dict[str, float][source]

Get the GPU status from NVML queries.

Module contents