0
Job Statistics with NVIDIA Data Center GPU Manager and SLURM
https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-manager-slurm/(developer.nvidia.com)NVIDIA's Data Center GPU Manager (DCGM) can be integrated with resource managers like SLURM to provide detailed job-level GPU statistics. The process involves using SLURM prolog and epilog scripts to start and stop DCGM's statistics collection for each job, identified by its unique SLURM job ID. This allows administrators and users to monitor resource utilization without modifying the user's workload. The resulting reports provide key metrics such as energy consumption, power usage, maximum GPU memory used, and error counts, helping to understand system performance and diagnose issues.
0 points•by hdt•4 days ago