RFC: Implement nvidia_smi module#2759
Conversation
|
I'm looking forwards to being able to monitor my NVIDIA GPu's without needing to install some outdated Python modules. I'll see about getting this tested on my laptop (GTX 1080, Nvidia driver version 384.69) once the Travis build goes green. |
|
Hmm, the macOS build from before the most recent push is still running after over an hour, compared to the 4.5 minutes it appears to usually take. I think something might be broken there, but I don't have a macOS system to test with to confirm. |
|
I don't think I've seen travis-ci complete any netdata builds to date, but maybe I missed it. |
|
Here's the most recent one that succeeded: Other than this PR, it looks like all the PR test builds have been working fine for the past week or so according to: |
|
Huh, I stand corrected. I've pretty much always just seen the yellow "in progress" icon for Travis on my PRs, but I guess it is working. |
|
travis is having difficulties with macOS, it is usually a bit slow and unreliable (I have to restart a lot of jobs to make it finish). I see now it is stuck because it cannot find a macOS slot to build netdata. My click for nvidia-smi outputI haven't installed nvidia on my XPS9560 (nvidia 1050 - @Ferroin nvidia 1080 on a laptop? wow! which laptop? ) The truth is I am afraid to install it... :) will it break everything? XPS9560 has many issues with drivers, on both Linux and Windows... @Ferroin which distro do you use? |
|
@ktsaou Can you pastebin the output of |
|
btw @tycho this plugin is nice! @l2isbad is there a way to avoid polling external commands all the time? I mean, both Something like |
|
@tycho here it is: click for XML output<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v6.dtd">
<nvidia_smi_log>
<timestamp>Mon Sep 18 22:10:18 2017</timestamp>
<driver_version>340.102</driver_version>
<attached_gpus>1</attached_gpus>
<gpu id="0000:01:00.0">
<product_name>GeForce 8800 GT</product_name>
<product_brand>GeForce</product_brand>
<display_mode>N/A</display_mode>
<display_active>N/A</display_active>
<persistence_mode>Disabled</persistence_mode>
<accounting_mode>N/A</accounting_mode>
<accounting_mode_buffer_size>N/A</accounting_mode_buffer_size>
<driver_model>
<current_dm>N/A</current_dm>
<pending_dm>N/A</pending_dm>
</driver_model>
<serial>N/A</serial>
<uuid>GPU-da613ba8-27bb-76b9-2c57-d36a5905478a</uuid>
<minor_number>0</minor_number>
<vbios_version>62.92.16.00.04</vbios_version>
<multigpu_board>N/A</multigpu_board>
<board_id>N/A</board_id>
<inforom_version>
<img_version>N/A</img_version>
<oem_object>N/A</oem_object>
<ecc_object>N/A</ecc_object>
<pwr_object>N/A</pwr_object>
</inforom_version>
<gpu_operation_mode>
<current_gom>N/A</current_gom>
<pending_gom>N/A</pending_gom>
</gpu_operation_mode>
<pci>
<pci_bus>01</pci_bus>
<pci_device>00</pci_device>
<pci_domain>0000</pci_domain>
<pci_device_id>061110DE</pci_device_id>
<pci_bus_id>0000:01:00.0</pci_bus_id>
<pci_sub_system_id>080110B0</pci_sub_system_id>
<pci_gpu_link_info>
<pcie_gen>
<max_link_gen>N/A</max_link_gen>
<current_link_gen>N/A</current_link_gen>
</pcie_gen>
<link_widths>
<max_link_width>N/A</max_link_width>
<current_link_width>N/A</current_link_width>
</link_widths>
</pci_gpu_link_info>
<pci_bridge_chip>
<bridge_chip_type>N/A</bridge_chip_type>
<bridge_chip_fw>N/A</bridge_chip_fw>
</pci_bridge_chip>
</pci>
<fan_speed>40 %</fan_speed>
<performance_state>P12</performance_state>
<clocks_throttle_reasons>N/A</clocks_throttle_reasons>
<fb_memory_usage>
<total>1023 MiB</total>
<used>386 MiB</used>
<free>637 MiB</free>
</fb_memory_usage>
<bar1_memory_usage>
<total>N/A</total>
<used>N/A</used>
<free>N/A</free>
</bar1_memory_usage>
<compute_mode>Default</compute_mode>
<utilization>
<gpu_util>N/A</gpu_util>
<memory_util>N/A</memory_util>
<encoder_util>N/A</encoder_util>
<decoder_util>N/A</decoder_util>
</utilization>
<ecc_mode>
<current_ecc>N/A</current_ecc>
<pending_ecc>N/A</pending_ecc>
</ecc_mode>
<ecc_errors>
<volatile>
<single_bit>
<device_memory>N/A</device_memory>
<register_file>N/A</register_file>
<l1_cache>N/A</l1_cache>
<l2_cache>N/A</l2_cache>
<texture_memory>N/A</texture_memory>
<total>N/A</total>
</single_bit>
<double_bit>
<device_memory>N/A</device_memory>
<register_file>N/A</register_file>
<l1_cache>N/A</l1_cache>
<l2_cache>N/A</l2_cache>
<texture_memory>N/A</texture_memory>
<total>N/A</total>
</double_bit>
</volatile>
<aggregate>
<single_bit>
<device_memory>N/A</device_memory>
<register_file>N/A</register_file>
<l1_cache>N/A</l1_cache>
<l2_cache>N/A</l2_cache>
<texture_memory>N/A</texture_memory>
<total>N/A</total>
</single_bit>
<double_bit>
<device_memory>N/A</device_memory>
<register_file>N/A</register_file>
<l1_cache>N/A</l1_cache>
<l2_cache>N/A</l2_cache>
<texture_memory>N/A</texture_memory>
<total>N/A</total>
</double_bit>
</aggregate>
</ecc_errors>
<retired_pages>
<multiple_single_bit_retirement>
<retired_count>N/A</retired_count>
<retired_page_addresses>N/A</retired_page_addresses>
</multiple_single_bit_retirement>
<double_bit_retirement>
<retired_count>N/A</retired_count>
<retired_page_addresses>N/A</retired_page_addresses>
</double_bit_retirement>
<pending_retirement>N/A</pending_retirement>
</retired_pages>
<temperature>
<gpu_temp>54 C</gpu_temp>
<gpu_temp_max_threshold>N/A</gpu_temp_max_threshold>
<gpu_temp_slow_threshold>N/A</gpu_temp_slow_threshold>
</temperature>
<power_readings>
<power_state>P12</power_state>
<power_management>N/A</power_management>
<power_draw>N/A</power_draw>
<power_limit>N/A</power_limit>
<default_power_limit>N/A</default_power_limit>
<enforced_power_limit>N/A</enforced_power_limit>
<min_power_limit>N/A</min_power_limit>
<max_power_limit>N/A</max_power_limit>
</power_readings>
<clocks>
<graphics_clock>N/A</graphics_clock>
<sm_clock>N/A</sm_clock>
<mem_clock>N/A</mem_clock>
</clocks>
<applications_clocks>
<graphics_clock>N/A</graphics_clock>
<mem_clock>N/A</mem_clock>
</applications_clocks>
<default_applications_clocks>
<graphics_clock>N/A</graphics_clock>
<mem_clock>N/A</mem_clock>
</default_applications_clocks>
<max_clocks>
<graphics_clock>N/A</graphics_clock>
<sm_clock>N/A</sm_clock>
<mem_clock>N/A</mem_clock>
</max_clocks>
<clock_policy>
<auto_boost>N/A</auto_boost>
<auto_boost_default>N/A</auto_boost_default>
</clock_policy>
<supported_clocks>N/A</supported_clocks>
<compute_processes>N/A</compute_processes>
<accounted_processes>N/A</accounted_processes>
</gpu>
</nvidia_smi_log> |
|
Thanks @ktsaou. Pushed some changes to make it properly handle the XML output on your system (some charts don't/can't have any metrics available, so now we don't bother creating them). |
|
It might take a week or two before I can setup a proper dual GPU system to test with, but I'll add that to my list for this week. |
51d3b36 to
327ab7d
Compare
|
@l2isbad can we support a python.d.service like So external commands that can output data repeatedly, this would greatly improve the overhead of such plugins. If you can do this, I suggest to re-implement this ( and #2714 ) using the new service. |
|
Well, you can read from quick try (click)# -*- coding: utf-8 -*-
# Description: nvidia-smi netdata python.d module
# Author: Steven Noonan (tycho)
import copy
import xml.etree.ElementTree as et
try:
from queue import Queue, Empty
except ImportError:
from Queue import Queue, Empty
from subprocess import Popen, PIPE
from threading import Thread
from base import SimpleService
CHART_TEMPLATES = {
'pci_bandwidth': {
'options': [None, 'PCI express bandwidth utilization', 'KB/s', 'nvidia_smi', None, 'area'],
'_metrics': {'pci': ['rx_util', 'tx_util']},
'lines': []
},
'fan_speed': {
'options': [None, 'Fan speed', 'percentage', 'nvidia_smi', None, 'line'],
'_metrics': {None: ['fan_speed']},
'lines': []
},
'utilization': {
'options': [None, 'Resource utilization', 'percentage', 'nvidia_smi', None, 'line'],
'_metrics': {'utilization': ['gpu_util', 'memory_util', 'encoder_util', 'decoder_util']},
'lines': []
},
'mem_usage': {
'options': [None, 'Memory utilization', 'MB', 'nvidia_smi', None, 'line'],
'_metrics': {'fb_memory_usage': ['used']},
'lines': []
},
'temperature': {
'options': [None, 'Temperature', 'Celsius', 'nvidia_smi', None, 'line'],
'_metrics': {'temperature': ['gpu_temp']},
'lines': []
},
'clocks': {
'options': [None, 'Clock frequencies', 'MHz', 'nvidia_smi', None, 'line'],
'_metrics': {'clocks': ['graphics_clock', 'sm_clock', 'mem_clock', 'video_clock']},
'lines': []
},
'power': {
'options': [None, 'Power utilization', 'Watts', 'nvidia_smi', None, 'line'],
'_metrics': {'power_readings': ['power_draw']},
'_divisor': 10,
'lines': []
},
}
LAST_POLL = Queue()
def is_end_of_snapshot(line):
return '</nvidia_smi_log>' in line
def convert_to_xml_obj(data):
if not data:
return None
try:
return et.fromstring(data)
except et.ParseError:
return None
def invoke_nvidia_smi():
p = Popen(['nvidia-smi', '-x', '-q'], stdout=PIPE)
stdout, _ = p.communicate()
return stdout
class Poller(Thread):
def __init__(self, command, break_condition, service):
Thread.__init__(self)
self.command = command
self.break_condition = break_condition
self.service = service
def run(self):
try:
self.process = Popen(self.command.split(), stdout=PIPE, stderr=PIPE)
except (OSError, FileNotFoundError) as error:
self.error(str(error))
return
self.read_nvidia_smi_output()
def read_nvidia_smi_output(self):
output = list()
get_line = self.process.stdout.readline
read_lines_number = 0
while True:
line = get_line().decode()
read_lines_number += 1
if read_lines_number >= self.read_lines_limit:
self.error('After reading {0} lines,'
' the end of "nvidia-smi" output can not be found.'
' Poller thread exiting'. format(read_lines_number))
return
if self.break_condition(line):
output.append(line)
if not LAST_POLL.empty():
LAST_POLL.queue.clear()
LAST_POLL.put(output)
read_lines_number, output = 0, list()
continue
output.append(line)
def __getattr__(self, item):
return getattr(self.service, item)
class Service(SimpleService):
def __init__(self, configuration=None, name=None):
SimpleService.__init__(self, configuration=configuration, name=name)
self.order = list()
self.definitions = dict()
self.assignment = dict()
self.nvidia_smi = self.find_binary('nvidia-smi')
self.read_lines_limit = 1024 # will be redefined in check method
command = '{nvidia_smi} -x -q -l -lms {interval}'.format(nvidia_smi=self.nvidia_smi,
interval=self.update_every * 1000 - 100)
self.data_poller = Poller(command=command,
break_condition=is_end_of_snapshot,
service=self)
def parse_xml_obj(self, xml_obj):
data = dict()
for gpu in xml_obj.findall('gpu'):
gpu_id = self.assignment[gpu.get('id')]['gpu_id']
for template in CHART_TEMPLATES.values():
for root_name, metrics in template['_metrics'].items():
if root_name:
root = gpu.find(root_name)
else:
root = gpu
for metric in metrics:
metric_name = '_'.join([gpu_id, metric])
elem = root.find(metric)
if elem is None:
continue
try:
val = float(elem.text.split()[0])
except ValueError:
continue
val *= float(template.get('_divisor', 1))
data[metric_name] = val
return data or None
def _get_data(self):
try:
raw_data = LAST_POLL.get(timeout=0.2)
except Empty:
return None
if not raw_data:
return None
xml_obj = convert_to_xml_obj(''.join(raw_data))
if not xml_obj:
return None
return self.parse_xml_obj(xml_obj)
def check(self):
if not self.nvidia_smi:
self.error('"nvidia-smi" binary not found')
return None
nvidia_smi_output = invoke_nvidia_smi()
self.read_lines_limit = len(nvidia_smi_output.split()) * 2
smi = convert_to_xml_obj(nvidia_smi_output)
if not smi:
self.error("Failed to invoke 'nvidia-smi'."
" Do you have the proprietary NVIDIA driver and tools installed?")
return False
for i, gpu in enumerate(smi.findall('gpu')):
gpu_id = gpu.get('id')
self.assignment[gpu_id] = dict(gpu_id='gpu{0}'.format(i))
if not self.assignment:
self.error("Could not find any NVIDIA GPUs in nvidia-smi XML output")
return False
data = self.parse_xml_obj(smi)
self.create_charts(data)
self._data_from_check = data
self.data_poller.daemon = True
self.data_poller.start()
return True
def create_charts(self, data):
for chart, template in CHART_TEMPLATES.items():
for name in sorted(self.assignment, key=lambda v: self.assignment[v]['gpu_id']):
assignment = self.assignment[name]
gpu_id = assignment['gpu_id'] # gpu0
chart_name = '_'.join([chart, gpu_id]) # fan_speed_gpu0
chart_def = copy.deepcopy(template)
for root_name, metrics in chart_def['_metrics'].items():
collected_metrics = set('_'.join([gpu_id, metric]) for metric in metrics) & set(data)
if not collected_metrics:
break
self.order.append(chart_name)
for metric in collected_metrics:
direction = -1 if root_name == 'pci' and metric == 'tx_util' else 1
nickname = metric.split('_', 1)[1]
divisor = template.get('_divisor', 1)
chart_def['lines'].append([metric, nickname, 'absolute', direction, divisor])
self.definitions[chart_name] = chart_def |
|
The idea is to specify a special function (break_condition) that determines when we need to finish reading from |
|
I see some issues with the approach you're demonstrating. The first is that we could get out of step with the child process. For instance, let's say we're late at reading the output of the subprocess a few times, and eventually we're behind a whole interval. The subprocess has had enough time to output the https://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/ (I ran into this a few times at a previous job. I don't know if it still applies to Python 3.x; we were using Python 2.7.10 or so at the time.) It would seem like a better approach would be to spawn a thread that is constantly attempting to read the output of the subprocess. It could keep hold of the last full snapshot it's read. When our module needs to read an update, it would just pull the last complete snapshot read by the other thread. |
indeed
code above edited |
|
With the |
|
@l2isbad I can't confirm your observations. With the code in the PR, I verified So, my conclusion is that if |
327ab7d to
8bc124e
Compare
|
I just pushed a version demonstrating how I'd implement it. |
|
My version is live here: https://netdata.uplinklabs.net/incubus/#menu_gpu_submenu_nvidia_smi;theme=slate;help=true |
if not LAST_POLL.empty():
LAST_POLL.queue.clear()
LAST_POLL.put(output)Well, it acts differently on my 2 PCs (work NVIDIA-SMI 375.82 GeForce GT 730, home NVIDIA-SMI 375.82 GeForce GTX 960). lgz@lgz-homepc[~]nvidia-smi -q -x -l -lms 1000 > /dev/null |
|
@l2isbad OK, so you intend for |
|
i see try:
r = q.get(timeout=0.2)
except Queue.Empty:
return None |
|
@l2isbad Yeah, that could work. I'd rather not make the main thread wait at all if I can help it though. Another thing we're not yet accounting for is what to do in the Poller if |
|
work (GeForce GT 730) home (GeForce GTX 960) Seem it depends on gpu, number of gpus
we can get number of lines after invoking |
Actually it's must have feature. Try sudo |
4a37b28 to
70d7d08
Compare
d25e9c2 to
0b763e7
Compare
|
I have a very strange case that is identified by this plugin: The PCI Express bandwidth utilization spikes like crazy when I switch tabs at the browser. The lag is visible, even mouse movement is lagging while this happens. hm... can't figure out why... btw, I really like this plugin and I would like to merge it. Some suggestions:
|
|
btw, since we know that nvidia-smi may consume quite some CPU, the plugin could me merged and installed, but disabled by default, so that users would have to edit a config to enable it. |
|
I used to have the graphs broken out per-GPU, but it got really crowded in a box with 4 GPUs in it. I could turn it into a preference or something. I'll see what I can do about the suggestions you have. |
0b763e7 to
90f31b3
Compare
90f31b3 to
15526c4
Compare
15526c4 to
1cadd87
Compare
1cadd87 to
7e39052
Compare
|
is this working? |
|
Yes? |
|
Are we merging this? |
|
Just realized I never got back to this to actually test it. Appears to work correctly on my Laptop (GTX 1080) and my home-server (GeForce 210), including confirming the caveats mentioned in the original description. Both tests were done with the most recent Travis build of the PR. I also agree entirely with all of @ktsaou's suggestions on this, especially disabling by default. |
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
7e39052 to
e78da93
Compare
|
Is this blocked because of subprocess call failing the PR quality review? |
| import threading | ||
| import xml.etree.ElementTree as et | ||
| import time | ||
| from base import SimpleService |
There was a problem hiding this comment.
please use
from bases.FrameworkServices.SimpleService import SimpleServicebase.py was removed
| self.break_condition = break_condition | ||
| self.lock = threading.RLock() | ||
| self.last_time = 0 | ||
| self.last_data = '' |
|
|
||
| return data | ||
|
|
||
| def check(self): |
There was a problem hiding this comment.
check method is super long and hard to read. Please split it into several functions. At least let's move chart creating to a separate function.
Actually no special reason. Since this plugin is requested by people several times already i think we should merge it (disabled by default?). So let's work on this. |
|
This pull request introduces 2 alerts when merging e78da93 into 6ec1605 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
|
This is missing the following items:
The complaints reported by Codacy about use of |
|
This also needs a rebase. |
|
How is this? Not sure I did the rebase correct? |
|
After committing my changes to my fork of the branch I added the netdata/netdata repo as origin and ran: But then it wouldn't let me push without force. |
|
Having to force-push after a rebase is normal. Rebasing rewrites commit history, so it's not a simple case of just putting a few extra commits at the end of the branch, and therefore git is (rightfully) picky about making sure you want to push the results. |
|
@tycho Do you still have interest in getting this merged upstream? If so, I would suggest collaborating with @madhavajay as they appear to have an interest in it. If not, please mention so here so we can close this and @madhavajay can open a PR for it. |
|
If this hits some time limit we can just merge from my branch since it still attributes all of @tycho work on this so I doubt they would care. Im using the unofficial on my servers right now and i'm itching to see it go live and let all the Crypto fiends and AI / ML nerds have at it! 😝 |
|
closed in favor of #4357 |




This is mostly ready for merging, I think, but I'd like to do some iterating and get some feedback on it.
I've been using this on my own machines for a few days without any issues. My dual Xeon box is currently running a TensorFlow learner on the GPU, so you can see this module in action here:
https://netdata.uplinklabs.net/incubus/#menu_gpu_submenu_nvidia_smi;theme=slate;help=true
A couple of screenshots here, in case it's not busy when you're looking:
There are a couple caveats to note. First, I haven't tried this with a multi-GPU box yet, so I'm not sure what the module will do when it sees more than a single GPU. It probably does the wrong thing, but the intent is that it creates a separate set of graphs per GPU.
There are also some caveats with
nvidia-smi.nvidia-smi -qcan occasionally be slightly slower when the GPU is under load:And it can be much slower if the GPU is completely idle, has no X11 session running, and has persistence mode disabled (

nvidia-smi -pm 0):