Implement nvidia smi module#4357
Conversation
| CHART_TEMPLATES = { | ||
| 'pci_bandwidth': { | ||
| 'options': [None, 'PCI express bandwidth utilization', 'KB/s', 'nvidia_smi', 'nvidia_smi', 'area'], | ||
| '_metrics': { 'pci': [ 'rx_util', 'tx_util' ] }, |
There was a problem hiding this comment.
You've got some formatting issues here and a couple of places below too.
This line should look like this:
'_metrics': {'pci': ['rx_util', 'tx_util']},
For future reference, you can easily check for stuff like this by using flake8. The only difference from the default behavior is that we use a line length limit of 120 characters instead of 79 (so you can get behavior identical to what our checking does by just passing --max-line-length=120 to flake8).
There was a problem hiding this comment.
Yes I appreciate formatting, I have recently switched to vscode and dont have all my PEP8 stuff setup yet.
|
|
||
| class Service(SimpleService): | ||
| def __init__(self, configuration=None, name=None): | ||
| SimpleService.__init__(self, configuration=configuration, name=name) |
There was a problem hiding this comment.
Please update this to use super() like so:
super().__init__(configuration=configuration,name=name)
|
This pull request introduces 1 alert when merging 485b2dd into aab4285 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
|
Please add an entry to |
Ferroin
left a comment
There was a problem hiding this comment.
The duplicate variable issue reported by LGTM, while technically wrong, should still be resolved.
|
@ktsaou Runtime on this is pretty high. About 50-70ms when the GPU is under only light load, roughly 80-100ms under heavy load, and well over 150ms when there is nothing using it. Do you think we should have it disabled by default or just bump up the collection interval to a couple of seconds? |
|
@Ferroin I checked one of my servers and I think anything less than 1 second might lose too much granularity, the tool actually allows for ms level polling. |
|
This pull request introduces 1 alert when merging 5d40632 into aab4285 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
| CHART_TEMPLATES = { | ||
| 'pci_bandwidth': { | ||
| 'options': [None, 'PCI express bandwidth utilization', 'KB/s', | ||
| 'nvidia_smi', 'nvidia_smi', 'area'], |
There was a problem hiding this comment.
A lot of (actually, I think all of) the lines you reformatted as two lines were fine as is. While PEP 8 itself insists on a 79 character line length limit, we deviate from that and limit it to at most 120 characters instead (most people have big enough screens these days that 120 still fits perfectly fine).
Sorry about any confusion regarding this, we're still in the process of getting the rest of our existing code reformatted to be compliant with our coding standards, and we don't yet have thing set up to show nice notifications about whether the code meets the coding standards or not.
There was a problem hiding this comment.
Also, just to clarify, it's fine to merge this as it currently is (assuming of course that @l2isbad agrees), I just wanted to mention that this particular case of reformatting wasn't entirely necessary.
There was a problem hiding this comment.
I personally still argue for pep8 80 char for split pane etc etc, but I understand the desire for consistency. Brb.
There was a problem hiding this comment.
Done. 👍 I left the small changes to code because I dont think theres any advantage to overly long nested and complex lines of code, but for array literals and strings let them breath!
|
This pull request introduces 1 alert when merging e8320c4 into aab4285 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
|
How are we looking? Do I need an ignore comment for these subprocess / xml checks? |
|
Yeah, the stuff reported by Codacy can be ignored. All the stuff they're complaining about is an unfortunate requirement for working with |
|
At this point, it just needs approving reviews from @l2isbad and @ktsaou, and ideally a concrete decision as to whether or not it should be enabled by default. Given the amount of time it takes to run each cycle and the fact that |
|
|
Its weird because GPUStat just wraps nvidia-smi but its faster at least for similar output: Also they mention a nvidia-smi daemon mode? |
|
Maybe a version 2 could be to investigate why gpustat is faster and use it instead. |
|
What matters here isn't as much exactly how long Total run-time for each cycle is tracked by Netdata itself and reported in the 'Netdata Monitoring' section at the bottom of the dashboard. There's a sub-section there for the python modules which has one graph per module which tracks how long it takes from the start of data collection each cycle until data actually gets reported. Based on what I'm seeing here in that section, and what was included with the original PR< it looks like the absolute minimum amount of time it takes for a single collection cycle is somewhere around 40ms on decent hardware, which is not horrible, but that value goes up when the GPU is under load (running a CUDA task on my laptop's GTX 1050 bumps the times up to around 100ms), and calling it when the GPU is completely idle (and therefore likely to enter low power states) makes the collection time skyrocket to somewhere around a quarter of a second. This wouldn't be much of a problem, except that because of how |
|
@Ferroin Do you think thats because of this?: |
|
Also, last thing we want is to poll the nvidia-smi tool and have it wake up peoples asleep discreet laptop GPUs when they're supposed to be in power saving mode, if that is why it takes so long to load up. But then... surely NetData is for more for Servers out of the box? |
|
i see @tycho still not answered @madhavajay |
That's probably part of it, but I'd be willing to bet that it also involves the device transitioning to a low power state. I'm pretty sure just about all modern NVIDIA cards support PCI ASPM, and transitioning out of L1 there is usually a very high-latency operation. Even coming out of L0s can be pretty bad on some controllers.
As far as stuff that's likely to be using this plugin, yes. That said, I do run Netdata on my laptop just for a number of the alarms, though said laptop only has a discrete GPU (it's got a nice Core i7 CPU, but the iGPU is hard-disabled by firmware and never gets power or a clock signal after the system finishes POST). |
|
Not really sure what the next steps are here? |
|
@madhavajay current implementation has no protection from memory leak in case of |
|
FYI it looks like there are more programmatic ways of doing this than by shelling out to https://github.com/Syllo/nvtop/blob/master/src/extract_gpuinfo.c I'd recommend implementing the NVIDIA GPU module with NVML. It'd definitely have lower overhead and feel like less of a hacky implementation. |
|
Also looks like they have an official python library: |
|
@l2isbad please check this. |
|
@madhavajay hi |
|
@l2isbad sorry was away, im not sure what to do here, any hints? Is there another plugin which deals with something similar to look at? |
|
@madhavajay we have to implement some protection from it. How? if i remember it right while True:
line = self.process.stdout.readline().decode()
lines.append(line)
if self.break_condition(line):
self.lock.acquire()
self.last_data = '\n'.join(lines)
self.last_time = time.time()
self.lock.release()
returnIf this occurs we need stop poller thread. |
|
@l2isbad okay got it, I will take a look when I have a spare moment and find a way to gracefully handle the error. Thanks. |
|
i rewrote click me!
class NvidiaSMI:
def __init__(self):
self.command = find_binary(NVIDIA_SMI)
self.active_proc = None
def __bool__(self):
return bool(self.command)
def run_once(self):
proc = subprocess.Popen([self.command, '-x', '-q'], stdout=subprocess.PIPE)
stdout, _ = proc.communicate()
return stdout
def run_loop(self, interval):
if self.active_proc:
self.kill()
proc = subprocess.Popen([self.command, '-x', '-q', '-l', str(interval)], stdout=subprocess.PIPE)
self.active_proc = proc
return proc.stdout
def kill(self):
if self.active_proc:
self.active_proc.kill()
self.active_proc = None
class Poller(threading.Thread):
def __init__(self, command, break_condition, poll_interval):
threading.Thread.__init__(self)
self.daemon = True
self.command = command
self.break_condition = break_condition
self.interval = int(poll_interval)
self.lock = threading.RLock()
self.last_data = str()
self.exit = False
self.empty_rows = 0
self.rows = list()
def run(self):
stdout = self.command.run_loop(self.interval)
for row in stdout:
if self.exit or self.empty_rows > EMPTY_LIMIT:
break
self.process_row(row)
self.command.kill()
def process_row(self, row):
row = row.decode()
self.empty_rows += (row == EMPTY)
self.rows.append(row)
if self.break_condition(row):
self.lock.acquire()
self.last_data = '\n'.join(self.rows)
self.lock.release()
self.rows.clear()
self.empty_rows = 0
def is_started(self):
return self.ident is not None
def shutdown(self):
self.exit = True
def data(self):
self.lock.acquire()
data = self.last_data
self.lock.release()
return databtw create charts part and get_data are hard to read/follow, i suggest to simplify them. If you need help feel free to ask |
|
The module has been developing for ages 😄 |
9134341 to
70de864
Compare
|
hey @madhavajay |
|
@l2isbad yes please. Unfortunately I had spare time back when I started but just lately I haven't had a chance. If you can finish it that would be awesome! 👍 |
|
ok, closing this in favor of #4589 |
|
#4589 merged |
|
I added a native implementation in golang here netdata/go.d.plugin#318 |
In the interest of speeding things up this is my PR meeting the requirements of this earlier PR from @tycho and their awesome work on this plugin.
#2759