Implement nvidia smi module by madhavajay · Pull Request #4357 · netdata/netdata

madhavajay · 2018-10-04T15:04:38Z

In the interest of speeding things up this is my PR meeting the requirements of this earlier PR from @tycho and their awesome work on this plugin.
#2759

CLAassistant · 2018-10-04T15:04:47Z

All committers have signed the CLA.

Ferroin · 2018-10-04T15:14:27Z

+CHART_TEMPLATES = {
+    'pci_bandwidth': {
+        'options': [None, 'PCI express bandwidth utilization', 'KB/s', 'nvidia_smi', 'nvidia_smi', 'area'],
+        '_metrics': { 'pci': [ 'rx_util', 'tx_util' ] },


You've got some formatting issues here and a couple of places below too.

This line should look like this:

'_metrics': {'pci': ['rx_util', 'tx_util']},

For future reference, you can easily check for stuff like this by using flake8. The only difference from the default behavior is that we use a line length limit of 120 characters instead of 79 (so you can get behavior identical to what our checking does by just passing --max-line-length=120 to flake8).

Yes I appreciate formatting, I have recently switched to vscode and dont have all my PEP8 stuff setup yet.

Ferroin · 2018-10-04T15:15:26Z

+
+class Service(SimpleService):
+    def __init__(self, configuration=None, name=None):
+        SimpleService.__init__(self, configuration=configuration, name=name)


Please update this to use super() like so:

super().__init__(configuration=configuration,name=name)

netdatabot · 2018-10-04T15:19:06Z

This pull request introduces 1 alert when merging 485b2dd into aab4285 - view on LGTM.com

new alerts:

1 for Variable defined multiple times

Comment posted by LGTM.com

Ferroin · 2018-10-04T15:19:26Z

Please add an entry to python.d/README.md for the plugin as well. You can largely just copy the style of stuff there directly, it should provide info about what data is collected and how to configure the plugin, as well as mentioning the performance implications (runs really slow if the GPU is completely idle, runs slightly slower than usual when GPU is under heavy load).

Ferroin

The duplicate variable issue reported by LGTM, while technically wrong, should still be resolved.

Ferroin · 2018-10-04T15:24:40Z

@ktsaou Runtime on this is pretty high. About 50-70ms when the GPU is under only light load, roughly 80-100ms under heavy load, and well over 150ms when there is nothing using it. Do you think we should have it disabled by default or just bump up the collection interval to a couple of seconds?

madhavajay · 2018-10-04T15:55:46Z

@Ferroin I checked one of my servers and I think anything less than 1 second might lose too much granularity, the tool actually allows for ms level polling.
https://imgur.com/a/4dlVvl7

netdatabot · 2018-10-04T16:43:31Z

This pull request introduces 1 alert when merging 5d40632 into aab4285 - view on LGTM.com

new alerts:

1 for Unused local variable

Comment posted by LGTM.com

Ferroin · 2018-10-04T17:16:37Z

+CHART_TEMPLATES = {
+    'pci_bandwidth': {
+        'options': [None, 'PCI express bandwidth utilization', 'KB/s',
+                    'nvidia_smi', 'nvidia_smi', 'area'],


A lot of (actually, I think all of) the lines you reformatted as two lines were fine as is. While PEP 8 itself insists on a 79 character line length limit, we deviate from that and limit it to at most 120 characters instead (most people have big enough screens these days that 120 still fits perfectly fine).

Sorry about any confusion regarding this, we're still in the process of getting the rest of our existing code reformatted to be compliant with our coding standards, and we don't yet have thing set up to show nice notifications about whether the code meets the coding standards or not.

Also, just to clarify, it's fine to merge this as it currently is (assuming of course that @l2isbad agrees), I just wanted to mention that this particular case of reformatting wasn't entirely necessary.

I personally still argue for pep8 80 char for split pane etc etc, but I understand the desire for consistency. Brb.

Done. 👍 I left the small changes to code because I dont think theres any advantage to overly long nested and complex lines of code, but for array literals and strings let them breath!

netdatabot · 2018-10-04T17:37:02Z

This pull request introduces 1 alert when merging e8320c4 into aab4285 - view on LGTM.com

new alerts:

1 for Unused local variable

Comment posted by LGTM.com

madhavajay · 2018-10-04T18:05:39Z

How are we looking? Do I need an ignore comment for these subprocess / xml checks?

Ferroin · 2018-10-04T18:43:22Z

Yeah, the stuff reported by Codacy can be ignored. All the stuff they're complaining about is an unfortunate requirement for working with nvidia-smi.

Ferroin · 2018-10-04T18:47:25Z

At this point, it just needs approving reviews from @l2isbad and @ktsaou, and ideally a concrete decision as to whether or not it should be enabled by default. Given the amount of time it takes to run each cycle and the fact that nvidia-smi appears to be installed on pretty much any system that's using the proprietary NVIDIA drivers, I'm inclined to say it probably should be disabled by default, but let's wait to hear what @l2isbad and @ktsaou think regarding that.

madhavajay · 2018-10-04T18:54:20Z

user@box-1:~$ time nvidia-smi
Thu Oct  4 18:53:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    50W / 300W |  15355MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   37C    P0    36W / 300W |  15349MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   40C    P0    46W / 300W |  15350MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   40C    P0    59W / 300W |  15350MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2354      G   /usr/lib/xorg/Xorg                            98MiB |
|    0      9035      C   .../virtualenvs/tools-yyZS7auo/bin/python3 15245MiB |
|    1      9035      C   .../virtualenvs/tools-yyZS7auo/bin/python3 15339MiB |
|    2      9035      C   .../virtualenvs/tools-yyZS7auo/bin/python3 15339MiB |
|    3      9035      C   .../virtualenvs/tools-yyZS7auo/bin/python3 15339MiB |
+-----------------------------------------------------------------------------+

real	0m0.381s
user	0m0.000s
sys	0m0.304s

user@box-1:~$ time gpustat
box-1  Thu Oct  4 18:53:07 2018
[0] Tesla V100-SXM2-16GB | 41'C,   0 % | 15355 / 16130 MB | user(15245M) root(98M)
[1] Tesla V100-SXM2-16GB | 37'C,   0 % | 15349 / 16130 MB | user(15339M)
[2] Tesla V100-SXM2-16GB | 40'C,   0 % | 15350 / 16130 MB | user(15339M)
[3] Tesla V100-SXM2-16GB | 40'C,   0 % | 15350 / 16130 MB | user(15339M)

real	0m0.163s
user	0m0.111s
sys	0m0.033s

madhavajay · 2018-10-04T19:02:53Z

Its weird because GPUStat just wraps nvidia-smi but its faster at least for similar output:
https://github.com/wookayin/gpustat

Also they mention a nvidia-smi daemon mode?
But from all i can find thats just running:

$ sudo nvidia-smi --loop=1 --filename=/var/log/nvidia-smi.log &

madhavajay · 2018-10-04T19:03:16Z

Maybe a version 2 could be to investigate why gpustat is faster and use it instead.

Ferroin · 2018-10-04T19:06:55Z

What matters here isn't as much exactly how long nvidia-smi itself takes to run, but the combination of that and how long it takes the module to parse the data.

Total run-time for each cycle is tracked by Netdata itself and reported in the 'Netdata Monitoring' section at the bottom of the dashboard. There's a sub-section there for the python modules which has one graph per module which tracks how long it takes from the start of data collection each cycle until data actually gets reported.

Based on what I'm seeing here in that section, and what was included with the original PR< it looks like the absolute minimum amount of time it takes for a single collection cycle is somewhere around 40ms on decent hardware, which is not horrible, but that value goes up when the GPU is under load (running a CUDA task on my laptop's GTX 1050 bumps the times up to around 100ms), and calling it when the GPU is completely idle (and therefore likely to enter low power states) makes the collection time skyrocket to somewhere around a quarter of a second. This wouldn't be much of a problem, except that because of how python.d.plugin works, one module taking a long time to collect data can delay data collection for other modules.

madhavajay · 2018-10-04T20:16:46Z

@Ferroin Do you think thats because of this?:
The NVIDIA linux driver automatically unloads driver libraries and releases state resources after a period of inactivity when no client (be that the X server, a user space application, or nvidia-smi itself) is connected to the driver. The unloading causes a loss of driver state including compute mode settings.

https://devtalk.nvidia.com/default/topic/476424/nvidia-smi-how-to-make-compute-mode-permanent-compute-mode-reverts-to-0-after-reboot/

madhavajay · 2018-10-04T21:26:40Z

Also, last thing we want is to poll the nvidia-smi tool and have it wake up peoples asleep discreet laptop GPUs when they're supposed to be in power saving mode, if that is why it takes so long to load up. But then... surely NetData is for more for Servers out of the box?

ilyam8 · 2018-10-05T02:33:05Z

i see @tycho still not answered
#2759 (comment)

@madhavajay
do a sudo kill $(pidof nvidia-smi) and watch for a memory usage of python.d

Ferroin · 2018-10-05T12:26:21Z

The NVIDIA linux driver automatically unloads driver libraries and releases state resources after a period of inactivity when no client (be that the X server, a user space application, or nvidia-smi itself) is connected to the driver. The unloading causes a loss of driver state including compute mode settings.

That's probably part of it, but I'd be willing to bet that it also involves the device transitioning to a low power state. I'm pretty sure just about all modern NVIDIA cards support PCI ASPM, and transitioning out of L1 there is usually a very high-latency operation. Even coming out of L0s can be pretty bad on some controllers.

Also, last thing we want is to poll the nvidia-smi tool and have it wake up peoples asleep discreet laptop GPUs when they're supposed to be in power saving mode, if that is why it takes so long to load up. But then... surely NetData is for more for Servers out of the box?

As far as stuff that's likely to be using this plugin, yes. That said, I do run Netdata on my laptop just for a number of the alarms, though said laptop only has a discrete GPU (it's got a nice Core i7 CPU, but the iGPU is hard-disabled by firmware and never gets power or a clock signal after the system finishes POST).

madhavajay · 2018-10-07T21:16:23Z

Not really sure what the next steps are here?

ilyam8 · 2018-10-08T14:06:52Z

@madhavajay current implementation has no protection from memory leak in case of nvidia_smi die (i check it with sudo kill $(pidof nvidia-smi))

tycho · 2018-10-08T16:20:37Z

FYI it looks like there are more programmatic ways of doing this than by shelling out to nvidia-smi:

https://github.com/Syllo/nvtop/blob/master/src/extract_gpuinfo.c
https://developer.nvidia.com/nvidia-management-library-nvml

I'd recommend implementing the NVIDIA GPU module with NVML. It'd definitely have lower overhead and feel like less of a hacky implementation.

madhavajay · 2018-10-08T16:23:07Z

Also looks like they have an official python library:
https://pypi.org/project/nvidia-ml-py/

ktsaou · 2018-10-22T21:49:59Z

@l2isbad please check this.

ilyam8 · 2018-10-23T02:09:09Z

@madhavajay hi
let's start from
#4357 (comment)

madhavajay · 2018-10-29T03:17:20Z

@l2isbad sorry was away, im not sure what to do here, any hints? Is there another plugin which deals with something similar to look at?

ilyam8 · 2018-10-29T06:58:38Z

@madhavajay we have to implement some protection from it. How?
I suggest to see what exactly happens when nvidia-smi process dies, why is memory leak occurs.
If we know the reason it will be easy to handle the problem.

if i remember it right self.process.stdout.readline().decode() indefinitely returns empty lines after nvidia process dies.

        while True:
            line = self.process.stdout.readline().decode()
            lines.append(line)
            if self.break_condition(line):
                self.lock.acquire()
                self.last_data = '\n'.join(lines)
                self.last_time = time.time()
                self.lock.release()
                return

If this occurs we need stop poller thread.

madhavajay · 2018-10-31T15:07:06Z

@l2isbad okay got it, I will take a look when I have a spare moment and find a way to gracefully handle the error. Thanks.

ilyam8 · 2018-10-31T15:30:24Z

i rewrote nvidia_smi invoke part and poller, take a look.
use it if you find it useful

click me!

class NvidiaSMI:
    def __init__(self):
        self.command = find_binary(NVIDIA_SMI)
        self.active_proc = None

    def __bool__(self):
        return bool(self.command)

    def run_once(self):
        proc = subprocess.Popen([self.command, '-x', '-q'], stdout=subprocess.PIPE)
        stdout, _ = proc.communicate()
        return stdout

    def run_loop(self, interval):
        if self.active_proc:
            self.kill()
        proc = subprocess.Popen([self.command, '-x', '-q', '-l', str(interval)], stdout=subprocess.PIPE)
        self.active_proc = proc
        return proc.stdout

    def kill(self):
        if self.active_proc:
            self.active_proc.kill()
            self.active_proc = None


class Poller(threading.Thread):
    def __init__(self, command, break_condition, poll_interval):
        threading.Thread.__init__(self)
        self.daemon = True

        self.command = command
        self.break_condition = break_condition
        self.interval = int(poll_interval)

        self.lock = threading.RLock()
        self.last_data = str()
        self.exit = False
        self.empty_rows = 0
        self.rows = list()

    def run(self):
        stdout = self.command.run_loop(self.interval)

        for row in stdout:
            if self.exit or self.empty_rows > EMPTY_LIMIT:
                break
            self.process_row(row)
        self.command.kill()

    def process_row(self, row):
        row = row.decode()
        self.empty_rows += (row == EMPTY)
        self.rows.append(row)

        if self.break_condition(row):
            self.lock.acquire()
            self.last_data = '\n'.join(self.rows)
            self.lock.release()

            self.rows.clear()
            self.empty_rows = 0

    def is_started(self):
        return self.ident is not None

    def shutdown(self):
        self.exit = True

    def data(self):
        self.lock.acquire()
        data = self.last_data
        self.lock.release()
        return data

btw create charts part and get_data are hard to read/follow, i suggest to simplify them.

If you need help feel free to ask

ilyam8 · 2018-10-31T15:43:23Z

The module has been developing for ages 😄
We need to finish it.

ilyam8 · 2018-11-08T00:05:05Z

hey @madhavajay
We are interested in this being merged. If you don't have spare time to finish the PR, i can do it. What do you think?

madhavajay · 2018-11-08T23:10:52Z

@l2isbad yes please. Unfortunately I had spare time back when I started but just lately I haven't had a chance. If you can finish it that would be awesome! 👍

ktsaou · 2018-11-10T01:26:33Z

ok, closing this in favor of #4589

ilyam8 · 2018-11-12T14:47:50Z

#4589 merged

Ehekatl · 2020-01-09T18:46:55Z

I added a native implementation in golang here netdata/go.d.plugin#318

madhavajay requested review from Ferroin, ilyam8 and ktsaou as code owners October 4, 2018 15:04

Ferroin reviewed Oct 4, 2018

View reviewed changes

Ferroin requested changes Oct 4, 2018

View reviewed changes

Ferroin reviewed Oct 4, 2018

View reviewed changes

Ferroin previously approved these changes Oct 4, 2018

View reviewed changes

ilyam8 mentioned this pull request Nov 1, 2018

RFC: Implement nvidia_smi module #2759

Closed

paulfantom force-pushed the master branch 8 times, most recently from 9134341 to 70de864 Compare November 2, 2018 13:12

ilyam8 added the module/python label Nov 6, 2018

ilyam8 mentioned this pull request Nov 9, 2018

nvidia_smi: init version added #4589

Merged

ktsaou closed this Nov 10, 2018

paulfantom added area/collectors Everything related to data collection area/docs area/web no changelog Issues which are not going to be added to changelog and removed no changelog Issues which are not going to be added to changelog labels Nov 29, 2018

Conversation

madhavajay commented Oct 4, 2018 • edited by prologic Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ferroin Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

madhavajay Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

Ferroin Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

netdatabot commented Oct 4, 2018

Uh oh!

Ferroin commented Oct 4, 2018

Uh oh!

Ferroin left a comment

Choose a reason for hiding this comment

Uh oh!

Ferroin commented Oct 4, 2018

Uh oh!

madhavajay commented Oct 4, 2018

Uh oh!

netdatabot commented Oct 4, 2018

Uh oh!

Ferroin Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

Ferroin Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

madhavajay Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

madhavajay Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

netdatabot commented Oct 4, 2018

Uh oh!

madhavajay commented Oct 4, 2018

Uh oh!

Ferroin commented Oct 4, 2018

Uh oh!

Ferroin commented Oct 4, 2018

Uh oh!

madhavajay commented Oct 4, 2018

Uh oh!

madhavajay commented Oct 4, 2018

Uh oh!

madhavajay commented Oct 4, 2018

Uh oh!

Ferroin commented Oct 4, 2018

Uh oh!

madhavajay commented Oct 4, 2018

Uh oh!

madhavajay commented Oct 4, 2018

Uh oh!

ilyam8 commented Oct 5, 2018

Uh oh!

Ferroin commented Oct 5, 2018

Uh oh!

madhavajay commented Oct 7, 2018

Uh oh!

ilyam8 commented Oct 8, 2018

Uh oh!

tycho commented Oct 8, 2018

Uh oh!

madhavajay commented Oct 8, 2018

Uh oh!

ktsaou commented Oct 22, 2018

Uh oh!

ilyam8 commented Oct 23, 2018

Uh oh!

madhavajay commented Oct 29, 2018

Uh oh!

ilyam8 commented Oct 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

madhavajay commented Oct 4, 2018 •

edited by prologic

Loading

CLAassistant commented Oct 4, 2018 •

edited

Loading

ilyam8 commented Oct 29, 2018 •

edited

Loading

ilyam8 commented Oct 31, 2018 •

edited

Loading

ilyam8 commented Oct 31, 2018 •

edited

Loading