Skip to content

Implement nvidia smi module#4357

Closed
madhavajay wants to merge 8 commits into
netdata:masterfrom
madhavajay:implement-nvidia-smi-module
Closed

Implement nvidia smi module#4357
madhavajay wants to merge 8 commits into
netdata:masterfrom
madhavajay:implement-nvidia-smi-module

Conversation

@madhavajay
Copy link
Copy Markdown

@madhavajay madhavajay commented Oct 4, 2018

In the interest of speeding things up this is my PR meeting the requirements of this earlier PR from @tycho and their awesome work on this plugin.
#2759

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Oct 4, 2018

CLA assistant check
All committers have signed the CLA.

Comment thread python.d/nvidia_smi.chart.py Outdated
CHART_TEMPLATES = {
'pci_bandwidth': {
'options': [None, 'PCI express bandwidth utilization', 'KB/s', 'nvidia_smi', 'nvidia_smi', 'area'],
'_metrics': { 'pci': [ 'rx_util', 'tx_util' ] },
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got some formatting issues here and a couple of places below too.

This line should look like this:

'_metrics': {'pci': ['rx_util', 'tx_util']},

For future reference, you can easily check for stuff like this by using flake8. The only difference from the default behavior is that we use a line length limit of 120 characters instead of 79 (so you can get behavior identical to what our checking does by just passing --max-line-length=120 to flake8).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I appreciate formatting, I have recently switched to vscode and dont have all my PEP8 stuff setup yet.

Comment thread python.d/nvidia_smi.chart.py Outdated

class Service(SimpleService):
def __init__(self, configuration=None, name=None):
SimpleService.__init__(self, configuration=configuration, name=name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this to use super() like so:

super().__init__(configuration=configuration,name=name)

@netdatabot
Copy link
Copy Markdown
Member

This pull request introduces 1 alert when merging 485b2dd into aab4285 - view on LGTM.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by LGTM.com

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Oct 4, 2018

Please add an entry to python.d/README.md for the plugin as well. You can largely just copy the style of stuff there directly, it should provide info about what data is collected and how to configure the plugin, as well as mentioning the performance implications (runs really slow if the GPU is completely idle, runs slightly slower than usual when GPU is under heavy load).

Copy link
Copy Markdown
Member

@Ferroin Ferroin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate variable issue reported by LGTM, while technically wrong, should still be resolved.

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Oct 4, 2018

@ktsaou Runtime on this is pretty high. About 50-70ms when the GPU is under only light load, roughly 80-100ms under heavy load, and well over 150ms when there is nothing using it. Do you think we should have it disabled by default or just bump up the collection interval to a couple of seconds?

@madhavajay
Copy link
Copy Markdown
Author

@Ferroin I checked one of my servers and I think anything less than 1 second might lose too much granularity, the tool actually allows for ms level polling.
https://imgur.com/a/4dlVvl7

@netdatabot
Copy link
Copy Markdown
Member

This pull request introduces 1 alert when merging 5d40632 into aab4285 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

Comment posted by LGTM.com

Comment thread python.d/nvidia_smi.chart.py Outdated
CHART_TEMPLATES = {
'pci_bandwidth': {
'options': [None, 'PCI express bandwidth utilization', 'KB/s',
'nvidia_smi', 'nvidia_smi', 'area'],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of (actually, I think all of) the lines you reformatted as two lines were fine as is. While PEP 8 itself insists on a 79 character line length limit, we deviate from that and limit it to at most 120 characters instead (most people have big enough screens these days that 120 still fits perfectly fine).

Sorry about any confusion regarding this, we're still in the process of getting the rest of our existing code reformatted to be compliant with our coding standards, and we don't yet have thing set up to show nice notifications about whether the code meets the coding standards or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just to clarify, it's fine to merge this as it currently is (assuming of course that @l2isbad agrees), I just wanted to mention that this particular case of reformatting wasn't entirely necessary.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally still argue for pep8 80 char for split pane etc etc, but I understand the desire for consistency. Brb.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 👍 I left the small changes to code because I dont think theres any advantage to overly long nested and complex lines of code, but for array literals and strings let them breath!

@netdatabot
Copy link
Copy Markdown
Member

This pull request introduces 1 alert when merging e8320c4 into aab4285 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

Comment posted by LGTM.com

@madhavajay
Copy link
Copy Markdown
Author

How are we looking? Do I need an ignore comment for these subprocess / xml checks?

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Oct 4, 2018

Yeah, the stuff reported by Codacy can be ignored. All the stuff they're complaining about is an unfortunate requirement for working with nvidia-smi.

Ferroin
Ferroin previously approved these changes Oct 4, 2018
@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Oct 4, 2018

At this point, it just needs approving reviews from @l2isbad and @ktsaou, and ideally a concrete decision as to whether or not it should be enabled by default. Given the amount of time it takes to run each cycle and the fact that nvidia-smi appears to be installed on pretty much any system that's using the proprietary NVIDIA drivers, I'm inclined to say it probably should be disabled by default, but let's wait to hear what @l2isbad and @ktsaou think regarding that.

@madhavajay
Copy link
Copy Markdown
Author

user@box-1:~$ time nvidia-smi
Thu Oct  4 18:53:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    50W / 300W |  15355MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   37C    P0    36W / 300W |  15349MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   40C    P0    46W / 300W |  15350MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   40C    P0    59W / 300W |  15350MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2354      G   /usr/lib/xorg/Xorg                            98MiB |
|    0      9035      C   .../virtualenvs/tools-yyZS7auo/bin/python3 15245MiB |
|    1      9035      C   .../virtualenvs/tools-yyZS7auo/bin/python3 15339MiB |
|    2      9035      C   .../virtualenvs/tools-yyZS7auo/bin/python3 15339MiB |
|    3      9035      C   .../virtualenvs/tools-yyZS7auo/bin/python3 15339MiB |
+-----------------------------------------------------------------------------+

real	0m0.381s
user	0m0.000s
sys	0m0.304s
user@box-1:~$ time gpustat
box-1  Thu Oct  4 18:53:07 2018
[0] Tesla V100-SXM2-16GB | 41'C,   0 % | 15355 / 16130 MB | user(15245M) root(98M)
[1] Tesla V100-SXM2-16GB | 37'C,   0 % | 15349 / 16130 MB | user(15339M)
[2] Tesla V100-SXM2-16GB | 40'C,   0 % | 15350 / 16130 MB | user(15339M)
[3] Tesla V100-SXM2-16GB | 40'C,   0 % | 15350 / 16130 MB | user(15339M)

real	0m0.163s
user	0m0.111s
sys	0m0.033s

@madhavajay
Copy link
Copy Markdown
Author

Its weird because GPUStat just wraps nvidia-smi but its faster at least for similar output:
https://github.com/wookayin/gpustat

Also they mention a nvidia-smi daemon mode?
But from all i can find thats just running:

$ sudo nvidia-smi --loop=1 --filename=/var/log/nvidia-smi.log &

@madhavajay
Copy link
Copy Markdown
Author

Maybe a version 2 could be to investigate why gpustat is faster and use it instead.

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Oct 4, 2018

What matters here isn't as much exactly how long nvidia-smi itself takes to run, but the combination of that and how long it takes the module to parse the data.

Total run-time for each cycle is tracked by Netdata itself and reported in the 'Netdata Monitoring' section at the bottom of the dashboard. There's a sub-section there for the python modules which has one graph per module which tracks how long it takes from the start of data collection each cycle until data actually gets reported.

Based on what I'm seeing here in that section, and what was included with the original PR< it looks like the absolute minimum amount of time it takes for a single collection cycle is somewhere around 40ms on decent hardware, which is not horrible, but that value goes up when the GPU is under load (running a CUDA task on my laptop's GTX 1050 bumps the times up to around 100ms), and calling it when the GPU is completely idle (and therefore likely to enter low power states) makes the collection time skyrocket to somewhere around a quarter of a second. This wouldn't be much of a problem, except that because of how python.d.plugin works, one module taking a long time to collect data can delay data collection for other modules.

@madhavajay
Copy link
Copy Markdown
Author

@Ferroin Do you think thats because of this?:
The NVIDIA linux driver automatically unloads driver libraries and releases state resources after a period of inactivity when no client (be that the X server, a user space application, or nvidia-smi itself) is connected to the driver. The unloading causes a loss of driver state including compute mode settings.

https://devtalk.nvidia.com/default/topic/476424/nvidia-smi-how-to-make-compute-mode-permanent-compute-mode-reverts-to-0-after-reboot/

@madhavajay
Copy link
Copy Markdown
Author

Also, last thing we want is to poll the nvidia-smi tool and have it wake up peoples asleep discreet laptop GPUs when they're supposed to be in power saving mode, if that is why it takes so long to load up. But then... surely NetData is for more for Servers out of the box?

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Oct 5, 2018

i see @tycho still not answered
#2759 (comment)

@madhavajay
do a sudo kill $(pidof nvidia-smi) and watch for a memory usage of python.d

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Oct 5, 2018

The NVIDIA linux driver automatically unloads driver libraries and releases state resources after a period of inactivity when no client (be that the X server, a user space application, or nvidia-smi itself) is connected to the driver. The unloading causes a loss of driver state including compute mode settings.

That's probably part of it, but I'd be willing to bet that it also involves the device transitioning to a low power state. I'm pretty sure just about all modern NVIDIA cards support PCI ASPM, and transitioning out of L1 there is usually a very high-latency operation. Even coming out of L0s can be pretty bad on some controllers.

Also, last thing we want is to poll the nvidia-smi tool and have it wake up peoples asleep discreet laptop GPUs when they're supposed to be in power saving mode, if that is why it takes so long to load up. But then... surely NetData is for more for Servers out of the box?

As far as stuff that's likely to be using this plugin, yes. That said, I do run Netdata on my laptop just for a number of the alarms, though said laptop only has a discrete GPU (it's got a nice Core i7 CPU, but the iGPU is hard-disabled by firmware and never gets power or a clock signal after the system finishes POST).

@madhavajay
Copy link
Copy Markdown
Author

Not really sure what the next steps are here?

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Oct 8, 2018

@madhavajay current implementation has no protection from memory leak in case of nvidia_smi die (i check it with sudo kill $(pidof nvidia-smi))

@tycho
Copy link
Copy Markdown
Contributor

tycho commented Oct 8, 2018

FYI it looks like there are more programmatic ways of doing this than by shelling out to nvidia-smi:

https://github.com/Syllo/nvtop/blob/master/src/extract_gpuinfo.c
https://developer.nvidia.com/nvidia-management-library-nvml

I'd recommend implementing the NVIDIA GPU module with NVML. It'd definitely have lower overhead and feel like less of a hacky implementation.

@madhavajay
Copy link
Copy Markdown
Author

Also looks like they have an official python library:
https://pypi.org/project/nvidia-ml-py/

@ktsaou
Copy link
Copy Markdown
Member

ktsaou commented Oct 22, 2018

@l2isbad please check this.

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Oct 23, 2018

@madhavajay hi
let's start from
#4357 (comment)

@madhavajay
Copy link
Copy Markdown
Author

@l2isbad sorry was away, im not sure what to do here, any hints? Is there another plugin which deals with something similar to look at?

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Oct 29, 2018

@madhavajay we have to implement some protection from it. How?
I suggest to see what exactly happens when nvidia-smi process dies, why is memory leak occurs.
If we know the reason it will be easy to handle the problem.

if i remember it right self.process.stdout.readline().decode() indefinitely returns empty lines after nvidia process dies.

        while True:
            line = self.process.stdout.readline().decode()
            lines.append(line)
            if self.break_condition(line):
                self.lock.acquire()
                self.last_data = '\n'.join(lines)
                self.last_time = time.time()
                self.lock.release()
                return

If this occurs we need stop poller thread.

@madhavajay
Copy link
Copy Markdown
Author

@l2isbad okay got it, I will take a look when I have a spare moment and find a way to gracefully handle the error. Thanks.

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Oct 31, 2018

i rewrote nvidia_smi invoke part and poller, take a look.
use it if you find it useful

click me!

class NvidiaSMI:
    def __init__(self):
        self.command = find_binary(NVIDIA_SMI)
        self.active_proc = None

    def __bool__(self):
        return bool(self.command)

    def run_once(self):
        proc = subprocess.Popen([self.command, '-x', '-q'], stdout=subprocess.PIPE)
        stdout, _ = proc.communicate()
        return stdout

    def run_loop(self, interval):
        if self.active_proc:
            self.kill()
        proc = subprocess.Popen([self.command, '-x', '-q', '-l', str(interval)], stdout=subprocess.PIPE)
        self.active_proc = proc
        return proc.stdout

    def kill(self):
        if self.active_proc:
            self.active_proc.kill()
            self.active_proc = None


class Poller(threading.Thread):
    def __init__(self, command, break_condition, poll_interval):
        threading.Thread.__init__(self)
        self.daemon = True

        self.command = command
        self.break_condition = break_condition
        self.interval = int(poll_interval)

        self.lock = threading.RLock()
        self.last_data = str()
        self.exit = False
        self.empty_rows = 0
        self.rows = list()

    def run(self):
        stdout = self.command.run_loop(self.interval)

        for row in stdout:
            if self.exit or self.empty_rows > EMPTY_LIMIT:
                break
            self.process_row(row)
        self.command.kill()

    def process_row(self, row):
        row = row.decode()
        self.empty_rows += (row == EMPTY)
        self.rows.append(row)

        if self.break_condition(row):
            self.lock.acquire()
            self.last_data = '\n'.join(self.rows)
            self.lock.release()

            self.rows.clear()
            self.empty_rows = 0

    def is_started(self):
        return self.ident is not None

    def shutdown(self):
        self.exit = True

    def data(self):
        self.lock.acquire()
        data = self.last_data
        self.lock.release()
        return data

btw create charts part and get_data are hard to read/follow, i suggest to simplify them.

If you need help feel free to ask

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Oct 31, 2018

The module has been developing for ages 😄
We need to finish it.

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Nov 8, 2018

hey @madhavajay
We are interested in this being merged. If you don't have spare time to finish the PR, i can do it. What do you think?

@madhavajay
Copy link
Copy Markdown
Author

@l2isbad yes please. Unfortunately I had spare time back when I started but just lately I haven't had a chance. If you can finish it that would be awesome! 👍

@ktsaou
Copy link
Copy Markdown
Member

ktsaou commented Nov 10, 2018

ok, closing this in favor of #4589

@ktsaou ktsaou closed this Nov 10, 2018
@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Nov 12, 2018

#4589 merged

@paulfantom paulfantom added area/collectors Everything related to data collection area/docs area/web no changelog Issues which are not going to be added to changelog and removed no changelog Issues which are not going to be added to changelog labels Nov 29, 2018
@Ehekatl
Copy link
Copy Markdown
Contributor

Ehekatl commented Jan 9, 2020

I added a native implementation in golang here netdata/go.d.plugin#318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/collectors Everything related to data collection area/docs area/web

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants