Skip to content

RFC: Implement nvidia_smi module#2759

Closed
tycho wants to merge 1 commit into
netdata:masterfrom
tycho:implement-nvidia-smi-module
Closed

RFC: Implement nvidia_smi module#2759
tycho wants to merge 1 commit into
netdata:masterfrom
tycho:implement-nvidia-smi-module

Conversation

@tycho
Copy link
Copy Markdown
Contributor

@tycho tycho commented Sep 18, 2017

This is mostly ready for merging, I think, but I'd like to do some iterating and get some feedback on it.

I've been using this on my own machines for a few days without any issues. My dual Xeon box is currently running a TensorFlow learner on the GPU, so you can see this module in action here:

https://netdata.uplinklabs.net/incubus/#menu_gpu_submenu_nvidia_smi;theme=slate;help=true

A couple of screenshots here, in case it's not busy when you're looking:

screenshot_2017-09-18_09-08-26
screenshot_2017-09-18_09-08-20

There are a couple caveats to note. First, I haven't tried this with a multi-GPU box yet, so I'm not sure what the module will do when it sees more than a single GPU. It probably does the wrong thing, but the intent is that it creates a separate set of graphs per GPU.

There are also some caveats with nvidia-smi.

nvidia-smi -q can occasionally be slightly slower when the GPU is under load:
screenshot_2017-09-18_09-10-19

And it can be much slower if the GPU is completely idle, has no X11 session running, and has persistence mode disabled (nvidia-smi -pm 0):
screenshot_2017-09-18_09-13-11

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Sep 18, 2017

I'm looking forwards to being able to monitor my NVIDIA GPu's without needing to install some outdated Python modules.

I'll see about getting this tested on my laptop (GTX 1080, Nvidia driver version 384.69) once the Travis build goes green.

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Sep 18, 2017

Hmm, the macOS build from before the most recent push is still running after over an hour, compared to the 4.5 minutes it appears to usually take. I think something might be broken there, but I don't have a macOS system to test with to confirm.

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 18, 2017

I don't think I've seen travis-ci complete any netdata builds to date, but maybe I missed it.

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Sep 18, 2017

Here's the most recent one that succeeded:
https://travis-ci.org/firehol/netdata/builds/276661558

Other than this PR, it looks like all the PR test builds have been working fine for the past week or so according to:
https://travis-ci.org/firehol/netdata/pull_requests

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 18, 2017

Huh, I stand corrected. I've pretty much always just seen the yellow "in progress" icon for Travis on my PRs, but I guess it is working.

@ktsaou
Copy link
Copy Markdown
Member

ktsaou commented Sep 18, 2017

travis is having difficulties with macOS, it is usually a bit slow and unreliable (I have to restart a lot of jobs to make it finish). I see now it is stuck because it cannot find a macOS slot to build netdata.

My nvidia-smi -q from an old 8800GT machine:

click for nvidia-smi output
==============NVSMI LOG==============

Timestamp                           : Mon Sep 18 21:51:27 2017
Driver Version                      : 340.102

Attached GPUs                       : 1
GPU 0000:01:00.0
    Product Name                    : GeForce 8800 GT
    Product Brand                   : GeForce
    Display Mode                    : N/A
    Display Active                  : N/A
    Persistence Mode                : Disabled
    Accounting Mode                 : N/A
    Accounting Mode Buffer Size     : N/A
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-da613ba8-27bb-76b9-2c57-d36a5905478a
    Minor Number                    : 0
    VBIOS Version                   : 62.92.16.00.04
    MultiGPU Board                  : N/A
    Board ID                        : N/A
    Inforom Version
        Image Version               : N/A
        OEM Object                  : N/A
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x061110DE
        Bus Id                      : 0000:01:00.0
        Sub System Id               : 0x080110B0
        GPU Link Info
            PCIe Generation
                Max                 : N/A
                Current             : N/A
            Link Width
                Max                 : N/A
                Current             : N/A
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
    Fan Speed                       : 41 %
    Performance State               : P12
    Clocks Throttle Reasons         : N/A
    FB Memory Usage
        Total                       : 1023 MiB
        Used                        : 344 MiB
        Free                        : 679 MiB
    BAR1 Memory Usage
        Total                       : N/A
        Used                        : N/A
        Free                        : N/A
    Compute Mode                    : Default
    Utilization
        Gpu                         : N/A
        Memory                      : N/A
        Encoder                     : N/A
        Decoder                     : N/A
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 54 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
    Power Readings
        Power Management            : N/A
        Power Draw                  : N/A
        Power Limit                 : N/A
        Default Power Limit         : N/A
        Enforced Power Limit        : N/A
        Min Power Limit             : N/A
        Max Power Limit             : N/A
    Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : N/A
        SM                          : N/A
        Memory                      : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Compute Processes               : N/A

I haven't installed nvidia on my XPS9560 (nvidia 1050 - @Ferroin nvidia 1080 on a laptop? wow! which laptop? ) The truth is I am afraid to install it... :) will it break everything? XPS9560 has many issues with drivers, on both Linux and Windows... @Ferroin which distro do you use?

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 18, 2017

@ktsaou Can you pastebin the output of nvidia-smi -x -q (i.e. the XML version of what you posted) so I can test it with the module? I'm betting all the missing values aren't handled well. It probably doesn't throw exceptions, but if some values are marked "N/A" then we could just omit the graphs for them.

@ktsaou
Copy link
Copy Markdown
Member

ktsaou commented Sep 18, 2017

btw @tycho this plugin is nice!

@l2isbad is there a way to avoid polling external commands all the time? I mean, both turbostat and nvidia-smi can loop by themselves (e.g. nvidia-smi -q -l -lms 1000). It would be perfect if we could have them always running, pushing their output to the plugin to parse it.

Something like nvidia-smi -q -l -lms 1000 | awk ... but in python...

@ktsaou
Copy link
Copy Markdown
Member

ktsaou commented Sep 18, 2017

@tycho here it is:

click for XML output
<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v6.dtd">
<nvidia_smi_log>
	<timestamp>Mon Sep 18 22:10:18 2017</timestamp>
	<driver_version>340.102</driver_version>
	<attached_gpus>1</attached_gpus>
	<gpu id="0000:01:00.0">
		<product_name>GeForce 8800 GT</product_name>
		<product_brand>GeForce</product_brand>
		<display_mode>N/A</display_mode>
		<display_active>N/A</display_active>
		<persistence_mode>Disabled</persistence_mode>
		<accounting_mode>N/A</accounting_mode>
		<accounting_mode_buffer_size>N/A</accounting_mode_buffer_size>
		<driver_model>
			<current_dm>N/A</current_dm>
			<pending_dm>N/A</pending_dm>
		</driver_model>
		<serial>N/A</serial>
		<uuid>GPU-da613ba8-27bb-76b9-2c57-d36a5905478a</uuid>
		<minor_number>0</minor_number>
		<vbios_version>62.92.16.00.04</vbios_version>
		<multigpu_board>N/A</multigpu_board>
		<board_id>N/A</board_id>
		<inforom_version>
			<img_version>N/A</img_version>
			<oem_object>N/A</oem_object>
			<ecc_object>N/A</ecc_object>
			<pwr_object>N/A</pwr_object>
		</inforom_version>
		<gpu_operation_mode>
			<current_gom>N/A</current_gom>
			<pending_gom>N/A</pending_gom>
		</gpu_operation_mode>
		<pci>
			<pci_bus>01</pci_bus>
			<pci_device>00</pci_device>
			<pci_domain>0000</pci_domain>
			<pci_device_id>061110DE</pci_device_id>
			<pci_bus_id>0000:01:00.0</pci_bus_id>
			<pci_sub_system_id>080110B0</pci_sub_system_id>
			<pci_gpu_link_info>
				<pcie_gen>
					<max_link_gen>N/A</max_link_gen>
					<current_link_gen>N/A</current_link_gen>
				</pcie_gen>
				<link_widths>
					<max_link_width>N/A</max_link_width>
					<current_link_width>N/A</current_link_width>
				</link_widths>
			</pci_gpu_link_info>
			<pci_bridge_chip>
				<bridge_chip_type>N/A</bridge_chip_type>
				<bridge_chip_fw>N/A</bridge_chip_fw>
			</pci_bridge_chip>
		</pci>
		<fan_speed>40 %</fan_speed>
		<performance_state>P12</performance_state>
		<clocks_throttle_reasons>N/A</clocks_throttle_reasons>
		<fb_memory_usage>
			<total>1023 MiB</total>
			<used>386 MiB</used>
			<free>637 MiB</free>
		</fb_memory_usage>
		<bar1_memory_usage>
			<total>N/A</total>
			<used>N/A</used>
			<free>N/A</free>
		</bar1_memory_usage>
		<compute_mode>Default</compute_mode>
		<utilization>
			<gpu_util>N/A</gpu_util>
			<memory_util>N/A</memory_util>
			<encoder_util>N/A</encoder_util>
			<decoder_util>N/A</decoder_util>
		</utilization>
		<ecc_mode>
			<current_ecc>N/A</current_ecc>
			<pending_ecc>N/A</pending_ecc>
		</ecc_mode>
		<ecc_errors>
			<volatile>
				<single_bit>
					<device_memory>N/A</device_memory>
					<register_file>N/A</register_file>
					<l1_cache>N/A</l1_cache>
					<l2_cache>N/A</l2_cache>
					<texture_memory>N/A</texture_memory>
					<total>N/A</total>
				</single_bit>
				<double_bit>
					<device_memory>N/A</device_memory>
					<register_file>N/A</register_file>
					<l1_cache>N/A</l1_cache>
					<l2_cache>N/A</l2_cache>
					<texture_memory>N/A</texture_memory>
					<total>N/A</total>
				</double_bit>
			</volatile>
			<aggregate>
				<single_bit>
					<device_memory>N/A</device_memory>
					<register_file>N/A</register_file>
					<l1_cache>N/A</l1_cache>
					<l2_cache>N/A</l2_cache>
					<texture_memory>N/A</texture_memory>
					<total>N/A</total>
				</single_bit>
				<double_bit>
					<device_memory>N/A</device_memory>
					<register_file>N/A</register_file>
					<l1_cache>N/A</l1_cache>
					<l2_cache>N/A</l2_cache>
					<texture_memory>N/A</texture_memory>
					<total>N/A</total>
				</double_bit>
			</aggregate>
		</ecc_errors>
		<retired_pages>
			<multiple_single_bit_retirement>
				<retired_count>N/A</retired_count>
				<retired_page_addresses>N/A</retired_page_addresses>
			</multiple_single_bit_retirement>
			<double_bit_retirement>
				<retired_count>N/A</retired_count>
				<retired_page_addresses>N/A</retired_page_addresses>
			</double_bit_retirement>
			<pending_retirement>N/A</pending_retirement>
		</retired_pages>
		<temperature>
			<gpu_temp>54 C</gpu_temp>
			<gpu_temp_max_threshold>N/A</gpu_temp_max_threshold>
			<gpu_temp_slow_threshold>N/A</gpu_temp_slow_threshold>
		</temperature>
		<power_readings>
			<power_state>P12</power_state>
			<power_management>N/A</power_management>
			<power_draw>N/A</power_draw>
			<power_limit>N/A</power_limit>
			<default_power_limit>N/A</default_power_limit>
			<enforced_power_limit>N/A</enforced_power_limit>
			<min_power_limit>N/A</min_power_limit>
			<max_power_limit>N/A</max_power_limit>
		</power_readings>
		<clocks>
			<graphics_clock>N/A</graphics_clock>
			<sm_clock>N/A</sm_clock>
			<mem_clock>N/A</mem_clock>
		</clocks>
		<applications_clocks>
			<graphics_clock>N/A</graphics_clock>
			<mem_clock>N/A</mem_clock>
		</applications_clocks>
		<default_applications_clocks>
			<graphics_clock>N/A</graphics_clock>
			<mem_clock>N/A</mem_clock>
		</default_applications_clocks>
		<max_clocks>
			<graphics_clock>N/A</graphics_clock>
			<sm_clock>N/A</sm_clock>
			<mem_clock>N/A</mem_clock>
		</max_clocks>
		<clock_policy>
			<auto_boost>N/A</auto_boost>
			<auto_boost_default>N/A</auto_boost_default>
		</clock_policy>
		<supported_clocks>N/A</supported_clocks>
		<compute_processes>N/A</compute_processes>
		<accounted_processes>N/A</accounted_processes>
	</gpu>

</nvidia_smi_log>

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 18, 2017

Thanks @ktsaou. Pushed some changes to make it properly handle the XML output on your system (some charts don't/can't have any metrics available, so now we don't bother creating them).

@sarlalian
Copy link
Copy Markdown

It might take a week or two before I can setup a proper dual GPU system to test with, but I'll add that to my list for this week.

@tycho tycho force-pushed the implement-nvidia-smi-module branch from 51d3b36 to 327ab7d Compare September 21, 2017 18:35
@ktsaou
Copy link
Copy Markdown
Member

ktsaou commented Sep 24, 2017

@l2isbad can we support a python.d.service like ExecutableService that keeps the pipe open, reading repeatedly data from the external program?

So external commands that can output data repeatedly, this would greatly improve the overhead of such plugins.

If you can do this, I suggest to re-implement this ( and #2714 ) using the new service.

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Sep 25, 2017

Well, you can read from stdout line by line... Not sure about turbostat but we can implement it for nvidia-smi.

quick try (click)
# -*- coding: utf-8 -*-
# Description: nvidia-smi netdata python.d module
# Author: Steven Noonan (tycho)
import copy
import xml.etree.ElementTree as et

try:
    from queue import Queue, Empty
except ImportError:
    from Queue import Queue, Empty
from subprocess import Popen, PIPE
from threading import Thread

from base import SimpleService

CHART_TEMPLATES = {
    'pci_bandwidth': {
        'options': [None, 'PCI express bandwidth utilization', 'KB/s', 'nvidia_smi', None, 'area'],
        '_metrics': {'pci': ['rx_util', 'tx_util']},
        'lines': []
    },
    'fan_speed': {
        'options': [None, 'Fan speed', 'percentage', 'nvidia_smi', None, 'line'],
        '_metrics': {None: ['fan_speed']},
        'lines': []
    },
    'utilization': {
        'options': [None, 'Resource utilization', 'percentage', 'nvidia_smi', None, 'line'],
        '_metrics': {'utilization': ['gpu_util', 'memory_util', 'encoder_util', 'decoder_util']},
        'lines': []
    },
    'mem_usage': {
        'options': [None, 'Memory utilization', 'MB', 'nvidia_smi', None, 'line'],
        '_metrics': {'fb_memory_usage': ['used']},
        'lines': []
    },
    'temperature': {
        'options': [None, 'Temperature', 'Celsius', 'nvidia_smi', None, 'line'],
        '_metrics': {'temperature': ['gpu_temp']},
        'lines': []
    },
    'clocks': {
        'options': [None, 'Clock frequencies', 'MHz', 'nvidia_smi', None, 'line'],
        '_metrics': {'clocks': ['graphics_clock', 'sm_clock', 'mem_clock', 'video_clock']},
        'lines': []
    },
    'power': {
        'options': [None, 'Power utilization', 'Watts', 'nvidia_smi', None, 'line'],
        '_metrics': {'power_readings': ['power_draw']},
        '_divisor': 10,
        'lines': []
    },
}

LAST_POLL = Queue()


def is_end_of_snapshot(line):
    return '</nvidia_smi_log>' in line


def convert_to_xml_obj(data):
    if not data:
        return None
    try:
        return et.fromstring(data)
    except et.ParseError:
        return None


def invoke_nvidia_smi():
    p = Popen(['nvidia-smi', '-x', '-q'], stdout=PIPE)
    stdout, _ = p.communicate()
    return stdout


class Poller(Thread):
    def __init__(self, command, break_condition, service):
        Thread.__init__(self)
        self.command = command
        self.break_condition = break_condition
        self.service = service

    def run(self):
        try:
            self.process = Popen(self.command.split(), stdout=PIPE, stderr=PIPE)
        except (OSError, FileNotFoundError) as error:
            self.error(str(error))
            return
        self.read_nvidia_smi_output()

    def read_nvidia_smi_output(self):
        output = list()
        get_line = self.process.stdout.readline
        read_lines_number = 0
        while True:
            line = get_line().decode()
            read_lines_number += 1

            if read_lines_number >= self.read_lines_limit:
                self.error('After reading {0} lines,'
                           ' the end of "nvidia-smi" output can not be found.'
                           ' Poller thread exiting'. format(read_lines_number))
                return

            if self.break_condition(line):
                output.append(line)
                if not LAST_POLL.empty():
                    LAST_POLL.queue.clear()
                LAST_POLL.put(output)
                read_lines_number, output = 0, list()
                continue

            output.append(line)

    def __getattr__(self, item):
        return getattr(self.service, item)


class Service(SimpleService):
    def __init__(self, configuration=None, name=None):
        SimpleService.__init__(self, configuration=configuration, name=name)
        self.order = list()
        self.definitions = dict()
        self.assignment = dict()
        self.nvidia_smi = self.find_binary('nvidia-smi')
        self.read_lines_limit = 1024  # will be redefined in check method
        command = '{nvidia_smi} -x -q -l -lms {interval}'.format(nvidia_smi=self.nvidia_smi,
                                                                 interval=self.update_every * 1000 - 100)
        self.data_poller = Poller(command=command,
                                  break_condition=is_end_of_snapshot,
                                  service=self)

    def parse_xml_obj(self, xml_obj):
        data = dict()
        for gpu in xml_obj.findall('gpu'):
            gpu_id = self.assignment[gpu.get('id')]['gpu_id']

            for template in CHART_TEMPLATES.values():
                for root_name, metrics in template['_metrics'].items():
                    if root_name:
                        root = gpu.find(root_name)
                    else:
                        root = gpu

                    for metric in metrics:
                        metric_name = '_'.join([gpu_id, metric])
                        elem = root.find(metric)
                        if elem is None:
                            continue
                        try:
                            val = float(elem.text.split()[0])
                        except ValueError:
                            continue

                        val *= float(template.get('_divisor', 1))
                        data[metric_name] = val
        return data or None

    def _get_data(self):
        try:
            raw_data = LAST_POLL.get(timeout=0.2)
        except Empty:
            return None

        if not raw_data:
            return None

        xml_obj = convert_to_xml_obj(''.join(raw_data))
        if not xml_obj:
            return None

        return self.parse_xml_obj(xml_obj)

    def check(self):
        if not self.nvidia_smi:
            self.error('"nvidia-smi" binary not found')
            return None

        nvidia_smi_output = invoke_nvidia_smi()
        self.read_lines_limit = len(nvidia_smi_output.split()) * 2

        smi = convert_to_xml_obj(nvidia_smi_output)

        if not smi:
            self.error("Failed to invoke 'nvidia-smi'."
                       " Do you have the proprietary NVIDIA driver and tools installed?")
            return False

        for i, gpu in enumerate(smi.findall('gpu')):
            gpu_id = gpu.get('id')
            self.assignment[gpu_id] = dict(gpu_id='gpu{0}'.format(i))

        if not self.assignment:
            self.error("Could not find any NVIDIA GPUs in nvidia-smi XML output")
            return False

        data = self.parse_xml_obj(smi)
        self.create_charts(data)
        self._data_from_check = data

        self.data_poller.daemon = True
        self.data_poller.start()
        return True

    def create_charts(self, data):
        for chart, template in CHART_TEMPLATES.items():
            for name in sorted(self.assignment, key=lambda v: self.assignment[v]['gpu_id']):

                assignment = self.assignment[name]
                gpu_id = assignment['gpu_id']  # gpu0
                chart_name = '_'.join([chart, gpu_id])  # fan_speed_gpu0
                chart_def = copy.deepcopy(template)

                for root_name, metrics in chart_def['_metrics'].items():
                    collected_metrics = set('_'.join([gpu_id, metric]) for metric in metrics) & set(data)
                    if not collected_metrics:
                        break
                    self.order.append(chart_name)

                    for metric in collected_metrics:
                        direction = -1 if root_name == 'pci' and metric == 'tx_util' else 1
                        nickname = metric.split('_', 1)[1]
                        divisor = template.get('_divisor', 1)
                        chart_def['lines'].append([metric, nickname, 'absolute', direction, divisor])
                    self.definitions[chart_name] = chart_def

screenshot_20170925_154052

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Sep 25, 2017

The idea is to specify a special function (break_condition) that determines when we need to finish reading from stdout and start processing data.

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 25, 2017

I see some issues with the approach you're demonstrating. The first is that we could get out of step with the child process. For instance, let's say we're late at reading the output of the subprocess a few times, and eventually we're behind a whole interval. The subprocess has had enough time to output the nvidia_smi_log XML block twice, but we're completely unaware because we are reading where we last left off. So we read the one-interval-old nvidia_smi_log, hit the closing </nvidia_smi_log> tag, and stop reading. But there's still a 2nd output in the buffer that we will read a whole refresh interval later. So one problem is that we could be reading outdated information from the pipe. A second problem is a side effect: if we lag behind enough, eventually the pipe could fill up, leading to dreaded subprocess hangs:

https://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/

(I ran into this a few times at a previous job. I don't know if it still applies to Python 3.x; we were using Python 2.7.10 or so at the time.)

It would seem like a better approach would be to spawn a thread that is constantly attempting to read the output of the subprocess. It could keep hold of the last full snapshot it's read. When our module needs to read an update, it would just pull the last complete snapshot read by the other thread.

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Sep 25, 2017

@tycho

The first is that we could get out of step with the child process

indeed

It would seem like a better approach would be to spawn a thread that is constantly attempting to read the output of the subprocess. It could keep hold of the last full snapshot it's read. When our module needs to read an update, it would just pull the last complete snapshot read by the other thread.

code above edited

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Sep 25, 2017

continuously running nvidia-smi consumes more cpu
screenshot_20170925_222947

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 25, 2017

With the Queue you're just moving the buffer we'll end up behind in reading (e.g. two responses in the Queue). On the other extreme: if the Queue ends up empty, then the LAST_POLL.get() will block, waiting for the Poller thread to put something there. Why not just have a member variable that you assign in the Poller thread and read in the main thread? You might just be able to rely on the GIL for synchronization, but if you're really worried you could put a lock around reading/writing the variable.

@ktsaou
Copy link
Copy Markdown
Member

ktsaou commented Sep 25, 2017

@l2isbad I can't confirm your observations.

With the code in the PR, python.d.plugin is runs with 2-3% CPU, while with the new code nvidia-smi (which is always running) it barely consumes 0.5% CPU and python.d.plugin runs with 1% CPU.

I verified nvidia-smi is really standing idle between iterations (run it by hand and set -lms 10000, and observe it with top -p $(pidof nvidia-smi)).

So, my conclusion is that if nvidia-smi is not always running, apps.plugin will report a higher increase for the CPU consumption of python.d.plugin. apps.plugin reports the CPU utilization of a process children, even if they don't currently run.

@tycho tycho force-pushed the implement-nvidia-smi-module branch from 327ab7d to 8bc124e Compare September 25, 2017 21:41
@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 25, 2017

I just pushed a version demonstrating how I'd implement it.

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 25, 2017

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Sep 25, 2017

With the Queue you're just moving the buffer we'll end up behind in reading (e.g. two responses in the Queue).

if not LAST_POLL.empty():
    LAST_POLL.queue.clear()
LAST_POLL.put(output)

@ktsaou

Well, it acts differently on my 2 PCs (work NVIDIA-SMI 375.82 GeForce GT 730, home NVIDIA-SMI 375.82 GeForce GTX 960).

lgz@lgz-homepc[~]nvidia-smi -q -x -l -lms 1000 > /dev/null

screenshot_20170926_075220

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 25, 2017

@l2isbad OK, so you intend for LAST_POLL.get() to block the main thread? What happens if nvidia-smi dies?

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Sep 26, 2017

i see Queue().get has timeout argument
like

try:
    r = q.get(timeout=0.2)
except Queue.Empty:
    return None

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Sep 26, 2017

@l2isbad Yeah, that could work. I'd rather not make the main thread wait at all if I can help it though. Another thing we're not yet accounting for is what to do in the Poller if </nvidia_smi_log> never appears in the output... Right now it's around 500-ish lines from the open tag to the close tag. Maybe we should say if we read 1000 lines then it's time to give up?

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Sep 26, 2017

work (GeForce GT 730)

nvidia-smi -x -q | wc -l
190

home (GeForce GTX 960)

nvidia-smi -x -q | wc -l
533

Seem it depends on gpu, number of gpus

if we read 1000 lines then it's time to give up?

we can get number of lines after invoking nvidia-smi -x -q at check() and pass it to Poller instance or something (and set the limit to number of lines * 2 as example - see my code above)

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Sep 26, 2017

Maybe we should say if we read 1000 lines then it's time to give up?

Actually it's must have feature. Try sudo kill $(pidof nvidia-smi) without it 😄

@tycho tycho force-pushed the implement-nvidia-smi-module branch from 4a37b28 to 70d7d08 Compare September 27, 2017 16:45
@tycho tycho force-pushed the implement-nvidia-smi-module branch from d25e9c2 to 0b763e7 Compare March 24, 2018 03:09
@ktsaou
Copy link
Copy Markdown
Member

ktsaou commented Mar 31, 2018

I have a very strange case that is identified by this plugin:

image

The PCI Express bandwidth utilization spikes like crazy when I switch tabs at the browser. The lag is visible, even mouse movement is lagging while this happens.

hm... can't figure out why...

btw, I really like this plugin and I would like to merge it. Some suggestions:

  1. If the system is having more than one GPU, I would like to see them separately, with a different set of charts. It would also be nice if the dashboard section could be named after the gpu model found. So charts could be named nvidia_<model_name>_gpu<X>.<chart_name> (make sure the model name has only alphanumerics and underscores - change any other character to underscore).

    This will also allow to have better families: cpu, memory, frequencies, fans, temperatures. So the menu would be like:

    • Nvidia GTX 1050 gpu0
      • cpu
      • memory
      • frequencies
      • temperatures
      • fans
    • Nvidia GTX 1070 gpu1
      • cpu
      • memory
      • frequencies
      • temperatures
      • fans
  2. I suggest to provide a stacked memory chart for memory utilization, featuring used and free. The chart that shows the percentage is not required. netdata can turn such a stacked chart to a percentage if required.

  3. I would be perfect to have a set of heading gauges, featuring:

    • cpu %
    • memory %
    • encoder %
    • decoder %
    • pcie rx bandwidth
    • pcie tx bandwidth

    The heading row can perfectly have 6 gauges (containers and VMs have that many).

@ktsaou
Copy link
Copy Markdown
Member

ktsaou commented Mar 31, 2018

btw, since we know that nvidia-smi may consume quite some CPU, the plugin could me merged and installed, but disabled by default, so that users would have to edit a config to enable it.

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Mar 31, 2018

I used to have the graphs broken out per-GPU, but it got really crowded in a box with 4 GPUs in it. I could turn it into a preference or something.

I'll see what I can do about the suggestions you have.

@tycho tycho force-pushed the implement-nvidia-smi-module branch from 0b763e7 to 90f31b3 Compare April 2, 2018 14:45
@tycho tycho force-pushed the implement-nvidia-smi-module branch from 90f31b3 to 15526c4 Compare April 14, 2018 18:12
@tycho tycho force-pushed the implement-nvidia-smi-module branch from 15526c4 to 1cadd87 Compare May 12, 2018 06:19
@tycho tycho force-pushed the implement-nvidia-smi-module branch from 1cadd87 to 7e39052 Compare June 5, 2018 04:39
@calvyntwh
Copy link
Copy Markdown

is this working?

@tycho
Copy link
Copy Markdown
Contributor Author

tycho commented Jul 16, 2018

Yes?

@paulfantom
Copy link
Copy Markdown
Contributor

Are we merging this?

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Sep 10, 2018

Just realized I never got back to this to actually test it. Appears to work correctly on my Laptop (GTX 1080) and my home-server (GeForce 210), including confirming the caveats mentioned in the original description. Both tests were done with the most recent Travis build of the PR.

I also agree entirely with all of @ktsaou's suggestions on this, especially disabling by default.

Signed-off-by: Steven Noonan <steven@uplinklabs.net>
@tycho tycho force-pushed the implement-nvidia-smi-module branch from 7e39052 to e78da93 Compare September 12, 2018 11:10
@madhavajay
Copy link
Copy Markdown

Is this blocked because of subprocess call failing the PR quality review?

import threading
import xml.etree.ElementTree as et
import time
from base import SimpleService
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use

from bases.FrameworkServices.SimpleService import SimpleService

base.py was removed

self.break_condition = break_condition
self.lock = threading.RLock()
self.last_time = 0
self.last_data = ''
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add

self.daemon = True


return data

def check(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check method is super long and hard to read. Please split it into several functions. At least let's move chart creating to a separate function.

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Sep 27, 2018

Is this blocked because of subprocess call failing the PR quality review?

Actually no special reason. Since this plugin is requested by people several times already i think we should merge it (disabled by default?).

So let's work on this.

@ilyam8 ilyam8 requested a review from Ferroin September 27, 2018 02:36
@netdatabot
Copy link
Copy Markdown
Member

This pull request introduces 2 alerts when merging e78da93 into 6ec1605 - view on LGTM.com

new alerts:

  • 1 for Missing call to __init__ during object initialization
  • 1 for Variable defined multiple times

Comment posted by LGTM.com

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Sep 27, 2018

This is missing the following items:

  • A configuration file for the module in conf.d/python.d, including an entry for it in conf.d/Makefile.am
  • An entry for the module in conf.d/python.d.conf
  • A section with information on the plugin in python.d/README.md

The complaints reported by Codacy about use of subprocess can be ignored in this case. The complaints on Codacy about use of xml.etree can also be ignored.

@paulfantom
Copy link
Copy Markdown
Contributor

This also needs a rebase.

@madhavajay
Copy link
Copy Markdown

How is this?
https://github.com/madhavajay/netdata/tree/implement-nvidia-smi-module

Not sure I did the rebase correct?

@madhavajay
Copy link
Copy Markdown

After committing my changes to my fork of the branch I added the netdata/netdata repo as origin and ran:

$ git pull --rebase origin master

But then it wouldn't let me push without force.

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Oct 3, 2018

Having to force-push after a rebase is normal. Rebasing rewrites commit history, so it's not a simple case of just putting a few extra commits at the end of the branch, and therefore git is (rightfully) picky about making sure you want to push the results.

@Ferroin
Copy link
Copy Markdown
Member

Ferroin commented Oct 3, 2018

@tycho Do you still have interest in getting this merged upstream? If so, I would suggest collaborating with @madhavajay as they appear to have an interest in it. If not, please mention so here so we can close this and @madhavajay can open a PR for it.

@madhavajay
Copy link
Copy Markdown

If this hits some time limit we can just merge from my branch since it still attributes all of @tycho work on this so I doubt they would care. Im using the unofficial on my servers right now and i'm itching to see it go live and let all the Crypto fiends and AI / ML nerds have at it! 😝

@ilyam8
Copy link
Copy Markdown
Member

ilyam8 commented Nov 1, 2018

closed in favor of #4357

@ilyam8 ilyam8 closed this Nov 1, 2018
@tycho tycho deleted the implement-nvidia-smi-module branch December 17, 2018 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants