c410x, GPU/M2075 and C6145

Martin Flemming martin.flemming at desy.de
Fri Mar 30 06:57:23 CDT 2012


> Martin,
>
> At the time I built the cluster, I installed:
>
> NVIDIA-Linux-x86_64-285.05.09.run
>
> and
>
> CUDA-Z-0.5.95-i686.run
>
> The second part is a bit "complicated" - If the machine get rebooted for any other reason than a hung GPU process (a "normal" reboot), then the GPU's are there and ready when the machine comes back up.
>
> We have done development work on the GPU's with programs that can lock up the PCI bus and the computer will get rebooted, but in that case the GPU's are not there and we need to power off and restart everything.
>
> "nvidia-smi -L" as "root" - that command is your friend - I've toyed with the idea of adding that to the rc.local, it will wake up GPU's that seem unresponsive after a reboot.
>
> Pat
>
>

Hi and thanks for response !

The driver works, also the newer ones :-)
... which they didn't done before ?!?!?!?

Actualy i've got only the problem
with loosing the GPU's after a normal reboot :-(

Also adding "nvidia-smi -L" or/and "/usr/bin/nvidia-smi -pm 1"
  in  /etc/rc.local didn't solve the problems because it seems the 
controller/GPUS are really lost :-(


/usr/bin/nvidia-smi -pm 1
FATAL: Error inserting nvidia 
(/lib/modules/2.6.32-220.7.1.el6.x86_64/kernel/drivers/video/nvidia.ko): No such device
NVIDIA: failed to load the NVIDIA kernel module.
Nvidia-smi has failed because it couldn't communicate with NVIDIA driver.
  Make sure that latest NVIDIA driver is installed and running.

Before the reboot
42:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
43:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
43:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
45:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
46:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
46:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
48:00.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
49:04.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
49:08.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
49:10.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
49:14.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
4a:00.0 3D controller: nVidia Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev a1)
4c:00.0 3D controller: nVidia Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev a1)

after reboot:

42:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
43:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
43:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
45:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
46:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
46:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)

What's going wrong in my setup ? :-(

Only to complete my setup :

I've got only two connected C6145

Slot 1 -> first of the top nodes and 
Slot 2 -> first of the bottom nodes

GPUS are inside the slots 1 2 3 4

and the mapping looks like

1    1,2,15,16
- vs ----------
5     N/A

2    3,4,13,14
- vs -------- 
6      N/A


Thanks in advance

 	martin


>
>
> ----- Original Message -----
> From: "Martin Flemming" <martin.flemming at desy.de>
> To: "Dell poweredge Mailling-liste" <linux-poweredge at dell.com>
> Sent: Wednesday, March 28, 2012 3:44:41 PM (GMT-0500) America/New_York
> Subject: Re: c410x, GPU/M2075 and C6145
>
>
> Hi, Pat !
>
> Thanks for the hint with Slots to nodes, this problem is solved :-)
>
> But which driver do you use or build ?
>
> .. and another question ....
>
> ...  is your "system" also so extrem sensitive,
> that if  one machine will be rebootet,  means this one lost "his" GPU's :-(
>
> .. and this also means that if you want the GPU back for this machine, the
> hole "system" has to be shutdown (first the nodes, than the c410x, restart the c410x and last
> power on for the nodes) ?
>
> thanks & cheers
>
> 	martin
>
>
> On Tue, 27 Mar 2012, Patrick McMahon wrote:
>
>> Martin,
>>
>> I have one of those same setups.
>>
>> I found I needed to power up the C410 and let it initialize well before I power up the C6145 (3 or 5 min...)
>>
>> After that I also found the problem was usually due to the thick cables and connectors not seating properly, either in the c410 or in the c6145.
>>
>> Do you have four cables from the c410 to the c6145?
>>
>> Slot #1 and Slot #3 on the C410x to the two connectors on the top node in the C6145
>> Slot #2 and Slot #4 on the C410x to the two connectors on the bottom node in the C6145
>>
>> That should allow for 4 GPU's per C6145 node with the default C410x mapping.
>>
>> Pat
>>
>>
>>
>>
>> ----- Original Message -----
>> From: "Martin Flemming" <martin.flemming at desy.de>
>> To: "Dell poweredge Mailling-liste" <linux-poweredge at dell.com>
>> Sent: Tuesday, March 27, 2012 1:39:49 AM (GMT-0500) America/New_York
>> Subject: c410x, GPU/M2075 and C6145
>>
>>
>> Hi !
>>
>> I've got a problem to get the
>> NVIDIA M2075 PCIe x16 GPGPU Card in PowerEdge C410x
>> to work with (at this time for two) C6145 :-(
>>
>> ... lspci shows nothing about them :-(
>>
>> After disable all special PCIE-BIOS-settings,
>> one machine shows the controller :-)
>>
>> 42:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>> 43:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>> 43:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>> 45:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ff)
>> 46:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ff)
>> 46:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ff)
>> 47:00.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev ff)
>> 48:04.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev ff)
>> 48:08.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev ff)
>> 49:00.0 3D controller: NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev ff)
>> 4a:00.0 3D controller: NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev ff)
>>
>> The other machine shows nothing :-(
>>
>> Both machines are connected via the the default port-mapping
>> IPASS mapping to PCIE Controller
>>
>> Mapping 1
>>
>> 1 -> 1,15
>> VS
>> 5 -> 2,16
>>
>>
>> But i can't build the nvivdia-drvier on the machine with the detected
>> TESLA M2075
>>
>> WARNING: You do not appear to have an NVIDIA GPU supported by the 295.20
>> NVIDIA Linux graphics driver installed in this system.  For further
>> details, please see the appendix SUPPORTED NVIDIA GRAPHICS CHIPS in the
>> README available on the Linux driver download page at www.nvidia.com.
>>
>> I'm running Scientificlinux 6.2 (Redhat Clone)
>>
>> Any hint is welcome !
>>
>> thanks & cheers,
>>
>>        Martin
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>
>
> -- 
> Happiness lies in being privileged to work hard for long hours in doing whatever you think is worth doing. - Robert Heinlein
> ---
> Patrick McMahon,  CITA IV
> University of Delaware
> Department of Chemistry & Biochemistry
> Phone: (302)831-4289   Mobile: (302)690-1049
>
>



More information about the Linux-PowerEdge mailing list