c410x, GPU/M2075 and C6145

Patrick McMahon pmcmahon at mail.chem.udel.edu
Fri Mar 30 07:49:01 CDT 2012


Martin,

I may be wrong, but with only 4 GPU's in the C410, you want to put them in positions 1,2 & 15,16. 

Then use the default mapping on the C410 wich would put 

1 & 15 on connector # 1
2 & 16 on connector # 5

Then connect the C410 #1 connector to the first slot on the C6145 and the C410 #5 connector to the first slot on the bottom C6145.

If you do it the way you currently have it (1,2,3,4) & that (1/5 vs) and (2/6 vs) mapping you are creating Two problems......

1) With the missing 15,16 & 13,14 GPU's there is no guarantee the PCI detection will complete correctly

and

2) With that mapping you are using two PCI bridges in the C410, so there's the added complication that the PCI detect won't find your GPU's across multiply bridged and partially populated PCI busses.


Not to mention that you are going to cost the users of the GPU's time...there is a delay for each cycle if you use more PCI bridges, every time the data needs to cross a bridge, there's a delay. I don't think your code won't run at it's optimum speed for your equipment with that setup.

I think your setup would only work correctly with 8 GPU's in 1,2,3,4,13,14,15,16, and that would also not be optimum performance.

Pat





----- Original Message -----
From: "Martin Flemming" <martin.flemming at desy.de>
To: "Dell poweredge Mailling-liste" <linux-poweredge at dell.com>
Sent: Friday, March 30, 2012 7:57:23 AM (GMT-0500) America/New_York
Subject: Re: c410x, GPU/M2075 and C6145


> Martin,
>
> At the time I built the cluster, I installed:
>
> NVIDIA-Linux-x86_64-285.05.09.run
>
> and
>
> CUDA-Z-0.5.95-i686.run
>
> The second part is a bit "complicated" - If the machine get rebooted for any other reason than a hung GPU process (a "normal" reboot), then the GPU's are there and ready when the machine comes back up.
>
> We have done development work on the GPU's with programs that can lock up the PCI bus and the computer will get rebooted, but in that case the GPU's are not there and we need to power off and restart everything.
>
> "nvidia-smi -L" as "root" - that command is your friend - I've toyed with the idea of adding that to the rc.local, it will wake up GPU's that seem unresponsive after a reboot.
>
> Pat
>
>

Hi and thanks for response !

The driver works, also the newer ones :-)
... which they didn't done before ?!?!?!?

Actualy i've got only the problem
with loosing the GPU's after a normal reboot :-(

Also adding "nvidia-smi -L" or/and "/usr/bin/nvidia-smi -pm 1"
  in  /etc/rc.local didn't solve the problems because it seems the 
controller/GPUS are really lost :-(


/usr/bin/nvidia-smi -pm 1
FATAL: Error inserting nvidia 
(/lib/modules/2.6.32-220.7.1.el6.x86_64/kernel/drivers/video/nvidia.ko): No such device
NVIDIA: failed to load the NVIDIA kernel module.
Nvidia-smi has failed because it couldn't communicate with NVIDIA driver.
  Make sure that latest NVIDIA driver is installed and running.

Before the reboot
42:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
43:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
43:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
45:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
46:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
46:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
48:00.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
49:04.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
49:08.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
49:10.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
49:14.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
4a:00.0 3D controller: nVidia Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev a1)
4c:00.0 3D controller: nVidia Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev a1)

after reboot:

42:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
43:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
43:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
45:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
46:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
46:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)

What's going wrong in my setup ? :-(

Only to complete my setup :

I've got only two connected C6145

Slot 1 -> first of the top nodes and 
Slot 2 -> first of the bottom nodes

GPUS are inside the slots 1 2 3 4

and the mapping looks like

1    1,2,15,16
- vs ----------
5     N/A

2    3,4,13,14
- vs -------- 
6      N/A


Thanks in advance

 	martin


>
>
> ----- Original Message -----
> From: "Martin Flemming" <martin.flemming at desy.de>
> To: "Dell poweredge Mailling-liste" <linux-poweredge at dell.com>
> Sent: Wednesday, March 28, 2012 3:44:41 PM (GMT-0500) America/New_York
> Subject: Re: c410x, GPU/M2075 and C6145
>
>
> Hi, Pat !
>
> Thanks for the hint with Slots to nodes, this problem is solved :-)
>
> But which driver do you use or build ?
>
> .. and another question ....
>
> ...  is your "system" also so extrem sensitive,
> that if  one machine will be rebootet,  means this one lost "his" GPU's :-(
>
> .. and this also means that if you want the GPU back for this machine, the
> hole "system" has to be shutdown (first the nodes, than the c410x, restart the c410x and last
> power on for the nodes) ?
>
> thanks & cheers
>
> 	martin
>
>
> On Tue, 27 Mar 2012, Patrick McMahon wrote:
>
>> Martin,
>>
>> I have one of those same setups.
>>
>> I found I needed to power up the C410 and let it initialize well before I power up the C6145 (3 or 5 min...)
>>
>> After that I also found the problem was usually due to the thick cables and connectors not seating properly, either in the c410 or in the c6145.
>>
>> Do you have four cables from the c410 to the c6145?
>>
>> Slot #1 and Slot #3 on the C410x to the two connectors on the top node in the C6145
>> Slot #2 and Slot #4 on the C410x to the two connectors on the bottom node in the C6145
>>
>> That should allow for 4 GPU's per C6145 node with the default C410x mapping.
>>
>> Pat
>>
>>
>>
>>
>> ----- Original Message -----
>> From: "Martin Flemming" <martin.flemming at desy.de>
>> To: "Dell poweredge Mailling-liste" <linux-poweredge at dell.com>
>> Sent: Tuesday, March 27, 2012 1:39:49 AM (GMT-0500) America/New_York
>> Subject: c410x, GPU/M2075 and C6145
>>
>>
>> Hi !
>>
>> I've got a problem to get the
>> NVIDIA M2075 PCIe x16 GPGPU Card in PowerEdge C410x
>> to work with (at this time for two) C6145 :-(
>>
>> ... lspci shows nothing about them :-(
>>
>> After disable all special PCIE-BIOS-settings,
>> one machine shows the controller :-)
>>
>> 42:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>> 43:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>> 43:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>> 45:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ff)
>> 46:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ff)
>> 46:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ff)
>> 47:00.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev ff)
>> 48:04.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev ff)
>> 48:08.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev ff)
>> 49:00.0 3D controller: NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev ff)
>> 4a:00.0 3D controller: NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev ff)
>>
>> The other machine shows nothing :-(
>>
>> Both machines are connected via the the default port-mapping
>> IPASS mapping to PCIE Controller
>>
>> Mapping 1
>>
>> 1 -> 1,15
>> VS
>> 5 -> 2,16
>>
>>
>> But i can't build the nvivdia-drvier on the machine with the detected
>> TESLA M2075
>>
>> WARNING: You do not appear to have an NVIDIA GPU supported by the 295.20
>> NVIDIA Linux graphics driver installed in this system.  For further
>> details, please see the appendix SUPPORTED NVIDIA GRAPHICS CHIPS in the
>> README available on the Linux driver download page at www.nvidia.com.
>>
>> I'm running Scientificlinux 6.2 (Redhat Clone)
>>
>> Any hint is welcome !
>>
>> thanks & cheers,
>>
>>        Martin
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>
>
> -- 
> Happiness lies in being privileged to work hard for long hours in doing whatever you think is worth doing. - Robert Heinlein
> ---
> Patrick McMahon,  CITA IV
> University of Delaware
> Department of Chemistry & Biochemistry
> Phone: (302)831-4289   Mobile: (302)690-1049
>
>

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge at dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge


-- 
Happiness lies in being privileged to work hard for long hours in doing whatever you think is worth doing. - Robert Heinlein
---
Patrick McMahon,  CITA IV
University of Delaware 
Department of Chemistry & Biochemistry
Phone: (302)831-4289   Mobile: (302)690-1049



More information about the Linux-PowerEdge mailing list