c410x, GPU/M2075 and C6145

Martin Flemming martin.flemming at desy.de
Fri Mar 30 08:02:36 CDT 2012


On Fri, 30 Mar 2012, Patrick McMahon wrote:

> Martin,
>
> I may be wrong, but with only 4 GPU's in the C410, you want to put them in positions 1,2 & 15,16.
>
> Then use the default mapping on the C410 wich would put
>
> 1 & 15 on connector # 1
> 2 & 16 on connector # 5
>
> Then connect the C410 #1 connector to the first slot on the C6145 and the C410 #5 connector to the first slot on the bottom C6145.
>
> If you do it the way you currently have it (1,2,3,4) & that (1/5 vs) and (2/6 vs) mapping you are creating Two problems......
>
> 1) With the missing 15,16 & 13,14 GPU's there is no guarantee the PCI detection will complete correctly
>
> and
>
> 2) With that mapping you are using two PCI bridges in the C410, so there's the added complication that the PCI detect won't find your GPU's across multiply bridged and partially populated PCI busses.
>
>
> Not to mention that you are going to cost the users of the GPU's time...there is a delay for each cycle if you use more PCI bridges, every time the data needs to cross a bridge, there's a delay. I don't think your code won't run at it's optimum speed for your equipment with that setup.
>
> I think your setup would only work correctly with 8 GPU's in 1,2,3,4,13,14,15,16, and that would also not be optimum performance.


Hi, Pat !

Unfortunaley this setup is the only one,
which work for me to got 2 GPUS for only these two nodes :-(

I know your favour setup, and this was also my first logical attempt,
but (maybe i've made some mistakes, but didn't know what to make false within) 
i didn't saw in any way 2 GPUS for each host, i think it was only 2 GPUS 
for one host ... i know it sounds crazy but the combination for nodes/GPU 
and Mapping on the c410x makes me really crazy :-(

Ok, i will test your and also my first logical setup on Monday again,
but i'm afraid that it dosn't work

thanks again & nice weekend

        martin

>
>
>
>
> ----- Original Message -----
> From: "Martin Flemming" <martin.flemming at desy.de>
> To: "Dell poweredge Mailling-liste" <linux-poweredge at dell.com>
> Sent: Friday, March 30, 2012 7:57:23 AM (GMT-0500) America/New_York
> Subject: Re: c410x, GPU/M2075 and C6145
>
>
>> Martin,
>>
>> At the time I built the cluster, I installed:
>>
>> NVIDIA-Linux-x86_64-285.05.09.run
>>
>> and
>>
>> CUDA-Z-0.5.95-i686.run
>>
>> The second part is a bit "complicated" - If the machine get rebooted for any other reason than a hung GPU process (a "normal" reboot), then the GPU's are there and ready when the machine comes back up.
>>
>> We have done development work on the GPU's with programs that can lock up the PCI bus and the computer will get rebooted, but in that case the GPU's are not there and we need to power off and restart everything.
>>
>> "nvidia-smi -L" as "root" - that command is your friend - I've toyed with the idea of adding that to the rc.local, it will wake up GPU's that seem unresponsive after a reboot.
>>
>> Pat
>>
>>
>
> Hi and thanks for response !
>
> The driver works, also the newer ones :-)
> ... which they didn't done before ?!?!?!?
>
> Actualy i've got only the problem
> with loosing the GPU's after a normal reboot :-(
>
> Also adding "nvidia-smi -L" or/and "/usr/bin/nvidia-smi -pm 1"
>  in  /etc/rc.local didn't solve the problems because it seems the
> controller/GPUS are really lost :-(
>
>
> /usr/bin/nvidia-smi -pm 1
> FATAL: Error inserting nvidia
> (/lib/modules/2.6.32-220.7.1.el6.x86_64/kernel/drivers/video/nvidia.ko): No such device
> NVIDIA: failed to load the NVIDIA kernel module.
> Nvidia-smi has failed because it couldn't communicate with NVIDIA driver.
>  Make sure that latest NVIDIA driver is installed and running.
>
> Before the reboot
> 42:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 43:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 43:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 45:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 46:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 46:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 48:00.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
> 49:04.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
> 49:08.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
> 49:10.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
> 49:14.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev aa)
> 4a:00.0 3D controller: nVidia Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev a1)
> 4c:00.0 3D controller: nVidia Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev a1)
>
> after reboot:
>
> 42:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 43:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 43:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 45:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 46:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
> 46:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>
> What's going wrong in my setup ? :-(
>
> Only to complete my setup :
>
> I've got only two connected C6145
>
> Slot 1 -> first of the top nodes and
> Slot 2 -> first of the bottom nodes
>
> GPUS are inside the slots 1 2 3 4
>
> and the mapping looks like
>
> 1    1,2,15,16
> - vs ----------
> 5     N/A
>
> 2    3,4,13,14
> - vs --------
> 6      N/A
>
>
> Thanks in advance
>
> 	martin
>
>
>>
>>
>> ----- Original Message -----
>> From: "Martin Flemming" <martin.flemming at desy.de>
>> To: "Dell poweredge Mailling-liste" <linux-poweredge at dell.com>
>> Sent: Wednesday, March 28, 2012 3:44:41 PM (GMT-0500) America/New_York
>> Subject: Re: c410x, GPU/M2075 and C6145
>>
>>
>> Hi, Pat !
>>
>> Thanks for the hint with Slots to nodes, this problem is solved :-)
>>
>> But which driver do you use or build ?
>>
>> .. and another question ....
>>
>> ...  is your "system" also so extrem sensitive,
>> that if  one machine will be rebootet,  means this one lost "his" GPU's :-(
>>
>> .. and this also means that if you want the GPU back for this machine, the
>> hole "system" has to be shutdown (first the nodes, than the c410x, restart the c410x and last
>> power on for the nodes) ?
>>
>> thanks & cheers
>>
>> 	martin
>>
>>
>> On Tue, 27 Mar 2012, Patrick McMahon wrote:
>>
>>> Martin,
>>>
>>> I have one of those same setups.
>>>
>>> I found I needed to power up the C410 and let it initialize well before I power up the C6145 (3 or 5 min...)
>>>
>>> After that I also found the problem was usually due to the thick cables and connectors not seating properly, either in the c410 or in the c6145.
>>>
>>> Do you have four cables from the c410 to the c6145?
>>>
>>> Slot #1 and Slot #3 on the C410x to the two connectors on the top node in the C6145
>>> Slot #2 and Slot #4 on the C410x to the two connectors on the bottom node in the C6145
>>>
>>> That should allow for 4 GPU's per C6145 node with the default C410x mapping.
>>>
>>> Pat
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: "Martin Flemming" <martin.flemming at desy.de>
>>> To: "Dell poweredge Mailling-liste" <linux-poweredge at dell.com>
>>> Sent: Tuesday, March 27, 2012 1:39:49 AM (GMT-0500) America/New_York
>>> Subject: c410x, GPU/M2075 and C6145
>>>
>>>
>>> Hi !
>>>
>>> I've got a problem to get the
>>> NVIDIA M2075 PCIe x16 GPGPU Card in PowerEdge C410x
>>> to work with (at this time for two) C6145 :-(
>>>
>>> ... lspci shows nothing about them :-(
>>>
>>> After disable all special PCIE-BIOS-settings,
>>> one machine shows the controller :-)
>>>
>>> 42:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>>> 43:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>>> 43:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev bb)
>>> 45:00.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ff)
>>> 46:04.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ff)
>>> 46:08.0 PCI bridge: PLX Technology, Inc. PEX 8647 48-Lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ff)
>>> 47:00.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev ff)
>>> 48:04.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev ff)
>>> 48:08.0 PCI bridge: PLX Technology, Inc. PEX 8696 96-lane, 24-Port PCI Express Gen 2 (5.0 GT/s) Multi-Root Switch (rev ff)
>>> 49:00.0 3D controller: NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev ff)
>>> 4a:00.0 3D controller: NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module (rev ff)
>>>
>>> The other machine shows nothing :-(
>>>
>>> Both machines are connected via the the default port-mapping
>>> IPASS mapping to PCIE Controller
>>>
>>> Mapping 1
>>>
>>> 1 -> 1,15
>>> VS
>>> 5 -> 2,16
>>>
>>>
>>> But i can't build the nvivdia-drvier on the machine with the detected
>>> TESLA M2075
>>>
>>> WARNING: You do not appear to have an NVIDIA GPU supported by the 295.20
>>> NVIDIA Linux graphics driver installed in this system.  For further
>>> details, please see the appendix SUPPORTED NVIDIA GRAPHICS CHIPS in the
>>> README available on the Linux driver download page at www.nvidia.com.
>>>
>>> I'm running Scientificlinux 6.2 (Redhat Clone)
>>>
>>> Any hint is welcome !
>>>
>>> thanks & cheers,
>>>
>>>        Martin
>>
>> _______________________________________________
>> Linux-PowerEdge mailing list
>> Linux-PowerEdge at dell.com
>> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>>
>>
>> --
>> Happiness lies in being privileged to work hard for long hours in doing whatever you think is worth doing. - Robert Heinlein
>> ---
>> Patrick McMahon,  CITA IV
>> University of Delaware
>> Department of Chemistry & Biochemistry
>> Phone: (302)831-4289   Mobile: (302)690-1049
>>
>>
>
> _______________________________________________
> Linux-PowerEdge mailing list
> Linux-PowerEdge at dell.com
> https://lists.us.dell.com/mailman/listinfo/linux-poweredge
>
>
> -- 
> Happiness lies in being privileged to work hard for long hours in doing whatever you think is worth doing. - Robert Heinlein
> ---
> Patrick McMahon,  CITA IV
> University of Delaware
> Department of Chemistry & Biochemistry
> Phone: (302)831-4289   Mobile: (302)690-1049
>
>



More information about the Linux-PowerEdge mailing list