NFS problems with 2650 and 2.4.18-26.7

Steven Kehlet steven.kehlet at conexant.com
Mon Mar 24 13:02:00 CST 2003


Have you tried tcp?

fileserver:/home      /home  nfs
tcp,nosuid,rsize=8192,wsize=8192,timeo=600,hard,intr 0 0

I'm still trying various kernel revs/mount options to get rid of my NFS
woes...  tcp looks promising, would probably help your problem of
fragmentation, see the following NetApp bug:

http://now.netapp.com/Knowledgebase/solutionarea.asp?id=4.0.852643.2515326&resource=


I'm currently trying 2.4.18-27 with the following mount options:

tcp,rsize=32768,wsize=32768,hard,intr,timeo=600

so far no errors on a very lightly loaded system.  I'll try my Oracle
server later this week...

Steve





On Thu, 2003-03-20 at 19:59, David C. Kovar wrote:
> Good evening,
> 
> I'm rather stumped by an NFS error we're seeing and I am not sure where
> to investigate next.
> 
> Earlier in the week I posted a question about some NFS errors we were
> seeing. One particular file, when written to an NFS partition, would
> cause the NFS mount to hang. I've narrowed the problem down
> considerably.
> 
> Both machines are 2650's running 2.4.18-26.7, connected to a fairly
> lightly loaded Dell 1GB switch in a very generic configuration.
> 
> In my initial tcpdump traces and netstat -r output we were seeing a lot
> of ip fragmentation. Once the file server replied with an "icmp
> reassembly" message, it would not respond to any new packets from the
> client.
> 
> We turned the block sizes down to 1K on the client to prevent
> fragmentation and ran the test again. 
> 
> The write will succeed most of the time, but it will take anywhere from
> 9 seconds to 1 minute. When it fails, the file server stops replying to
> packets, the client ARPs for it, the client tries again, and the cycle
> repeats.
> 
> In both cases - fragmentation and no fragmentation - once the file
> server stops replying, the exchange fails.
> 
> It's only this one 196M file. Larger files and smaller files work fine,
> and an identically sized file does not have any problems.
> 
> What bug am I exercising and what is it about this file that is
> exercising the bug?
> 
> Mount options:
> 
> fileserver:/home      /home  nfs    
> rw,nosuid,soft,rsize=1024,wsize=1024,timeo=14,intr 0 0
> 
> NFS mount, read write, no root write, soft mount, read and write block
> size at 1024, and it'll time out.
> 
> Sometimes it works:
> 
> 57 >time cp cmc-backup.mpp foo1.mpp
> 
> real    0m30.834s
> user    0m0.002s
> sys     0m0.000s
> Thu Mar 20 18:54:10 kovar at bee02.kealia.com:~/test
> 
> (The time will vary between 9 seconds and a minute for the cp command to
> complete.)
> 
> Sometimes it fails:
> 
> 58 >^1^2
> time cp cmc-backup.mpp foo2.mpp
> cp: writing `foo2.mpp': Input/output error
> 
> real    1m2.932s
> user    0m0.002s
> sys     0m0.004s
> Thu Mar 20 18:55:36 kovar at bee02.kealia.com:~/test
> 
> 
> Network traffic at failure:
> 
> [A normal write/ACK exchange.]
> 18:54:33.216095 bee02.kealia.com.2952007036 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006400
> 1024 bytes @ 0x000006400 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:54:33.216309 fileserver.nfs > bee02.kealia.com.2952007036: reply ok
> 136 write PRE: POST: REG 100644 ids 2031/2031 sz 0x000006800 nlink 1
> rdev 0/0 fsid 0x000000000 nodeid 0x000000000 a/m/ctime 1048215273.000000
> 1048215273.000000 1048215273.000000 1024 bytes <filesync> (DF) (ttl 64,
> id 0, len 164)
> [Four unACK'd writes.]
> 18:54:33.216334 bee02.kealia.com.2968784252 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:54:34.613551 bee02.kealia.com.2968784252 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:54:37.410087 bee02.kealia.com.2968784252 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:54:43.003177 bee02.kealia.com.2968784252 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> [Bee02 checks to fileserver's address with ARP, gets a reply.]
> 18:54:48.002570 arp who-has fileserver tell bee02.kealia.com
> 18:54:48.002642 arp reply fileserver is-at 0:6:5b:f2:c0:5f
> [Seven unACK'd writes.]
> 18:54:54.189479 bee02.kealia.com.2985561468 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:54:55.587612 bee02.kealia.com.2985561468 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:54:58.384155 bee02.kealia.com.2985561468 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:55:03.977238 bee02.kealia.com.2985561468 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:55:15.163486 bee02.kealia.com.3002338684 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:55:16.561675 bee02.kealia.com.3002338684 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> 18:55:19.358218 bee02.kealia.com.3002338684 > fileserver.nfs: 1156 write
> fh
> Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
> 1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
> [Cycle repeats until timeout.]
> 
> 
> 
> 
> 
> 




More information about the Linux-PowerEdge mailing list