NFS problems with 2650 and 2.4.18-26.7

David C. Kovar kovar at kealia.com
Thu Mar 20 21:48:00 CST 2003


Good evening,

I'm rather stumped by an NFS error we're seeing and I am not sure where
to investigate next.

Earlier in the week I posted a question about some NFS errors we were
seeing. One particular file, when written to an NFS partition, would
cause the NFS mount to hang. I've narrowed the problem down
considerably.

Both machines are 2650's running 2.4.18-26.7, connected to a fairly
lightly loaded Dell 1GB switch in a very generic configuration.

In my initial tcpdump traces and netstat -r output we were seeing a lot
of ip fragmentation. Once the file server replied with an "icmp
reassembly" message, it would not respond to any new packets from the
client.

We turned the block sizes down to 1K on the client to prevent
fragmentation and ran the test again. 

The write will succeed most of the time, but it will take anywhere from
9 seconds to 1 minute. When it fails, the file server stops replying to
packets, the client ARPs for it, the client tries again, and the cycle
repeats.

In both cases - fragmentation and no fragmentation - once the file
server stops replying, the exchange fails.

It's only this one 196M file. Larger files and smaller files work fine,
and an identically sized file does not have any problems.

What bug am I exercising and what is it about this file that is
exercising the bug?

Mount options:

fileserver:/home      /home  nfs    
rw,nosuid,soft,rsize=1024,wsize=1024,timeo=14,intr 0 0

NFS mount, read write, no root write, soft mount, read and write block
size at 1024, and it'll time out.

Sometimes it works:

57 >time cp cmc-backup.mpp foo1.mpp

real    0m30.834s
user    0m0.002s
sys     0m0.000s
Thu Mar 20 18:54:10 kovar at bee02.kealia.com:~/test

(The time will vary between 9 seconds and a minute for the cp command to
complete.)

Sometimes it fails:

58 >^1^2
time cp cmc-backup.mpp foo2.mpp
cp: writing `foo2.mpp': Input/output error

real    1m2.932s
user    0m0.002s
sys     0m0.004s
Thu Mar 20 18:55:36 kovar at bee02.kealia.com:~/test


Network traffic at failure:

[A normal write/ACK exchange.]
18:54:33.216095 bee02.kealia.com.2952007036 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006400
1024 bytes @ 0x000006400 <filesync> (DF) (ttl 64, id 0, len 1184)
18:54:33.216309 fileserver.nfs > bee02.kealia.com.2952007036: reply ok
136 write PRE: POST: REG 100644 ids 2031/2031 sz 0x000006800 nlink 1
rdev 0/0 fsid 0x000000000 nodeid 0x000000000 a/m/ctime 1048215273.000000
1048215273.000000 1048215273.000000 1024 bytes <filesync> (DF) (ttl 64,
id 0, len 164)
[Four unACK'd writes.]
18:54:33.216334 bee02.kealia.com.2968784252 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
18:54:34.613551 bee02.kealia.com.2968784252 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
18:54:37.410087 bee02.kealia.com.2968784252 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
18:54:43.003177 bee02.kealia.com.2968784252 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
[Bee02 checks to fileserver's address with ARP, gets a reply.]
18:54:48.002570 arp who-has fileserver tell bee02.kealia.com
18:54:48.002642 arp reply fileserver is-at 0:6:5b:f2:c0:5f
[Seven unACK'd writes.]
18:54:54.189479 bee02.kealia.com.2985561468 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
18:54:55.587612 bee02.kealia.com.2985561468 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
18:54:58.384155 bee02.kealia.com.2985561468 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
18:55:03.977238 bee02.kealia.com.2985561468 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
18:55:15.163486 bee02.kealia.com.3002338684 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
18:55:16.561675 bee02.kealia.com.3002338684 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
18:55:19.358218 bee02.kealia.com.3002338684 > fileserver.nfs: 1156 write
fh
Unknown/010000020008000501800000A3027900A47D59ABE80279000000000000006800
1024 bytes @ 0x000006800 <filesync> (DF) (ttl 64, id 0, len 1184)
[Cycle repeats until timeout.]







-- 
David C. Kovar <kovar at kealia.com>




More information about the Linux-PowerEdge mailing list