Main Menu

Search

LINUX: How To Troubleshoot High I/O WAIT Issues in Linux Servers With NFS Shares?

In Linux systems with NFS shares exported from Storage servers like ZFS, Netapp etc., anytime there is significant load on the Storage than what it can handle, any new NFS requests or existing NFS requests from Linux clients are going to see slowness, because there will be I/O WAITs on the NFS devices.IO WAITs can be due to number of reasons like backups running or virus scan (which does significant reads), high IO ops coming from particular process on particular VM client etc.

In Virtualized deployments like Oracle VM Manager if repository filesystem is on NFS Shares exported from Storage Servers (for e.g. in Systems like Exalogic), any read/write operations on local disks (non NFS shares) on guest VMs will also be NFS Ops on the ZFS shares. That is because the root system disk of the Guest VMs is on Repositories shares on Dom0 Hypervisors which is exported from Storage. So any huge reads/writes done on Guest VMs local disk will also lead to NFS reads/write Ops on OVS Repositories share.


In order to narrow down on High I/O WAIT issues below things should be checked.


•     Which Linux Client is seeing increase in number of reads and writes at the time of the issue.

•     Which NFS devices are seeing longer I/O WAITs on Linux clients.
•     Which processes on the Linux Clients are having significant increase in Ops at the time of the issue.
•     From Storage level to narrow down further on the the issue get a report which shows NFSv3/NFSv4 operations per client,  NFSv3/NFSv4 operations per share drilled down by client. Steps and methods to get the report may vary depending on the Storage server being used. In case of ZFS Storage Appliance and ZFS analytics can be enabled from ZFS BUI console.

Following are steps for troubleshooting I/O WAIT issues.

1. On the affected Linux client VM's run below iostat command to get the device stats like number of reads/writes, utilization and wait times.

iostat
Below is the sample snippet of iostat command. You will see the overall I/O WAIT from iostat command and also individual waits from each device.

If you see the utilization as very less and await times higher that means that the response is slower from Storage side which is causing slowness from Linux client sides.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.04    0.00    0.15    0.02    0.00   99.79

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  1.00     0.00     4.00     8.00     0.00    0.00   0.00   0.00
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda4              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda5              0.00     0.00  0.00  1.00     0.00     4.00     8.00     0.00    0.00   0.00   0.00
sda6              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00  1.00  1.00     2.00     0.50     2.50     0.00    0.50   0.50   0.10
dm-1              0.00     0.00  0.00  4.00     0.00    44.00    22.00     0.01    2.25   1.75   0.70
dm-2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00  0.00  3.00     0.00    16.00    10.67     0.00    1.33   1.33   0.40

2. On the affected Linux clients run below nfsiostat command.

nfsiostat

nfsiostat command will show number of reads/writes bytes an ops on each of the devices. Below is sample snippet of above command.

Device:                   rkB_nor/s    wkB_nor/s    rkB_dir/s    wkB_dir/s    rkB_svr/s    wkB_svr/s     ops/s    ro
ps/s    wops/s
192.168.21.5:/export/general/         0.00         0.00         0.00         0.00         0.00         0.00
   85.00      0.00      0.00

You can then compare read/write ops of NFS shares from nfsiostat with read/writes and compare them to I/O wait, read/writes details of devices from iostat command and then narrow down on which NFS share has significant writes, ops, and I/O WAIT.


3. Run below iotop command on the affected Linux client.

iotop -n 3 -d 5 -t  
iotop command will show which process is seeing significant increase in ops and reads/writes at the time of the issue. This way we can narrow down if there is a particular process on the client VMs which is causing high IO ops and reads/writes. Below is sample output of iotop.


Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 4608 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [dm-thin]
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % systemd --switched-root --system --deserialize 21
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
    5 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:0H]
    7 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]
    8 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_bh]
Above steps will help to narrow down the issue of high I/O WAIT where it is caused due to slowness on ZFS heads, or if the issue is happening due to a particular share/device or a particular process on the Linux client.

Products to which Article Applies


All Linux Operating Systems.


Additional Reference

https://bencane.com/2012/08/06/troubleshooting-high-io-wait-in-linux/


 

tarun boyella


No comments:

Post a Comment