May 15, 2008

What goes inside kernel ..when you issue a system call

Ever wondered what happens inside the kernel, when you issue a system call like read/write etc .. ? Dtrace is the answer !! This simple snippet of dtrace code, can show/trace each operation being done inside kernel.

#!/usr/sbin/dtrace -Fs
#pragma D option flowindent

syscall::read:entry /pid == 5142/ { self->follow = 1; }
:::entry,return /self->follow/ { }
syscall::read:return /pid == 5142/ { self->follow = 0; }



You need to replace the "read" call with, whatever call you want to trace and PID 5142 - with whatever process id (or) could use / execname =="process name" /

vxdmpadm listenclosure weird output ...

Last week, i ran into a situation, where the output from "vxdmpadm listenclosure all" on a Solaris 10 host looks weird.
Here is the output i see :
root# vxdmpadm listenclosure all
ENCLR_NAME ENCLR_TYPE ENCLR_SNO STATUS ARRAY_TYPE
============================================================================
Disk Disk DISKS CONNECTED Disk
EMC_CLARiiON0 EMC_CLARiiON trFMOCSFRMHsURBB CONNECTED A/A
EMC_CLARiiON1 EMC_CLARiiON lwIEFHOLRMHsURBB CONNECTED A/A
EMC_CLARiiON2 EMC_CLARiiON psNOJHOHRMHsURBB CONNECTED A/A
EMC_CLARiiON3 EMC_CLARiiON lsOPNKRERMHsURBB CONNECTED A/A
EMC_CLARiiON4 EMC_CLARiiON twHKLPSARMHsURBB CONNECTED A/A
EMC_CLARiiON5 EMC_CLARiiON rsJLMIJIRMHsURBB CONNECTED A/A
EMC_CLARiiON6 EMC_CLARiiON pjLIAIMJRMHsURBB CONNECTED A/A
EMC_CLARiiON7 EMC_CLARiiON vtCAKEGARMHsURBB CONNECTED A/A
EMC_CLARiiON8 EMC_CLARiiON pjEGHMSGRMHsURBB CONNECTED A/A
EMC_CLARiiON9 EMC_CLARiiON rpKINBFOQMHsURBB CONNECTED A/A
EMC_CLARiiON10 EMC_CLARiiON tmOIFMRLQMHsURBB CONNECTED A/A
EMC_CLARiiON11 EMC_CLARiiON nqMMMKIORMHsURBB CONNECTED A/A
EMC_CLARiiON12 EMC_CLARiiON nkCOHEIFQMHsURBB CONNECTED A/A
EMC_CLARiiON13 EMC_CLARiiON vuEPBLMEQMHsURBB CONNECTED A/A
EMC_CLARiiON14 EMC_CLARiiON prNMBARDQMHsURBB CONNECTED A/A
EMC_CLARiiON15 EMC_CLARiiON ppLEGBFLQMHsURBB CONNECTED A/A
EMC_CLARiiON16 EMC_CLARiiON jhOJFASBQMHsURBB CONNECTED A/A
EMC_CLARiiON17 EMC_CLARiiON xhNJOMEGQMHsURBB CONNECTED A/A
EMC_CLARiiON18 EMC_CLARiiON rpGIPKEDQMHsURBB CONNECTED A/A
EMC_CLARiiON19 EMC_CLARiiON xoMDMALNQMHsURBB CONNECTED A/A
EMC_CLARiiON20 EMC_CLARiiON lhGOAEFBQMHsURBB CONNECTED A/A


Even though there is only one CX array connected to this host, i see 20 arrays - equal to the number of luns presented on this host. This host is using EMCpowerpath, hence this is not an issue - but look weird and causes confusion.

To fix this problem, had a look at the CX configuration for the host storage group and found something interesting. Here is the output of storage group configuration from SP:

SP Name: SP B
SP Port ID: 2
HBA Devicename: PwrP:c3t5006016A1060284Bd0s0
Trusted: NO
Logged In: YES
Source ID: 6946832
Defined: YES
Initiator Type: 3
StorageGroup Name: sstm9732psy
ArrayCommPath: 1
Unit serial number: LUN
Failover mode: 1
SP Name: SP A
SP Port ID: 2
HBA Devicename: PwrP:c3t500601621060284Bd0s0
Trusted: NO
Logged In: YES
Source ID: 6946832
Defined: YES
Initiator Type: 3
StorageGroup Name: sstm9732psy
ArrayCommPath: 1
Unit serial number: LUN
Failover mode: 1


Looks like this setting is causing each lun to be presented to host with a unique array serial number and causing weird output. Once storage team has fixed this setting to "Array", all looks good.

ENCLR_NAME ENCLR_TYPE ENCLR_SNO STATUS ARRAY_TYPE
============================================================================
Disk Disk DISKS CONNECTED Disk
EMC_CLARiiON0 EMC_CLARiiON CK200040200384 CONNECTED A/A

May 13, 2008

prex, tnfxtract and tnfdump ...

With Solaris 10, dtrace is the way to go, if you want to probe/profile some system calls. However, if you are still stuck with Solaris running older versions and need to dig deeper into kernel profiling .. then read on

"prex" has been present since Solaris 2.5 days and is part of Solaris tracing architecture that controls probes in a process or the kernel.

root# prex -k
Type "help" for help ...
prex> buffer dealloc
There is no buffer to deallocate
prex> buffer alloc 2M
Buffer of size 2097152 bytes allocated
prex> enable io
prex> trace io
prex> ktrace on


.... wait for sometime to collect data

prex> ktrace off
prex> quit

Extract the data from buffer into a trace file
root# tnfxtract data1.tnf
Convert the "tnf" file into a text file as
root# tnfdump data1.tnf > data1.txt

The output file has two sections 1) header 2) detailed probe data
Here is a sample header, which displays the probes which have been enabled

probe tnf_name: "strategy" tnf_string: "keys io blockio;file ../../common/os/driver.c;line 411;"
probe tnf_name: "pageout" tnf_string: "keys vm pageio io;file ../../common/vm/vm_pvn.c;line 571;"
probe tnf_name: "biodone" tnf_string: "keys io blockio;file ../../common/os/bio.c;line 1222;"
probe tnf_name: "physio_start" tnf_string: "keys io rawio;file ../../common/os/vm_subr.c;line 158;"
probe tnf_name: "physio_end" tnf_string: "keys io rawio;file ../../common/os/vm_subr.c;line 304;"

Little info on above probes:

strategy - strategy() routine is called indectly by kernel to read/write blocks of data on block devices. Its responsibility
is to setup and initiate data transfer. It records i/o start.
biodone - release buffer after buffer I/O transfer and notify blocked threads. It record i/o completion.
pageout - I/o due to paging is recorded as pageoutphysio_start - raw i/o (or) kaio start
physio_end - raw i/o (or) kaio stop These probes record the device number, offset, size, and direction of the I/O transfer.

The contents of the actual probe data is shown below (edited for clarity)
---------------- ---------------- ----- ----- ---------- ---
Elapsed (ms) Delta (ms) PID LWPID TID CPU Probe Name Data / Description . . .
---------------- ---------------- ----- ----- ---------- ---

0.000000 0.000000 15024 12 0x3000bf93c80 3 strategy device: ... buf: 0x3001427dc80 flags: 34078993
8.149680 0.022960 0 0 0x2a10030fcc0 2 biodone device: ..block: 162881792 buf: 0x3001427dc80
8.251120 0.101440 15024 12 0x3000bf93c80 1 pageout vnode: ... pages_pageout: 0 pages_freed: 0 pages_reclaimed: 0
1529.298240 1515.858800 0 0 0x2a10029fcc0 3 strategy device: .. block: 8249 size: 512 buf: 0x60002515d10 flags: 33554689
.
1536.184480 0.007440 0 0 0x2a100af5cc0 9 biodone device: 137438953516 block: 8249 buf: 0x60002515d10
5046.189280 0.121040 68 1 0x30003442380 2 physio_start device: 506806142280 offset: 0 size: 36 rw: 64
5046.461840 0.021520 68 1 0x30003442380 2 physio_end device: 506806142280

From above data, we can observe the block location and size of data transfer – this information can be used to determine if the applications are doing random/sequenctial i/o.

If required to trace only single process, process filter can be turned on using :

prex> pfilter on
Warning: Process filtering on, but pid filter list is empty
prex> pfilter add 936
prex> pfilter
Process filtering is on
Process filter set is {936}


References:


http://developers.sun.com/solaris/developer/support/driver/notes/iotracing.pdf
http://www.brendangregg.com/Perf/paper_diskubyp1.pdf