November 30, 2015

Linux memory: What are those buffers?

Let's figure how 'free' command on linux gathers information related to it's output esp. buffers
Here is a sample output of free command on Linux
# free
             total       used       free     shared    buffers     cached
Mem:       1004300     852292     152008          4     194872     335584
-/+ buffers/cache:     321836     682464 
Swap:      1048572      88500     960072
It's evident from simple strace,free is simply reading/parsing information from /proc/meminfo
# strace -e open,read free 
.
open("/proc/meminfo", O_RDONLY)         = 3
read(3, "MemTotal:        1004300 kB\nMemF"..., 2047) = 1170
Looking at kernel source meminfo.c information collected with si_meminfo and si_swapinfo functions are displayed in KB. /proc/meminfo fields of interest map as below
56                 "MemTotal:       %8lu kB\n" -> 108                 K(i.totalram),
57                 "MemFree:        %8lu kB\n" -> 109                 K(i.freeram),
58                 "Buffers:        %8lu kB\n" -> 110                 K(i.bufferram),
59                 "Cached:         %8lu kB\n" -> 111                 K(cached),
60                 "SwapCached:     %8lu kB\n" -> 112                 K(total_swapcache_pages),
As seen from here buffers field is populated from si_meminfo call, which in turn leads to
2831         val->bufferram = nr_blockdev_pages();
nr_blockdev_pages() is defined in block_dev.c as shown below
555 long nr_blockdev_pages(void)
556 {
557         struct block_device *bdev;
558         long ret = 0;
559         spin_lock(&bdev_lock);
560         list_for_each_entry(bdev, &all_bdevs, bd_list) {
561                 ret += bdev->bd_inode->i_mapping->nrpages;
562         }
563         spin_unlock(&bdev_lock);
564         return ret;
565 }
As seen above, this function is returning sum of all 'nrpages' of all the block_device  address space (i_mapping) inodes (bd_inode). So the 'Buffers' field is directly related to all the block devices. nrpages is the number of resident pages in use by the address space.

Looking at source, found two functions which update nrpages from mm/filemap.c i.e __delete_from_page_cache and add_to_page_cache_locked

Out of all I/O that pass through pagecache, if the inode mapping is for a file it would be accounted under cached field and if it's for a block device, buffers field is updated.

So, if you try reading a block device directly say for ex: using dd, that should increase buffers field of free output. Let's give it a try
[root@cent8 ~]# free
             total       used       free     shared    buffers     cached
Mem:       1004300     312268     692032          4         64       9844
-/+ buffers/cache:     302360     701940 
Swap:      1048572      88496     960076
[root@cent8 ~]# dd if=/dev/vda1 of=/dev/null bs=1024k count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.279128 s, 376 MB/s 
[root@cent8 ~]# free
             total       used       free     shared    buffers     cached
Mem:       1004300     428852     575448          4     102592      10008
-/+ buffers/cache:     316252     688048 
Swap:      1048572      88496     960076 
As expected 'Buffers' increased around 100MB. Another reason for block device inode mapping is for file metadata i.e file system related disk blocks, directories etc. Let's run find to read a bunch of files,directories and observe 'buffers' field
[root@cent8 ~]# free
             total       used       free     shared    buffers     cached
Mem:       1004300     432796     571504          4     104612      12092
-/+ buffers/cache:     316092     688208 
Swap:      1048572      88496     960076 
[root@cent8 ~]# find / > /dev/null
[root@cent8 ~]# free
             total       used       free     shared    buffers     cached
Mem:       1004300     494556     509744          4     148880      12328
-/+ buffers/cache:     333348     670952 
Swap:      1048572      88496     960076 
Sure enough about 40M+ data pulled into 'buffers'. We can find nrpages used per block device, using crash with live kernel debugging i.e basically trying to find block_device.bd_inode->i_mapping->nrpages value for each block_device

Using below commands from crash, you will be able to find this info

1. list block_device.bd_list -H all_bdevs -s block_device.bd_inode  => all_bdevs is the list head, that contains list of all block_devices and it's walked using bd_list as pointer to next member
2. once you get all bd_inode's you can find i_mapping i.e address_space struct  using below (for each device)
struct inode.i_mapping [address of bd_inode struct from step 1]
3. Find nrpages using above address_space struct (for each device)
struct address_space.nrpages [address of address_space struct from step 2]
Wrote simple python script to do the same: nrpages_by_dev
[root@cent8 ~]# /tmp/nrpages_by_dev;free
Running /usr/bin/crash to determine nrpages by device ...
Device#              DeviceName           nrpages              nrpages - KB        
--------------------------------------------------------------------------------
264241153            "vda1                25639                102556              
265289728            "dm-0                11844                47376               
             total       used       free     shared    buffers     cached
Mem:       1004300     689448     314852          4     149932     153600
-/+ buffers/cache:     385916     618384 
Swap:      1048572      88496     960076 
P.S: Above script has no error checking and I have tested only on Fedora/Centos linux.Please use with *care* and it assumes crash, kernel-debugino rpm's are installed and 'crash -s' works! (ofcourse, you need to be root to run). I am also interested to find this info by other means, probably via ftrace or perf to avoid having crash,debuginfo pkg dependency. Let me know, if any one knows !
 

November 10, 2015

file write with O_SYNC enabled

Reading one of the questions on stackexchange, thought to verify it with ftrace. So, enabled function tracing for write(2) syscall as below to capture code flow
root@test1:/sys/kernel/debug/tracing# echo SyS_write > set_graph_function 
root@test1:/sys/kernel/debug/tracing# echo function_graph > current_tracer 
root@test1:/sys/kernel/debug/tracing# echo 1 > tracing_on
At this point, I had a simple program that open's a file with O_SYNC flag and performs a write (see below strace output of program)
open("/tmp/testing_o_sync", O_WRONLY|O_CREAT|O_SYNC, 0644) = 3
write(3, "This write has to go to disk dir"..., 100) = 100
Below is the edited version of captured ftrace output (minimal details on un-interesting functions; first column is line number)
     5  1)               |  SyS_write() {
     .
     9  1)               |    vfs_write() {
    10  1)               |      rw_verify_area() {
    14  1)   1.736 us    |      }
    15  1)   0.176 us    |      __sb_start_write();
    30  1)               |      new_sync_write() {
    31  1)   0.105 us    |        iov_iter_init();
    32  1)               |        ext4_file_write_iter [ext4]() {
    33  1)               |          mutex_lock() {
    34  1)   0.100 us    |            _cond_resched();
    35  1)   0.898 us    |          }
    36  1)               |          __generic_file_write_iter() {  # http://lxr.free-electrons.com/source/mm/filemap.c?v=3.16#L2518 
     # As per above source link, most work of writing to file done in this function, however it does not handle O_SYNC
     # Syncing of data need to be handled by caller 
    37  1)   0.105 us    |            file_remove_suid();
    38  1)               |            file_update_time() {
    39  1)               |              current_fs_time() {
    40  1)   0.230 us    |                current_kernel_time();
    41  1)   0.999 us    |              }
    42  1)               |              __mnt_want_write_file() {
    43  1)   0.146 us    |                mnt_clone_write();
    44  1)   1.048 us    |              }
    45  1)               |              update_time() {
    46  1)               |                __mark_inode_dirty() {
    47  1)               |                  ext4_dirty_inode [ext4]() {
   146  1) + 74.959 us   |                  }  # ext4_dirty_inode _ends_here_
   147  1)   0.120 us    |                  _raw_spin_lock();
   148  1)   0.096 us    |                  sb_is_blkdev_sb();
   149  1)   0.276 us    |                  _raw_spin_lock();
   150  1) + 79.373 us   |                }
   151  1) + 80.631 us   |              }
   152  1)   0.176 us    |              __mnt_drop_write_file();
   153  1) + 85.662 us   |            } # file_update_time _ends_here_
   154  1)               |            generic_perform_write() {
   155  1)   0.140 us    |              iov_iter_fault_in_readable();
   156  1)               |              ext4_da_write_begin [ext4]() {
   192  1) + 25.083 us   |              } # ext4_da_write_begin _ends_here_
   193  1)   0.442 us    |              iov_iter_copy_from_user_atomic();
   194  1)               |              ext4_da_write_end [ext4]() {
   234  1) + 22.415 us   |              } # ext4_da_write_end _ends_here_
   235  1)   0.090 us    |              _cond_resched();
   236  1)   0.111 us    |              iov_iter_advance();
   237  1)   0.436 us    |              balance_dirty_pages_ratelimited();
   238  1) + 54.575 us   |            } # generic_perform_write _ends_here_
   239  1) ! 143.216 us  |          }
   240  1)   0.125 us    |          mutex_unlock();
   241  1)               |          vfs_fsync_range() { # http://lxr.free-electrons.com/source/include/linux/fs.h?v=3.16#L2233
     # Called to flush files with O_SYNC flag. vfs_fsync_range() is called from generic_write_sync function and 
     # calls respective fsync_method, as seen below
178 int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
179 {
180         if (!file->f_op->fsync)
181                 return -EINVAL;
182         return file->f_op->fsync(file, start, end, datasync);
183 }
   242  1)               |            ext4_sync_file [ext4]() {
   243  1)               |              filemap_write_and_wait_range() {
   244  1)               |                __filemap_fdatawrite_range() {
   245  1)               |                  do_writepages() {
   246  1)               |                    ext4_writepages [ext4]() {
   247  1)   0.211 us    |                      mapping_tagged();
   248  1)   0.096 us    |                      ext4_has_inline_data [ext4]();
   249  1)   0.090 us    |                      ext4_io_submit_init [ext4]();
   254  1)   0.105 us    |                      blk_start_plug();
   255  1)               |                      ext4_init_io_end [ext4]() {
       .
   261  1)   5.532 us    |                      }
   262  1)   0.095 us    |                      jbd2_journal_blocks_per_page [jbd2]();
   263  1)               |                      ext4_meta_trans_blocks [ext4]() {
       .
   267  1)   2.473 us    |                      }
   268  1)               |                      __ext4_journal_start_sb [ext4]() {
   280  1)   7.227 us    |                      }
   281  1)               |                      mpage_prepare_extent_to_map [ext4]() {
       .
   360  1) + 57.845 us   |                      } # mpage_prepare_extent_to_map _ends_here_
   361  1)               |                      __ext4_journal_stop [ext4]() {
       .
   370  1)   5.121 us    |                      }
   371  1)               |                      ext4_io_submit [ext4]() {
   372  1)               |                        submit_bio() {
   373  1)               |                          generic_make_request() {
       .
   517  1) ! 116.884 us  |                          }
   518  1) ! 117.787 us  |                        }  # submit_bio _ends_here_
   519  1)   0.135 us    |                        bio_put();
   520  1) ! 119.573 us  |                      }
   521  1)   0.286 us    |                      mpage_release_unused_pages [ext4]();
   522  1)   0.161 us    |                      ext4_put_io_end [ext4]();
   523  1)               |                      blk_finish_plug() {
   524  1)               |                        blk_flush_plug_list() {
       .
   706  1) ! 141.451 us  |                        }
   707  1) ! 142.228 us  |                      }  # blk_finish_plug _ends_here_
   708  1) ! 355.522 us  |                    }
   709  1) ! 356.280 us  |                  }  # do_writepages _ends_here_
   710  1) ! 357.167 us  |                }
   711  1)               |                filemap_fdatawait_range() {
      .
   816  1) ! 154.310 us  |                }
   817  1) ! 513.164 us  |              } # filemap_write_and_wait_range _ends_here_
   818  1)               |              jbd2_trans_will_send_data_barrier [jbd2]() {
   819  1)   0.150 us    |                _raw_read_lock();
   820  1)   0.973 us    |              }
   821  1)               |              jbd2_complete_transaction [jbd2]() {
   877  1)               |                jbd2_log_wait_commit [jbd2]() {
      .
  1130  1) ! 25593.78 us |                } # jbd2_log_wait_commit _ends_here_
  1131  1) ! 25627.76 us |              } # jbd2_complete_transaction _ends_here_
  1132  1) ! 26144.90 us |            } # ext4_sync_file _ends_here
  1133  1) ! 26145.69 us |          }
  1134  1) ! 26293.56 us |        }
  1135  1) ! 26294.93 us |      }
  1136  1)               |      __fsnotify_parent() {
     .
  1165  1)   9.083 us    |      }
  1166  1)               |      fsnotify() {
  1169  1)   0.858 us    |      }
  1170  1)   0.191 us    |      __sb_end_write();
  1171  1) ! 26321.44 us |    } # vfs_write _ends_here_
  1172  1) ! 26324.27 us |  }  # SyS_write _ends_here_
2233 static inline int generic_write_sync(struct file *file, loff_t pos, loff_t count)
2234 {
2235         if (!(file->f_flags & O_DSYNC) && !IS_SYNC(file->f_mapping->host))
2236                 return 0;
2237         return vfs_fsync_range(file, pos, pos + count - 1,
2238                                (file->f_flags & __O_SYNC) ? 0 : 1);
2239 }
As seen in above function, it returns 0 for buffered writes and calls 
vfs_fsync_range for file writes with O_SYNC flag

November 5, 2015

Linux ftrace: UDP transmit packet kernel code path

Checkout earlier blog entry on setting up ftrace to capture this trace. I have removed most details with in ip filter related functions

Below trace is for udp_sendmsg (edited for clarity)

 0)               |  udp_sendmsg() {
 0)   0.093 us    |    sock_tx_timestamp();
 0)               |    sk_dst_check() {
 0)   0.070 us    |      ipv4_dst_check();
 0)   1.307 us    |    }
 0)               |    ip_make_skb() {
 0)               |      ip_setup_cork() {
 0)   0.209 us    |        ipv4_mtu();
 0)   0.909 us    |      }
 0)               |      __ip_append_data.isra.45() {
 0)               |        sock_alloc_send_skb() {
 0)               |          sock_alloc_send_pskb() {
 0)               |            __alloc_skb() {
 0)               |              kmem_cache_alloc_node() {
 0)   0.141 us    |                __might_sleep();
 0)   0.070 us    |                _cond_resched();
 0)   0.116 us    |                should_failslab();
 0)   0.102 us    |                kmemleak_alloc();
 0)   2.454 us    |              }
 0)               |              __kmalloc_reserve.isra.26() {
 0)               |                __kmalloc_node_track_caller() {
 0)   0.157 us    |                  kmalloc_slab();
 0)   0.133 us    |                  __might_sleep();
 0)   0.065 us    |                  _cond_resched();
 0)   0.074 us    |                  should_failslab();
 0)   0.075 us    |                  kmemleak_alloc();
 0)   3.474 us    |                }
 0)   3.980 us    |              }
 0)   0.252 us    |              ksize();
 0)   8.598 us    |            }
 0)   9.271 us    |          }
 0)   9.910 us    |        }
 0)   0.060 us    |        skb_put();
 0)               |        ip_generic_getfrag() {
 0)               |          memcpy_fromiovecend() {
 0)               |            might_fault() {
 0)   0.150 us    |              __might_sleep();
 0)   0.961 us    |            }
 0)   1.595 us    |          }
 0)   2.192 us    |        }
 0) + 14.200 us   |      }
 0)               |      __ip_make_skb() {
 0)   0.063 us    |        ipv4_mtu();
 0)               |        __ip_select_ident() {
 0)   1.272 us    |          inet_getpeer();
 0)               |          inet_putpeer() {
 0)   0.084 us    |            __smp_mb__before_atomic();
 0)   0.556 us    |          }
 0)   2.828 us    |        }
 0)   0.078 us    |        kfree();
 0)   0.070 us    |        dst_release();
 0)   5.206 us    |      }
 0) + 22.094 us   |    }
 0)               |    udp_send_skb() {
 0)               |      ip_send_skb() {
 0)               |        ip_local_out_sk() {
 0)               |          __ip_local_out() {
 0)               |            nf_hook_slow() {    
                                               ...edited...
 0) + 88.178 us   |            }
 0) + 88.869 us   |          }
 0)               |          ip_output() {
 0)               |            nf_hook_slow() {    
                                              ... edited ...
 0) + 50.133 us   |            }
 0)               |            ip_finish_output() {
 0)   0.070 us    |              ipv4_mtu();
 0)               |              local_bh_disable() {
 0)   0.092 us    |                __local_bh_disable();
 0)   0.625 us    |              }
 0)   0.130 us    |              rcu_read_lock_bh_held();
 0)   0.070 us    |              rcu_read_lock_bh_held();
 0)   0.062 us    |              skb_push();
 0)               |              dev_queue_xmit() {  http://lxr.free-electrons.com/source/net/core/dev.c?v=3.10#L2777
 0)               |                local_bh_disable() {
 0)   0.091 us    |                  __local_bh_disable();
 0)   0.551 us    |                }
 0)   0.086 us    |                rcu_read_lock_bh_held();
 0)   0.156 us    |                netdev_pick_tx();
 0)   0.066 us    |                rcu_read_lock_bh_held();
 0)   0.072 us    |                rcu_read_lock_bh_held();
 0)   0.637 us    |                _raw_spin_lock();
 0)               |                sch_direct_xmit() {
 0)   0.259 us    |                  _raw_spin_unlock();
 0)   0.579 us    |                  _raw_spin_lock();
 0)               |                  dev_hard_start_xmit() { http://lxr.free-electrons.com/source/net/core/dev.c?v=3.10#L2512
 0)   0.077 us    |                    dst_release();
 0)               |                    netif_skb_features() {
 0)               |                      harmonize_features.part.86() {
 0)   0.096 us    |                        skb_network_protocol();
 0)   0.704 us    |                      }
 0)   1.307 us    |                    }
 

In this function common code calls 'driver' specific transmit function, as shown at http://lxr.free-electrons.com/source/net/core/dev.c?v=3.10#L2580
 2580                 rc = ops->ndo_start_xmit(skb, dev);
Here ops is a pointer to net_device_ops, as defined http://lxr.free-electrons.com/source/include/linux/netdevice.h?v=3.10#L906
In the driver code, it set's up this struct and assign's it's functions to the generic function pointers as shown http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/e1000e/netdev.c?v=3.10#L6457

6457 static const struct net_device_ops e1000e_netdev_ops = {
6458         .ndo_open               = e1000_open,
6459         .ndo_stop               = e1000_close,
6460         .ndo_start_xmit         = e1000_xmit_frame,
6461         .ndo_get_stats64        = e1000e_get_stats64,
6462         .ndo_set_rx_mode        = e1000e_set_rx_mode,
6463         .ndo_set_mac_address    = e1000_set_mac,
6464         .ndo_change_mtu         = e1000_change_mtu,
6465         .ndo_do_ioctl           = e1000_ioctl,
6466         .ndo_tx_timeout         = e1000_tx_timeout,
6467         .ndo_validate_addr      = eth_validate_addr,
6468 
6469         .ndo_vlan_rx_add_vid    = e1000_vlan_rx_add_vid,
6470         .ndo_vlan_rx_kill_vid   = e1000_vlan_rx_kill_vid,
6471 #ifdef CONFIG_NET_POLL_CONTROLLER
6472         .ndo_poll_controller    = e1000_netpoll,
6473 #endif
6474         .ndo_set_features = e1000_set_features,
6475 };
Underlying driver code get's executed
 0)               |                    e1000_xmit_frame [e1000]() {
 0)   0.187 us    |                      skb_pad();
 0)               |                      nommu_map_page() {
 0)   0.177 us    |                        check_addr();
 0)   0.861 us    |                      }
 0)   0.730 us    |                      _raw_spin_lock_irqsave();
 0)   0.348 us    |                      _raw_spin_unlock_irqrestore();
 0)   0.539 us    |                      _raw_spin_lock_irqsave();
 0)   0.333 us    |                      _raw_spin_unlock_irqrestore();
 0)               |                      save_stack_trace() {
 0)               |                        dump_trace() {
 0) + 20.129 us   |                          print_context_stack();
 0) + 21.455 us   |                        }
 0) + 22.056 us   |                      }
 0)   0.321 us    |                      _raw_spin_lock_irqsave();
 0)   0.249 us    |                      _raw_spin_unlock_irqrestore();
 0)   0.261 us    |                      _raw_spin_lock_irqsave();
 0)   0.226 us    |                      _raw_spin_unlock_irqrestore();
 0)               |                      skb_clone_tx_timestamp() {
 0)   0.059 us    |                        classify();
 0)   0.778 us    |                      }
 0)   ==========> |
 0)               |                      do_IRQ() {
 0)               |                        irq_enter() {
 0)   0.142 us    |                          rcu_irq_enter();
 0)   0.735 us    |                        }
 0)   0.071 us    |                        exit_idle();
 0)               |                        handle_irq() {
 0)   0.813 us    |                          irq_to_desc();
 0)               |                          handle_fasteoi_irq() {
 0)   1.031 us    |                            _raw_spin_lock();
 0)               |                            handle_irq_event() {
 0)   0.342 us    |                              _raw_spin_unlock();
 0)               |                              handle_irq_event_percpu() {
 0)               |                                e1000_intr [e1000]() {
 0)               |                                  __napi_schedule() {
 0)   0.078 us    |                                    __raise_softirq_irqoff();
 0)   0.932 us    |                                  }
 0) + 49.101 us   |                                }
 0)   0.655 us    |                                add_interrupt_randomness();
 0)   0.132 us    |                                note_interrupt();
 0) + 51.976 us   |                              }
 0)   0.840 us    |                              _raw_spin_lock();
 0) + 54.969 us   |                            }
 0)               |                            ack_apic_level() {
 0)   0.212 us    |                              irq_complete_move();
 0) + 41.956 us   |                            }
 0)   0.570 us    |                            _raw_spin_unlock();
 0) ! 101.772 us  |                          }
 0) ! 104.131 us  |                        }
 0)               |                        irq_exit() {
 0)   0.099 us    |                          idle_cpu();
 0)   0.164 us    |                          rcu_irq_exit();
 0)   1.684 us    |                        }
 0) ! 108.989 us  |                      }
 0)   <========== |
 0) ! 285.379 us  |                    }
 0) ! 289.474 us  |                  }
 0)   1.046 us    |                  _raw_spin_unlock();
 0)   1.076 us    |                  _raw_spin_lock();
 0) ! 295.938 us  |                }
 0)   0.261 us    |                _raw_spin_unlock();
 0)   0.211 us    |                local_bh_enable();
 0) ! 304.568 us  |              }   ==> dev_queue_xmit ends here

Below code looks to be cleaning up the ring buffer/memory used for the TX packet

 0)               |              local_bh_enable() {
 0)               |                do_softirq() {
 0)               |                  __do_softirq() {
 0)   0.082 us    |                    msecs_to_jiffies();
 0)   0.187 us    |                    __local_bh_disable();
 0)               |                    run_timer_softirq() {
 0)   0.089 us    |                      hrtimer_run_pending();
 0)   0.971 us    |                      _raw_spin_lock_irq();
 0)   0.545 us    |                      _raw_spin_unlock_irq();
 0)   4.199 us    |                    }
 0)   0.075 us    |                    rcu_bh_qs();
 0)               |                    net_rx_action() {
 0)               |                      e1000_clean [e1000]() {
 0)   0.186 us    |                        e1000_unmap_and_free_tx_resource.isra.36 [e1000]();
 0)               |                        e1000_unmap_and_free_tx_resource.isra.36 [e1000]() {
 0)   1.078 us    |                          _raw_spin_lock_irqsave();
 0)   0.833 us    |                          _raw_spin_lock_irqsave();
 0)   0.382 us    |                          _raw_spin_unlock_irqrestore();
 0)   0.404 us    |                          _raw_spin_unlock_irqrestore();
 0)               |                          dev_kfree_skb_any() {
 0)               |                            consume_skb() {
 0)               |                              skb_release_all() {
 0)               |                                skb_release_head_state() {
 0)               |                                  sock_wfree() {
 0)   1.178 us    |                                    sock_def_write_space();
 0)   2.294 us    |                                  }
 0)   3.436 us    |                                }
 0)               |                                skb_release_data() {
 0)               |                                  skb_free_head() {
 0)               |                                    kfree() {
 0)   0.223 us    |                                      kmemleak_free();
 0)   0.814 us    |                                      _raw_spin_lock_irqsave();
 0)   0.247 us    |                                      _raw_spin_unlock_irqrestore();
 0)   4.380 us    |                                    }
 0)   5.180 us    |                                  }
 0)   6.080 us    |                                }
 0) + 11.199 us   |                              }
 0)               |                              kfree_skbmem() {
 0)               |                                kmem_cache_free() {
 0)   0.091 us    |                                  kmemleak_free();
 0)   0.634 us    |                                  _raw_spin_lock_irqsave();
 0)   0.254 us    |                                  _raw_spin_unlock_irqrestore();
 0)   3.368 us    |                                }
 0)   4.317 us    |                              }
 0) + 17.709 us   |                            }
 0) + 18.576 us   |                          }
 0) + 26.477 us   |                        }
 0)   0.327 us    |                        e1000_clean_rx_irq [e1000]();
 0)   0.182 us    |                        e1000_update_itr [e1000]();
 0)   0.094 us    |                        e1000_update_itr [e1000]();
 0)               |                        napi_complete() {
 0)   0.094 us    |                          napi_gro_flush();
 0)               |                          __napi_complete() {
 0)   0.073 us    |                            __smp_mb__before_atomic();
 0)   1.002 us    |                          }
 0)   2.888 us    |                        }
 0) + 53.550 us   |                      }
 0)   0.162 us    |                      net_rps_action_and_irq_enable.isra.84();
 0) + 56.990 us   |                    }
 0)   0.084 us    |                    rcu_bh_qs();
 0)   0.185 us    |                    __local_bh_enable();
 0) + 67.206 us   |                  }
 0) + 68.256 us   |                }
 0) + 69.261 us   |              }
 0) ! 380.182 us  |            }
 0) ! 431.741 us  |          }
 0) ! 521.827 us  |        }
 0) ! 522.526 us  |      }
 0) ! 523.713 us  |    }
 0)   0.084 us    |    dst_release();
 0) ! 551.849 us  |  }

August 15, 2015

Linux ftrace: UDP receive packet kernel code path

What happens when a packet arrives at NIC ?
How is data flowing through various stacks ?
 .
 .
ftrace gives us the ability to look in depth into kernel code to figure what exactly happens. In this blog entry, I tried to look at UDP receive code.To capture required info, I used simple perl code to send/receive data on specified port using UDP protocol.On my test VM, NIC was using e1000 driver.

Using simple strace, I could see my UDP server is blocked in recvfrom syscall

[cv@mylinux1 ~]$ strace -e network -f ./us.pl 
socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
recvfrom(3, 
recvfrom inturn calls 'udp_recvmsg' for UDP.  Follow below steps to enable 3 functions in ftrace

[root@mylinux1 tracing]# pwd
/sys/kernel/debug/tracing

[root@mylinux1 tracing]# grep e1000 available_filter_functions |grep -i intr
e1000_intr [e1000]
e1000_test_intr [e1000]

[root@mylinux1 tracing]# grep udp_recv available_filter_functions 
udp_recvmsg

[root@mylinux1 tracing]# grep net_rx_action available_filter_functions 
net_rx_action

[root@mylinux1 tracing]# echo "e1000_intr net_rx_action udp_recvmsg" > set_graph_function 

[root@mylinux1 tracing]# cat set_graph_function
e1000_intr [e1000]
net_rx_action
udp_recvmsg

[root@mylinux1 tracing]# echo function_graph > current_tracer 

Below is the trace captured (after some manual edits)
0)               |  udp_recvmsg() {
 0)               |    __skb_recv_datagram() {
 0)   0.843 us    |      _raw_spin_lock_irqsave();
 0)   0.379 us    |      _raw_spin_unlock_irqrestore();
 0)               |      prepare_to_wait_exclusive() {
 0)   0.815 us    |        _raw_spin_lock_irqsave();
 0)   0.295 us    |        _raw_spin_unlock_irqrestore();
 0)   2.152 us    |      }
 0)               |      schedule_timeout() {
 0)               |        schedule() {
 0)               |          __schedule() {
 0)   0.472 us    |            _raw_spin_lock_irq();
 0)               |            deactivate_task() {
 0)               |              dequeue_task() {
 0)   0.110 us    |                update_rq_clock.part.72();
 0)               |                dequeue_task_fair() {
 0)               |                  dequeue_entity() {
 0)               |                    update_curr() {
 0)   0.183 us    |                      update_min_vruntime();
 0)   1.011 us    |                      cpuacct_charge();
 0)   2.380 us    |                    }
 0)   0.142 us    |                    update_cfs_rq_blocked_load();
 0)   0.067 us    |                    clear_buddies();
 0)               |                    account_entity_dequeue() {
 0)   0.386 us    |                      __cpu_to_node();
 0)   1.156 us    |                    }
 0)   0.070 us    |                    update_min_vruntime();
 0)   0.722 us    |                    update_cfs_shares();
 0)   7.507 us    |                  }
 0)   0.070 us    |                  hrtick_update();
 0)   8.903 us    |                }
 0) + 10.218 us   |              }
 0) + 10.833 us   |            }
 0)   0.161 us    |            put_prev_task_fair();
 0)               |            pick_next_task_fair() {
 0)   0.067 us    |              clear_buddies();
 0)               |              set_next_entity() {
 0)   0.202 us    |                update_stats_wait_end();
 0)   1.251 us    |              }
 0)   2.195 us    |            }
At this point, recvfrom/udp_recvmsg call is blocked, waiting for data to arrive. As seen above the task is deactivated (or) state changed to sleep and put on wait queue

Later when client sends data and it's received by NIC, e1000_intr get's called, which inturn calls __napi_schedule to copy data and raise soft IRQ

0)               |  e1000_intr [e1000]() {
 0)               |    __napi_schedule() {
 0)   0.151 us    |      __raise_softirq_irqoff();
 0)   1.400 us    |    }
 0) + 36.024 us   |  }
softIRQ later calls net_rx_action - which does most work, as you see in below trace
0)               |  net_rx_action() {
 0)               |    e1000_clean [e1000]() {
 0)               |      e1000_clean_rx_irq [e1000]() {
 0)   1.422 us    |        _raw_spin_lock_irqsave();
 0)   1.033 us    |        _raw_spin_lock_irqsave();
 0)   0.374 us    |        _raw_spin_unlock_irqrestore();
 0)   0.339 us    |        _raw_spin_unlock_irqrestore();
 0)               |        __netdev_alloc_skb() {
 0)   0.256 us    |          __netdev_alloc_frag();
 0)               |          build_skb() {
 0)               |            kmem_cache_alloc() {
 0)   0.471 us    |              should_failslab();
 0)   0.436 us    |              kmemleak_alloc();
 0)   2.641 us    |            }
 0)   3.653 us    |          }
 0)   5.250 us    |        }
 0)   0.082 us    |        skb_put();
 0)   0.153 us    |        e1000_rx_checksum.isra.25 [e1000]();
 0)   0.376 us    |        eth_type_trans();
 0)               |        napi_gro_receive() {
 0)   0.085 us    |          skb_gro_reset_offset();
 0)               |          dev_gro_receive() {
 0)               |            inet_gro_receive() {
 0)               |              udp4_gro_receive() {
 0)   0.259 us    |                udp_gro_receive();
 0)   1.060 us    |              }
 0)   2.804 us    |            }
 0)   5.094 us    |          }
 0)               |          netif_receive_skb() {
 0)               |            ktime_get_real() {
 0)               |              getnstimeofday() {
 0)   0.161 us    |                __getnstimeofday();
 0)   0.630 us    |              }
 0)   1.202 us    |            }
 0)               |            skb_defer_rx_timestamp() {
 0)   0.132 us    |              classify();
 0)   0.660 us    |            }
 0)               |            __netif_receive_skb() {
 0)               |              __netif_receive_skb_core() {
 0)               |                ip_rcv() {
 0)               |                  nf_hook_slow() {
 0)               |                    nf_iterate() {
 0)   0.087 us    |                      ip_sabotage_in [bridge]();
 0)   1.564 us    |                    }
 0)   2.997 us    |                  }
 0)               |                  ip_rcv_finish() {
 0)               |                    ip_route_input_noref() {
 0)               |                      fib_table_lookup() {
 0)   1.021 us    |                        check_leaf.isra.7();
 0)   3.657 us    |                      }
 0)               |                      fib_validate_source() {
 0)               |                        fib_table_lookup() {
 0)   0.243 us    |                          check_leaf.isra.7();
 0)   1.595 us    |                        }
 0)               |                        fib_table_lookup() {
 0)   0.564 us    |                          check_leaf.isra.7();
 0)   1.952 us    |                        }
 0)   5.546 us    |                      }
 0)   0.150 us    |                      __skb_dst_set_noref();
 0) + 13.506 us   |                    }
 0)               |                    ip_local_deliver() {
 0)               |                      ip_local_deliver_finish() {
 0)   0.235 us    |                        raw_local_deliver();
 0)               |                        udp_rcv() {
 0)               |                          __udp4_lib_rcv() {
 0)               |                            __skb_checksum_complete() {
 0)               |                              skb_checksum() {
 0)               |                                __skb_checksum() {
 0)               |                                  csum_partial() {
 0)   0.330 us    |                                    do_csum();
 0)   0.889 us    |                                  }
 0)   1.567 us    |                                }
 0)   2.146 us    |                              } /* skb_checksum */
 0)   2.729 us    |                            }
 0)   1.352 us    |                            __udp4_lib_lookup();
 0)               |                            udp_queue_rcv_skb() {
 0)               |                              ipv4_pktinfo_prepare() {
 0)   0.457 us    |                                fib_compute_spec_dst();
 0)   1.311 us    |                              }
 0)   1.334 us    |                              _raw_spin_lock();
 0)               |                              __udp_queue_rcv_skb() {
 0)               |                                sock_queue_rcv_skb() {
 0)               |                                  sk_filter() {
 0)               |                                    security_sock_rcv_skb() {
 0)               |                                      selinux_socket_sock_rcv_skb() {
 0)   0.204 us    |                                        netlbl_enabled();
 0)   1.537 us    |                                      }
 0)   2.632 us    |                                    }
 0)   3.922 us    |                                  }
 0)   0.387 us    |                                  __sk_mem_schedule();
 0)   0.854 us    |                                  _raw_spin_lock_irqsave();
 0)   0.499 us    |                                  _raw_spin_unlock_irqrestore();
 0)               |                                  sock_def_readable() {
 0)               |                                    __wake_up_sync_key() {
 0)   0.769 us    |                                      _raw_spin_lock_irqsave();
 0)               |                                      __wake_up_common() {
 0)               |                                        receiver_wake_function() {
 0)               |                                          autoremove_wake_function() {
 0)               |                                            default_wake_function() {
 0)               |                                              try_to_wake_up() {
 0)   0.954 us    |                                                _raw_spin_lock_irqsave();
 0)   0.217 us    |                                                task_waking_fair();
 0)   0.249 us    |                                                select_task_rq_fair();
 0)   0.833 us    |                                                _raw_spin_lock();
 0)               |                                                ttwu_do_activate.constprop.83() {
 0)               |                                                  activate_task() {
 0)               |                                                    enqueue_task() {
 0)   0.201 us    |                                                      update_rq_clock.part.72();
 0)               |                                                      enqueue_task_fair() {
 0)               |                                                        enqueue_entity() {
 0)   0.148 us    |                                                          update_curr();
 0)   0.080 us    |                                                          __compute_runnable_contrib.part.45();
 0)   0.251 us    |                                                          update_cfs_rq_blocked_load();
 0)               |                                                          account_entity_enqueue() {
 0)   0.464 us    |                                                            __cpu_to_node();
 0)   1.153 us    |                                                          }
 0)   0.369 us    |                                                          update_cfs_shares();
 0)   0.150 us    |                                                          place_entity();
 0)   0.240 us    |                                                          __enqueue_entity();
 0)   7.862 us    |                                                        }
 0)   0.235 us    |                                                        hrtick_update();
 0)   9.481 us    |                                                      } /* enqueue_task_fair */
 0) + 10.909 us   |                                                    }
 0) + 11.514 us   |                                                  }
 0)               |                                                  ttwu_do_wakeup() {
 0)               |                                                    check_preempt_curr() {
 0)   0.098 us    |                                                      resched_task();
 0)   0.762 us    |                                                    }
 0)   2.776 us    |                                                  }
 0) + 15.257 us   |                                                }
 0)   0.297 us    |                                                _raw_spin_unlock();
 0)   0.360 us    |                                                ttwu_stat();
 0)   0.373 us    |                                                _raw_spin_unlock_irqrestore();
 0) + 45.922 us   |                                              }
 0) + 46.438 us   |                                            }
 0) + 47.088 us   |                                          }
 0) + 47.682 us   |                                        }
 0) + 48.499 us   |                                      }
 0)   0.522 us    |                                      _raw_spin_unlock_irqrestore();
 0) + 51.418 us   |                                    }
 0) + 53.066 us   |                                  }
 0) + 61.798 us   |                                }
 0) + 62.568 us   |                              }
 0)   0.350 us    |                              _raw_spin_unlock();
 0) + 68.290 us   |                            }
 0) + 75.644 us   |                          }
 0) + 76.260 us   |                        }
 0) + 78.225 us   |                      }
 0) + 78.743 us   |                    }
 0) + 94.300 us   |                  }
 0) + 99.118 us   |                }
 0) ! 101.755 us  |              }
 0) ! 102.300 us  |            }
 0) ! 106.790 us  |          }
 0) ! 113.511 us  |        }
 0)               |        e1000_alloc_rx_buffers [e1000]() {
 0)   0.293 us    |          skb_trim();
 0)               |          nommu_map_page() {
 0)   0.177 us    |            check_addr();
 0)   0.711 us    |          }
 0)   0.666 us    |          _raw_spin_lock_irqsave();
 0)   0.253 us    |          _raw_spin_unlock_irqrestore();
 0)   0.472 us    |          _raw_spin_lock_irqsave();
 0)   0.238 us    |          _raw_spin_unlock_irqrestore();
 0)               |          save_stack_trace() {
 0)               |            dump_trace() {
 0)   0.067 us    |              save_stack_stack();
 0)               |              print_context_stack() {
 0)               |                __kernel_text_address() {
 0)               |                  is_module_text_address() {
 0)               |                    __module_text_address() {
 0)   0.153 us    |                      __module_address();
 0)   0.684 us    |                    }
 0)   1.314 us    |                  }
 0)   1.892 us    |                }
 0)   0.084 us    |                __kernel_text_address();
 0)   0.077 us    |                save_stack_address();
 0)               |                print_ftrace_graph_addr.isra.1.part.2() {
 0)   0.067 us    |                  save_stack_address();
 0)   0.541 us    |                }
 0)   0.073 us    |                __kernel_text_address();
 0)   0.080 us    |                save_stack_address();
 0)               |                print_ftrace_graph_addr.isra.1.part.2() {
 0)   0.081 us    |                  save_stack_address();
 0)   0.534 us    |                }
     .
     .
     .
 0)   3.415 us    |          _raw_spin_lock_irqsave();
 0)   1.336 us    |          _raw_spin_unlock_irqrestore();
 0)   0.959 us    |          _raw_spin_lock_irqsave();
 0)   0.628 us    |          _raw_spin_unlock_irqrestore();
 0) ! 687.078 us  |        }
 0) ! 821.321 us  |      }
 0)   0.363 us    |      e1000_update_itr [e1000]();
 0)   0.351 us    |      e1000_update_itr [e1000]();
 0)               |      napi_complete() {
 0)   0.354 us    |        napi_gro_flush();
 0)               |        __napi_complete() {
 0)   0.251 us    |          __smp_mb__before_atomic();
 0)   2.903 us    |        }
 0)   6.895 us    |      }
 0) ! 866.912 us  |    }
 0)   0.250 us    |    net_rps_action_and_irq_enable.isra.84();
 0) ! 872.356 us  |  }

In above trace, some important functions to note are
ip_rcv()
ip_rcv_finish()
ip_local_deliver()
ip_local_deliver_finish()
udp_rcv()
and also observe function calls to wake up any tasks that are sleeping on this data. As now our task is woken up, udp_recvmsg will continue to run now and below is the trace seen


0)               |            finish_task_switch() {
 0)   0.792 us    |              _raw_spin_unlock_irq();
 0)   4.152 us    |            }
 0) ! 11280427 us |          } /* __schedule */
 0) ! 11280429 us |        } /* schedule */
 0) ! 11280430 us |      } /* schedule_timeout */
 0)   0.304 us    |      finish_wait();
 0)   2.098 us    |      _raw_spin_lock_irqsave();
 0)   0.785 us    |      _raw_spin_unlock_irqrestore();
 0) ! 11280447 us |    } /* __skb_recv_datagram */
 0)               |    skb_copy_datagram_iovec() {
 0)               |      might_fault() {
 0)   0.922 us    |        __might_sleep();
 0)   5.446 us    |      }
 0)   9.181 us    |    }
 0)               |    skb_free_datagram_locked() {
 0)               |      lock_sock_fast() {
 0)   0.489 us    |        __might_sleep();
 0)   0.261 us    |        _cond_resched();
 0)               |        _raw_spin_lock_bh() {
 0)               |          local_bh_disable() {
 0)   0.320 us    |            __local_bh_disable();
 0)   1.851 us    |          }
 0)   4.977 us    |        }
 0) + 11.082 us   |      }
 0)   0.484 us    |      sock_rfree();
 0)               |      _raw_spin_unlock_bh() {
 0)   0.651 us    |        local_bh_enable_ip();
 0)   2.742 us    |      }
 0)               |      __kfree_skb() {
 0)               |        skb_release_all() {
 0)   0.280 us    |          skb_release_head_state();
 0)               |          skb_release_data() {
 0)               |            skb_free_head() {
 0)               |              put_page() {
 0)   0.250 us    |                put_compound_page();
 0)   1.897 us    |              }
 0)   4.156 us    |            }
 0)   5.770 us    |          }
 0)   8.950 us    |        }
 0)               |        kfree_skbmem() {
 0)               |          kmem_cache_free() {
 0)   1.042 us    |            kmemleak_free();
 0)   1.360 us    |            _raw_spin_lock_irqsave();
 0)   0.852 us    |            _raw_spin_unlock_irqrestore();
 0)   9.442 us    |          }
 0) + 11.503 us   |        }
 0) + 23.744 us   |      }
 0) + 45.085 us   |    }
 0) ! 11280507 us |  } /* udp_recvmsg */

will look at UDP transmit in the next blog entry

Linux ftrace: see underlying semantics of a system call

I always wondered what's happening under the hood of a system call. strace is a simple tool to see, what syscalls are being called by an application (with lot's of performance impact on app), however to know what's really going on beyond syscall we need ftrace

ftrace is an internal tracer designed to help out developers and designers of systems to find what is going on inside the kernel. ftrace uses the debugfs file system to hold the control files as well as the files to display output

Kernel documentation on ftrace could be found here
Brenden Gregg's lwn article talks about using ftrace and his perf-tools

Below is an example trace of read(2) syscall using function_graph tracer
[root@mylinux1 tracing]# pwd
/sys/kernel/debug/tracing
[root@mylinux1 tracing]# echo SyS_read > set_graph_function
[root@mylinux1 tracing]# echo function_graph > current_tracer 
[root@mylinux1 tracing]# echo 1 > tracing_on 
Observe tracer output in trace or trace_pipe files. Below output shows kernel code path
taken for read(2) syscall on my Linux VM running 3.10.0-229 kernel
 
 0)               |  SyS_read() {
 0)   0.261 us    |    fget_light();
 0)               |    vfs_read() {
 0)               |      rw_verify_area() {
 0)               |        security_file_permission() {
 0)               |          selinux_file_permission() {
 0)   0.066 us    |            avc_policy_seqno();
 0)   0.581 us    |          }
 0)               |          __fsnotify_parent() {
 0)               |            dget_parent() {
 0)   0.580 us    |              _raw_spin_lock();
 0)   0.180 us    |              _raw_spin_unlock();
 0)   2.213 us    |            }
 0)               |            dput() {
 0)   0.265 us    |              _raw_spin_lock();
 0)   0.172 us    |              _raw_spin_unlock();
 0)   1.318 us    |            }
 0)   4.431 us    |          }
 0)   0.080 us    |          fsnotify();
 0)   6.372 us    |        }
 0)   6.803 us    |      }
 0)               |      do_sync_read() {
 0)               |        xfs_file_aio_read [xfs]() {
 0)   0.065 us    |          generic_segment_checks();
 0)               |          xfs_ilock [xfs]() {
 0)               |            down_read_nested() {
 0)   0.142 us    |              __might_sleep();
 0)   0.058 us    |              _cond_resched();
 0)   1.416 us    |            }
 0)   1.919 us    |          }
 0)               |          generic_file_aio_read() {
 0)   0.072 us    |            generic_segment_checks();
 0)   0.153 us    |            __might_sleep();
 0)   0.059 us    |            _cond_resched();
 0)   0.635 us    |            __find_get_page();
 0)   0.070 us    |            mark_page_accessed();
 0)               |            file_read_actor() {
 0)               |              do_page_fault() {
 0)               |                __do_page_fault() {
 0)   0.276 us    |                  down_read_trylock();
 0)   0.164 us    |                  __might_sleep();
 0)   0.064 us    |                  _cond_resched();
 0)   0.163 us    |                  find_vma();
 0)               |                  handle_mm_fault() {
 0)   0.499 us    |                    __mem_cgroup_count_vm_event();
 0)   0.112 us    |                    sync_mm_rss();
 0)               |                    anon_vma_prepare() {
 0)   0.203 us    |                      __might_sleep();
 0)   0.059 us    |                      _cond_resched();
 0)   1.284 us    |                    }
 0)               |                    alloc_pages_vma() {
 0)   0.095 us    |                      get_vma_policy();
 0)   0.070 us    |                      policy_nodemask();
 0)   0.081 us    |                      policy_zonelist();
 0)               |                      __alloc_pages_nodemask() {
 0)   0.136 us    |                        __might_sleep();
 0)   0.059 us    |                        _cond_resched();
 0)   0.069 us    |                        next_zones_zonelist();
 0)               |                        get_page_from_freelist() {
 0)   0.073 us    |                          next_zones_zonelist();
 0)   0.061 us    |                          __zone_watermark_ok();
 0)   0.060 us    |                          __mod_zone_page_state();
 0)               |                          zone_statistics() {
 0)   0.072 us    |                            __inc_zone_state();
 0)   0.064 us    |                            __inc_zone_state();
 0)   0.929 us    |                          }
 0)   0.060 us    |                          bad_range();
 0)               |                          mem_cgroup_bad_page_check() {
 0)               |                            lookup_page_cgroup_used() {
 0)   0.061 us    |                              lookup_page_cgroup();
 0)   0.694 us    |                            }
 0)   1.175 us    |                          }
 0)   5.841 us    |                        }
 0)   8.006 us    |                      }
 0) + 10.248 us   |                    }
 0)               |                    mem_cgroup_newpage_charge() {
 0)               |                      mem_cgroup_charge_common() {
 0)   0.549 us    |                        __mem_cgroup_try_charge();
 0)               |                        __mem_cgroup_commit_charge() {
 0)   0.095 us    |                          lookup_page_cgroup();
 0)   0.061 us    |                          mem_cgroup_charge_statistics.isra.26();
 0)   0.061 us    |                          memcg_check_events();
 0)   1.615 us    |                        }
 0)   3.113 us    |                      }
 0)   3.667 us    |                    }
 0)   0.497 us    |                    _raw_spin_lock();
 0)   0.058 us    |                    add_mm_counter_fast();
 0)               |                    page_add_new_anon_rmap() {
 0)               |                      __inc_zone_page_state() {
 0)   0.079 us    |                        __inc_zone_state();
 0)   0.586 us    |                      }
 0)               |                      lru_cache_add() {
 0)   0.062 us    |                        __lru_cache_add();
 0)   0.592 us    |                      }
 0)   2.218 us    |                    }
 0)   0.188 us    |                    _raw_spin_unlock();
 0) + 23.584 us   |                  }
 0)   0.182 us    |                  up_read();
 0) + 27.283 us   |                }
 0) + 27.703 us   |              }
 0) + 28.902 us   |            }
 0)   0.064 us    |            put_page();
 0)   0.143 us    |            __might_sleep();
 0)   0.058 us    |            _cond_resched();
 0)   0.443 us    |            __find_get_page();
 0)   0.063 us    |            put_page();
 0)               |            touch_atime() {
 0)               |              current_fs_time() {
 0)   0.071 us    |                current_kernel_time();
 0)   0.534 us    |              }
 0)   1.022 us    |            }
 0) + 36.860 us   |          }
 0)               |          xfs_iunlock [xfs]() {
 0)   0.197 us    |            up_read();
 0)   1.325 us    |          }
 0) + 41.959 us   |        }
 0) + 42.513 us   |      }
 0)               |      __fsnotify_parent() {
 0)               |        dget_parent() {
 0)   0.525 us    |          _raw_spin_lock();
 0)   0.200 us    |          _raw_spin_unlock();
 0)   2.022 us    |        }
 0)               |        dput() {
 0)   0.238 us    |          _raw_spin_lock();
 0)   0.201 us    |          _raw_spin_unlock();
 0)   1.299 us    |        }
 0)   4.228 us    |      }
 0)   0.069 us    |      fsnotify();
 0) + 55.277 us   |    }
 0) + 56.499 us   |  }

June 6, 2015

Linux ARP Cache

Recently I found very interesting question asked on stackoverflow.com as mentioned here. User was using host aliases to simulate local clients to load test his server, however he was hitting limit of 1023 devices and client timing out

Upon further investigation, I found the issue to be due to ARP cache getting full
Detailed information about ARP tunables could be found in man pages arp(7)
       gc_thresh1 (since Linux 2.2)
              The  minimum  number of entries to keep in the ARP cache.  The garbage collector will not run if there are fewer than
              this number of entries in the cache.  Defaults to 128.

       gc_thresh2 (since Linux 2.2)
              The soft maximum number of entries to keep in the ARP cache.  The garbage collector will allow the number of  entries
              to exceed this for 5 seconds before collection will be performed.  Defaults to 512.

       gc_thresh3 (since Linux 2.2)
              The hard maximum number of entries to keep in the ARP cache.  The garbage collector will always run if there are more
              than this number of entries in the cache.  Defaults to 1024.

As seen above default maximum for gc_thresh3 is 1024. When ARP table reaches 1024 - no more entries could be added, until garbage collector was able to cleanup some stale ARP entries. Simple fix was to bump-up default values, to suit the need



April 20, 2015

DNS notes: sub domain delegation with global forwarders

When you have global forwarders configured in your named configuration, sub domain delegation does not work as expected. Lookup's for sub domains of zones (even if you are authoritative) get forwarded to forwarders.

To disable forwarders for the zones, you are authoritative for - you need to add a null forwarders list in to your zone configuration

zone "mydomian.test" in {
    type master;
    file "master/named.test";
    allow-update { "localhost"; };
    forwarders { }; 
};

April 13, 2015

Debian static IP configuration: Wifi adapter

Following steps illustrate configuration of static IP to Wi-Fi connection on Debian Wheezy

Static configuration for the wifi interface is added into /etc/network/interfaces

First you need to generate 256 bit Pre-Shared Key (PSK) from passphrase for your WIFI connection as below

$ sudo wpa_passphrase MY_SSID
# reading passphrase from stdin
xxxxxx123   <= type in the passphrase
network={
    ssid="MY_SSID"
    #psk="xxxxxx123"
    psk=c2ddfd959c6febdc4aac1f3575ded17278a0749d9997d08c07d03a241a812e1c
}


in above example MY_SSID is wireless SSID and 'psk' is the 256-bit pre-shared key calculated from provided 'passphrase'

Now you can add this information into 'interfaces' file as below

$ tail /etc/network/interfaces
# for wifi connection
auto wlan0
iface wlan0 inet static
    address 192.168.0.200
    netmask 255.255.255.0
    gateway 192.168.0.1
    dns-nameservers 192.168.0.1
    wpa-ssid MY_SSID
    wpa-scan-ssid 1
    wpa-psk c2ddfd959c6febdc4aac1f3575ded17278a0749d9997d08c07d03a241a812e1c


If your SSID is configured to be hidden, then you need 'wpa-scan-ssid' to be set 1

Once interfaces file is properly configured, you can bring up interface using "ifup wlan0"

Check interface details with 'iw' (or) iwconfig as below

$ sudo iw wlan0 link
Connected to
xx:xx:xx:xx:xx:xx (on wlan0)
    SSID: MY_SSID
    freq: 2427
    RX: 73561328 bytes (398315 packets)
    TX: 8602925 bytes (32204 packets)
    signal: -39 dBm
    tx bitrate: 39.0 MBit/s MCS 10

    bss flags:    short-slot-time
    dtim period:    0
    beacon int:    100


$ sudo iwconfig wlan0
wlan0     IEEE 802.11bgn  ESSID:"MY_SSID" 
          Mode:Managed  Frequency:2.427 GHz  Access Point: xx:xx:xx:xx:xx:xx  
          Bit Rate=39 Mb/s   Tx-Power=20 dBm  
          Retry  long limit:7   RTS thr:off   Fragment thr:off
          Encryption key:off
          Power Management:off
          Link Quality=70/70  Signal level=-38 dBm 
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:6  Invalid misc:877   Missed beacon:0


Note: Disable avahi-daemon and also change hosts line in nsswitch.conf as below

$ grep hosts: /etc/nsswitch.conf
#hosts:          files mdns4_minimal [NOTFOUND=return] dns mdns4
hosts:          files dns


April 2, 2015

tcpdump: bpf_load: genmajor failed: A file or directory in the path name does not exist

If you are getting above error, while trying to run tcpdump via sudo on an AIX host, it's due to environment variable ODMDIR being not set

$env|grep ODMD
ODMDIR=/etc/objrepos
By default sudo will reset your environment and ODMDIR is not avilable in sudo session. Hence commands which require ODMDIR ( ex: cfgmgr, savebase etc) fail.

To resolve this either keep all user environment via 'sudo -E' (or) update sudoers to keep ODMDIR variable (preferred)

Defaults env_keep +="ODMDIR"

AIX rpm dependency

AIX filesets/packages information is stoted in ODM, however you could install packages in rpm format as well on AIX.  RPM uses it's own database to store information on installed packages.

To interlink the shared libraries from AIX and resolve dependencies for rpm's - AIX uses a virtual/psuedo rpm called AIX-rpm

$ rpm -q AIX-rpm
AIX-rpm-6.1.5.0-6

Major version of this rpm is based on oslevel of AIX. This rpm has nofiles, but it provides all installed libraries,shells to resolve dependencies. You can verify this via

$ rpm --provides -q AIX-rpm

Whenever new LPP filesets are installed/updated, this information in the psuedo rpm needs to be updated. This is done via a script - "/usr/sbin/updtvpkg"

If you are running into some issues installing rpm's with respect to missing shared libraries, it's possible that you need to re-generate new AIX-rpm. Depending on the installed software this script could take long time to complete