webrink: Performance Tuning for Linux Servers

Sunday, January 01, 2006

Performance Tuning for Linux Servers

Performance Tuning for Linux Servers

--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------

Installation
--------------------------------------------------------------------
Use separate partitions for root( / ), swap, /var, /usr, and /home

Most drives today pack more sectors on the outer tracks of the hard drive platter than on the inner tracks, so it’s much faster to read and write data from the outer tracks. Lower-numbered partitions are usually allocated at the outer tracks (for example, /dev/hda1 is closer to the drive’s outer edge than /dev/hda3), so place partitions that require frequent access first.
? http://www.pcguide.com/ref/hdd/geom/tracksZBR-c.html ?
The first partition should be the swap partition (to optimize memory swap operations).

The next partition should be /var because log entries are frequently written to /var/log.

The next partition should be /usr, because base system utilities and commands are placed in /usr.

The root and /home partitions can reside near the end of the drive.

USE MULTIPLE DRIVES
-Place frequently accessed partitions on the faster drives !duh, ofcourse!
-Place frequently accessed partitions (ie /var and /usr) on separate drives.
-Use RAID
-(IDE) Place each drive as master device on its own I/O channel

ext3 is journaling
convert ext2 to ext3
tune2fs -j /dev/hda1

RieserFS - Best performance with small files
xfs - Best performance, especially with large files

RAID
mkraid –V
cat /proc/mdstat
Create or modify /etc/raidtab
#---
/* Create RAID device md0 */
raiddev /dev/md 0 /* New RAID device */
raid-level 0 /* RAID 0 as example here */
nr-raid-disk 2 /* Assume two disks */
/* Automatically detect RAID devices on boot */
persistent-superblock 1
chunk-size 32 /* Writes 32 KB of data to each disk */
device /dev/hda1
raid-disk 0
device /dev/hdc1
raid-disk 1
#---

mkraid /dev/md0
mkreiserfs /dev/md0

IDE DISKS
Verify that DMA is enabled:
hdparm –d /dev/hda

If DMS is not enabled, enable it by issuing the following command:
hdparm –d 1 /dev/hda

Verify that 32-bit transfers are enabled:
hdparm –c /dev/hda

If 32-bit transfers are not enabled, enable them by issuing the following command:
hdparm –c 1 /dev/hda

Verify the effectiveness of the options by running simple disk read tests as follows:
hdparm –T -t /dev/hds

--------------------------------------------------------------------
--------------------------------------------------------------------

2.6 Kernel Features:
--------------------------------------------------------------------
I/O Elevators - anticipatory and deadline
An elevator is a queue where I/O requests are ordered by the function of their sector on disk
default = anticipatory //anticipating the “next” read operation
database applications seek all over the disk, performing reads and synchronous writes suffer with anticipatory I/O
10% performance improvement over the anticipatory scheduler, Select the deadline I/O scheduler by booting with elevator = deadline on the kernel command line

Huge TLB Page Support -
!!KJR TLB=TransLation Buffer!!
TLB is the processor’s cache of virtual-to-physical memory address translations
large TLB entry can map a 2MB or 4MB page, thus reducing the number of TLB misses
TLB miss is very costly in terms of processor cycles
mmap system calls or shared memory system calls
kernel config:
CONFIG_HUGETLB_PAGE (under processor section)
CONFIG_ HUGETLBFS (under file system section)
cat /proc/meminfo #show huge page size support
cat /proc/filesystem #hugetlbfs
cat /proc/sys/vm/nr_hugepages #configured huge pages

tune with:
echo x >/proc/sys/vm/nr_hugepages
x is the number of huge pages to be preallocated in megabytes

--------------------------------------------------------------------
--------------------------------------------------------------------

Logging Facility:
--------------------------------------------------------------------
!duh, ofcourse!
/var/log/messages
/var/log/XFree86.0.log

man logger
Logger makes entries in the system log

/etc/syslog.conf
/etc/sysconfig/syslog (RedHat)
kern.* /var/adm/kernel
kern.crit @remotehost #KJR says /etc/hosts loghost
kern.crit /dev/console
kern.info;kern.!err /var/adm/kernel-info

--------------------------------------------------------------------
--------------------------------------------------------------------

System Initialization
--------------------------------------------------------------------
init read /etc/inittab

BSD -
/etc/rc.d/rc.S - single file rc.S //daemons started
rc.S file enables the system’s virtual memory, mounts necessary file systems, cleans up certain log directories, initializes Plug and Play devices, loads kernel modules, configures PCMCIA devices, and sets up serial ports.
local script (rc.local) is available

System V - multiple independent files runlevel is given a subdirectory
scripts are run from runlevels 0 to 6
/etc/rc.d rc0.d to rc6.d and init.d
links to the master scripts stored in /etc/rc.d/init.d
K is kill
S is start
scripts run in numeric order

Initialization Table (/etc/inittab)
id:runlevel:action:process

id =
unique identifier

runlevel =
0=halt, 1=single, 2=multiuser w/0 nfs, 3=multiuser full, 4=unused, 5=X, 6=reboot

action =
respawn, once, sysinit, boot, bootwait, wait, off, ondemand, initdefault, powerwait, powerfail, powerokwait, ctrlaltdel, or kbrequest

process =
specific process or program to run

--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------

Kernel Overview

--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------
Linus Torvalds in 1991 on Intel 80x86 processor !duh, ofcourse!
kernel = interact/control system hardware components !duh, ofcourse!
kernel = provide an environment in which applications can run !duh, ofcourse!

Linux kernel is monolithic
Linux kernels extended by modules
module is an object that can be linked to the kernel at runtime

microkernel operating systems provide bare, minimal functionality, and all other operating system layers are performed on top of microkernels as processes
microkernels are slow due to message passing between the various layers

--------------------------------------------------------------------

/proc File System
virtual file system that is created dynamically by the kernel
provide data and to fine-tune

--------------------------------------------------------------------

Memory Management
address space, physical memory, memory mapping, paging, and swapping

Address Space - "virtual memory" space is mapped to physical memory
address space is a flat linear address space
linear address space is divided into two parts: user address space and kernel address space
x86 32-bit architecture supports 4GB address space
3GB is reserved for user space and 1GB is reserved for the kernel
location of the split is determined by the PAGE_OFFSET kernel configuration variable

!!KJR VM=Virtual Memory or Manager!!
Physical Memory - VM represents this arrangement as a node
Each node is divided into a number of blocks called zones that represent ranges within memory
ZONE_DMA - First 16MB of memory
ZONE_NORMAL - 16MB – 896MB
ZONE_HIGHMEM - 896MB – end

Memory Mapping
kernel has only 1GB of virtual address space for its use
3GB is reserved for the kernel
Intel PAE (Physical Address Extension) Pentium processors support up to 64GB of physical memory
kernel address a page in high memory, it maps that page into a small virtual address space (kmap) window, operates on that page, and unmaps the page
64-bit architectures do not have this problem because their address space is huge

Paging
Virtual address space is divided into fixed-size chunks called pages
three-level paging mechanism
Page Global Directory (PGD)
Page Middle Directory (PMD)
Page Table (PTE)

Swapping
Swapping is the moving of an entire process to and from secondary storage when the main memory is low
context switches are very expensive
swapping is performed at the page level rather than at the process level
major disadvantage of swapping is speed - disks are very slow

--------------------------------------------------------------------

Processes, Tasks, and Kernel Threads
task is simply a generic “description of work that needs to be done,” whether it is a lightweight thread or a full process
thread is the most lightweight instance of a task
process is a “heavier” data structure, Several threads can operate within a single process
kernel thread is a thread that always operates in kernel mode and has no user context

Threads and processes are scheduled identically by the scheduler

--------------------------------------------------------------------

Scheduling and Context Switching
Linux scheduler principle:
slow-running processes are better than processes that stop dead in their tracks either due to deliberate choices in scheduling policies or outright bugs

context switch = process stops running and another replaces it
overhead for this is high
timeslice = period of time in which to run

--
example:
a disk with data ready causes an interrupt
kernel calls the interrupt handler
interrupting the process that is currently running
utilizing many of its resources
currently running process resumes
effect steals time from the currently running process
--

Interrupt handlers are usually very fast and compact and thereby handle and clear interrupts quickly
an interrupt utilizes a random process’s resources

--------------------------------------------------------------------

Interprocess Communications (IPC) ipcs

Signals - job control
SIGSTOP signal causes a process to halt its execution
SIGKILL signal causes a process to exit and be ignored

pipe is a unidirectional, first-in first-out (FIFO), unstructured stream of data
named pipes are not temporary objects; they are entities in the file system and can be created using the mkfifo command

##file perms## ## ## ## ## ## ## ##
##########################################
from the 'man ls' on apple OSX:

b Block special file.
c Character special file.
d Directory.
l Symbolic link.
s Socket link.
p FIFO.
- Regular file.

!!KJR - it really perturbs me that linux man pages do not have this. Typically I'd have to shell into a solaris box to get this kind of info. Thank you apple OSX BSD flavor!!
d = directory
l = symbolic link
s = socket
p = named pipe
- = regular file
c = character (unbuffered) device file special
b = block (buffered) device file special

umask 0777 or 0666 - number
default file creation 666
default directory creation 777

# umask
0022
# touch file
# mkdir dir
# ls -al
drwxr-xr-x 2 root root 4096 Dec 31 15:05 dir
-rw-r--r-- 1 root root 0 Dec 31 15:05 file

1 is set sticky bit
2 is set gid
4 is set uid

setuid is a security vulnerabilies because it runs the process as the owner of the file

ie. /tmp has sticky bit. noted by trailing "t" anything created in this dir retains ownership of the original owner.
drwxrwxrwt 18 root root 4096 Dec 31 15:04 /tmp

!!KJR - They get more (S|s)quirrely:!!

The next three fields are three characters each: owner permissions, group
permissions, and other permissions. Each field has three character posi-
tions:

1. If r, the file is readable; if -, it is not readable.

2. If w, the file is writable; if -, it is not writable.

3. The first of the following that applies:

S If in the owner permissions, the file is not exe-
cutable and set-user-ID mode is set. If in the
group permissions, the file is not executable and
set-group-ID mode is set.

s If in the owner permissions, the file is exe-
cutable and set-user-ID mode is set. If in the
group permissions, the file is executable and set-
group-ID mode is set.

x The file is executable or the directory is search-
able.

- The file is neither readable, writable, exe-
cutable, nor set-user-ID nor set-group-ID mode,
nor sticky. (See below.)

These next two apply only to the third character in the last
group (other permissions).

T The sticky bit is set (mode 1000), but not execute
or search permission. (See chmod(1) or
sticky(8).)

t The sticky bit is set (mode 1000), and is search-
able or executable. (See chmod(1) or sticky(8).)

http://www.comptechdoc.org/os/linux/usersguide/linux_ugfilesp.html
##########################################
##file perms## ## ## ## ## ## ## ##

System V IPC Mechanisms

Message Queues - allow one or more processes to write messages
message queues are equivalent to pipes
Message queues pass data in messages rather than as an unformatted stream of bytes, allowing data to be processed easily
messages can be associated with a type, so the receiver can check for urgent messages before processing non-urgent messages

Semaphores - objects that support two atomic operations: set and test
counters that control access to shared resources by multiple processes
used as a locking mechanism to prevent processes from accessing a particular resource while another process is
problem = deadlocking, occurs when one process has altered a semaphore’s value as it enters a critical region but then fails to leave the critical region because it crashed or was killed
Linux protects by maintaining lists of adjustments to the semaphore arrays

Shared Memory - one or more processes to communicate via memory that appears in all of their virtual address spaces
Access to shared memory areas is controlled through keys and access rights checking

--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------

Processors and Multiprocessing
--------------------------------------------------------------------
--------------------------------------------------------------------
16 processors on 2.4-based kernels
32 processors on 2.6-based kernels
up to 512 processors on some architectures ??x86_64??

NUMA (Nonuniform Memory Access)
Non-Uniform Memory Architecture (NUMA).

or

Cluster
high-performance clusters (HPCs) higher node count 100+
spreading the work across a large number of nodes
Each node in an HPC has its own local disk storage to maintain the operating system, provide swap space, store programs

high-availability clusters 2-16
operates as an enterprise server
HA cluster consists minimally of two independent computers with a “heartbeat” monitoring program that monitors the health of the other node(s) in the cluster
http://linux-ha.org

--------------------------------------------------------------------
Symmetrical Multiprocessing (SMP)

Loosely coupled systems consist of processors that operate stand-alone
-Each processor has its own bus, memory, and I/O subsystem, and communicates with other processors through the network medium

Tightly coupled systems consist of processors that share the memory, bus, devices, and sometimes cache
-run a single instance of the operating system

!!KJR - Myth Buster = no SMP system is 100% scalable is because of the overhead involved in maintaining additional processors!!

--------------------------------------------------------------------
Symmetric Multithreading (SMT)
single physical CPU appears as two or more virtual CPUs
virtual CPUs share the core resources of the physical processor
Symmetric multithreading allows two or more tasks to be executed simultaneously in the processor
it has scheduler implications

--------------------------------------------------------------------

File Systems
--------------------------------------------------------------------

Virtual File System (VFS)
Virtual file system (VFS) allows Linux to support many, often very different, file systems, each presenting a common software interface to the VFS
Virtual File System layer allows you to transparently mount many different file systems at the same time

ext2fs

LVM - Logical Volume Manager
volume manager is used to hide the physical storage characteristics from the file systems and higher-level applications

RAID - Redundant Array of Inexpensive Disks
RAID-Linear = concatenation
RAID-0 = striping
RAID-1 = mirroring
RAID-5 = striping with parity
!! KJR RAID TEN is zero across one, striping across multiple mirroring !!

devfs - virtual device file system. ie, like procfs
/dev
Device drivers can register devices to devfs through device names instead of through the traditional major-minor number scheme
namespace is not limited by the number of major and minor numbers

--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------

--------------------------------------------------------------------
--------------------------------------------------------------------

Memory
32-bit processors have a 4GB limit on memory addressability (2 raised to the 32nd power)
64-bit processors maximum address (2 raised to the 64th power)

32-bit processors (Pentium) implement additional address bits for accessing physical addresses greater than 32 bits
via virtual addressing by use of additional bits in page table entries
x86-based processors currently support up to 64GB of physical memory through this mechanism
virtual addressability is still restricted to 4GB

--------------------------------------------------------------------
I/O
disks limited to 256 in the 2.4 kernel series
Multipath I/O (MPIO) provides more than one path to a storage device

--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------

System Performance Monitoring
--------------------------------------------------------------------

CPU Utilization
cat /proc/cpuinfo

uptime
iostat
vmstat
top
sar //from sysstat pkg !!KJR - System Analysis Reporting || System And Reporting!!

load average represents the average number of tasks that could be run over a period of 1, 5, and 15
linux load average
http://www.teamquest.com/resources/gunther/ldavg1.shtml

---
# vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 5336 40564 134016 0 0 3 30 325 233 5 2 92 0

--procs
r = running processes
b = blocked processes

--memory
swpd = memory swapped out #linux kswapd
free = free memory
buff = buffer cache for I/O data
cache = memory for file reads on disk in kilobytes

--swap
si = memory swapped in from disk #linux page fault activities as pages are swapped back to physical mem.
so = memory swapped out to disk in kilobytes per second

--io
bi = block read in from devices
bo = block written out to devices

--system
in = interrupts
cs = context switches

--cpu
us = user
sy = system
id = true idleness
wa = waiting for I/O completion
---

Also Look At: /proc/irq/ID
if 0x0001 is echoed to /proc/irq/ID, where ID corresponds to a device, only CPU 0 will process IRQ for this device

--------------------------------------------------------------------
Memory Utilization

cat /proc/meminfo
cat /proc/slabinfo

/proc/meminfo
MemTotal = total amount of physical memory of the system
MemFree = total amount of unused memory
Buffers = buffer cache for I/O operations
Cached = memory reading files from disk
SwapCached = amount of cache memory that has been swapped out in the swap space
SwapTotal = amount of disk memory for swapping purposes
HighTotal = memory greater than ~860MB of the physical memory
LowTotal = memory used by the kernel
Mapped = files that are memory-mapped
Slab = memory used for the kernel data structures

If an IA32-based system has more than 1GB of physical memory, HighTotal is nonzero

/proc/slabinfo
tcp_bind_bucket 56 224 32 2 2 1

first column lists the names of the kernel data structures
56 of which are active
total of 224 tcp_bind_bucket
Each data structure takes up 32 bytes
There are two pages that have at least one active object,
and there is a total of two allocated pages

--------------------------------------------------------------------
ps aux
!!KJR aux is better that -ef!!

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

root 1 0.0 0.0 1528 528 ? S 15:24 0:00 init [2]

%CPU =
%MEM = total percentage of system memory that each process consumes
VSZ = virtual memory footprint
RSS = amount of physical memory that the process is currently using

/proc/pid/maps
layout of the processes virtual address space.
where pid is the process ID of a particular process
cat /proc/3162/maps

Check It Tips:

monitor I/O workloads
vmstat: bi and bo transfe rate

see whether the system is swapping
monitor if system is swapping swpd, si, and so
If so, you can check on the swapping rate

monitor CPU utilization of us, sy, id, and wa.
If wa is large, you need to examine the I/O subsystem

--------------------------------------------------------------------

I/O Utilization

iostat
sar

iostat reports CPU utilization similar to how it is provided by the top tool
splits the CPU time into user, nice, system, I/O wait, and system idle

--------------------------------------------------------------------

Network Utilization

netstat, nfsstat, tcpdump, ethtool, snmp, ifport, ifconfig, route, arp, ping, traceroute, host, and nslookup
!!KJR what about mii-tool !!
!!KJR what about ip !!
!!KJR what about tc !!

ping (ICMP) = Internet Control Message Protocol - Echo function
A small packet is sent through the network for a given IP address
icmp type 255 = any
http://www.iana.org/assignments/icmp-parameters

route !!no duh!!
ie. route add default gw 192.168.0.1 //adds a default gateway

Flags Possible flags include
U (route is up)
H (target is a host)
G (use gateway)
R (reinstate route for dynamic routing)
D (dynamically installed by daemon or redirect)
M (modified from routing daemon or redirect)
A (installed by addrconf)
C (cache entry)
! (reject route)

arp !!no duh!! Address Resolution Protocol
ie. arp -d hostname //deletes arp entry of 'hostname' from arp table
Flags are same as from route

traceroute - find hops

tcpdump - sniffs network packets

host is a tool used to retrieve the host name for a given IP address from the Domain Name System

#network traffic
netstat -i
netstat -s
ip -s link

netstat -rn // shows routes
netstat -nlut // show open ports

ifconfig eth0:1 creates an alias

MAC stands for Media Access Control !!KJR Not Machine Address Code like most think!!
six hexadecimal numbers
ifconfig eth0 down hw ether 00:00:00:00:00:01
ifconfig eth0 up

nfsstat
network file system

--------------------------------------------------------------------
--------------------------------------------------------------------
System Trace Tools
identify performance problems and bottlenecks

top

strace

Oprofile
opcontrol initializes the OProfile tool
oprof_start is a GUI interface
oprofpp produces reports
op_time produces summary reports relative to the binaries that are running on the system
op_to_source tool generates annotated source for assembly listings
op_merge merges profiling samples

Performance Inspector
swtrace ai
run.tprof command to perform the trace and produce the default reports
run.itrace

vtune

dprobes - kernel and load-module debug tracing-type information

tracer - hooks into the kernel and provides tracing information

--------------------------------------------------------------------
--------------------------------------------------------------------
Benchmarks
Component benchmarks are often referred to as microbenchmarks
larger benchmarks are often referred to as application benchmarks or enterprise benchmarks

Operating System Benchmark Tools:
LMbench
AIM7 and AIM9
Reaim
SPEC SDET

Disk Benchmark Tools:
Bonnie/Bonnie++
IOzone
IOmeter
tiobench
dbench

Network Benchmark Tools:
Netperf
SPEC SFS

Application Benchmark Tools:
The Java benchmarks Volanomark, SPECjbb, and SPECjvm
PostMark
Database benchmarks
postfix, included w/ source

Database Benchmark Tools:
Open Source Development Lab
TPC http://www.tpc.org
SPEC benchmarks http://www.spec.org
Oracle Applications Standard Benchmark
SAP Standard Application Benchmark
MySQL, included w/ source

Web Server Benchmark Tools:
SPECweb, SPECweb SSL, and TPC-W
SPECjAppServer and ECPerf

oprofile
http://oprofile.sourceforge.net

performace inspector
http://perfinsp.sourceforge.net

linux trace toolkit
http://www.opersys.com/LTT/

dprobes
http://dprobes.sourceforge.net

vtune
http://www.intel.com/cd/software/products/asmo-na/eng/vtune/index.htm

--------------------------------------------------------------------
Performance Evaluation Methodologies
aka Theory:

-Tracing
-Workload Characterization
-Numerical Analysis
-Simulation

SPECweb99. Representative of web serving performance.

SPECsfs. Representative of NFS performance.

Database query. Representative of database query performance.

NetBench. Representative of SMB file-serving performance

Netperf3. Measures the performance of the network stack, including TCP, IP, and network device drivers.

VolanoMark. Measures the performance of the scheduler, signals, TCP send/receive, and loopback.

Block I/O test. Measures the performance of VFS, raw and direct I/O, block device layer, SCSI layer, and low-level SCSI/fibre device driver.

Lmbench. Used to measure performance of the Linux APIs.

IOzone. Used to measure native file system throughput.

dbench. Used to measure the file system component of NetBench.

SMB Torture. Used to measure SMB file-serving performance.

--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------

System Tuning

2.6 Linux Scheduler
nice() system call
priority classes, varying from 0 to MAX_PRIO, where MAX_PRIO=140
The first MAX_RT_PRIO priorities, where MAX_RT_PRIO=100, are set aside for real-time tasks
The remaining 40 priority classes, [100..140], are set aside for time sharing (that is, normal) jobs,
normal jobs, representing the [–20..19] nice value of UNIX processes
-20 (highest priority) to 19 (lowest)

sleep average is a number in the range of [0..MAX_SLEEP_AVG(=10 seconds)]

timeslice is the maximum time a task can run before yielding to another task
range of [MIN_TIMESLICE(=10 milliseconds)..MAX_TIMESLICE(=200 milliseconds)]

STARVATION_LIMIT(=10seconds) times the number of tasks in the run queue

scheduler attempts to keep the system load as balanced as possible
event balancing = rebalance code when tasks change state or make specific system calls
active balancing = specified intervals measure in jiffies
!!KJR "I'll have that done in a jiffie" !!

Active balancing happens at each tick

CHILD_PENALTY - percentage of the parent’s sleep average that a child inherits
Increasing the value of this parameter increases the child’s effective priority

CREDIT_LIMIT - number of times a task earns sleep_avg over MAX_SLEEP_AVG
Reducing the value of this parameter helps highly interactive tasks by raising them to the highly interactive level

EXIT_WEIGHT - penalized for creating children that are processor hogs relative to the parent
Setting this value to zero causes the parent to inherit the child’s sleep average when the child exits

INTERACTIVE_DELTA - determines the offset that is added in determining whether or not a task is considered interactive
parameter is increased, a task needs to accumulate a larger sleep average to be considered interactive

MAX_SLEEP_AVG - A task with this sleep average gets the maximum bonus as indicated by PRIO_BONUS_RATIO
Increasing the value of this parameter gives the highest-priority task more time for execution before it is rescheduled

MAX_TIMESLICE - timeslice that is allocated to the task with the highest static priority (MAX_RT_PRIO)

MIN_TIMESLICE - timeslice that is allocated to the task with the lowest static priority (MAX_PRIO-1)

PARENT_PENALTY - percentage of the sleep average that the parent is permitted to keep

PRIO_BONUS_RATIO - percentage of the priority range used to provide a temporary bonus to interactive tasks

STARVATION_LIMIT - multiplication factor used to decide whether an interactive task is placed in an active or expired array

--------------------------------------------------------------------
--------------------------------------------------------------------
Address Space
The kernel creates the basic skeleton of a process’s virtual address space when the fork() system call is initiated

User Address Space
Each address space is represented in the Linux kernel through an object known as the mm structure
mm structure is a reference counted object that exists as long as the reference count is greater than zero

The VM Area Structures
To circumvent the issue of large page tables, Linux does not represent address spaces with page tables per se, but utilizes a set of VM area structure lists instead
VM area-based approach is that if a process maps a significant number of different files into its address space

Kernel Address Space
vmalloc()
two platform-specific parameters VMALLOC_START and VMALLOC_END
a simple mapping formula (pfn = (addr – PAGE_OFFSET) / PAGE_SIZE)

High-Memory Support
highmem Interface
highmem interface provides indirect access to this memory by dynamically mapping high-memory pages into a small portion of the kernel address space that is reserved for this purpose
kmap()

Paging and Swapping
When accessing a virtual page that is not present, the CPU generates a page fault
The technique of borrowing a page from a process and writing it to the disk subsystem is referred to as paging
swapping—a much more aggressive form of paging that steals not only an individual page, but also a process’s entire page set

Replacement Policy
procedure that determines which page to evict from the main memory subsystem
least recently used (LRU) approach analyzes the past behavior
most UNIX operating systems utilize variations of lower overhead replacement polices such as not recently used (NRU)
Linux relies on an LRU-based approach
not just a replacement policy, but also a memory balancing policy that determines how much memory is utilized for kernel buffers and how much is used to back virtual pages

Page Replacement and Memory Balancing
2 extra bits in each page-table entry = access and dirty bits
access bit indicates whether the page has been accessed since the access bit was last cleared
dirty bit indicates whether the page has been modified since it was last paged in
kswapd clears the access bit

Linux Page Tables
system maintains a page table for each process in physical memory and accesses the actual page tables via the identity mapped kernel segment
Page tables in Linux cannot be paged out to the swap space
per-process page table layout is based on a multilayer tree consisting of three levels
first layer consists of the global directory (pgd)
second layer consists of the middle directory (pmd)
third layer consists of the page table entry (pte)
different memory zones (ZONE_DMA, ZONE_NORMAL, and ZONE_ HIGHMEM)
VM system impacts every other subcomponent in the system

rmap and objrmap
One new VM feature of the Linux 2.6 kernel is referred to as reversed mapping (rmap)
ObjRMAP, the struct page structure utilizes the mapping file to point to an address_space structure describing the object that backs up that particular page

Largepages Support
IA-32 architecture supports either 4KB or 4MB pages
Largepage usage is primarily intended to provide performance improvements for high-performance computing (HPC) and other memory-intensive applications
Linux utilizes either 2MB or 4MB largepages, AIX uses 16MB largepages, and Solaris uses 4MB
translation lookaside buffer (TLB)
number of available largepages can be configured through the proc file system
/proc/sys/vm/nr_hugepage
The core of the largepage implementation in Linux 2.6 is referred to as the hugetlbfs, a pseudo file system (implemented in fs/hugetlbfs/inode.c) based on ramfs
A process may access largepages either through the shmget() interface to set up a shared region that is backed by largepages or by utilizing the mmap() call on a file that has been opened in the huge page file system

Slab Allocator
In Linux 2.4, kmem_cache_reap() is called in low-memory situations
2.6 The set_shrinker() function populates a struct with a pointer to the callback and a weight that indicates the complexity of re-creating the object

VM Tunables
/proc/sys/vm
# cd /proc/sys/vm
# ls
block_dump hugetlb_shm_group min_free_kbytes page-cluster
dirty_background_ratio laptop_mode nr_hugepages swappiness
dirty_expire_centisecs legacy_va_layout nr_pdflush_threads swap_token_timeout
dirty_ratio lowmem_reserve_ratio overcommit_memory vfs_cache_pressure
dirty_writeback_centisecs max_map_count overcommit_ratio

--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------

I/O Subsystems—Performance Implications

These threads perform raw I/O or are generated by VMM components of the kernel, such as the kswapd or pdflush threads

Scheduler Tunables
/sys/block/device/iosched
Completely Fair Queuing (CFQ) I/O scheduler
Stochastic Fair Queuing (SFQ)

bdflush starts, flushes, or tunes the buffer-dirty-flush daemon

cat /proc/sys/vm/bdflush
50 500 0 0 500 3000 60 20 0

The first parameter (nfract), default 50, governs the maximum number of dirty buffers in the buffer cache
The second parameter (ndirty), default 500, is the maximum number of dirty buffers that bdflush can write to the disk at one time.
The third and fourth parameters are not currently used.
The fifth parameter (interval), default 500, is the delay between kupdate flushes
The sixth parameter (age_buffer), default 3000, is the time for a normal buffer to age before it is flushed.
The seventh parameter (nfract_sync), default 60, is the percentage of buffer cache that is dirty to activate bdflush synchronously
The eighth parameter (nfract_stop_bdflush), default 20, is the percentage of buffer cache that is dirty to stop bdflush
The ninth parameter is not currently used.

echo "100 1200 0 0 500 3000 60 20 0">/proc/sys/vm/bdflush

Setting up Raw I/O on Linux
# ln –s /dev/your_raw_dev_ctrl/.dev/rawctl
raw -a to see which raw device nodes are already in use
raw /dev/raw/raw1 /dev/sda5

--------------------------------------------------------------------
--------------------------------------------------------------------
--------------------------------------------------------------------
Network Tuning

sysctl
/proc/sys/net/core
/proc/sys/net/ipv4

Default Socket Buffer Size
net.core.wmem_default (/proc/sys/net/core/wmem_default)
net.core.rmem_default (/proc/sys/net/core/rmem_default)

........................................
Memory | <=4KB | <=128KB | >128KB
........................................
rmem_default 32KB 64 KB 64KB
wmem_default 32KB 64 KB 64KB
wmem_max 32KB 64 KB 128KB
rmem_max 32KB 64 KB 128KB
........................................

Maximum Socket Buffer Size
net.core.rmem_max (/proc/sys/net/core/rmem_max)
net.core.wmem_max (/proc/sys/net/core/wmem_max)

netdev_max_backlog
net.core.netdev_max_backlog (/proc/sys/net/core/netdev_max_backlog)
The default value is 300, which is typically too small for heavy network loads
Increasing this value permits a larger store of packets queued and reduces the number of packets dropped
dropped packets result in a significant reduction in throughput

somaxconn
net.core.somaxconn (/proc/sys/net/core/somaxconn)
default maximum is 128.

optmem_max
optmem_max (/proc/sys/net/core/optmem_max)
This variable is the maximum initialization size of socket buffers, expressed in bytes

TCP Buffer and Memory Management
net.ipv4.tcp_rmem (/proc/sys/net/ipv4/tcp_rmem)
This variable is an array of three integers:
net.ipv4.tcp_rmem[0] = minimum size of the read buffer
net.ipv4.tcp_rmem[1] = default size of the read buffer
net.ipv4.tcp_rmem[2] = maximum size of the read buffer
......................................................
Default TCP Socket Read Buffer Sizes

Minimum[0] Default[1] Maximum[2]
......................................................
Low Memory PAGE_SIZE 43689 43689*2
Normal 4KB 87380 87380*2
......................................................

tcp_wmem
net.ipv4.tcp_wmem (/proc/sys/net/ipv4/tcp_wmem)
As with the read buffer, the TCP socket write buffer is also an array of three integers:
net.ipv4.tcp_wmem[0] = minimum size of the write buffer
net.ipv4.tcp_wmem[1] = default size of the write buffer
net.ipv4.tcp_wmem[2] = maximum size of the write buffer
........................................................
Default TCP Socket Write Buffer Sizes

Minimum[0] Default[1] Maximum[2]
........................................................
Low Memory 4KB 16KB 64KB
Normal 4KB 16KB 128KB
........................................................

tcp_mem
net.ipv4.tcp_mem[] (/proc/sys/net/ipv4/tcp_mem)
This kernel parameter is also an array of three integers that are used to control memory management behavior by defining the boundaries of memory management zones:
net.ipv4.tcp_mem[0] = pages below which TCP does not consider itself under memory pressure
net.ipv4.tcp_mem[1] = pages at which TCP enters memory pressure region
net.ipv4.tcp_mem[2] = pages at which TCP refuses further socket allocations (with some exceptions)

tcp_window_scaling
net.ipv4.tcp_window_scaling (/proc/sys/net/ipv4/tcp_window_scaling)
employment of TCP window sizes larger than 64K
noted that socket buffers larger than 64K are still potentially beneficial even when window scaling is turned off

tcp_sack
net.ipv4.tcp_sack (/proc/sys/net/ipv4/tcp_sack)
This variable enables the TCP Selective Acknowledgments (SACK) feature
SACK is a TCP option for congestion control

tcp_dsack
net.ipv4.tcp_dsack (/proc/sys/net/ipv4/tcp_dsack)
This variable enables the TCP D-SACK feature
enhancement to SACK to detect unnecessary retransmits

tcp_fack
net.ipv4.tcp_fack (/proc/sys/net/ipv4/tcp_fack)
This variable enables the TCP Forward Acknowledgment (FACK) feature
FACK is a refinement of the SACK protocol to improve congestion control in TCP

TCP Connection Management

tcp_max_syn_backlog
net.ipv4.tcp_max_syn_backlog (/proc/sys/net/ipv4/tcp_max_syn_backlog)
This variable controls the length of the TCP Syn Queue for each port

tcp_synack_retries
net.ipv4/tcp_synack_retries (/proc/sys/net/ipv4/tcp_synack_retries)
This variable controls the number of times the kernel tries to resend a response to an incoming SYN/ACK segment
Reducing this number results in earlier detection of a failed connection attempt from the remote host

tcp_retries2
net.ipv4/tcp_retries2 (/proc/sys/net/ipv4/tcp_retries2)
This variable controls the number of times the kernel tries to resend data to a remote host with which it has an established connection
Reducing this number results in earlier detection of a failed connection to the remote host
This allows busy servers to quickly free up the resources tied to the failed connection
makes it easier for the server to support a larger number of simultaneous connections

TCP Keep-Alive Management

tcp_keepalive_time
net.ipv4.tcp_keepalive_time (/proc/sys/net/ipv4/tcp_keepalive_time)
If a connection is idle for the number of seconds specified by this parameter
the kernel initiates a probing of the connection to the remote host

tcp_keepalive_intvl
net.ipv4.tcp_keepalive_intvl (/proc/sys/net/ipv4/tcp_keepalive_intvl)
This parameter specifies the time interval in seconds between the keepalive probes sent by the kernel to the remote host

tcp_keepalive_probes
net.ipv4.tcp_keepalive_probes (/proc/sys/net/ipv4/tcp_keepalive_probes)
This parameter specifies the maximum number of keepalive probes the kernel sends to the remote host to detect if it is still alive

The default values are as follows:
tcp_keepalive_time = 7200 seconds (2 hours)
tcp_keepalive_probes = 9
tcp_keepalive_intvl = 75 seconds

IP Port Space Range

ip_local_port_range
sysctl.net.ipv4.ip_local_port_range (/proc/sys/net/ipv4/ip_local_port_range)
This parameter specifies the range of ephemeral ports that are available to the system
increasing this range allows a larger number of simultaneous connections for each protocol (TCP and UDP)
systems with more than 128KB of memory, it is set to 32768 to 61000
maximum of 28,232 ports can be in use simultaneously

?? What is a port ??
A port is a logical abstraction that the IP protocol uses as an address of sorts to distinguish between individual sockets
it is simply an integer sequence space

?? what can you tune with sysctl ??

TCP Socket and Buffer Sizes ie. max socket connections
TCP Buffer Sizes ie. max window sizes and network backlogs
TCP Memory Management ie. tcp read write buffes
TCP Connection Management ie. keepalive and intervals
IP Port Space Range

--------------------------------------------------------------------
--------------------------------------------------------------------

What Is Interprocess Communication?

Interprocess communication allows processes to synchronize with each other and exchange data. In general, System V (SysV) IPC facilities provide three types of resources:

Semaphores. Allow processes to synchronize with other and also prevent collisions when multiple processes are sharing resources.

Message queues. Asynchronously pass small data, such as messages, between processes.

Shared memory segments. Provide a fast way for processes to share relatively large amounts of data by sharing a common segment of memory among multiple processes.

In addition to these resources, IPC pipes and FIFOs are among the most commonly used IPC facilities in UNIX-based systems:

Pipes are unidirectional, first-in/first-out data channels that pass unstructured data streams between related processes.

FIFOs (a.k.a. named pipes) are pipes that have a persistent name associated with them.

ipcs -u //resources
ipcs -l //limits

--------------------------------------------------------------------
"Too many open files?" adjust the ulimit

ulimit 1024 is default for linux
ulimit -n 2048

--------------------------------------------------------------------
IDE DISKS ONLY

hdparm /dev/hda
IO_support = 0 (default 16-bit)

put in rc.local
hdparm -c 1 /dev/hda
IO_support = 1 (32-bit)

DMA
hdparm -d 1/dev/hda
--------------------------------------------------------------------

////////////////////////////////////////////////////////////////////
Other Misc questions

/17 network
formula? 2N-2 (where N is the number of bits added to the mask for subnetting)
nodes?
netmask?

2n – 2 available subnets and 2n – 2 available hosts
http://www.pantz.org/networking/tcpip/subnetchart.shtml
CIDR - Classless Inter-Domain Routing ie. /17

3-way Tcp handshake syn-->, <--syn/ack, ack-->
three-way handshake
"SYN" to establish communication and "synchronize" sequence numbers in counting bytes of data which will be exchanged
destination then sends a "SYN/ACK" which again "synchronizes" his byte count with the originator and acknowledges the initial packet
originator then returns an "ACK" which acknowledges the packet the destination just sent him

connection is now "OPEN" and ongoing communication between the originator and the destination are permitted until one of them issues a "FIN" packet, or a "RST" packet, or the connection times out

faster, order 1-4 ?:
? context switch
? read from ram
? read from disk
? read from cpu register

2 to the 11th power = 2048

2^1 = 2
2^2 = 4
2^3 = 8
2^4 = 16
2^5 = 32
2^6 = 64
2^7 = 128
2^8 = 256
2^9 = 512
2^10 = 1024

How to reduce the sync or seek time for data on a hard disk?

////////////////

know these services and ports:

#ftpd
ftp 21 tcp
ftp-data 20 tcp //data connection
active mode = client connects from a random unprivileged port
passive mode = client initiates both connections to the server
http://slacksite.com/other/ftp.html

#tftpd-hpa
tftp 69 tcp //trivial file transfer protocol

#sshd + scp
ssh 22 tcp
sftp 115/tcp
sftp 115/udp

#in.telnetd typically run through inetd & inetd.conf
telnet 23 tcp

#bind9 aka named
dns 53 udp //querries
53 tcp //dns record transfers

953 rndc control socket bind9

#dhcpd
67 & 68 = bootpc (client) bootps (server)
67 = dhcp Dynamic Host Configuration Protocol
DHCP is based on BOOTP and maintains some backward compatibility
RARP is a protocol used by Sun and other vendors that allows a computer to find out its own IP number
DHCP, like BOOTP runs over UDP, utilizing ports 67 and 68

#apache
http 80 tcp
https 443 tcp //ssl

#mail services...
pop3 110 tcp
pop3s 995 tcp //ssl

imap 143 tcp
imaps 993 tcp //ssl

smtp 25 tcp
smtps 465 tcp //ssl

#databases...
postgres 5432 tcp
mysql 3306 tcp

#nfs
http://nfs.sourceforge.net/nfs-howto/security.html
portmap aka sunrpc
111 udp
portmapper, rpc.statd, and rpc.lockd
mountd
statd, mountd, lockd, and rquotad
nfs 2049/tcp nfsd
nfs 2049/udp nfsd

#auth 113 tcp
host auth stuff
ircd 6667/tcp # Internet Relay Chat
ircd 6667/udp # Internet Relay Chat

#ntp 123 network time protocol
ntpdate

rsync 873/tcp # rsync
rsync -avz /data /data

syslog 514/udp
loghost

snmp 161/tcp # Simple Net Mgmt Proto
snmp 161/udp # Simple Net Mgmt Proto
snmptrap 162/udp snmp-trap # Traps for SNMP

x11 6000/tcp X # the X Window System

#microsoft
netbios-ns 137/tcp # NETBIOS Name Service
netbios-ns 137/udp
netbios-dgm 138/tcp # NETBIOS Datagram Service
netbios-dgm 138/udp
netbios-ssn 139/tcp # NETBIOS session service
netbios-ssn 139/udp
microsoft-ds 445/tcp # microsoft name services
microsoft-ds 445/udp
ms-sql-s 1433/tcp # Microsoft-SQL-Server
ms-sql-s 1433/udp # Microsoft-SQL-Server
ms-sql-m 1434/tcp # Microsoft-SQL-Monitor
ms-sql-m 1434/udp # Microsoft-SQL-Monitor
wins 1512/tcp # Microsoft's Windows Internet Name Service
wins 1512/udp # Microsoft's Windows Internet Name Service

# posted by WebRink @ 7:50 AM

Comments: Post a Comment

<< Home

webrink

Sunday, January 01, 2006

Performance Tuning for Linux Servers

About Me

Links

archives