7
perf stat Collects statistics
8
perf record Record events
9
perf report/annotate Report the recorded events
10
perf diff Difference between 2 saved profiles
11
perf probe Dynamic tracepoints on kernel and user-space code
12
perf sched Schedulling profile
13
perf mem/kmem Kernel memory allocation and access
14
perf lock Spinlocks (require CONFIG_LOCKDEP and CONFIG_LOCK_STAT which are not enabled by default in SuSE)
15
perf trace Simple strace-style trace
16
perf timechart Visualization
21
- Compile everything with -g -gdwarf-2 -g3 -fno-omit-frame-pointer
22
* frame pointers are needed to resolve callgraph
23
* dwarf is a workaround for missing frame pointers using libunwind
24
* Haswell provides LBR (Last Branch Record) facility which can be used by recent perf to resolve callgraph (not tested)
25
- We also need frame pointers to resolve callgraph's
26
* CONFIG_FRAME_POINTER=y (not set on SuSE)
27
- Sometimes, the following may help with resolution of symbols in kernel
28
echo 0 > /proc/sys/kernel/kptr_restrict
29
- Perf fails to find modules in /lib/modules/kernel-*/extra
30
* Temporarily move them into the /lib/modules/kernel-*/kernel/extra
35
perf top (show the most time consuming fucntions)
36
perf top -p PID (just for selected process)
38
perf timechart record -- sleep 1 (record system-wide statistics)
39
perf timechart (generate timeline in svg, best viewed using chromium)
41
perf stat -d ./mult_sse_debug
42
perf stat -d -p PID (specified process/thread, until Ctrl+C)
43
perf stat -d -p PID -- sleep 5 (specified process/thread, for 5 seconds)
44
perf stat -d -a (system wide)
46
perf record -F 1000 ./app (sampling at 1000 Hz)
47
perf record -F 100 -g dwarf ./app (-g instruct to include stack traces, with dwarf much more information obtained)
48
perf record -F 100 --call-graph lbr ./app (a better approach for stack traces, but requires Haswell processor and 4.1+ kernel)
49
perf report (all samples separately)
50
perf script (better for script processing)
51
perf report -n (standard profile)
52
perf report -n --stdio (output to standard output)
53
perf annotate --source dma_ipe_stream_read (mixed with assembler, no other way)
55
perf stat -e event1,event2 <app> (specific events)
56
perf stat -e r80a2 <app> (custom events where format is rUUEE, UU is umask and EE is event number)
57
perf record -e L1-dcache-loads-misses -g <app> (catch callgraph on all events when L1 cache miss occurs)
58
perf record -e event -c 1000 -g <app> (catch only 1 event in 1000 to avoid performance pentalty)
60
* In OpenSuSE many events are disabled by default, a tracing kernel is required
61
* perf tracing is much faster compared to strace (but still incur significant penalty unless limited)
62
perf stat -e 'syscalls:sys_enter_*' Count system calls (seperate number for each of them)
63
perf stat -e 'block:*' -a Count block device I/O events
64
perf record -e 'syscalls:*' -a Will also provide some arguments for syscall
65
- Filter events based on the parameters
66
perf record -e 'syscalls:sys_enter_read' --filter 'count == 1 || count == 8' -a
67
- Create histograms (size of incomming read requests)
68
perf stat -e syscalls:sys_enter_read --filter 'count < 1024' -e syscalls:sys_enter_read --filter 'count >= 1024 && count < 1048576' -e syscalls:sys_enter_read --filter 'count > 1048576' -a -- sleep 5
69
- Memory (just executes record/report with appropriate set of parameters for memory profilling)
70
perf mem -t load record <application> (recording)
71
perf mem -t load report (show number of specified memory operations)
72
perf mem -t load report --sort mem
74
perf bench (microbenchmarks)
75
perf bench mem memcpy (memcpy performance, strange non-consistent results...)
79
- Instrument the kernel tcp_sendmsg() function (get all the calls with timestamps and callgraph if stack is enabled)
80
perf probe --add tcp_sendmsg
81
perf record -e probe:tcp_sendmsg -ag -- sleep 5
82
perf probe --del tcp_sendmsg
84
- Adding probes on invocation/return of kernel/module function and measuring latencies
85
perf probe --add pcidriver_ioctl_in=pcidriver_ioctl
86
perf probe --add pcidriver_ioctl_out=pcidriver_ioctl%return
87
perf record -p <pid> -e probe:pcidriver_ioctl_in -e probe:pcidriver_ioctl_out -- sleep 5
88
perf script | grep -v '#' | awk '{ gsub(/:/, "") } $5 ~ /ioctl_in/ { ts[$1, $2] = $4 } $5 ~ /ioctl_out/ { if (l = ts[$1, $2]) { printf "%.f %.f\n", $4 * 1000000, ($4 - l) * 1000000; ts[$1, $2] = 0 } }' > out.lat_us
89
perf probe --del pcidriver_ioctl_in
90
perf probe --del pcidriver_ioctl_out
91
- Inspecting variable values, we just expect the kmem ids and measure function execution time depending on it
92
perf probe -m pciDriver --add 'kmem_sync_in=pcidriver_kmem_sync ID=kmem_sync->handle.handle_id'
93
perf probe -m pciDriver --add 'kmem_sync_out=pcidriver_kmem_sync:37 ID=kmem_entry->id'
94
perf record -p 16390 -e probe:kmem_sync_in -e probe:kmem_sync_out -- sleep 5
96
- Instrumenting user-space functions
97
* Requires recent kernel, 4.0 is reported to work and 3.11 is reported not to
99
perf probe -m <module> ... - Specify module to instrument (or full path to module)
100
perf probe -x <full_path_to_lib_or_binary> - Specify object to instrument
101
perf probe -l - List defined probe
102
perf probe -F - Show functions to instrument
103
perf probe -m pciDriver -L pcidriver_ioctl - Show lines to instrument
104
perf probe -m pciDriver -V pcidriver_ioctl:8 - Show define variables at defined position in code
108
perf sched record -p <pid> -- sleep 5 - Record schedulling information
109
perf sched latency --sort max - Report most significant latencies (how much task have waited for scheduler to put it on CPU)
110
perf sched map - Show the task placement (. - idle CPU, * - event on cpu (new task?), X# - a task running on CPU)
111
perf sched trace - Show raw trace
112
perf sched replay - Emulates the workload (the apps will be visible in top, etc...)
117
perf kmem record -- sleep 1 - Record kernel memory allocation statistics
118
perf kmem stat --caller [--page|--slab] - Per-symbol statistics on usage of slab/page kernel allocator
119
perf kmem stat --alloc [...] - Show usage statistics of individual allocated memory blocks
120
perf mem record - Record kernel memory access statistics
121
perf mem report -t [load|store]
126
perf lock info -t - current status
127
perf lock record -- sleep 1 - record
128
perf lock report - report
132
- Measure disk I/O latencies
133
perf record -e block:block_rq_issue -e block:block_rq_complete -a -- sleep 5
134
perf script | awk '{ gsub(/:/, "") } $5 ~ /issue/ { ts[$6, $10] = $4 } $5 ~ /complete/ { if (l = ts[$6, $9]) { printf "%.f %.f\n", $4 * 1000000, ($4 - l) * 1000000; ts[$6, $10] = 0 } }' > out.lat_us
138
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses <app>
139
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores L1 details
143
perf record -e sched:sched_process_exec -a Started processes
144
perf record -e syscalls:sys_enter_connect -a Created connection
145
perf record -e 'skb:consume_skb' -ag Who is doing socket I/O
149
- Very intuitive profiller charts which can be obtained instead opreport.
150
* git clone https://github.com/brendangregg/FlameGraph
152
* perf script | ./stackcollapse-perf.pl > out.perf-folded
153
* ./flamegraph.pl out.perf-folded > perf-kernel.svg
154
* Use mozilla to open svg file
155
- Another tool to visualize latencies produced like exlained in probes
156
* git clone https://github.com/brendangregg/HeatMap.git
159
* ./trace2heatmap.pl --unitstime=us --unitslat=us --maxlat=50 --unitstime ms out.lat_us > out.svg
160
* Use mozilla to open svg file