Linux , perf , 性能诊断 , stap , systemtap , strace , dtrace , dwarf , profiler , perf_events
Linux在服务端已占据非常大的比例,很多业务很多服务都跑在Linux上面。
软件运行在Linux下,软件本身、以及Linux系统的性能诊断也成为热门的话题。
例如,你要如何回答这些问题
Why is the kernel on-CPU so much? What code-paths? Which code-paths are causing CPU level 2 cache misses? Are the CPUs stalled on memory I/O? Which code-paths are allocating memory, and how much? What is triggering TCP retransmits? Is a certain kernel function being called, and how often? What reasons are threads leaving the CPU?又或者你是一名DBA或者开发人员,想知道数据库在跑某些benchmark时,性能瓶颈在哪里,是IO,是等待,还是网络,代码瓶颈在哪里?
在Linux下诊断的工具比较多,比如systemtap, dtrace, perf。
本文将介绍一下perf的用法,网上很多叫法如perf_events , perf profiler , Performance Counters for Linux。叫法不同,都指perf。
perf是Linux 2.6+内核中的一个工具,在内核源码包中的位置 tools/perf。
perf利用Linux的trace特性,可以用于实时跟踪,统计event计数(perf stat);或者使用采样(perf record),报告(perf report|script|annotate)的使用方式进行诊断。
perf命令行接口并不能利用所有的Linux trace特性,有些trace需要通过ftrace接口得到。
参考 https://github.com/brendangregg/perf-tools
这张图大致列出了perf支持的跟踪事件,从kernerl到user space,支持块设备、网络、CPU、文件系统、内存等,同时还支持系统调用,用户库的事件跟踪。
你可以使用perf list输出当前内核perf 支持的预置events
perf list List of pre-defined events (to be used in -e): ref-cycles [Hardware event] alignment-faults [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] .....略....... writeback:writeback_pages_written [Tracepoint event] writeback:writeback_queue [Tracepoint event] writeback:writeback_task_start [Tracepoint event] writeback:writeback_task_stop [Tracepoint event]我们看到perf支持这么多的事件和trace,它依赖了很多的接口来干这件事情。
1. Symbols
没有符号表,无法将内存地址翻译成函数和变量名。
例如,无符号表的跟踪显示如下
57.14% sshd libc-2.15.so [.] connect | --- connect | |--25.00%-- 0x7ff3c1cddf29 | |--25.00%-- 0x7ff3bfe82761 | 0x7ff3bfe82b7c | |--25.00%-- 0x7ff3bfe82dfc --25.00%-- [...]有符号表的跟踪显示如下
57.14% sshd libc-2.15.so [.] __GI___connect_internal | --- __GI___connect_internal | |--25.00%-- add_one_listen_addr.isra.0 | |--25.00%-- __nscd_get_mapping | __nscd_get_map_ref | |--25.00%-- __nscd_open_socket --25.00%-- [...]如何安装符号表?
对于内核代码的符号表,在编译内核时,使用CONFIG_KALLSYMS=y。 检查如下
# cat /boot/config-`uname -r` |grep CONFIG_KALLSYMS CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y CONFIG_KALLSYMS_EXTRA_PASS=y对于用户安装软件的符号表,如果是yum安装的,可以安装对应的debuginfo包。
如果是用户自己编译的,例如使用GCC编译时加上-g选项。
2. perf annotate
perf annotate can generate sourcecode level information if the application is compiled with -ggdb.
3. Stack Traces (使用perf record -g收集stack traces)
要跟踪完整的stack,编译时需要注意几个东西。
Always compile with frame pointers. Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. Without them, you may see incomplete stacks from perf_events, like seen in the earlier sshd symbols example. There are two ways to fix this: either using dwarf data to unwind the stack, or returning the frame pointers.3.1 编译perf时包含libunwind和-g dwarf,需要3.9以上的内核版本。
Since about the 3.9 kernel, perf_events has supported a workaround for missing frame pointers in user-level stacks: libunwind, which uses dwarf. This can be enabled using "-g dwarf".3.2 有些编译优化项会忽略frame pointer,所以编译软件时必须指定 -fno-omit-frame-pointer ,才能跟踪完整的stack trace.
The earlier sshd example was a default build of OpenSSH, which uses compiler optimizations (-O2), which in this case has omitted the frame pointer. Here's how it looks after recompiling OpenSSH with -fno-omit-frame-pointer3.3 编译内核时包含 CONFIG_FRAME_POINTER=y
总结一下,要愉快的跟踪更完备的信息,就要在编译软件时打开符号表的支持(gcc -g),开启annotate的支持(gcc -ggdb),以及Stack trace的支持(gcc -fno-omit-frame-pointer)。
了解了perf event后,我们可以更精细的,有针对性的对事件进行跟踪,采样,报告。
当然,你也可以不指定事件,全面采样。
例如centos你可以使用yum安装,也可以使用源码安装。
perf在内核源码的tools/perf中,所以下载与你的内核大版本一致的内核源码即可
uname -a wget https://cdn.kernel.org/pub/linux/kernel/v3.x/linux-3.10.104.tar.xz tar -xvf linux-3.10.104.tar.xz cd linux-3.10.104/tools/perf/安装依赖库,有一个小窍门可以找到依赖的库
$cat Makefile |grep found msg := $(warning No libelf found, disables 'probe' tool, please install elfutils-libelf-devel/libelf-dev); msg := $(error No gnu/libc-version.h found, please install glibc-dev[el]/glibc-static); msg := $(warning No libdw.h found or old libdw.h found or elfutils is older than 0.138, disables dwarf support. Please install new elfutils-devel/libdw-dev); msg := $(warning No libunwind found, disabling post unwind support. Please install libunwind-dev[el] >= 0.99); msg := $(warning No libaudit.h found, disables 'trace' tool, please install audit-libs-devel or libaudit-dev); msg := $(warning slang not found, disables TUI support. Please install slang-devel or libslang-dev); msg := $(warning GTK2 not found, disables GTK2 support. Please install gtk2-devel or libgtk2.0-dev); $(if $(1),$(warning No $(1) was found)) msg := $(warning No bfd.h/libbfd found, install binutils-dev[el]/zlib-static to gain symbol demangling) msg := $(warning No numa.h found, disables 'perf bench numa mem' benchmark, please install numa-libs-devel or libnuma-dev);通常依赖 gcc make bison flex elfutils libelf-dev libdw-dev libaudit-dev python-dev binutils-dev
并不是每个开关都需要,但是有些没有就不方便或者功能缺失,例如没有打开符号表的话,看到的是一堆内存地址。
# for perf_events: CONFIG_PERF_EVENTS=y # for stack traces: CONFIG_FRAME_POINTER=y # kernel symbols: CONFIG_KALLSYMS=y # tracepoints: CONFIG_TRACEPOINTS=y # kernel function trace: CONFIG_FTRACE=y # kernel-level dynamic tracing: CONFIG_KPROBES=y CONFIG_KPROBE_EVENTS=y # user-level dynamic tracing: CONFIG_UPROBES=y CONFIG_UPROBE_EVENTS=y # full kernel debug info: CONFIG_DEBUG_INFO=y # kernel lock tracing: CONFIG_LOCKDEP=y # kernel lock tracing: CONFIG_LOCK_STAT=y # kernel dynamic tracepoint variables: CONFIG_DEBUG_INFO=y一些开关的用途介绍
1. Kernel-level symbols are in the kernel debuginfo package, or when the kernel is compiled with CONFIG_KALLSYMS.
2. The kernel stack traces are incomplete. Now a similar profile with CONFIG_FRAME_POINTER=y
3. 当我们使用perf record [stack traces (-g)]时,可以跟踪stack,但是如果内核编译时没有指定CONFIG_FRAME_POINTER=y,perf report时就会看到缺失的信息。
不包含CONFIG_FRAME_POINTER=y时
99.97% swapper [kernel.kallsyms] [k] default_idle | --- default_idle 0.03% sshd [kernel.kallsyms] [k] iowrite16 | --- iowrite16 __write_nocancel (nil)包含CONFIG_FRAME_POINTER=y时 (Much better -- the entire path from the write() syscall (__write_nocancel) to iowrite16() can be seen.)
99.97% swapper [kernel.kallsyms] [k] default_idle | --- default_idle cpu_idle | |--87.50%-- start_secondary | --12.50%-- rest_init start_kernel x86_64_start_reservations x86_64_start_kernel 0.03% sshd [kernel.kallsyms] [k] iowrite16 | --- iowrite16 vp_notify virtqueue_kick start_xmit dev_hard_start_xmit sch_direct_xmit dev_queue_xmit ip_finish_output ip_output ip_local_out ip_queue_xmit tcp_transmit_skb tcp_write_xmit __tcp_push_pending_frames tcp_sendmsg inet_sendmsg sock_aio_write do_sync_write vfs_write sys_write system_call_fastpath __write_nocancel4. 如果要使用动态跟踪,即跟踪任意指定代码,则需要打开这些开关:
For kernel analysis, using CONFIG_KPROBES=y and CONFIG_KPROBE_EVENTS=y, to enable kernel dynamic tracing. and CONFIG_FRAME_POINTER=y, for frame pointer-based kernel stacks.
For user-level analysis, CONFIG_UPROBES=y and CONFIG_UPROBE_EVENTS=y, for user-level dynamic tracing.
5. 如果打开了CONFIG_DEBUG_INFO,则可以在动态跟踪中打印内核变量的值。
If your kernel has debuginfo (CONFIG_DEBUG_INFO=y), you can fish out kernel variables from functions. This is a simple example of examining a size_t (integer)
例如
1. Listing variables available for tcp_sendmsg(): # perf probe -V tcp_sendmsg Available variables at tcp_sendmsg @<tcp_sendmsg+0> size_t size struct kiocb* iocb struct msghdr* msg struct sock* sk 2. Creating a probe for tcp_sendmsg() with the "size" variable: # perf probe --add 'tcp_sendmsg size' Added new event: probe:tcp_sendmsg (on tcp_sendmsg with size) 3. You can now use it in all perf tools, such as: perf record -e probe:tcp_sendmsg -aR sleep 1通过以下命令可以查看linux的config
cat /boot/config-`uname -r` # # Automatically generated make config: don't edit # Linux kernel version: 2.6.32-573.el6.x86_64 # Thu Jul 23 15:38:20 2015 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_GENERIC_CMOS_UPDATE=y .......先了解一下概貌
perf 命令用法还是挺简单的,根据功能区分了COMMAND,每个COMMAND有各自的用法。
用得比较多的有list, record, report, script, stat, top。
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS] The most commonly used perf commands are: annotate Read perf.data (created by perf record) and display annotated code archive Create archive with object files with build-ids found in perf.data file bench General framework for benchmark suites buildid-cache Manage build-id cache. buildid-list List the buildids in a perf.data file data Data file related processing diff Read perf.data files and display the differential profile evlist List the event names in a perf.data file inject Filter to augment the events stream with additional information kmem Tool to trace/measure kernel memory properties kvm Tool to trace/measure kvm guest os list List all symbolic event types lock Analyze lock events mem Profile memory accesses record Run a command and record its profile into perf.data report Read perf.data (created by perf record) and display the profile sched Tool to trace/measure scheduler properties (latencies) script Read perf.data (created by perf record) and display trace output stat Run a command and gather performance counter statistics test Runs sanity tests. timechart Tool to visualize total system behavior during a workload top System profiling tool. probe Define new dynamic tracepoints trace strace inspired tool See 'perf help COMMAND' for more information on a specific command.要得到每个command的用法也蛮简单,可以使用perf help COMMAND得到。
例如
perf help record PERF-RECORD(1) perf Manual PERF-RECORD(1) NAME perf-record - Run a command and record its profile into perf.data SYNOPSIS perf record [-e <EVENT> | --event=EVENT] [-l] [-a] <command> perf record [-e <EVENT> | --event=EVENT] [-l] [-a] — <command> [<options>] DESCRIPTION This command runs a command and gathers a performance counter profile from it, into perf.data - without displaying anything. This file can then be inspected later on, using perf report. OPTIONS <command>... Any command you can specify in a shell. .....跟踪时可以指定事件,CPU,以及是否跟踪stack trace。
perf top -ag -a, --all-cpus System-wide collection. (default) -g Enables call-graph (stack chain/backtrace) recording.输出如下
Samples: 240 of event 'cpu-clock', Event count (approx.): 19122881 Children Self Shared Object Symbol + 14.64% 14.64% [kernel] [k] _spin_unlock_irqrestore + 10.91% 10.91% libslang.so.2.2.1 [.] SLtt_smart_puts + 6.02% 6.02% perf [.] symbols__insert + 6.02% 6.02% [kernel] [k] kallsyms_expand_symbol + 6.01% 6.01% [kernel] [k] copy_page + 3.96% 0.64% libc-2.12.so [.] _int_malloc + 3.61% 3.61% [kernel] [k] number + 3.31% 3.31% [kernel] [k] clear_page + 2.71% 2.71% [kernel] [k] pointer ....输入 ? 可以得到top的帮助介绍
│ │UP/DOWN/PGUP │ │PGDN/SPACE Navigate │ │q/ESC/CTRL+C Exit browser │ │ │ │For multiple event sessions: │ │ │ │TAB/UNTAB Switch events │ │ │ │For symbolic views (--sort has sym): │ │ │ │-> Zoom into DSO/Threads & Annotate current symbol│ │<- Zoom out │ │a Annotate current symbol │ │C Collapse all callchains │ │d Zoom into current DSO │ │E Expand all callchains │ │F Toggle percentage of filtered entries │ │H Display column headers │ │P Print histograms to perf.hist.N │ │t Zoom into current Thread │ │V Verbose (DSO names in callchains, etc) │ │z Toggle zeroing of samples │ │/ Filter symbol by name输入E全部展开,展开后可以得到stack trace的结果, 如果发现有地址的信息,但是没有符号表的信息,可能是软件编译时没有指定-g,如果stack信息不完整,则软件编译时需要加-fno-omit-frame-pointer(参考前面章节的介绍)。
Samples: 1K of event 'cpu-clock', Event count (approx.): 17510648 Children Self Shared Object Symbol - 28.51% 28.51% [kernel] [k] _spin_unlock_irqrestore -→_spin_unlock_irqrestore [kernel] 17.24% __write_nocancel libc-2.12.so 13.79% 0xe7cd libpthread-2.12.so 10.34% 0x7f670b15b923 unknown 10.34% __fdatasync_nocancel libc-2.12.so 0x1b00000017 unknown 4 10.34% __libc_fork libc-2.12.so 6.90% __select libc-2.12.so 3.45% 0x7f670b15b822 unknown 3.45% 0x7f670b14d560 unknown 3.45% 0x7f670b11ede7 unknown 3.45% 0x7f670b11eb5d unknown 0x2ad1 pam_unix.so 0xdcdd4c000000000 unknown 3.45% __clone libc-2.12.so 3.45% __libc_writev libc-2.12.so 3.45% __read_nocancel libc-2.12.so 3.45% 0x7b294 libslang.so.2.2.1 3.45% 0x7efee6ee2560 unknown输入C全部收起
Samples: 240 of event 'cpu-clock', Event count (approx.): 19122881 Children Self Shared Object Symbol + 14.64% 14.64% [kernel] [k] _spin_unlock_irqrestore + 10.91% 10.91% libslang.so.2.2.1 [.] SLtt_smart_puts + 6.02% 6.02% perf [.] symbols__insert + 6.02% 6.02% [kernel] [k] kallsyms_expand_symbol + 6.01% 6.01% [kernel] [k] copy_page + 3.96% 0.64% libc-2.12.so [.] _int_malloc + 3.61% 3.61% [kernel] [k] number + 3.31% 3.31% [kernel] [k] clear_page + 2.71% 2.71% [kernel] [k] pointer ....Symbol列[]中字符代表的含义如下,通常是k或者.表示kernel和user space事件。
[.]: user level [k]: kernel level [g]: guest kernel level (virtualization) [u]: guest os user space [H]: hypervisorperf record用来收集统计信息,通常的用法包括
1. 使用record收集统计信息,可以收集全局的,指定PID或者线程ID的,指定CPU的,指定USER ID的。
-a, --all-cpus System-wide collection from all CPUs. -p, --pid= Record events on existing process ID (comma separated list). -t, --tid= Record events on existing thread ID (comma separated list). This option also disables inheritance by default. Enable it by adding --inherit. -u, --uid= Record events in threads owned by uid. Name or number.2. 收集间隔,可以指定采样的频率,采样的EVENT数,采样时间。
-c, --count= Event period to sample. -F, --freq= Profile at this frequency.3. 采集的详细程度,可以指定event,使用CPU实时调度策略的进程(可以参考RHEL 讲CGROUP cpu部分的文章),是否跟踪stack chain,
-r, --realtime= Collect data with this RT SCHED_FIFO priority. -g Enables call-graph (stack chain/backtrace) recording. --call-graph Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Allows specifying "fp" (frame pointer) or "dwarf" (DWARF′s CFI - Call Frame Information) or "lbr" (Hardware Last Branch Record facility) as the method to collect the information used to show the call graphs. In some systems, where binaries are build with gcc --fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the libunwind library) should be used instead. Using the "lbr" method doesn′t require any compiler options. It will produce call graphs from the hardware LBR registers. The main limition is that it is only available on new Intel platforms, such as Haswell. It can only get user call chain. It doesn′t work with branch stack sampling at the same time. -e, --event= Select the PMU event. Selection can be: · a symbolic event name (use perf list to list all events) · a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a hexadecimal event descriptor. · a symbolically formed PMU event like pmu/param1=0x3,param2/ where param1, param2, etc are defined as formats for the PMU in /sys/bus/event_sources/devices/<pmu>/format/*. · a symbolically formed event like pmu/config=M,config1=N,config3=K/ where M, N, K are numbers (in decimal, hex, octal format). Acceptable values for each of ′config′, ′config1′ and ′config2′ are defined by corresponding entries in /sys/bus/event_sources/devices/<pmu>/format/* param1 and param2 are defined as formats for the PMU in: /sys/bus/event_sources/devices/<pmu>/format/* · a group of events surrounded by a pair of brace ("{event1,event2,...}"). Each event is separated by commas and the group should be quoted to prevent the shell interpretation. You also need to use --group on "perf report" to view group events together. --filter=<filter> Event filter.例子
开始跑一个benchmark,同时收集统计信息(指定-a, -g选项)。 CFLAGS='-g -ggdb -fno-omit-frame-pointer' ./configure --prefix=/home/digoal/pgsql9.6 make world -j 32 make install-world -j 32 初始化略 pgbench -i -s 100 跟踪 perf record -ag Ctrl-C,退出perf record,统计信息已输出到perf.data。解读前面收集到的perf.data.
常用的开关如下,--tui是交互式的文本显示窗口,--stdio是文本显示窗口。
-i, --input= Input file name. (default: perf.data) -v, --verbose Be more verbose. (show symbol address, etc) -n, --show-nr-samples Show the number of samples for each symbol -g [type,min], --call-graph Display call chains using type and min percent threshold. type can be either: · flat: single column, linear exposure of call chains. · graph: use a graph tree, displaying absolute overhead rates. · fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object. Default: fractal,0.5. --stdio Use the stdio interface. --tui Use the TUI interface, that is integrated with annotate and allows zooming into DSOs or threads, among other features. Use of --tui requires a tty, if one is not present, as when piping to other commands, the stdio interface is used.交互式显示例子,看概貌挺方便的 (常用的交互命令: E扩展,C收敛,q退出)
#perf report -g --tui Events: 52K cycles + 11.17% postgres postgres [.] hash_seq_search [.]表示user call, [k]表示kernel call + 4.41% pgbench libc-2.12.so [.] __GI_vfprintf + 4.02% postgres postgres [.] NextCopyFrom + 3.27% postgres [kernel.kallsyms] [k] copy_user_generic_string + 2.83% postgres postgres [.] CopyReadLineText + 2.25% postgres postgres [.] pg_comp_crc32c_sse42 + 2.18% postgres libc-2.12.so [.] memcpy + 1.99% postgres postgres [.] CopyReadAttributesText + 1.72% postgres [xfs] [k] ftrace_raw_init_event_xfs_dir2_leafn_add + 1.65% postgres postgres [.] heap_fill_tuple + 1.52% postgres libc-2.12.so [.] __GI_____strtoll_l_internal + 1.26% postgres postgres [.] heap_form_tuple + 1.22% pgbench libc-2.12.so [.] _IO_default_xsputn_internal + 1.17% postgres postgres [.] AllocSetAlloc + 1.16% postgres postgres [.] heap_compute_data_size + 1.10% postgres postgres [.] heap_multi_insert + 1.03% postgres postgres [.] pg_atoi + 1.03% postgres libc-2.12.so [.] __memset_sse2 + 1.01% postgres postgres [.] pg_verify_mbstr_len + 0.95% pgbench libc-2.12.so [.] _itoa_word + 0.89% postgres postgres [.] int4in + 0.83% postgres postgres [.] resetStringInfo + 0.81% postgres postgres [.] InputFunctionCall + 0.81% postgres postgres [.] hash_search_with_hash_value + 0.79% postgres postgres [.] ExecConstraints + 0.71% postgres postgres [.] CopyFrom + 0.71% pgbench libc-2.12.so [.] __strchrnul + 0.67% postgres postgres [.] PageAddItemExtended + 0.66% postgres [kernel.kallsyms] [k] find_get_page + 0.65% postgres postgres [.] CopyReadLine + 0.62% postgres [kernel.kallsyms] [k] page_fault + 0.58% postgres postgres [.] heap_prepare_insert + 0.57% postgres postgres [.] heap_prune_chain + 0.52% postgres postgres [.] bpcharin + 0.52% pgbench libpq.so.5.9 [.] PQputCopyData + 0.52% postgres postgres [.] AllocSetFreeIndex + 0.52% postgres postgres [.] pq_getbytes + 0.51% postgres postgres [.] RelationPutHeapTuple + 0.49% postgres postgres [.] BufferGetBlockNumber + 0.47% postgres postgres [.] enlargeStringInfo + 0.47% postgres postgres [.] bpchar_input + 0.46% postgres postgres [.] CopyGetData + 0.45% postgres postgres [.] lazy_scan_heap + 0.45% postgres [kernel.kallsyms] [k] radix_tree_lookup_slot + 0.44% pgbench libpq.so.5.9 [.] PQputline + 0.44% postgres postgres [.] CopyLoadRawBuf + 0.42% postgres postgres [.] HeapTupleSatisfiesVacuum + 0.42% postgres [kernel.kallsyms] [k] _spin_lock_irq + 0.42% postgres [kernel.kallsyms] [k] list_del + 0.38% flush-8:16 [xfs] [k] ftrace_raw_init_event_xfs_dir2_leafn_add + 0.37% postgres postgres [.] ExecStoreTuple + 0.36% postgres postgres [.] appendBinaryStringInfo + 0.35% postgres postgres [.] pq_getmessage + 0.35% pgbench libpq.so.5.9 [.] pqPutMsgEnd Press '?' for help on key bindings文本显示例子,看细节挺方便
#perf report -v -n --showcpuutilization -g --stdio首先显示了一些异常,如果你发现少了符号表或者什么的,可以根据提示安装debuginfo,或者重新编译内核或软件。
Looking at the vmlinux_path (6 entries long) dso__load_sym: cannot get elf header. Using /proc/kallsyms for symbols Looking at the vmlinux_path (6 entries long) No kallsyms or vmlinux with build-id 3187a0b0fc53e27c19f9fad3e63f9437402e5548 was found [bianque_driver] with build id 3187a0b0fc53e27c19f9fad3e63f9437402e5548 not found, continuing without symbols Looking at the vmlinux_path (6 entries long) No kallsyms or vmlinux with build-id e6c06734499e665685cd28ac846d8d69d95cce8c was found [missing_slash] with build id e6c06734499e665685cd28ac846d8d69d95cce8c not found, continuing without symbols /lib64/ld-2.12.so was updated, restart the long running apps that use it! /usr/lib64/libsyslog-ng-3.6.so.0.0.0 was updated, restart the long running apps that use it! /disk1/gpdb_20160101/bin/postgres was updated, restart the long running apps that use it! Looking at the vmlinux_path (6 entries long) No kallsyms or vmlinux with build-id 253ed41ca3926c96e4181ac4df30cf3cde868089 was found [pdeath_signal] with build id 253ed41ca3926c96e4181ac4df30cf3cde868089 not found, continuing without symbols no symbols found in /sbin/iscsid, maybe install a debug package? no symbols found in /bin/egrep, maybe install a debug package? no symbols found in /usr/sbin/dmidecode, maybe install a debug package? /opt/aegis/lib64/aegis_monitor.so was updated, restart the long running apps that use it!报告的细节如下
# Events: 52K cycles # # Overhead Samples sys us Command Shared Object # 事件百分比 采样百分比 系统消耗占比 系统消耗u秒 命令 共享对象 # ........ .......... ............... ............................................ ............................................................... # 第一条 事件百分比 采样百分比 系统消耗占比 系统消耗u秒 命令 共享对象 11.17% 0.00% 5.56% 2173 postgres /home/digoal/pgsql9.6/bin/postgres 0x50eb35 B [.] hash_seq_search | --- hash_seq_search (输出call stack trace,需要在编译软件时开启-g开关, -fno-omit-frame-pointer开关) GetRunningTransactionLocks LogStandbySnapshot (有两个分支会调用 LogStandbySnapshot) | |--50.23%-- CreateCheckPoint (分支开销占比) | CheckpointerMain | AuxiliaryProcessMain | StartChildProcess | reaper | __restore_rt | PostmasterMain | startup_hacks | __libc_start_main | --49.77%-- BackgroundWriterMain (分支开销占比) AuxiliaryProcessMain StartChildProcess reaper __restore_rt PostmasterMain startup_hacks __libc_start_main 下面这条全部是系统开销 4.41% 0.00% 4.41% 1808 pgbench /lib64/libc-2.12.so 0x44005 B [.] __GI_vfprintf | --- __GI_vfprintf | |--99.48%-- __vsnprintf | | 后面。。。。。。略一条条打印perf.data内的数据,输出的是perf record收集到的原始数据。
生成热力图、火焰图也需要perf script的输出,从最原始的信息中提取数据,生成svg。
perf-script - Read perf.data (created by perf record) and display trace output 主要的options -G, --hide-call-graph When printing symbols do not display call chain. -v, --verbose be more verbose (show symbol address, etc) -L, --Latency show latency attributes (irqs/preemption disabled, etc) -d, --debug-mode do various checks like samples ordering and lost events -D, --dump-raw-trace dump raw trace in ASCII #perf script -v -G|wc -l 52958 刚好等于前面使用perf report输出的# Events: 52K cycles 详细内容就不列出了,非常多解读前面收集到的perf.data,辅以汇编指令显示。
perf-annotate - Read perf.data (created by perf record) and display annotated code -i, --input= Input file name. (default: perf.data) -v, --verbose Be more verbose. (Show symbol address, etc) -D, --dump-raw-trace Dump raw trace in ASCII. -l, --print-line Print matching source lines (may be slow). -P, --full-paths Don’t shorten the displayed pathnames. --stdio Use the stdio interface. --tui Use the TUI interface Use of --tui requires a tty, if one is not present, as when piping to other commands, the stdio interface is used. This interfaces starts by centering on the line with more samples, TAB/UNTAB cycles through the lines with more samples.例子
Sorted summary for file /home/digoal/pgsql9.6/bin/postgres ---------------------------------------------- 32.43 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1) 16.20 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1) 14.16 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1) 10.68 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1406 10.68 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1) 9.97 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1412 3.76 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1) Percent | Source code & Disassembly of /home/digoal/pgsql9.6/bin/postgres ------------------------------------------------ 50eac0: 466 50eac4: 1 50eac7: 3 50eaca: 1 50eaec: 435 50eaf5: 12 50eafb: 7 50eb00: 20 50eb08: 5 50eb10: 8 50eb14: 1 50eb18: 5 50eb1b: 4 50eb1e: 15 50eb22: 466 50eb26: 2 50eb2e: 2 50eb32: 6 50eb35: 618 50eb38: 707 50eb3c: 164 50eb41: 1415 h->sum: 4363 : : : : Disassembly of section .text: : : 000000000090e9cc <hash_seq_search>: : register_seq_scan(hashp); : } : : void * : hash_seq_search(HASH_SEQ_STATUS *status) : { 0.00 : 90e9cc: 55 push %rbp 0.00 : 90e9cd: 48 89 e5 mov %rsp,%rbp 0.00 : 90e9d0: 48 83 ec 60 sub $0x60,%rsp 0.00 : 90e9d4: 48 89 7d a8 mov %rdi,-0x58(%rbp) : long segment_ndx; : HASHSEGMENT segp; : uint32 curBucket; : HASHELEMENT *curElem; : : if ((curElem = status->curEntry) != NULL) 0.00 : 90e9d8: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90e9dc: 48 8b 40 10 mov 0x10(%rax),%rax 0.00 : 90e9e0: 48 89 45 d8 mov %rax,-0x28(%rbp) 0.00 : 90e9e4: 48 83 7d d8 00 cmpq $0x0,-0x28(%rbp) 0.00 : 90e9e9: 74 3a je 90ea25 <hash_seq_search+0x59> : { : /* Continuing scan of curBucket... */ : status->curEntry = curElem->link; 0.00 : 90e9eb: 48 8b 45 d8 mov -0x28(%rbp),%rax 0.00 : 90e9ef: 48 8b 10 mov (%rax),%rdx 0.00 : 90e9f2: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90e9f6: 48 89 50 10 mov %rdx,0x10(%rax) : if (status->curEntry == NULL) /* end of this bucket */ 0.00 : 90e9fa: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90e9fe: 48 8b 40 10 mov 0x10(%rax),%rax 0.00 : 90ea02: 48 85 c0 test %rax,%rax 0.00 : 90ea05: 75 11 jne 90ea18 <hash_seq_search+0x4c> : ++status->curBucket; 0.00 : 90ea07: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90ea0b: 8b 40 08 mov 0x8(%rax),