Perf
Perf 抓函数调用热点 / perf record/ perf report
perf record -g -o perf.data --pid <pid> -- sleep <要 profile 的时间(s)>
perf report -i perf.data
输出示例在尾部。输出解读:
- 我们可以看到前两列分别是 Children 和 Self:
- Self: 表示该函数自身的指令所消耗的 CPU 时间百分比(不包括它调用的子函数)。
- Children: 表示该函数所调用的所有子函数消耗的 CPU 时间百分比(不包括它自己)。这两者合起来,就构成了该函数整个调用栈的总开销。
- 默认是按 Children 从高到低排的;可以
perf report --no-children从高到低排。 - 按 E/e 键可以展开,可以看到函数之间调用链,更加直观,按 C 退回原来模式。
- 按方向键来移动。Enter 选中,q 退出。h help。Space 往下翻页,但是怎么翻回去还不清楚。
- 默认包含了 kernel/userspace 的函数,按 k 可以只看 kernel 函数调用。
Samples: 40K of event 'cycles:P', Event count (approx.): 30780586616
Children Self Command Shared Object Symbol
+ 93.63% 0.00% python3 [unknown] [k] 0x0000000000000001
+ 91.33% 0.00% python3 libc10_cuda.so [.] c10::cuda::device_synchronize()
+ 91.33% 0.00% python3 libcudart.so.12 [.] cudaDeviceSynchronize
+ 91.33% 0.00% python3 libcudart.so.12 [.] 0x0000796c49e11a9d
+ 91.33% 0.00% python3 libcuda.so.575.57.08 [.] cuCtxSynchronize
+ 91.33% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d783bff
+ 91.33% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d6b4616
+ 91.33% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d6a1e21
+ 89.81% 0.00% python3 [unknown] [k] 0x0000000000000e00
+ 89.76% 0.00% python3 [unknown] [k] 0x0000000600000000
+ 68.62% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d68042c
+ 48.44% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7e347dc8
+ 47.66% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7e347998
+ 45.15% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d6b52eb
+ 29.17% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d674493
+ 14.06% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d68073c
+ 10.81% 0.67% python3 libc.so.6 [.] clock_gettime@@GLIBC_2.17
+ 8.41% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7e347a8e
+ 6.67% 0.00% python3 [vdso] [.] 0x00007ffcad160b03
+ 6.67% 6.67% python3 [vdso] [.] 0x0000000000000b00
+ 5.65% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d8bcc90
+ 5.64% 5.64% python3 libcuda.so.575.57.08 [.] 0x000000000027448d
+ 4.30% 0.00% python3 libcublasLt.so.12 [.] 0x0000796b930563fb
+ 3.96% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d680421
+ 3.95% 3.95% python3 libcuda.so.575.57.08 [.] 0x000000000028041f
+ 3.94% 0.00% python3 libcublasLt.so.12 [.] 0x0000796b92f96bf1
+ 3.87% 0.00% python3 libcublasLt.so.12 [.] 0x0000796b9397fda3
+ 3.86% 0.00% python3 libcublasLt.so.12 [.] 0x0000796b960fd49b
+ 3.78% 0.00% python3 libcublasLt.so.12 [.] 0x0000796b9609a7ce
+ 3.78% 0.00% python3 libcuda.so.575.57.08 [.] cuLaunchKernel
+ 3.77% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d7f2417
+ 3.75% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d634b06
+ 3.70% 3.70% python3 libcuda.so.575.57.08 [.] 0x00000000004bcca1
+ 3.70% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d8bcca3
+ 3.70% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d8bccc4
+ 3.69% 3.69% python3 libcuda.so.575.57.08 [.] 0x00000000004bccbf
+ 3.68% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d634250
+ 3.65% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d6289a5
+ 3.57% 3.57% python3 libcuda.so.575.57.08 [.] 0x00000000004bcdc4
+ 3.57% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d8bcdc6
+ 3.54% 3.54% python3 libcuda.so.575.57.08 [.] 0x000000000027447a
+ 3.54% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d67447e
+ 3.20% 0.00% python3 libtorch_cuda.so [.] void at::cuda::blas::gemm_inter
+ 3.11% 3.11% python3 libcuda.so.575.57.08 [.] 0x00000000004bcc96
+ 3.11% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d8bcc98
+ 3.05% 0.01% python3 libcublas.so.12 [.] cublasGemmEx
+ 3.03% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d8bcdb0
+ 3.02% 0.00% python3 libcublas.so.12 [.] 0x0000796bb4526610
+ 3.01% 0.00% python3 libcublas.so.12 [.] 0x0000796bb4526184
+ 3.00% 0.00% python3 libcublas.so.12 [.] 0x0000796bb4285b7e
+ 2.97% 0.00% python3 libcublas.so.12 [.] 0x0000796bb42b13a6
+ 2.87% 0.00% python3 [unknown] [.] 0xec8348fb89485355
+ 2.87% 0.00% python3 libc10.so [.] c10::TensorImpl::~TensorImpl()
+ 2.82% 2.82% python3 libcuda.so.575.57.08 [.] 0x00000000004bcdd0
+ 2.80% 2.80% python3 libcuda.so.575.57.08 [.] 0x00000000004bccab
+ 2.80% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d8bccb4
+ 2.75% 0.00% python3 libcublasLt.so.12 [.] 0x0000796b93aca4ed
+ 2.60% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7e36aebd
+ 2.58% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d674470
+ 2.58% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7e34216f
+ 2.55% 0.00% python3 libcublas.so.12 [.] 0x0000796bb42af24d
+ 2.55% 0.01% python3 libcublasLt.so.12 [.] cublasLtTSTMatmul
+ 2.54% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d911e40
+ 2.52% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7e347d14
+ 2.52% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d676aeb
+ 2.43% 2.43% python3 libcuda.so.575.57.08 [.] 0x00000000004bcdcb
+ 2.43% 0.00% python3 libcuda.so.575.57.08 [.] 0x0000796b7d8bcdcd
+ 2.37% 0.00% python3 [unknown] [.] 0x0000000100000001
+ 2.34% 2.34% python3 libcuda.so.575.57.08 [.] 0x00000000004bccd5
+ 2.30% 0.00% python3 [unknown] [.] 0x00007969d19d5000
+ 2.23% 2.23% python3 libcuda.so.575.57.08 [.] 0x00000000002744aa
+ 2.22% 0.00% python3 libtorch_cuda.so [.] c10::impl::wrap_kernel_functor_
+ 2.20% 0.03% python3 libtorch_cuda.so [.] at::native::(anonymous namespac
+ 2.20% 0.01% python3 libtorch_cuda.so [.] at::(anonymous namespace)::wrap
按 E 展开后的(可以按小 e 来切换简洁模式,更加直观):
Samples: 40K of event 'cycles:P', Event count (approx.): 30780586616
Symbol
- [k] 0x0000000000000001 ◆
- 0x1 ▒
- 91.33% c10::cuda::device_synchronize() ▒
cudaDeviceSynchronize ▒
0x796c49e11a9d ▒
cuCtxSynchronize ▒
0x796b7d783bff ▒
0x796b7d6b4616 ▒
- 0x796b7d6a1e21 ▒
- 68.62% 0x796b7d68042c ▒
- 48.44% 0x796b7e347dc8 ▒
- 46.97% 0x796b7e347998 ▒
- 44.47% 0x796b7d6b52eb ▒
- 28.74% 0x796b7d674493 ▒
3.64% 0x796b7d8bccc4 ▒
3.64% 0x796b7d8bcca3 ▒
3.52% 0x796b7d8bcdc6 ▒
3.08% 0x796b7d8bcc98 ▒
2.99% 0x796b7d8bcdb0 ▒
2.78% 0x796b7d8bccb4 ▒
2.40% 0x796b7d8bcdcd ▒
0.80% 0x796b7d8bcd11 ▒
0.78% 0x796b7d8bcd31 ▒
0.64% 0x796b7d8bccf8 ▒
0.62% 0x796b7d8bcd20 ▒
0.57% 0x796b7d8bcd04 ▒
5.58% 0x796b7d8bcc90 ▒
3.49% 0x796b7d67447e ▒
2.53% 0x796b7d674470 ▒
1.10% 0x796b7d674460 ▒
- 8.41% 0x796b7e347a8e ▒
1.31% 0x796b7d842cab ▒
0.54% 0x796b7d842ccc ▒
0.51% 0x796b7d842cae ▒
2.52% 0x796b7e347d14 ▒
0.64% 0x796b7e347a12 ▒
0.56% 0x796b7e347acf ▒
0.55% 0x796b7e347a6e ▒
0.52% 0x796b7e347a7b ▒
- 14.06% 0x796b7d68073c ▒
- 10.41% clock_gettime@@GLIBC_2.17 ▒
6.67% 0x7ffcad160b03 ▒
1.64% 0x796b7d75d641 ▒
3.96% 0x796b7d680421 ▒
1.63% 0x796b7d680715 ▒
1.25% 0x796b7e3479e0 ▒
- 2.22% c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunct▒
- 2.20% at::(anonymous namespace)::wrapper_CUDA_addmm(at::Tensor const&, at::Tensor c▒
- 2.17% at::native::(anonymous namespace)::addmm_out_cuda_impl(at::Tensor&, at::Te▒
- 2.12% void at::cuda::blas::gemm_and_bias<c10::BFloat16>(bool, bool, long, lon▒
- 1.94% cublasLtMatmul ▒
- 1.90% 0x796b930563fb ▒
- 1.83% 0x796b92f96bf1 ▒
- 1.81% 0x796b9397fda3 ▒
- 1.81% 0x796b93aca4ed ▒
- 0x796b960fd49b ▒
- 1.79% 0x796b9609a7ce ▒
- cuLaunchKernel ▒
- 1.78% 0x796b7d7f2417 ▒
- 1.78% 0x796b7d634b06 ▒
- 1.76% 0x796b7d634250 ▒
- 1.75% 0x796b7d6289a5 ▒
- 1.48% 0x796b7e36aebd ▒
- 1.48% 0x796b7e34216f ▒
- 1.47% 0x796b7d911e40 ▒
- 1.46% 0x796b7d676aeb ▒
- 0.99% __sched_yield ▒
- 0.83% entry_SYSCALL_64_▒
- 0.82% do_syscall_64 ▒
- 0.71% x64_sys_cal▒
- __x64_sys_sche▒
- 0.69% do_sc▒
- 0.62% sc▒
0.59%▒
Perf 抓火焰图
py-spy 抓取 Python 进程火焰图
py-spy 本身安装很简单,只有 2 点几 M,而且也没有什么其他依赖的包。
# 万金油命令
py-spy record --native -o profile.svg -- python3 <your program>
🗞️ Recent Posts