Perf 抓函数调用热点 / perf record/ perf report

perf record -g -o perf.data --pid <pid> -- sleep <要 profile 的时间(s)>
perf report -i perf.data

输出示例在尾部。输出解读:

  • 我们可以看到前两列分别是 Children 和 Self:
    • Self: 表示该函数自身的指令所消耗的 CPU 时间百分比(不包括它调用的子函数)。
    • Children: 表示该函数所调用的所有子函数消耗的 CPU 时间百分比(不包括它自己)。这两者合起来,就构成了该函数整个调用栈的总开销。
  • 默认是按 Children 从高到低排的;可以 perf report --no-children 从高到低排。
  • 按 E/e 键可以展开,可以看到函数之间调用链,更加直观,按 C 退回原来模式。
  • 按方向键来移动。Enter 选中,q 退出。h help。Space 往下翻页,但是怎么翻回去还不清楚。
  • 默认包含了 kernel/userspace 的函数,按 k 可以只看 kernel 函数调用。
Samples: 40K of event 'cycles:P', Event count (approx.): 30780586616
  Children      Self  Command         Shared Object         Symbol
+   93.63%     0.00%  python3         [unknown]             [k] 0x0000000000000001
+   91.33%     0.00%  python3         libc10_cuda.so        [.] c10::cuda::device_synchronize()
+   91.33%     0.00%  python3         libcudart.so.12       [.] cudaDeviceSynchronize
+   91.33%     0.00%  python3         libcudart.so.12       [.] 0x0000796c49e11a9d
+   91.33%     0.00%  python3         libcuda.so.575.57.08  [.] cuCtxSynchronize
+   91.33%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d783bff
+   91.33%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d6b4616
+   91.33%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d6a1e21
+   89.81%     0.00%  python3         [unknown]             [k] 0x0000000000000e00
+   89.76%     0.00%  python3         [unknown]             [k] 0x0000000600000000
+   68.62%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d68042c
+   48.44%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7e347dc8
+   47.66%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7e347998
+   45.15%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d6b52eb
+   29.17%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d674493
+   14.06%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d68073c
+   10.81%     0.67%  python3         libc.so.6             [.] clock_gettime@@GLIBC_2.17
+    8.41%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7e347a8e
+    6.67%     0.00%  python3         [vdso]                [.] 0x00007ffcad160b03
+    6.67%     6.67%  python3         [vdso]                [.] 0x0000000000000b00
+    5.65%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d8bcc90
+    5.64%     5.64%  python3         libcuda.so.575.57.08  [.] 0x000000000027448d
+    4.30%     0.00%  python3         libcublasLt.so.12     [.] 0x0000796b930563fb
+    3.96%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d680421
+    3.95%     3.95%  python3         libcuda.so.575.57.08  [.] 0x000000000028041f
+    3.94%     0.00%  python3         libcublasLt.so.12     [.] 0x0000796b92f96bf1
+    3.87%     0.00%  python3         libcublasLt.so.12     [.] 0x0000796b9397fda3
+    3.86%     0.00%  python3         libcublasLt.so.12     [.] 0x0000796b960fd49b
+    3.78%     0.00%  python3         libcublasLt.so.12     [.] 0x0000796b9609a7ce
+    3.78%     0.00%  python3         libcuda.so.575.57.08  [.] cuLaunchKernel
+    3.77%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d7f2417
+    3.75%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d634b06
+    3.70%     3.70%  python3         libcuda.so.575.57.08  [.] 0x00000000004bcca1
+    3.70%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d8bcca3
+    3.70%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d8bccc4
+    3.69%     3.69%  python3         libcuda.so.575.57.08  [.] 0x00000000004bccbf
+    3.68%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d634250
+    3.65%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d6289a5
+    3.57%     3.57%  python3         libcuda.so.575.57.08  [.] 0x00000000004bcdc4
+    3.57%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d8bcdc6
+    3.54%     3.54%  python3         libcuda.so.575.57.08  [.] 0x000000000027447a
+    3.54%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d67447e
+    3.20%     0.00%  python3         libtorch_cuda.so      [.] void at::cuda::blas::gemm_inter
+    3.11%     3.11%  python3         libcuda.so.575.57.08  [.] 0x00000000004bcc96
+    3.11%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d8bcc98
+    3.05%     0.01%  python3         libcublas.so.12       [.] cublasGemmEx
+    3.03%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d8bcdb0
+    3.02%     0.00%  python3         libcublas.so.12       [.] 0x0000796bb4526610
+    3.01%     0.00%  python3         libcublas.so.12       [.] 0x0000796bb4526184
+    3.00%     0.00%  python3         libcublas.so.12       [.] 0x0000796bb4285b7e
+    2.97%     0.00%  python3         libcublas.so.12       [.] 0x0000796bb42b13a6
+    2.87%     0.00%  python3         [unknown]             [.] 0xec8348fb89485355
+    2.87%     0.00%  python3         libc10.so             [.] c10::TensorImpl::~TensorImpl()
+    2.82%     2.82%  python3         libcuda.so.575.57.08  [.] 0x00000000004bcdd0
+    2.80%     2.80%  python3         libcuda.so.575.57.08  [.] 0x00000000004bccab
+    2.80%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d8bccb4
+    2.75%     0.00%  python3         libcublasLt.so.12     [.] 0x0000796b93aca4ed
+    2.60%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7e36aebd
+    2.58%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d674470
+    2.58%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7e34216f
+    2.55%     0.00%  python3         libcublas.so.12       [.] 0x0000796bb42af24d
+    2.55%     0.01%  python3         libcublasLt.so.12     [.] cublasLtTSTMatmul
+    2.54%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d911e40
+    2.52%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7e347d14
+    2.52%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d676aeb
+    2.43%     2.43%  python3         libcuda.so.575.57.08  [.] 0x00000000004bcdcb
+    2.43%     0.00%  python3         libcuda.so.575.57.08  [.] 0x0000796b7d8bcdcd
+    2.37%     0.00%  python3         [unknown]             [.] 0x0000000100000001
+    2.34%     2.34%  python3         libcuda.so.575.57.08  [.] 0x00000000004bccd5
+    2.30%     0.00%  python3         [unknown]             [.] 0x00007969d19d5000
+    2.23%     2.23%  python3         libcuda.so.575.57.08  [.] 0x00000000002744aa
+    2.22%     0.00%  python3         libtorch_cuda.so      [.] c10::impl::wrap_kernel_functor_
+    2.20%     0.03%  python3         libtorch_cuda.so      [.] at::native::(anonymous namespac
+    2.20%     0.01%  python3         libtorch_cuda.so      [.] at::(anonymous namespace)::wrap

按 E 展开后的(可以按小 e 来切换简洁模式,更加直观):

Samples: 40K of event 'cycles:P', Event count (approx.): 30780586616
  Symbol
- [k] 0x0000000000000001                                                                      ◆
   - 0x1                                                                                      ▒
      - 91.33% c10::cuda::device_synchronize()                                                ▒
           cudaDeviceSynchronize                                                              ▒
           0x796c49e11a9d                                                                     ▒
           cuCtxSynchronize                                                                   ▒
           0x796b7d783bff                                                                     ▒
           0x796b7d6b4616                                                                     ▒
         - 0x796b7d6a1e21                                                                     ▒
            - 68.62% 0x796b7d68042c                                                           ▒
               - 48.44% 0x796b7e347dc8                                                        ▒
                  - 46.97% 0x796b7e347998                                                     ▒
                     - 44.47% 0x796b7d6b52eb                                                  ▒
                        - 28.74% 0x796b7d674493                                               ▒
                             3.64% 0x796b7d8bccc4                                             ▒
                             3.64% 0x796b7d8bcca3                                             ▒
                             3.52% 0x796b7d8bcdc6                                             ▒
                             3.08% 0x796b7d8bcc98                                             ▒
                             2.99% 0x796b7d8bcdb0                                             ▒
                             2.78% 0x796b7d8bccb4                                             ▒
                             2.40% 0x796b7d8bcdcd                                             ▒
                             0.80% 0x796b7d8bcd11                                             ▒
                             0.78% 0x796b7d8bcd31                                             ▒
                             0.64% 0x796b7d8bccf8                                             ▒
                             0.62% 0x796b7d8bcd20                                             ▒
                             0.57% 0x796b7d8bcd04                                             ▒
                          5.58% 0x796b7d8bcc90                                                ▒
                          3.49% 0x796b7d67447e                                                ▒
                          2.53% 0x796b7d674470                                                ▒
                          1.10% 0x796b7d674460                                                ▒
               - 8.41% 0x796b7e347a8e                                                         ▒
                    1.31% 0x796b7d842cab                                                      ▒
                    0.54% 0x796b7d842ccc                                                      ▒
                    0.51% 0x796b7d842cae                                                      ▒
                 2.52% 0x796b7e347d14                                                         ▒
                 0.64% 0x796b7e347a12                                                         ▒
                 0.56% 0x796b7e347acf                                                         ▒
                 0.55% 0x796b7e347a6e                                                         ▒
                 0.52% 0x796b7e347a7b                                                         ▒
            - 14.06% 0x796b7d68073c                                                           ▒
               - 10.41% clock_gettime@@GLIBC_2.17                                             ▒
                    6.67% 0x7ffcad160b03                                                      ▒
                 1.64% 0x796b7d75d641                                                         ▒
              3.96% 0x796b7d680421                                                            ▒
              1.63% 0x796b7d680715                                                            ▒
              1.25% 0x796b7e3479e0                                                            ▒
      - 2.22% c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunct▒
         - 2.20% at::(anonymous namespace)::wrapper_CUDA_addmm(at::Tensor const&, at::Tensor c▒
            - 2.17% at::native::(anonymous namespace)::addmm_out_cuda_impl(at::Tensor&, at::Te▒
               - 2.12% void at::cuda::blas::gemm_and_bias<c10::BFloat16>(bool, bool, long, lon▒
                  - 1.94% cublasLtMatmul                                                      ▒
                     - 1.90% 0x796b930563fb                                                   ▒
                        - 1.83% 0x796b92f96bf1                                                ▒
                           - 1.81% 0x796b9397fda3                                             ▒
                              - 1.81% 0x796b93aca4ed                                          ▒
                                 - 0x796b960fd49b                                             ▒
                                    - 1.79% 0x796b9609a7ce                                    ▒
                                       - cuLaunchKernel                                       ▒
                                          - 1.78% 0x796b7d7f2417                              ▒
                                             - 1.78% 0x796b7d634b06                           ▒
                                                - 1.76% 0x796b7d634250                        ▒
                                                   - 1.75% 0x796b7d6289a5                     ▒
                                                      - 1.48% 0x796b7e36aebd                  ▒
                                                         - 1.48% 0x796b7e34216f               ▒
                                                            - 1.47% 0x796b7d911e40            ▒
                                                               - 1.46% 0x796b7d676aeb         ▒
                                                                  - 0.99% __sched_yield       ▒
                                                                     - 0.83% entry_SYSCALL_64_▒
                                                                        - 0.82% do_syscall_64 ▒
                                                                           - 0.71% x64_sys_cal▒
                                                                              - __x64_sys_sche▒
                                                                                 - 0.69% do_sc▒
                                                                                    - 0.62% sc▒
                                                                                         0.59%▒

Perf 抓火焰图

py-spy 抓取 Python 进程火焰图

py-spy 本身安装很简单,只有 2 点几 M,而且也没有什么其他依赖的包。

# 万金油命令
py-spy record --native -o profile.svg -- python3 <your program>