[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Libunwind-devel] Question about performance of threaded access in libun
From: |
Robert Schöne |
Subject: |
[Libunwind-devel] Question about performance of threaded access in libunwind |
Date: |
Thu, 06 Oct 2016 12:55:52 +0200 |
Hello,
Could it be that unwinding does not work well with threading?
I run an Intel dual core system + Hyperthreading using Ubuntu 16.04.
and patched tests/Gperf-trace.c so that this part
unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_NONE);
doit ("no cache ");
unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_GLOBAL);
doit ("global cache ");
unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_PER_THREAD);
doit ("per-thread cache");
is executed thread parallel:
unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_NONE);
#pragma omp parallel
{
doit ("no cache ");
}
unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_GLOBAL);
#pragma omp parallel
{
doit ("global cache ");
}
unw_set_caching_policy (unw_local_addr_space, UNW_CACHE_PER_THREAD);
#pragma omp parallel
{
doit ("per-thread cache");
}
With this modification, the benchmark prints one line per active
thread. The number of threads can be set with the environment variable
OMP_NUM_THREADS. I compile the test with gcc Gperf-trace.c
-I../include/ -lunwind-x86_64 -lunwind -fopenmp -o Gperf-trace_omp
The original result that is also achieved with OMP_NUM_THREADS set to 1
is:
unw_getcontext : cold avg= 50.068 nsec, warm avg= 28.610 nsec
unw_init_local : cold avg= 138.283 nsec, warm avg= 38.147 nsec
no cache : unw_step : 1st= 3589.648 min= 354.286 avg= 368.953
nsec
global cache : unw_step : 1st= 523.630 min= 354.286 avg= 365.754
nsec
per-thread cache: unw_step : 1st= 532.542 min= 354.286 avg= 364.794
nsec
For every thread that I add, the reported times increase independently
of the setted caching method. 2 Threads:
no cache : unw_step : 1st= 5454.929 min= 582.801 avg= 660.841
nsec
no cache : unw_step : 1st= 5551.434 min= 393.719 avg= 656.303
nsec
global cache : unw_step : 1st= 843.295 min= 582.801 avg= 652.083
nsec
global cache : unw_step : 1st= 761.190 min= 393.719 avg= 648.359
nsec
per-thread cache: unw_step : 1st= 860.956 min= 593.839 avg= 658.977
nsec
per-thread cache: unw_step : 1st= 763.377 min= 402.468 avg= 654.147
nsec
3 Threads:
no cache : unw_step : 1st= 6794.930 min= 501.121 avg= 1096.294
nsec
no cache : unw_step : 1st= 7426.297 min= 631.368 avg= 1098.641
nsec
no cache : unw_step : 1st= 6835.395 min= 393.719 avg= 1089.486
nsec
global cache : unw_step : 1st= 1028.732 min= 702.010 avg= 1077.695
nsec
global cache : unw_step : 1st= 1046.393 min= 399.572 avg= 1092.139
nsec
global cache : unw_step : 1st= 1347.393 min= 393.719 avg= 1088.214
nsec
per-thread cache: unw_step : 1st= 2194.334 min= 554.102 avg= 1092.061
nsec
per-thread cache: unw_step : 1st= 1907.349 min= 565.140 avg= 1093.666
nsec
per-thread cache: unw_step : 1st= 1852.665 min= 393.719 avg= 1088.304
nsec
4 Threads:
no cache : unw_step : 1st= 7991.438 min= 788.106 avg= 1282.086
nsec
no cache : unw_step : 1st= 7962.739 min= 684.350 avg= 1291.780
nsec
no cache : unw_step : 1st= 8368.934 min= 582.801 avg= 1299.171
nsec
no cache : unw_step : 1st= 7705.951 min= 448.402 avg= 1289.624
nsec
global cache : unw_step : 1st= 2055.256 min= 602.669 avg= 1276.634
nsec
global cache : unw_step : 1st= 1165.602 min= 582.801 avg= 1285.326
nsec
global cache : unw_step : 1st= 1582.834 min= 509.951 avg= 1297.648
nsec
global cache : unw_step : 1st= 1054.291 min= 393.719 avg= 1286.160
nsec
per-thread cache: unw_step : 1st= 2203.164 min= 860.956 avg= 1284.144
nsec
per-thread cache: unw_step : 1st= 1249.490 min= 843.295 avg= 1292.000
nsec
per-thread cache: unw_step : 1st= 1119.243 min= 620.330 avg= 1297.951
nsec
per-thread cache: unw_step : 1st= 1559.564 min= 393.719 avg= 1288.791
nsec
Is this intended? We would like to use libunwind to gather the stack in
a multithreaded application but with this behavior I imagine that this
won't work without influencing the performance significantly (e.g.,
when 128 threads are used).
According to perf and strace a significant amount of time is spent in
the kernel, i.e. in sigprocmask.
Robert
- [Libunwind-devel] Question about performance of threaded access in libunwind,
Robert Schöne <=