qemu-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-commits] [qemu/qemu] 911a4d: compiler.h: add QEMU_ALIGNED() to enf


From: GitHub
Subject: [Qemu-commits] [qemu/qemu] 911a4d: compiler.h: add QEMU_ALIGNED() to enforce struct a...
Date: Mon, 13 Jun 2016 03:30:13 -0700

  Branch: refs/heads/master
  Home:   https://github.com/qemu/qemu
  Commit: 911a4d2215b05267b16925503218f49d607c6b29
      
https://github.com/qemu/qemu/commit/911a4d2215b05267b16925503218f49d607c6b29
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M include/qemu/compiler.h

  Log Message:
  -----------
  compiler.h: add QEMU_ALIGNED() to enforce struct alignment

Reviewed-by: Sergey Fedorov <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: ccdb3c1fc8865cf9142235b00e3a8fe3246f2a59
      
https://github.com/qemu/qemu/commit/ccdb3c1fc8865cf9142235b00e3a8fe3246f2a59
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M cpus.c
    M include/qemu/seqlock.h

  Log Message:
  -----------
  seqlock: remove optional mutex

This option is unused; besides, it bloats the struct when not needed.
Let's just let writers define their own locks elsewhere.

Reviewed-by: Sergey Fedorov <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 03719e44b6d6d559c506808e5b6c340feff867b8
      
https://github.com/qemu/qemu/commit/03719e44b6d6d559c506808e5b6c340feff867b8
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M cpus.c
    M include/qemu/seqlock.h

  Log Message:
  -----------
  seqlock: rename write_lock/unlock to write_begin/end

It is a more appropriate name, now that the mutex embedded
in the seqlock is gone.

Reviewed-by: Sergey Fedorov <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 462cda505f304c3915e5e30429cee22418f5a6ea
      
https://github.com/qemu/qemu/commit/462cda505f304c3915e5e30429cee22418f5a6ea
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    A include/qemu/processor.h

  Log Message:
  -----------
  include/processor.h: define cpu_relax()

Taken from the linux kernel.

Reviewed-by: Sergey Fedorov <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: ac9a9eba1e43c5354905c0c7f8e17852ed396ac2
      
https://github.com/qemu/qemu/commit/ac9a9eba1e43c5354905c0c7f8e17852ed396ac2
  Author: Guillaume Delbergue <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M include/qemu/thread.h

  Log Message:
  -----------
  qemu-thread: add simple test-and-set spinlock

Reviewed-by: Sergey Fedorov <address@hidden>
Signed-off-by: Guillaume Delbergue <address@hidden>
[Rewritten. - Paolo]
Signed-off-by: Paolo Bonzini <address@hidden>
[Emilio's additions: use TAS instead of atomic_xchg; emit acquire/release
 barriers; return bool from trylock; call cpu_relax() while spinning;
 optimize for uncontended locks by acquiring the lock with TAS instead
 of TATAS; add qemu_spin_locked().]
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: dc8b295d05ec35a8c032f9abca421772347ba5d4
      
https://github.com/qemu/qemu/commit/dc8b295d05ec35a8c032f9abca421772347ba5d4
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    A include/exec/tb-hash-xx.h

  Log Message:
  -----------
  exec: add tb_hash_func5, derived from xxhash

This will be used by upcoming changes for hashing the tb hash.

Add this into a separate file to include the copyright notice from
xxhash.

Reviewed-by: Sergey Fedorov <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 42bd32287f3a18d823f2258b813824a39ed7c6d9
      
https://github.com/qemu/qemu/commit/42bd32287f3a18d823f2258b813824a39ed7c6d9
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M cpu-exec.c
    M include/exec/tb-hash.h
    M translate-all.c

  Log Message:
  -----------
  tb hash: hash phys_pc, pc, and flags with xxhash

For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
   high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Reviewed-by: Sergey Fedorov <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: bf3afd5f419a2054bf03d963bdd223fbb27b72d2
      
https://github.com/qemu/qemu/commit/bf3afd5f419a2054bf03d963bdd223fbb27b72d2
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    A include/qemu/qdist.h
    M util/Makefile.objs
    A util/qdist.c

  Log Message:
  -----------
  qdist: add module to represent frequency distributions of data

Sometimes it is useful to have a quick histogram to represent a certain
distribution -- for example, when investigating a performance regression
in a hash table due to inadequate hashing.

The appended allows us to easily represent a distribution using Unicode
characters. Further, the data structure keeping track of the distribution
is so simple that obtaining its values for off-line processing is trivial.

Example, taking the last 10 commits to QEMU:

 Characters in commit title  Count
-----------------------------------
                   39      1
                   48      1
                   53      1
                   54      2
                   57      1
                   61      1
                   67      1
                   78      1
                   80      1
qdist_init(&dist);
qdist_inc(&dist, 39);
[...]
qdist_inc(&dist, 80);

char *str = qdist_pr(&dist, 9, QDIST_PR_LABELS);
// -> [39.0,43.6)▂▂ █▂ ▂ ▄[75.4,80.0]
g_free(str);

char *str = qdist_pr(&dist, 4, QDIST_PR_LABELS);
// -> [39.0,49.2)▁█▁▁[69.8,80.0]
g_free(str);

Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: ff9249b7337df52ff36cd8d35da496f6d9d009b1
      
https://github.com/qemu/qemu/commit/ff9249b7337df52ff36cd8d35da496f6d9d009b1
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M tests/.gitignore
    M tests/Makefile.include
    A tests/test-qdist.c

  Log Message:
  -----------
  qdist: add test program

Acked-by: Sergey Fedorov <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 2e11264aafe476c7a53accde4a23cfc2395a02fd
      
https://github.com/qemu/qemu/commit/2e11264aafe476c7a53accde4a23cfc2395a02fd
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    A include/qemu/qht.h
    M util/Makefile.objs
    A util/qht.c

  Log Message:
  -----------
  qht: QEMU's fast, resizable and scalable Hash Table

This is a fast, scalable chained hash table with optional auto-resizing, 
allowing
reads that are concurrent with reads, and reads/writes that are concurrent
with writes to separate buckets.

A hash table with these features will be necessary for the scalability
of the ongoing MTTCG work; before those changes arrive we can already
benefit from the single-threaded speedup that qht also provides.

Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 1a95404fbd34f4ed66fa2450869fb3b384b0a97b
      
https://github.com/qemu/qemu/commit/1a95404fbd34f4ed66fa2450869fb3b384b0a97b
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M tests/.gitignore
    M tests/Makefile.include
    A tests/test-qht.c

  Log Message:
  -----------
  qht: add test program

Acked-by: Sergey Fedorov <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 515864a0d7a2033163e8e08df20b7873bb6287ae
      
https://github.com/qemu/qemu/commit/515864a0d7a2033163e8e08df20b7873bb6287ae
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M tests/.gitignore
    M tests/Makefile.include
    A tests/qht-bench.c

  Log Message:
  -----------
  qht: add qht-bench, a performance benchmark

This serves as a performance benchmark as well as a stress test
for QHT. We can tweak quite a number of things, including the
number of resize threads and how frequently resizes are triggered.

A performance comparison of QHT vs CLHT[1] and ck_hs[2] using
this same benchmark program can be found here:
  http://imgur.com/a/0Bms4

The tests are run on a 64-core AMD Opteron 6376, pinning threads
to cores favoring same-socket cores. For each run, qht-bench is
invoked with:
  $ tests/qht-bench -d $duration -n $n -u $u -g $range
, where $duration is in seconds, $n is the number of threads,
$u is the update rate (0.0 to 100.0), and $range is the number
of keys.

Note that ck_hs's performance drops significantly as writes go
up, since it requires an external lock (I used a ck_spinlock)
around every write.

Also, note that CLHT instead of using a seqlock, relies on an
allocator that does not ever return the same address during the
same read-critical section. This gives it a slight performance
advantage over QHT on read-heavy workloads, since the seqlock
writes aren't there.

[1] CLHT: https://github.com/LPD-EPFL/CLHT
    https://infoscience.epfl.ch/record/207109/files/ascy_asplos15.pdf

[2] ck_hs: http://concurrencykit.org/
     http://backtrace.io/blog/blog/2015/03/13/workload-specialization/

A few of those plots are shown in text here, since that site
might not be online forever. Throughput is on Mops/s on the Y axis.
                        200K keys, 0 % updates

  450 ++--+------+------+-------+-------+-------+-------+------+-------+--++
      |   +      +      +       +       +       +       +      +      +N+  |
  400 ++                                                           ---+E+ ++
      |                                                       +++----      |
  350 ++          9 ++------+------++                       --+E+    -+H+ ++
      |             |      +H+-     |                 -+N+----   ---- +++  |
  300 ++          8 ++     +E+     ++             -----+E+  --+H+         ++
      |             |      +++      |         -+N+-----+H+--               |
  250 ++          7 ++------+------++  +++-----+E+----                    ++
  200 ++                    1         -+E+-----+H+                        ++
      |                           ----                     qht +-E--+      |
  150 ++                      -+E+                        clht +-H--+     ++
      |                   ----                              ck +-N--+      |
  100 ++               +E+                                                ++
      |            ----                                                    |
   50 ++       -+E+                                                       ++
      |   +E+E+  +      +       +       +       +       +      +       +   |
    0 ++--E------+------+-------+-------+-------+-------+------+-------+--++
    1      8      16      24      32      40      48     56      64
                          Number of threads
                        200K keys, 1 % updates

  350 ++--+------+------+-------+-------+-------+-------+------+-------+--++
      |   +      +      +       +       +       +       +      +     -+E+  |
  300 ++                                                         -----+H+ ++
      |                                                       +E+--        |
      |           9 ++------+------++                  +++----             |
  250 ++            |      +E+   -- |                 -+E+                ++
      |           8 ++         --  ++             ----                     |
  200 ++            |      +++-     |  +++  ---+E+                        ++
      |           7 ++------N------++ -+E+--               qht +-E--+      |
      |                     1  +++----                    clht +-H--+      |
  150 ++                      -+E+                          ck +-N--+     ++
      |                   ----                                             |
  100 ++               +E+                                                ++
      |            ----                                                    |
      |        -+E+                                                        |
   50 ++    +H+-+N+----+N+-----+N+------                                  ++
      |   +E+E+  +      +       +      +N+-----+N+-----+N+----+N+-----+N+  |
    0 ++--E------+------+-------+-------+-------+-------+------+-------+--++
    1      8      16      24      32      40      48     56      64
                          Number of threads
                        200K keys, 20 % updates

  300 ++--+------+------+-------+-------+-------+-------+------+-------+--++
      |   +      +      +       +       +       +       +      +       +   |
      |                                                              -+H+  |
  250 ++                                                         ----     ++
      |           9 ++------+------++                       --+H+  ---+E+  |
      |           8 ++     +H+--   ++                 -+H+----+E+--        |
  200 ++            |      +E+    --|             -----+E+--  +++         ++
      |           7 ++      + ---- ++       ---+H+---- +++ qht +-E--+      |
  150 ++          6 ++------N------++ -+H+-----+E+        clht +-H--+     ++
      |                     1     -----+E+--                ck +-N--+      |
      |                       -+H+----                                     |
  100 ++                  -----+E+                                        ++
      |                +E+--                                               |
      |            ----+++                                                 |
   50 ++       -+E+                                                       ++
      |     +E+ +++                                                        |
      |   +E+N+-+N+-----+       +       +       +       +      +       +   |
    0 ++--E------+------N-------N-------N-------N-------N------N-------N--++
    1      8      16      24      32      40      48     56      64
                          Number of threads
                       200K keys, 100 % updates       qht +-E--+
                                                    clht +-H--+
  160 ++--+------+------+-------+-------+-------+-------+---ck-+-N-----+--++
      |   +      +      +       +       +       +       +      +   ----H   |
  140 ++                                                      +H+--  -+E+ ++
      |                                                +++----   ----      |
  120 ++          8 ++------+------++                 -+H+    +E+         ++
      |           7 ++     +H+---- ++             ---- +++----             |
  100 ++            |      +E+      |  +++  ---+H+    -+E+                ++
      |           6 ++     +++     ++ -+H+--   +++----                     |
   80 ++          5 ++------N----------+E+-----+E+                        ++
      |                     1 -+H+---- +++                                 |
      |                   -----+E+                                         |
   60 ++               +H+---- +++                                        ++
      |            ----+E+                                                 |
   40 ++        +H+----                                                   ++
      |       --+E+                                                        |
   20 ++    +E+                                                           ++
      |  +EE+    +      +       +       +       +       +      +       +   |
    0 ++--+N-N---N------N-------N-------N-------N-------N------N-------N--++
    1      8      16      24      32      40      48     56      64
                          Number of threads

Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 896a9ee96720c5cc7e00e63943435ccccec54d23
      
https://github.com/qemu/qemu/commit/896a9ee96720c5cc7e00e63943435ccccec54d23
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M tests/.gitignore
    M tests/Makefile.include
    A tests/test-qht-par.c

  Log Message:
  -----------
  qht: add test-qht-par to invoke qht-bench from 'check' target

Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 909eaac9bbc2ed4f3a82ce38e905b87d478a3e00
      
https://github.com/qemu/qemu/commit/909eaac9bbc2ed4f3a82ce38e905b87d478a3e00
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M cpu-exec.c
    M include/exec/exec-all.h
    M include/exec/tb-context.h
    M include/exec/tb-hash.h
    M translate-all.c

  Log Message:
  -----------
  tb hash: track translated blocks with qht

Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
        MRU is implemented here by adding an intermediate struct
        that contains the u32 hash and a pointer to the TB; this
        allows us, on an MRU promotion, to copy said struct (that is not
        at the head), and put this new copy at the head. After a grace
        period, the original non-head struct can be eliminated, and
        after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
             no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
           no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
  http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.
                       Host: Intel Xeon E5-2690

  28 ++------------+-------------+-------------+-------------+------------++
     A*****        +             +             +             master **A*** +
  27 ++    *                                                 xxhash ##B###++
     |      A******A******                               xxhash-rcu $$C$$$ |
  26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
     D%%$$                              A******A******A*qht-dyn-mru A*E****A
  25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
     B#####%                                                               |
  24 ++    #C$$$$$                                                        ++
     |      B###  $                                                        |
     |          ## C$$$$$$                                                 |
  23 ++           #       C$$$$$$                                         ++
     |             B######       C$$$$$$                                %%%D
  22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
     |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
  21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
     +             E@@@   F&&&   +      E@     +      F&&&   +             +
  20 ++------------+-------------+-------------+-------------+------------++
     14            16            18            20            22            24
                       log2 number of buckets
                            Host: Intel i7-4790K

  14.5 ++------------+------------+-------------+------------+------------++
       A**           +            +             +            master **A*** +
    14 ++ **                                                 xxhash ##B###++
  13.5 ++   **                                           xxhash-rcu $$C$$$++
       |                                            qht-fixed-nomru %%D%%% |
    13 ++     A******                                   qht-dyn-mru @@E@@@++
       |             A*****A******A******             qht-dyn-nomru &&F&&& |
  12.5 C$$                               A******A******A*****A******    ***A
    12 ++ $$                                                        A***  ++
       D%%% $$                                                             |
  11.5 ++  %%                                                             ++
       B###  %C$$$$$$                                                      |
    11 ++  ## D%%%%% C$$$$$                                               ++
       |     #      %      C$$$$$$                                         |
  10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
    10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
       +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
   9.5 ++------------+------------+-------------+------------+------------++
       14            16           18            20           22            24
                        log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Reviewed-by: Sergey Fedorov <address@hidden>
Reviewed-by: Alex Bennée <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: 329844d4bc3d5a11f1e63938d66f74c9584c7abc
      
https://github.com/qemu/qemu/commit/329844d4bc3d5a11f1e63938d66f74c9584c7abc
  Author: Emilio G. Cota <address@hidden>
  Date:   2016-06-11 (Sat, 11 Jun 2016)

  Changed paths:
    M translate-all.c

  Log Message:
  -----------
  translate-all: add tb hash bucket info to 'info jit' dump

Examples:

- Good hashing, i.e. tb_hash_func5(phys_pc, pc, flags):
TB count            715135/2684354
[...]
TB hash buckets     388775/524288 (74.15% head buckets used)
TB hash occupancy   33.04% avg chain occ. Histogram: [0,10)%|▆ █  
▅▁▃▁▁|[90,100]%
TB hash avg chain   1.017 buckets. Histogram: 1|█▁▁|3

- Not-so-good hashing, i.e. tb_hash_func5(phys_pc, pc, 0):
TB count            712636/2684354
[...]
TB hash buckets     344924/524288 (65.79% head buckets used)
TB hash occupancy   31.64% avg chain occ. Histogram: [0,10)%|█ ▆  
▅▁▃▁▂|[90,100]%
TB hash avg chain   1.047 buckets. Histogram: 1|█▁▁▁|4

- Bad hashing, i.e. tb_hash_func5(phys_pc, 0, 0):
TB count            702818/2684354
[...]
TB hash buckets     112741/524288 (21.50% head buckets used)
TB hash occupancy   10.15% avg chain occ. Histogram: [0,10)%|█ ▁  
▁▁▁▁▁|[90,100]%
TB hash avg chain   2.107 buckets. Histogram: [1.0,10.2)|█▁▁▁▁▁▁▁▁▁|[83.8,93.0]

- Good hashing, but no auto-resize:
TB count            715634/2684354
TB hash buckets     8192/8192 (100.00% head buckets used)
TB hash occupancy   98.30% avg chain occ. Histogram: 
[95.3,95.8)%|▁▁▃▄▃▄▁▇▁█|[99.5,100.0]%
TB hash avg chain   22.070 buckets. Histogram: 
[15.0,16.7)|▁▂▅▄█▅▁▁▁▁|[30.3,32.0]

Acked-by: Sergey Fedorov <address@hidden>
Suggested-by: Richard Henderson <address@hidden>
Reviewed-by: Richard Henderson <address@hidden>
Signed-off-by: Emilio G. Cota <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: Richard Henderson <address@hidden>


  Commit: da2fdd0bd1514a44309dd5be162ebfb6c166a716
      
https://github.com/qemu/qemu/commit/da2fdd0bd1514a44309dd5be162ebfb6c166a716
  Author: Peter Maydell <address@hidden>
  Date:   2016-06-13 (Mon, 13 Jun 2016)

  Changed paths:
    M cpu-exec.c
    M cpus.c
    M include/exec/exec-all.h
    M include/exec/tb-context.h
    A include/exec/tb-hash-xx.h
    M include/exec/tb-hash.h
    M include/qemu/compiler.h
    A include/qemu/processor.h
    A include/qemu/qdist.h
    A include/qemu/qht.h
    M include/qemu/seqlock.h
    M include/qemu/thread.h
    M tests/.gitignore
    M tests/Makefile.include
    A tests/qht-bench.c
    A tests/test-qdist.c
    A tests/test-qht-par.c
    A tests/test-qht.c
    M translate-all.c
    M util/Makefile.objs
    A util/qdist.c
    A util/qht.c

  Log Message:
  -----------
  Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20160611' into staging

TB hashing improvements

# gpg: Signature made Sun 12 Jun 2016 01:12:50 BST
# gpg:                using RSA key 0xAD1270CC4DD0279B
# gpg: Good signature from "Richard Henderson <address@hidden>"
# gpg:                 aka "Richard Henderson <address@hidden>"
# gpg:                 aka "Richard Henderson <address@hidden>"
# Primary key fingerprint: 9CB1 8DDA F8E8 49AD 2AFC  16A4 AD12 70CC 4DD0 279B

* remotes/rth/tags/pull-tcg-20160611:
  translate-all: add tb hash bucket info to 'info jit' dump
  tb hash: track translated blocks with qht
  qht: add test-qht-par to invoke qht-bench from 'check' target
  qht: add qht-bench, a performance benchmark
  qht: add test program
  qht: QEMU's fast, resizable and scalable Hash Table
  qdist: add test program
  qdist: add module to represent frequency distributions of data
  tb hash: hash phys_pc, pc, and flags with xxhash
  exec: add tb_hash_func5, derived from xxhash
  qemu-thread: add simple test-and-set spinlock
  include/processor.h: define cpu_relax()
  seqlock: rename write_lock/unlock to write_begin/end
  seqlock: remove optional mutex
  compiler.h: add QEMU_ALIGNED() to enforce struct alignment

Signed-off-by: Peter Maydell <address@hidden>


Compare: https://github.com/qemu/qemu/compare/a93c1bdf0bd4...da2fdd0bd151

reply via email to

[Prev in Thread] Current Thread [Next in Thread]