Re: [INFO] Some preliminary performance data

On Sun, May 3, 2020, 1:25 AM Aleksandar Markovic <address@hidden> wrote:

[correcting some email addresses]

нед, 3. мај 2020. у 01:20 Aleksandar Markovic <address@hidden> је написао/ла:
Hi, all.

I just want to share with you some bits and pieces of data that I got while doing some preliminary experimentation for the GSoC project "TCG Continuous Benchmarking", that Ahmed Karaman, a student of the fourth final year of Electical Engineering Faculty in Cairo, will execute.

User Mode

   * As expected, for any program dealing with any substantional floating-point calculation, softfloat library will be the the heaviest CPU cycles consumer.
   * We plan to examine the performance behaviour of non-FP programs (integer arithmetic), or even non-numeric programs (sorting strings, for example).

System Mode

   * I did profiling of booting several machines using a tool called callgrind (a part of valgrind). The tool offers pletora of information, however it looks it is little confused by usage of coroutines, and that makes some of its reports look very illogical, or plain ugly. Still, it seems valid data can be extracted from it. Without going into details, here is what it says for one machine (bear in mind that results may vary to a great extent between machines):
     ** The booting involved six threads, one for display handling, one for emulations, and four more. The last four did almost nothing during boot, just almost entire time siting idle, waiting for something. As far as "Total Instruction Fetch Count" (this is the main measure used in callgrind), they were distributed in proportion 1:3 between display thread and emulation thread (the rest of threads were negligible) (but, interestingly enough, for another machine that proportion was 1:20).
     ** The display thread is dominated by vga_update_display() function (21.5% "self" time, and 51.6% "self + callees" time, called almost 40000 times). Other functions worth mentioning are cpu_physical_memory_snapshot_get_dirty() and memory_region_snapshot_get_dirty(), which are very small functions, but are both invoked over 26 000 000 times, and contribute with over 20% of display thread instruction fetch count together.
     ** Focusing now on emulation thread, "Total Instruction Fetch Counts" were roughly distributed this way:
           - 15.7% is execution of GIT-ed code from translation block buffer
           - 39.9% is execution of helpers
           - 44.4% is code translation stage, including some coroutine activities
        Top two among helpers:
          - helper_le_stl_memory()
          - helper_lookup_tb_ptr() (this one is invoked whopping 36 000 000 times)
        Single largest instruction consumer of code translation:
          - liveness_pass_1(), that constitutes 21.5% of the entire "emulation thread" consumption, or, in other way, almost half of code translation stage (that sits at 44.4%)

Please take all this with a little grain of salt, since these results are just of preliminary nature.

I would like to use this opportunity to welcome Ahmed Karaman, a talented young man from Egypt, into QEMU development community, that'll work on "TCG Continuous Benchmarking" project this summer. Please do help them in his first steps as our colleague. Best luck to Ahmed!

Thanks,
Aleksandar

From:	Ahmed Karaman
Subject:	Re: [INFO] Some preliminary performance data
Date:	Sun, 3 May 2020 08:47:56 +0200