qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who ca


From: Stefan Weil
Subject: Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
Date: Sat, 20 Apr 2024 22:25:18 +0200
User-agent: Mozilla Thunderbird

Am 16.04.24 um 14:17 schrieb Stefan Weil:
Am 16.04.24 um 14:10 schrieb Peter Maydell:

The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.
[...]
Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.

Can somebody who cares about TCI investigate, please, and track
down whatever this is?

I'll have a look.

Short summary:

The "persistent intermittent failures due to jobs timing out" are not related to TCI: they also occur if the same tests are run with the normal TCG. I suggest that the CI tests should run single threaded.

But let's have a look on details of my results.

I have run `time make test` using different scenarios on the rather old and not so performant VM which I typically use for QEMU builds. I did not restrict the tests to selected architectures like it is done in the QEMU CI tests. Therefore I had more tests which all ended successfully:

Ok:                 848
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            68
Timeout:            0

---

1st test with normal TCG

`nohup time ../configure --enable-modules --disable-werror && nohup time make -j4 && nohup time make test`
[...]
Full log written to /home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt 2296.08user 1525.02system 21:49.78elapsed 291%CPU (0avgtext+0avgdata 633476maxresident)k
1730448inputs+14140528outputs (11668major+56827263minor)pagefaults 0swaps

---

2nd test with TCI

`nohup time ../configure --enable-modules --disable-werror --enable-tcg-interpreter && nohup time make -j4 && nohup time make test`
[...]
Full log written to /home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt 3766.74user 1521.38system 26:50.51elapsed 328%CPU (0avgtext+0avgdata 633012maxresident)k
32768inputs+14145080outputs (3033major+56121586minor)pagefaults 0swaps

---

So the total test time with TCI was 26:50.51 minutes while for the normal test it was 21:49.78 minutes.

These 10 single tests had the longest duration:

1st test with normal TCG

  94/916 qtest-arm / qtest-arm/qom-test                 373.41s
  99/916 qtest-aarch64 / qtest-aarch64/qom-test         398.43s
100/916 qtest-i386 / qtest-i386/bios-tables-test       188.06s
103/916 qtest-x86_64 / qtest-x86_64/bios-tables-test   228.33s
106/916 qtest-aarch64 / qtest-aarch64/migration-test   201.15s
119/916 qtest-i386 / qtest-i386/migration-test         253.58s
126/916 qtest-x86_64 / qtest-x86_64/migration-test     266.66s
143/916 qtest-arm / qtest-arm/test-hmp                 101.72s
144/916 qtest-aarch64 / qtest-aarch64/test-hmp         113.10s
163/916 qtest-arm / qtest-arm/aspeed_smc-test          256.92s

2nd test with TCI

  68/916 qtest-arm / qtest-arm/qom-test                 375.35s
  82/916 qtest-aarch64 / qtest-aarch64/qom-test         403.50s
  93/916 qtest-i386 / qtest-i386/bios-tables-test       192.22s
  99/916 qtest-aarch64 / qtest-aarch64/bios-tables-test 379.92s
100/916 qtest-x86_64 / qtest-x86_64/bios-tables-test   240.19s
103/916 qtest-aarch64 / qtest-aarch64/migration-test   223.49s
106/916 qtest-ppc64 / qtest-ppc64/pxe-test             418.42s
113/916 qtest-i386 / qtest-i386/migration-test         284.96s
118/916 qtest-arm / qtest-arm/aspeed_smc-test          271.10s
119/916 qtest-x86_64 / qtest-x86_64/migration-test     287.36s

---

Summary:

TCI is not much slower than the normal TCG. Surprisingly it was even faster for the tests 99 and 103. For other tests like test 106 TCI was about half as fast as normal TCG, but in summary it is not "factors" slower. A total test time of 26:50 minutes is also not so bad compared with the 21:49 minutes of the normal TCG.

No single test (including subtests) with TCI exceeded 10 minutes, the longest one was well below that margin with 418 seconds.

---

The tests above were running with x86_64, and I could not reproduce timeouts. The Gitlab CI tests were running with i686 and used different configure options. Therefore I made additional tests with 32 bit builds in a chroot environment (Debian GNU Linux bullseye i386) with the original configure options. As expected that reduced the number of tests to 250. All tests passed with `make test`:

3rd test with normal TCG

Ok:                 250
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            0

Full log written to /root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt 855.30user 450.53system 6:45.57elapsed 321%CPU (0avgtext+0avgdata 609180maxresident)k
28232inputs+4772968outputs (64944major+8328814minor)pagefaults 0swaps


4th test with TCI

Ok:                 250
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            0

Full log written to /root/qemu/build/meson-logs/testlog.txt
make[1]: Leaving directory '/root/qemu/build'
1401.64user 483.55system 9:03.25elapsed 347%CPU (0avgtext+0avgdata 609244maxresident)k
24inputs+4690040outputs (70405major+7972385minor)pagefaults 0swaps

---

Summary:

Again TCI is not much slower than the normal TCG. The total test time for all 250 tests is below 10 minutes, even with TCI!

---

Could it be that the timeouts in Gitlab CI are caused by wrongly configured multithreading? Those tests are not started with `make test` which would run single threaded, but with meson and an argument `--num-processes 9`. I tested that with the normal TCG on my VM which only can run 4 threads simultaneously:

5th test with TCG

pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9 --print-errorlogs
Summary of Failures:

254/258 qemu:softfloat+softfloat-ops / fp-test-mul TIMEOUT 30.12s killed by signal 15 SIGTERM

Ok:                 249
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            1

Full log written to /root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt

Repeating that test again fails, this time with 2 timeouts:

time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9 --print-errorlogs
Summary of Failures:

253/258 qemu:softfloat+softfloat-ops / fp-test-mul TIMEOUT 30.14s killed by signal 15 SIGTERM 254/258 qemu:softfloat+softfloat-ops / fp-test-div TIMEOUT 30.18s killed by signal 15 SIGTERM

Ok:                 248
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            2

Full log written to /root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt

real    7m51.102s
user    14m0.470s
sys     7m58.427s

Now I reduced the number of threads to 4:

time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 4 --print-errorlogs
Ok:                 250
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            0

Full log written to /root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt

real    6m42.648s
user    13m52.271s
sys     7m35.643s

---

Summary:

I could reproduce persistent intermittent failures due to jobs timing out without TCI when I tried to run the tests with more threads than my machine supports. This result is expected because not all single tests can not run 100% of the time. They are interrupted because of scheduling, and it's normal that some (random) tests will have a longer duration. Some will even raise a timeout.

The final test shows that adjusting the number of threads to the hosts capabilities fixes the problem. But it also shows that 4 threads don't accelerate the whole test job compared to a single thread. Using multithreading obviously wastes a lot of user and sys CPU time.

Regards
Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]