Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who ca

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who ca

From:	Stefan Weil
Subject:	Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
Date:	Sat, 20 Apr 2024 22:25:18 +0200
User-agent:	Mozilla Thunderbird

Am 16.04.24 um 14:17 schrieb Stefan Weil:

Am 16.04.24 um 14:10 schrieb Peter Maydell:

The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.

[...]

Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.

Can somebody who cares about TCI investigate, please, and track
down whatever this is?


I'll have a look.


Short summary:

The "persistent intermittent failures due to jobs timing out" are notrelated to TCI: they also occur if the same tests are run with thenormal TCG. I suggest that the CI tests should run single threaded.


But let's have a look on details of my results.

I have run `time make test` using different scenarios on the rather oldand not so performant VM which I typically use for QEMU builds. I didnot restrict the tests to selected architectures like it is done in theQEMU CI tests. Therefore I had more tests which all ended successfully:


Ok:                 848
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            68
Timeout:            0

---

1st test with normal TCG

`nohup time ../configure --enable-modules --disable-werror && nohup timemake -j4 && nohup time make test`

[...]

Full log written to/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt2296.08user 1525.02system 21:49.78elapsed 291%CPU (0avgtext+0avgdata633476maxresident)k

1730448inputs+14140528outputs (11668major+56827263minor)pagefaults 0swaps

---

2nd test with TCI

`nohup time ../configure --enable-modules --disable-werror--enable-tcg-interpreter && nohup time make -j4 && nohup time make test`

[...]

Full log written to/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt3766.74user 1521.38system 26:50.51elapsed 328%CPU (0avgtext+0avgdata633012maxresident)k

32768inputs+14145080outputs (3033major+56121586minor)pagefaults 0swaps

---

So the total test time with TCI was 26:50.51 minutes while for thenormal test it was 21:49.78 minutes.


These 10 single tests had the longest duration:

1st test with normal TCG

  94/916 qtest-arm / qtest-arm/qom-test                 373.41s
  99/916 qtest-aarch64 / qtest-aarch64/qom-test         398.43s
100/916 qtest-i386 / qtest-i386/bios-tables-test       188.06s
103/916 qtest-x86_64 / qtest-x86_64/bios-tables-test   228.33s
106/916 qtest-aarch64 / qtest-aarch64/migration-test   201.15s
119/916 qtest-i386 / qtest-i386/migration-test         253.58s
126/916 qtest-x86_64 / qtest-x86_64/migration-test     266.66s
143/916 qtest-arm / qtest-arm/test-hmp                 101.72s
144/916 qtest-aarch64 / qtest-aarch64/test-hmp         113.10s
163/916 qtest-arm / qtest-arm/aspeed_smc-test          256.92s

2nd test with TCI

  68/916 qtest-arm / qtest-arm/qom-test                 375.35s
  82/916 qtest-aarch64 / qtest-aarch64/qom-test         403.50s
  93/916 qtest-i386 / qtest-i386/bios-tables-test       192.22s
  99/916 qtest-aarch64 / qtest-aarch64/bios-tables-test 379.92s
100/916 qtest-x86_64 / qtest-x86_64/bios-tables-test   240.19s
103/916 qtest-aarch64 / qtest-aarch64/migration-test   223.49s
106/916 qtest-ppc64 / qtest-ppc64/pxe-test             418.42s
113/916 qtest-i386 / qtest-i386/migration-test         284.96s
118/916 qtest-arm / qtest-arm/aspeed_smc-test          271.10s
119/916 qtest-x86_64 / qtest-x86_64/migration-test     287.36s

---

Summary:

TCI is not much slower than the normal TCG. Surprisingly it was evenfaster for the tests 99 and 103. For other tests like test 106 TCI wasabout half as fast as normal TCG, but in summary it is not "factors"slower. A total test time of 26:50 minutes is also not so bad comparedwith the 21:49 minutes of the normal TCG.

No single test (including subtests) with TCI exceeded 10 minutes, thelongest one was well below that margin with 418 seconds.

---

The tests above were running with x86_64, and I could not reproducetimeouts. The Gitlab CI tests were running with i686 and used differentconfigure options. Therefore I made additional tests with 32 bit buildsin a chroot environment (Debian GNU Linux bullseye i386) with theoriginal configure options. As expected that reduced the number of teststo 250. All tests passed with `make test`:


3rd test with normal TCG

Ok:                 250
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            0

Full log written to/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt855.30user 450.53system 6:45.57elapsed 321%CPU (0avgtext+0avgdata609180maxresident)k

28232inputs+4772968outputs (64944major+8328814minor)pagefaults 0swaps


4th test with TCI

Ok:                 250
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            0

Full log written to /root/qemu/build/meson-logs/testlog.txt
make[1]: Leaving directory '/root/qemu/build'

1401.64user 483.55system 9:03.25elapsed 347%CPU (0avgtext+0avgdata609244maxresident)k

24inputs+4690040outputs (70405major+7972385minor)pagefaults 0swaps

---

Summary:

Again TCI is not much slower than the normal TCG. The total test timefor all 250 tests is below 10 minutes, even with TCI!

---

Could it be that the timeouts in Gitlab CI are caused by wronglyconfigured multithreading? Those tests are not started with `make test`which would run single threaded, but with meson and an argument`--num-processes 9`. I tested that with the normal TCG on my VM whichonly can run 4 threads simultaneously:


5th test with TCG

pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9--print-errorlogs

Summary of Failures:

254/258 qemu:softfloat+softfloat-ops / fp-test-mulTIMEOUT 30.12s killed by signal 15 SIGTERM


Ok:                 249
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            1

Full log written to/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt


Repeating that test again fails, this time with 2 timeouts:

time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9--print-errorlogs

Summary of Failures:

253/258 qemu:softfloat+softfloat-ops / fp-test-mulTIMEOUT 30.14s killed by signal 15 SIGTERM254/258 qemu:softfloat+softfloat-ops / fp-test-divTIMEOUT 30.18s killed by signal 15 SIGTERM


Ok:                 248
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            2

Full log written to/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt


real    7m51.102s
user    14m0.470s
sys     7m58.427s

Now I reduced the number of threads to 4:

time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 4--print-errorlogs

Ok:                 250
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            8
Timeout:            0

Full log written to/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt


real    6m42.648s
user    13m52.271s
sys     7m35.643s

---

Summary:

I could reproduce persistent intermittent failures due to jobs timingout without TCI when I tried to run the tests with more threads than mymachine supports. This result is expected because not all single testscan not run 100% of the time. They are interrupted because ofscheduling, and it's normal that some (random) tests will have a longerduration. Some will even raise a timeout.

The final test shows that adjusting the number of threads to the hostscapabilities fixes the problem. But it also shows that 4 threads don'taccelerate the whole test job compared to a single thread. Usingmultithreading obviously wastes a lot of user and sys CPU time.


Regards
Stefan

[Prev in Thread]

Current Thread

[Next in Thread]

cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?, Peter Maydell, 2024/04/16
- Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?, Stefan Weil, 2024/04/16
  - Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?, Stefan Weil <=
    - Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?), Stefan Weil, 2024/04/24
    - Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?), Daniel P . Berrangé, 2024/04/24
    - Re: Timeouts in CI jobs, Stefan Weil, 2024/04/24
    - Re: Timeouts in CI jobs, Daniel P . Berrangé, 2024/04/25
    - Re: Timeouts in CI jobs, Daniel P . Berrangé, 2024/04/25

Prev by Date: xlnx-versal-virt machine in qemu
Next by Date: Re: [PATCH v7 00/12] Enabling DCD emulation support in Qemu
Previous by thread: Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
Next by thread: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)
Index(es):
- Date
- Thread