[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who ca
From: |
Stefan Weil |
Subject: |
Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate? |
Date: |
Sat, 20 Apr 2024 22:25:18 +0200 |
User-agent: |
Mozilla Thunderbird |
Am 16.04.24 um 14:17 schrieb Stefan Weil:
Am 16.04.24 um 14:10 schrieb Peter Maydell:
The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.
[...]
Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.
Can somebody who cares about TCI investigate, please, and track
down whatever this is?
I'll have a look.
Short summary:
The "persistent intermittent failures due to jobs timing out" are not
related to TCI: they also occur if the same tests are run with the
normal TCG. I suggest that the CI tests should run single threaded.
But let's have a look on details of my results.
I have run `time make test` using different scenarios on the rather old
and not so performant VM which I typically use for QEMU builds. I did
not restrict the tests to selected architectures like it is done in the
QEMU CI tests. Therefore I had more tests which all ended successfully:
Ok: 848
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 68
Timeout: 0
---
1st test with normal TCG
`nohup time ../configure --enable-modules --disable-werror && nohup time
make -j4 && nohup time make test`
[...]
Full log written to
/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt
2296.08user 1525.02system 21:49.78elapsed 291%CPU (0avgtext+0avgdata
633476maxresident)k
1730448inputs+14140528outputs (11668major+56827263minor)pagefaults 0swaps
---
2nd test with TCI
`nohup time ../configure --enable-modules --disable-werror
--enable-tcg-interpreter && nohup time make -j4 && nohup time make test`
[...]
Full log written to
/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt
3766.74user 1521.38system 26:50.51elapsed 328%CPU (0avgtext+0avgdata
633012maxresident)k
32768inputs+14145080outputs (3033major+56121586minor)pagefaults 0swaps
---
So the total test time with TCI was 26:50.51 minutes while for the
normal test it was 21:49.78 minutes.
These 10 single tests had the longest duration:
1st test with normal TCG
94/916 qtest-arm / qtest-arm/qom-test 373.41s
99/916 qtest-aarch64 / qtest-aarch64/qom-test 398.43s
100/916 qtest-i386 / qtest-i386/bios-tables-test 188.06s
103/916 qtest-x86_64 / qtest-x86_64/bios-tables-test 228.33s
106/916 qtest-aarch64 / qtest-aarch64/migration-test 201.15s
119/916 qtest-i386 / qtest-i386/migration-test 253.58s
126/916 qtest-x86_64 / qtest-x86_64/migration-test 266.66s
143/916 qtest-arm / qtest-arm/test-hmp 101.72s
144/916 qtest-aarch64 / qtest-aarch64/test-hmp 113.10s
163/916 qtest-arm / qtest-arm/aspeed_smc-test 256.92s
2nd test with TCI
68/916 qtest-arm / qtest-arm/qom-test 375.35s
82/916 qtest-aarch64 / qtest-aarch64/qom-test 403.50s
93/916 qtest-i386 / qtest-i386/bios-tables-test 192.22s
99/916 qtest-aarch64 / qtest-aarch64/bios-tables-test 379.92s
100/916 qtest-x86_64 / qtest-x86_64/bios-tables-test 240.19s
103/916 qtest-aarch64 / qtest-aarch64/migration-test 223.49s
106/916 qtest-ppc64 / qtest-ppc64/pxe-test 418.42s
113/916 qtest-i386 / qtest-i386/migration-test 284.96s
118/916 qtest-arm / qtest-arm/aspeed_smc-test 271.10s
119/916 qtest-x86_64 / qtest-x86_64/migration-test 287.36s
---
Summary:
TCI is not much slower than the normal TCG. Surprisingly it was even
faster for the tests 99 and 103. For other tests like test 106 TCI was
about half as fast as normal TCG, but in summary it is not "factors"
slower. A total test time of 26:50 minutes is also not so bad compared
with the 21:49 minutes of the normal TCG.
No single test (including subtests) with TCI exceeded 10 minutes, the
longest one was well below that margin with 418 seconds.
---
The tests above were running with x86_64, and I could not reproduce
timeouts. The Gitlab CI tests were running with i686 and used different
configure options. Therefore I made additional tests with 32 bit builds
in a chroot environment (Debian GNU Linux bullseye i386) with the
original configure options. As expected that reduced the number of tests
to 250. All tests passed with `make test`:
3rd test with normal TCG
Ok: 250
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 0
Full log written to
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
855.30user 450.53system 6:45.57elapsed 321%CPU (0avgtext+0avgdata
609180maxresident)k
28232inputs+4772968outputs (64944major+8328814minor)pagefaults 0swaps
4th test with TCI
Ok: 250
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 0
Full log written to /root/qemu/build/meson-logs/testlog.txt
make[1]: Leaving directory '/root/qemu/build'
1401.64user 483.55system 9:03.25elapsed 347%CPU (0avgtext+0avgdata
609244maxresident)k
24inputs+4690040outputs (70405major+7972385minor)pagefaults 0swaps
---
Summary:
Again TCI is not much slower than the normal TCG. The total test time
for all 250 tests is below 10 minutes, even with TCI!
---
Could it be that the timeouts in Gitlab CI are caused by wrongly
configured multithreading? Those tests are not started with `make test`
which would run single threaded, but with meson and an argument
`--num-processes 9`. I tested that with the normal TCG on my VM which
only can run 4 threads simultaneously:
5th test with TCG
pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9
--print-errorlogs
Summary of Failures:
254/258 qemu:softfloat+softfloat-ops / fp-test-mul
TIMEOUT 30.12s killed by signal 15 SIGTERM
Ok: 249
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 1
Full log written to
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
Repeating that test again fails, this time with 2 timeouts:
time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9
--print-errorlogs
Summary of Failures:
253/258 qemu:softfloat+softfloat-ops / fp-test-mul
TIMEOUT 30.14s killed by signal 15 SIGTERM
254/258 qemu:softfloat+softfloat-ops / fp-test-div
TIMEOUT 30.18s killed by signal 15 SIGTERM
Ok: 248
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 2
Full log written to
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
real 7m51.102s
user 14m0.470s
sys 7m58.427s
Now I reduced the number of threads to 4:
time pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 4
--print-errorlogs
Ok: 250
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 8
Timeout: 0
Full log written to
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
real 6m42.648s
user 13m52.271s
sys 7m35.643s
---
Summary:
I could reproduce persistent intermittent failures due to jobs timing
out without TCI when I tried to run the tests with more threads than my
machine supports. This result is expected because not all single tests
can not run 100% of the time. They are interrupted because of
scheduling, and it's normal that some (random) tests will have a longer
duration. Some will even raise a timeout.
The final test shows that adjusting the number of threads to the hosts
capabilities fixes the problem. But it also shows that 4 threads don't
accelerate the whole test job compared to a single thread. Using
multithreading obviously wastes a lot of user and sys CPU time.
Regards
Stefan
- cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?, Peter Maydell, 2024/04/16
- Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?, Stefan Weil, 2024/04/16
- Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?,
Stefan Weil <=
- Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?), Stefan Weil, 2024/04/24
- Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?), Daniel P . Berrangé, 2024/04/24
- Re: Timeouts in CI jobs, Stefan Weil, 2024/04/24
- Re: Timeouts in CI jobs, Daniel P . Berrangé, 2024/04/25
- Re: Timeouts in CI jobs, Daniel P . Berrangé, 2024/04/25