[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsiv
From: |
Peter Xu |
Subject: |
Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle |
Date: |
Wed, 11 Sep 2024 16:37:58 -0400 |
On Wed, Sep 11, 2024 at 04:48:21PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Tue, Sep 10, 2024 at 07:23:43PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >>
> >> > On Mon, Sep 09, 2024 at 06:54:46PM -0300, Fabiano Rosas wrote:
> >> >> Peter Xu <peterx@redhat.com> writes:
> >> >>
> >> >> > On Mon, Sep 09, 2024 at 03:02:57PM +0100, Peter Maydell wrote:
> >> >> >> On Mon, 9 Sept 2024 at 14:51, Hyman Huang <yong.huang@smartx.com>
> >> >> >> wrote:
> >> >> >> >
> >> >> >> > Despite the fact that the responsive CPU throttle is enabled,
> >> >> >> > the dirty sync count may not always increase because this is
> >> >> >> > an optimization that might not happen in any situation.
> >> >> >> >
> >> >> >> > This test case just making sure it doesn't interfere with any
> >> >> >> > current functionality.
> >> >> >> >
> >> >> >> > Signed-off-by: Hyman Huang <yong.huang@smartx.com>
> >> >> >>
> >> >> >> tests/qtest/migration-test already runs 75 different
> >> >> >> subtests, takes up a massive chunk of our "make check"
> >> >> >> time, and is very commonly a "times out" test on some
> >> >> >> of our CI jobs. It runs on five different guest CPU
> >> >> >> architectures, each one of which takes between 2 and
> >> >> >> 5 minutes to complete the full migration-test.
> >> >> >>
> >> >> >> Do we really need to make it even bigger?
> >> >> >
> >> >> > I'll try to find some time in the next few weeks looking into this to
> >> >> > see
> >> >> > whether we can further shrink migration test times after previous
> >> >> > attemps
> >> >> > from Dan. At least a low hanging fruit is we should indeed put some
> >> >> > more
> >> >> > tests into g_test_slow(), and this new test could also be a candidate
> >> >> > (then
> >> >> > we can run "-m slow" for migration PRs only).
> >> >>
> >> >> I think we could (using -m slow or any other method) separate tests
> >> >> that are generic enough that every CI run should benefit from them
> >> >> vs. tests that are only useful once someone starts touching migration
> >> >> code. I'd say very few in the former category and most of them in the
> >> >> latter.
> >> >>
> >> >> For an idea of where migration bugs lie, I took a look at what was
> >> >> fixed since 2022:
> >> >>
> >> >> # bugs | device/subsystem/arch
> >> >> ----------------------------------
> >> >> 54 | migration
> >> >> 10 | vfio
> >> >> 6 | ppc
> >> >> 3 | virtio-gpu
> >> >> 2 | pcie_sriov, tpm_emulator,
> >> >> vdpa, virtio-rng-pci
> >> >> 1 | arm, block, gpio, lasi,
> >> >> pci, s390, scsi-disk,
> >> >> virtio-mem, TCG
> >> >
> >> > Just curious; how did you collect these?
> >>
> >> git log --since=2022 and then squinted at it. I wrote a warning to take
> >> this with a grain of salt, but it missed the final version.
> >>
> >> >
> >> >>
> >> >> From these, ignoring the migration bugs, the migration-tests cover some
> >> >> of: arm, ppc, s390, TCG. The device_opts[1] patch hasn't merged yet, but
> >> >> once it is, then virtio-gpu would be covered and we could investigate
> >> >> adding some of the others.
> >> >>
> >> >> For actual migration code issues:
> >> >>
> >> >> # bugs | (sub)subsystem | kind
> >> >> ----------------------------------------------
> >> >> 13 | multifd | correctness/races
> >> >> 8 | ram | correctness
> >> >> 8 | rdma: | general programming
> >> >
> >> > 8 rdma bugs??? ouch..
> >>
> >> Mostly caught by a cleanup from Markus. Silly stuff like of mixed signed
> >> integer comparisons and bugs in error handling. I don't even want to
> >> look too much at it.
> >>
> >> ...hopefully this release we'll manage to resolve that situation.
> >>
> >> >
> >> >> 7 | qmp | new api bugs
> >> >> 5 | postcopy | races
> >> >> 4 | file: | leaks
> >> >> 3 | return path | races
> >> >> 3 | fd_cleanup | races
> >> >> 2 | savevm, aio/coroutines
> >> >> 1 | xbzrle, colo, dirtyrate, exec:,
> >> >> windows, iochannel, qemufile,
> >> >> arch (ppc64le)
> >> >>
> >> >> Here, the migration-tests cover well: multifd, ram, qmp, postcopy,
> >> >> file, rp, fd_cleanup, iochannel, qemufile, xbzrle.
> >> >>
> >> >> My suggestion is we run per arch:
> >> >>
> >> >> "/precopy/tcp/plain"
> >> >> "/precopy/tcp/tls/psk/match",
> >> >> "/postcopy/plain"
> >> >> "/postcopy/preempt/plain"
> >> >> "/postcopy/preempt/recovery/plain"
> >> >> "/multifd/tcp/plain/cancel"
> >> >> "/multifd/tcp/uri/plain/none"
> >> >
> >> > Don't you want to still keep a few multifd / file tests?
> >>
> >> Not really, but I won't object if you want to add some more in there. To
> >> be honest, I want to get out of people's way as much as I can because
> >> having to revisit this every couple of months is stressful to me.
> >
> > I hope migration tests are not too obstructive yet so far :) - this
> > discussion is still within a context where we might add one more slow test
> > in CI, and we probably simply should make it a -m slow test.
> >
> >>
> >> My rationale for those is:
> >>
> >> "/precopy/tcp/plain":
> >> Smoke test, the most common migration
> >
> > E.g. unix is missing, and I'm not sure which we use for e.g. kubevirt.
> >
> > And note that even if unix shares the socket iochannel with tcp, it may not
> > work the same. For example, if you remember I mentioned I was looking at
> > threadify the dest qemu receive side, i have a draft there but currently it
> > has a bug to hang a unix migration, not tcp..
>
> Ok, let's add a unix one, no problem.
>
> >
> > So IMHO it's not easy to justify which we can simply drop, if we still want
> > to provide some good gating in CI.
>
> It's not exactly about dropping, it's about which ones need to be run at
> *every* invocation of make check (x5 because of the multiple archs) and
Yeah, indeed we could consider reducing the number of runs, maybe. However
I do remember we used to have migration-test bugs only reproduced with some
specific distro, well..
Now I'm looking at the pipelines..
Why 5x, btw? I saw alphine, centos, debian, fedora, opensuse, ubuntu,
cfi-x86_64, then compat-x86_64. So that's 8x? They're for the same arch
(amd64) so far.
Maybe we can skip some indeed.
> at every git branch push in CI (unless -o ci.skip). For gating we don't
> need all the tests. Many of them are testing the same core code with
> just a few differences at the margins.
>
> > And I won't be 100% surprised if some
> > other non-migration PR (e.g. io/) changed some slight bit that unix is
> > broken and tcp keeps working, for example, then we loose some CI benefits.
>
> IMO, these non-migration PRs are exactly what we have to worry
> about. Because migration PRs would run with -m slow and we'd catch the
> issue there.
>
> >
> >>
> >> "/precopy/tcp/tls/psk/match":
> >> Something might change in the distro regarding tls. Such as:
> >> https://gitlab.com/qemu-project/qemu/-/issues/1937
> >>
> >> "/postcopy/plain":
> >> Smoke test for postcopy
> >>
> >> "/postcopy/preempt/plain":
> >> Just to exercise the preempt thread
> >>
> >> "/postcopy/preempt/recovery/plain":
> >> Recovery has had some issues with races in the past
> >>
> >> "/multifd/tcp/plain/cancel":
> >> The MVP of catching concurrency issues
> >>
> >> "/multifd/tcp/uri/plain/none":
> >> Smoke test for multifd
> >>
> >> All in all, these will test basic funcionality and run very often. The
> >> more tests we add to this set, the less return we get in relation to the
> >> time they take.
> >
> > This is true. We can try to discuss more on which is more important; I
> > still think something might be good to be added on top of above.
>
> Sure, just add what you think we need.
Let's leave the detailed discussion on what to choose until the patch
phase. IIUC this can be the last we do.
We've just queued your other patch to "slow down" the two time consuming
tests, we save 40s for each CI kick (total ~90s now locally, so that's 30%
cut already all acrosss!), not so bad as a start.
Then if we can remove some out of 8 (e.g., we can choose 2-3), then if it's
3 system runs, it'll be another 62% cut, a total of:
1.0 - 9/13 * 3/8 = ~75%
So if we keep removing 5 system running it we can reduce it to only a
quarter time comparing to before for each CI run.
Would that be good enough already for us to live for quite a few more
releases? They're so far all low hanging fruits.
If we want to move further (and you think we should then start with
reducing test case, rather than profiling the test cases), then feel free
to go ahead and send a patch when you're ready. But maybe you don't want
to bother after you notice that we can already shrink 75% of runtime even
without dropping anything. Your call!
>
> >
> > There's also the other way - at some point, I still want to improve
> > migration-test run speed, and please have a look if you like too at some
> > point: so there's still chance (average is ~2sec as of now), IMHO, we don't
> > lose anything in CI but runs simply faster.
>
> My only idea, but it requires a bit of work, is to do unit testing on
> the interfaces. Anything before migration_fd_connect(). Then we could
> stop doing a full migration for those.
What I'm thinking is some further profiling, like what Dan used to do. I
feel like we still have chance, considering what we run as guest image is
simply a loop.. so basically zero boot time logically.
>
> >
> >>
> >> >
> >> > IIUC some file ops can still be relevant to archs. Multifd still has one
> >> > bug that can only reproduce on arm64.. but not x86_64. I remember it's a
> >> > race condition when migration finishes, and the issue could be memory
> >> > ordering relevant, but maybe not.
> >>
> >> I'm not aware of anything. I believe the last arm64 bug we had was the
> >> threadinfo stuff[1]. If you remember what it's about, let me know.
> >>
> >> 1- 01ec0f3a92 ("migration/multifd: Protect accesses to migration_threads").
> >
> > https://issues.redhat.com/browse/RHEL-45628
>
> Interesting. But if multifd/tcp/plain/cancel doesn't catch this I don't
> think other tests will. Also, we don't have tests for zerocopy AFAIK.
That'll be challenging as it requires locked_vm limit.
>
> >
> > On RH side Bandan is looking at it, but he's during a long holidays
> > recently.
>
> Good luck to him.
>
> >
> >>
> >> >
> >> >>
> >> >> and x86 gets extra:
> >> >>
> >> >> "/precopy/unix/suspend/live"
> >> >> "/precopy/unix/suspend/notlive"
> >> >> "/dirty_ring"
> >> >
> >> > dirty ring will be disabled anyway when !x86, so probably not a major
> >> > concern.
> >> >
> >> >>
> >> >> (the other dirty_* tests are too slow)
> >> >
> >> > These are the 10 slowest tests when I run locally:
> >> >
> >> > /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client 2.41
> >> > /x86_64/migration/postcopy/recovery/plain 2.43
> >> > /x86_64/migration/multifd/tcp/tls/x509/default-host 2.66
> >> > /x86_64/migration/multifd/tcp/tls/x509/override-host 2.86
> >> > /x86_64/migration/postcopy/tls/psk 2.91
> >> > /x86_64/migration/postcopy/preempt/recovery/tls/psk 3.08
> >> > /x86_64/migration/postcopy/preempt/tls/psk 3.30
> >> > /x86_64/migration/postcopy/recovery/tls/psk 3.81
> >> > /x86_64/migration/vcpu_dirty_limit 13.29
> >> > /x86_64/migration/precopy/unix/xbzrle 27.55
> >> >
> >> > Are you aware of people using xbzrle at all?
> >>
> >> Nope.
> >>
> >> >
> >> >>
> >> >> All the rest go behind a knob that people touching migration code will
> >> >> enable.
> >> >>
> >> >> wdyt?
> >> >
> >> > Agree with the general idea, but I worry above exact list can be too
> >> > small.
> >>
> >> We won't stop running the full set of tests. We can run them in our
> >> forks' CI as much as we want. There are no cases of people reporting a
> >> migration bug because their 'make check' caught something that ours
> >> didn't.
> >
> > IIUC it's hard to say - when the test is in CI maintainers can catch them
> > already before sending a formal PR.
> >
> > If the test is run by default in make check, a developer can trigger a
> > migration issue (with his/her patch applied) then one can notice it
> > introduced a bug, fix it, then post the patches. We won't know whether
> > that happened.
>
> Good point.
>
> >
> > So one thing we can do (if you think worthwhile to do it now) is we shrink
> > the default test case a lot as you proposed, then we wait and see what
> > breaks, and then we gradually add tests back when it can be used to find
> > breakages. But that'll also take time if it really can find such tests,
> > because then we'll need to debug them one by one (instead of developer /
> > maintainer looking into them with their expertise knowledge..). I'm not
> > sure whether it's worthwhile to do it now, but I don't feel strongly if we
> > can still have a reliable set of default test cases.
>
> We first need a way to enable them for the migration CI. Do we have a
> variable that CI understands that can be used to enable slow tests?
I'm not aware of. Yes sounds like we should have one if it doesn't exist.
>
> >
> >>
> >> Besides, the main strength of CI is to catch bugs when someone makes a
> >> code change. If people touch migration code, then we'll run it in our CI
> >> anyway. If they touch device code and that device is migrated by default
> >> then any one of the simple tests will catch the issue when it runs via
> >> the migration-compat job. If the device is not enabled by default, then
> >> no tests will catch it.
> >>
> >> The worst case scenario is they touch some code completely unrelated and
> >> their 'make check' or CI run breaks because of some race in the
> >> migration code. That's not what CI is for, that's just an annoyance for
> >> everyone. I'd rather it breaks in our hands and then we'll go fix it.
> >>
> >> >
> >> > IMHO we can definitely, at least, move the last two into slow list
> >> > (vcpu_dirty_limit and xbzrle), then it'll already save us 40sec each
> >> > run..
> >>
> >> Agreed. I'll send a patch once I get out from under downstream stuff.
> >>
> >> >
> >> >>
> >> >> 1- allows adding devices to QEMU cmdline for migration-test
> >> >> https://lore.kernel.org/r/20240523201922.28007-4-farosas@suse.de
> >> >>
> >>
>
--
Peter Xu
- [PATCH RFC 08/10] migration: Introduce cpu-responsive-throttle parameter, (continued)
- [PATCH RFC 08/10] migration: Introduce cpu-responsive-throttle parameter, Hyman Huang, 2024/09/09
- [PATCH RFC 09/10] migration: Support responsive CPU throttle, Hyman Huang, 2024/09/09
- [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Hyman Huang, 2024/09/09
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Peter Maydell, 2024/09/09
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Peter Xu, 2024/09/09
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Fabiano Rosas, 2024/09/09
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Peter Xu, 2024/09/10
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Fabiano Rosas, 2024/09/10
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Peter Xu, 2024/09/11
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Fabiano Rosas, 2024/09/11
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle,
Peter Xu <=
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Fabiano Rosas, 2024/09/11
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Peter Maydell, 2024/09/12
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Fabiano Rosas, 2024/09/12
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Peter Maydell, 2024/09/12
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Fabiano Rosas, 2024/09/12
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Peter Xu, 2024/09/12
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Peter Maydell, 2024/09/12
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Peter Xu, 2024/09/13
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Fabiano Rosas, 2024/09/12
- Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle, Fabiano Rosas, 2024/09/12