[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl co
From: |
Peter Xu |
Subject: |
Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression |
Date: |
Fri, 22 Mar 2024 12:40:32 -0400 |
On Fri, Mar 22, 2024 at 02:47:02PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Liu, Yuan1
> > Sent: Friday, March 22, 2024 10:07 AM
> > To: Peter Xu <peterx@redhat.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com; Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: RE: [PATCH v5 5/7] migration/multifd: implement initialization of
> > qpl compression
> >
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Thursday, March 21, 2024 11:28 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > devel@nongnu.org; hao.xiang@bytedance.com; bryan.zhang@bytedance.com;
> > Zou,
> > > Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement initialization
> > of
> > > qpl compression
> > >
> > > On Thu, Mar 21, 2024 at 01:37:36AM +0000, Liu, Yuan1 wrote:
> > > > > -----Original Message-----
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > Sent: Thursday, March 21, 2024 4:32 AM
> > > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > > Cc: Daniel P. Berrangé <berrange@redhat.com>; farosas@suse.de; qemu-
> > > > > devel@nongnu.org; hao.xiang@bytedance.com;
> > bryan.zhang@bytedance.com;
> > > Zou,
> > > > > Nanhai <nanhai.zou@intel.com>
> > > > > Subject: Re: [PATCH v5 5/7] migration/multifd: implement
> > > initialization of
> > > > > qpl compression
> > > > >
> > > > > On Wed, Mar 20, 2024 at 04:23:01PM +0000, Liu, Yuan1 wrote:
> > > > > > let me explain here, during the decompression operation of IAA,
> > the
> > > > > > decompressed data can be directly output to the virtual address of
> > > the
> > > > > > guest memory by IAA hardware. It can avoid copying the
> > decompressed
> > > > > data
> > > > > > to guest memory by CPU.
> > > > >
> > > > > I see.
> > > > >
> > > > > > Without -mem-prealloc, all the guest memory is not populated, and
> > > IAA
> > > > > > hardware needs to trigger I/O page fault first and then output the
> > > > > > decompressed data to the guest memory region. Besides that, CPU
> > > page
> > > > > > faults will also trigger IOTLB flush operation when IAA devices
> > use
> > > SVM.
> > > > >
> > > > > Oh so the IAA hardware already can use CPU pgtables? Nice..
> > > > >
> > > > > Why IOTLB flush is needed? AFAIU we're only installing new pages,
> > the
> > > > > request can either come from a CPU access or a DMA. In all cases
> > > there
> > > > > should have no tearing down of an old page. Isn't an iotlb flush
> > only
> > > > > needed if a tear down happens?
> > > >
> > > > As far as I know, IAA hardware uses SVM technology to use the CPU's
> > page
> > > table
> > > > for address translation (IOMMU scalable mode directly accesses the CPU
> > > page table).
> > > > Therefore, when the CPU page table changes, the device's Invalidation
> > > operation needs
> > > > to be triggered to update the IOMMU and the device's cache.
> > > >
> > > > My current kernel version is mainline 6.2. The issue I see is as
> > > follows:
> > > > --Handle_mm_fault
> > > > |
> > > > -- wp_page_copy
> > >
> > > This is the CoW path. Not usual at all..
> > >
> > > I assume this issue should only present on destination. Then the guest
> > > pages should be the destination of such DMAs to happen, which means
> > these
> > > should be write faults, and as we see here it is, otherwise it won't
> > > trigger a CoW.
> > >
> > > However it's not clear to me why a pre-installed zero page existed. It
> > > means someone read the guest pages first.
> > >
> > > It might be interesting to know _why_ someone reads the guest pages,
> > even
> > > if we know they're all zeros. If we can avoid such reads then it'll be
> > a
> > > hole rather than a prefaulted read on zero page, then invalidations are
> > > not
> > > needed, and I expect that should fix the iotlb storm issue.
> >
> > The received pages will be read for zero pages check first. Although
> > these pages are zero pages, and IAA hardware will not access them, the
> > COW happens and causes following IOTLB flush operation. As far as I know,
> > IOMMU quickly detects whether the address range has been used by the
> > device,
> > and does not invalidate the address that is not used by the device, this
> > has
> > not yet been resolved in Linux kernel 6.2. I will check the latest status
> > for
> > this.
>
> I checked the Linux mainline 6.8 code, there are no big changes for this.
> In version 6.8, if the process needs to flush MMU TLB, then I/O TLB flush
> will be also triggered when the process has SVM devices. I haven't found
> the code to check if pages have been set EA (Extended-Accessed) bit before
> submitting invalidation operations, this is same with version 6.2.
>
> VT-d 3.6.2
> If the Extended-Accessed-Flag-Enable (EAFE) is 1 in a scalable-mode
> PASID-table
> entry that references a first-stage paging-structure entry used by the
> remapping
> hardware, it atomically sets the EA field in that entry. Whenever EA field is
> atomically set, the A field is also set in the same atomic operation. For
> software
> usages where the first-stage paging structures are shared across
> heterogeneous agents
> (e.g., CPUs and accelerator devices such as GPUs), the EA flag may be used by
> software
> to identify pages accessed by non-CPU agent(s) (as opposed to the A flag
> which indicates
> access by any agent sharing the paging structures).
This seems pretty new hardware features. I didn't check in depths but what
you said makes sense.
>
> > void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> > {
> > for (int i = 0; i < p->zero_num; i++) {
> > void *page = p->host + p->zero[i];
> > if (!buffer_is_zero(page, p->page_size)) {
> > memset(page, 0, p->page_size);
> > }
> > }
> > }
It may not matter much (where I also see your below comments), but just to
mention another solution to avoid this read is that we can maintain
RAMBlock->receivedmap for precopy (especially, multifd, afaiu multifd
doesn't yet update this bitmap.. even if normal precopy does), then here
instead of scanning every time, maybe we can do:
/*
* If it's the 1st time receiving it, no need to clear it as it must be
* all zeros now.
*/
if (bitmap_test(rb->receivedmap, page_offset)) {
memset(page, 0, ...);
} else {
bitmap_set(rb->receivedmap, page_offset);
}
And we also always set the bit when !zero too.
My rational is that it's unlikely a zero page if it's sent once or more,
while OTOH for the 1st time we receive it, it must be a zero page, so no
need to scan for the 1st round.
> >
> >
> > > It'll still be good we can fix this first to not make qpl special from
> > > this
> > > regard, so that the hope is migration submodule shouldn't rely on any
> > > pre-config (-mem-prealloc) on guest memory behaviors to work properly.
> >
> > Even if the IOTLB problem can be avoided, the I/O page fault problem
> > (normal
> > pages are loaded by the IAA device and solving normal page faults through
> > IOMMU,
> > the performance is not good)
Do you have a rough estimate on how slow that could be? It'll be good to
mention some details too in the doc file in that case.
> >
> > It can let the decompressed data of the IAA device be output to a pre-
> > populated
> > memory instead of directly outputting to the guest address, but then each
> > multifd
> > thread needs two memory copies, one copy from the network to the IAA input
> > memory(pre-populated), and another copy from the IAA output memory(pre-
> > populated)
> > to the guest address, which may become a performance bottleneck at the
> > destination
> > during the live migration process.
> >
> > So I think it is still necessary to use the -mem-prealloc option
Right, that complexity may not be necessary, in that case, maybe such
suggestion is fine.
Thanks,
> >
> > > > -- mmu_notifier_invalidate_range
> > > > |
> > > > -- intel_invalidate_rage
> > > > |
> > > > -- qi_flush_piotlb
> > > > -- qi_flush_dev_iotlb_pasid
> > >
> > > --
> > > Peter Xu
>
--
Peter Xu
- RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, (continued)
- Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Peter Xu, 2024/03/20
- RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Liu, Yuan1, 2024/03/20
- Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Peter Xu, 2024/03/20
- RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Liu, Yuan1, 2024/03/20
- Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Peter Xu, 2024/03/21
- RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Liu, Yuan1, 2024/03/21
- RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Liu, Yuan1, 2024/03/22
- Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression,
Peter Xu <=
- Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Peter Xu, 2024/03/27
- RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Liu, Yuan1, 2024/03/27
- Re: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Peter Xu, 2024/03/28
- RE: [PATCH v5 5/7] migration/multifd: implement initialization of qpl compression, Liu, Yuan1, 2024/03/28
[PATCH v5 1/7] docs/migration: add qpl compression feature, Yuan Liu, 2024/03/20
[PATCH v5 4/7] migration/multifd: add qpl compression method, Yuan Liu, 2024/03/20