Re: Performance Issue with CXL-emulation

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Performance Issue with CXL-emulation

From:	lokesh jaliminche
Subject:	Re: Performance Issue with CXL-emulation
Date:	Mon, 16 Oct 2023 15:37:49 -0700

Hi Jonathan,

Thanks for your quick and detailed response. I'll explore these
options further and asses if I get any performance uptick.

Thanks & Regards,
Lokesh


On Mon, Oct 16, 2023 at 2:56 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Sun, 15 Oct 2023 10:39:46 -0700
> lokesh jaliminche <lokesh.jaliminche@gmail.com> wrote:
>
> > Hi Everyone,
> >
> > I am facing performance issues while copying data to the CXL device
> > (Emulated with QEMU). I get approximately 500KB/Sec. Any suggestion on how
> > to improve this?
>
> Hi Lokesh,
>
> The target so far of QEMU emulation of CXL devices has been on functionality.
> I'm in favour of work to improve on that, but it isn't likely to be my focus
> - can offer some pointers on where to look though!
>
> The fundamental problem (probably) is address decoding in CXL for interleaving
> is at a sub page granularity. That means we can't use page table to perform 
> the address
> look ups in hardware. Note this also has the side effect that kvm won't work 
> if
> there is any chance that you will run instructions out of the CXL memory - 
> it's
> fine if you are interested in data only (DAX etc). (I've had a note in my 
> todo list
> to add a warning message about the KVM limitations for a while).
>
> There have been a few discussions (mostly when we were debugging some TCG 
> issues
> and considering KVM support) about how we 'might' be able to improve this.  
> That focused
> on a general 'fix', but there may be some lower hanging fruit.
>
> The options I think might work are:
>
> 1) Special case configurations where there is no interleave going on.
>    I'm not entirely sure how this would fit together and it won't deal with 
> the
>    more interesting cases - if it does work I'd want it to be minimally 
> invasive because
>    those complex cases are the main focus of testing etc.  There is an 
> extension of this
>    where we handle interleave, but only if it is 4k or above (on 
> appropriately configured
>    host).
>
> 2) Add caching layer to the CXL fixed memory windows.  That would hold copies 
> of a
>    number of pages that have been accessed in a software cache and setup the 
> mappings for
>    the hardware page table walkers to find them. If the page isn't cached 
> we'd trigger
>    a pagefault and have to bring it into the cache. If the configuration of 
> the interleave
>    is touched, all caches would need to be written back etc. This would need 
> to be optional
>    because I don't want to have to add cache coherency protocols etc when we 
> add shared
>    memory support (fun though it would be ;)
>
> 3) Might be worth looking at the critical paths for lookups in your 
> configuration.
>    Maybe we can optimize the address decoders (basically a software TLB for 
> HPA to DPA).
>    I've not looked at the performance of those paths.  For your example the 
> lookup is
>    * CFMWS - nothing to do
>    * Host bridge - nothing to do beyond a sanity check on range I think.
>    * Nothing to to do.
>    * Type 3 device - basic range match.
>    So I'm not sure it is worth while - but you could do a really simple test 
> by detecting
>    no interleave is going on and caching the offset needed to go HPA to DPA + 
> a device reference
>    for the first time cxl_cfmws_find_device() is called.
>    https://elixir.bootlin.com/qemu/latest/source/hw/cxl/cxl-host.c#L129
>
>
>    Then just match on hwaddr on another call of cxl_cmws_find_device() and 
> return the device
>    directly.  Maybe also shortcut lookups in cxl_type3_hpa_to_as_and_dpa() 
> which does the endpoint
>    decoding part. A quick hack would let you know if it was worth looking at 
> something more general.
>
>    Gut feeling is this last approach might get you some perf uptick but not 
> going to solve
>    the fundamental problem that in general we can't do the translation in 
> hardware (unlike most
>    other memory accesses in QEMU).
>
>    Not I believe all writes to file backed memory will go all the way to the 
> file. So you might want
>    to try backing it with RAM but I as with the above, that's not going to 
> address the fundamental
>    problem.
>
>
> Jonathan
>
>
>
>
> >
> > Steps to reproduce :
> > ===============
> > 1. QEMU Command:
> > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> > -hda ./images/ubuntu-22.04-server-cloudimg-amd64.img \
> > -hdb ./images/user-data.img \
> > -M q35,cxl=on,accel=kvm,nvdimm=on \
> > -smp 16 \
> > -m 16G,maxmem=32G,slots=8 \
> > -object
> > memory-backend-file,id=cxl-mem1,share=on,mem-path=/mnt/qemu_files/cxltest.raw,size=256M
> > \
> > -object
> > memory-backend-file,id=cxl-lsa1,share=on,mem-path=/mnt/qemu_files/lsa.raw,size=256M
> > \
> > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > -device
> > cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0
> > \
> > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> > -nographic \
> >
> > 2. Configure device with fsdax mode
> > ubuntu@ubuntu:~$ cxl list
> > [
> >   {
> >     "memdevs":[
> >       {
> >         "memdev":"mem0",
> >         "pmem_size":268435456,
> >         "serial":0,
> >         "host":"0000:0d:00.0"
> >       }
> >     ]
> >   },
> >   {
> >     "regions":[
> >       {
> >         "region":"region0",
> >         "resource":45365592064,
> >         "size":268435456,
> >         "type":"pmem",
> >         "interleave_ways":1,
> >         "interleave_granularity":1024,
> >         "decode_state":"commit"
> >       }
> >     ]
> >   }
> > ]
> >
> > 3. Format the device with ext4 file system in dax mode
> >
> > 4. Write data to mounted device with dd
> >
> > ubuntu@ubuntu:~$ time sudo dd if=/dev/urandom
> > of=/home/ubuntu/mnt/pmem0/test bs=1M count=128
> > 128+0 records in
> > 128+0 records out
> > 134217728 bytes (134 MB, 128 MiB) copied, 244.802 s, 548 kB/s
> >
> > real    4m4.850s
> > user    0m0.014s
> > sys     0m0.013s
> >
> >
> > Thanks & Regards,
> > Lokesh
> >
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Performance Issue with CXL-emulation, Jonathan Cameron, 2023/10/16
- Re: Performance Issue with CXL-emulation, lokesh jaliminche, 2023/10/16
- Re: Performance Issue with CXL-emulation, lokesh jaliminche <=

Prev by Date: RE: [PATCH RFC V2 18/37] arm/virt: Make ARM vCPU *present* status ACPI *persistent*
Next by Date: RE: [PATCH RFC V2 19/37] hw/acpi: ACPI/AML Changes to reflect the correct _STA.{PRES,ENA} Bits to Guest
Previous by thread: Re: Performance Issue with CXL-emulation
Next by thread: [PULL 00/38] Migration 20231016 patches
Index(es):
- Date
- Thread