[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [BUG] cxl can not create region
From: |
Jonathan Cameron |
Subject: |
Re: [BUG] cxl can not create region |
Date: |
Mon, 10 Oct 2022 17:20:57 +0100 |
On Fri, 19 Aug 2022 09:46:55 +0100
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> On Thu, 18 Aug 2022 17:37:40 +0100
> Jonathan Cameron via <qemu-devel@nongnu.org> wrote:
>
> > On Wed, 17 Aug 2022 17:16:19 +0100
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >
> > > On Thu, 11 Aug 2022 17:46:55 -0700
> > > Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > > Dan Williams wrote:
> > > > > Bobo WL wrote:
> > > > > > Hi Dan,
> > > > > >
> > > > > > Thanks for your reply!
> > > > > >
> > > > > > On Mon, Aug 8, 2022 at 11:58 PM Dan Williams
> > > > > > <dan.j.williams@intel.com> wrote:
> > > > > > >
> > > > > > > What is the output of:
> > > > > > >
> > > > > > > cxl list -MDTu -d decoder0.0
> > > > > > >
> > > > > > > ...? It might be the case that mem1 cannot be mapped by
> > > > > > > decoder0.0, or
> > > > > > > at least not in the specified order, or that validation check is
> > > > > > > broken.
> > > > > >
> > > > > > Command "cxl list -MDTu -d decoder0.0" output:
> > > > >
> > > > > Thanks for this, I think I know the problem, but will try some
> > > > > experiments with cxl_test first.
> > > >
> > > > Hmm, so my cxl_test experiment unfortunately passed so I'm not
> > > > reproducing the failure mode. This is the result of creating x4 region
> > > > with devices directly attached to a single host-bridge:
> > > >
> > > > # cxl create-region -d decoder3.5 -w 4 -m -g 256 mem{12,10,9,11} -s
> > > > $((1<<30))
> > > > {
> > > > "region":"region8",
> > > > "resource":"0xf1f0000000",
> > > > "size":"1024.00 MiB (1073.74 MB)",
> > > > "interleave_ways":4,
> > > > "interleave_granularity":256,
> > > > "decode_state":"commit",
> > > > "mappings":[
> > > > {
> > > > "position":3,
> > > > "memdev":"mem11",
> > > > "decoder":"decoder21.0"
> > > > },
> > > > {
> > > > "position":2,
> > > > "memdev":"mem9",
> > > > "decoder":"decoder19.0"
> > > > },
> > > > {
> > > > "position":1,
> > > > "memdev":"mem10",
> > > > "decoder":"decoder20.0"
> > > > },
> > > > {
> > > > "position":0,
> > > > "memdev":"mem12",
> > > > "decoder":"decoder22.0"
> > > > }
> > > > ]
> > > > }
> > > > cxl region: cmd_create_region: created 1 region
> > > >
> > > > > Did the commit_store() crash stop reproducing with latest cxl/preview
> > > > > branch?
> > > >
> > > > I missed the answer to this question.
> > > >
> > > > All of these changes are now in Linus' tree perhaps give that a try and
> > > > post the debug log again?
> > >
> > > Hi Dan,
> > >
> > > I've moved onto looking at this one.
> > > 1 HB, 2RP (to make it configure the HDM decoder in the QEMU HB, I'll tidy
> > > that up
> > > at some stage), 1 switch, 4 downstream switch ports each with a type 3
> > >
> > > I'm not getting a crash, but can't successfully setup a region.
> > > Upon adding the final target
> > > It's failing in check_last_peer() as pos < distance.
> > > Seems distance is 4 which makes me think it's using the wrong level of
> > > the heirarchy for
> > > some reason or that distance check is wrong.
> > > Wasn't a good idea to just skip that step though as it goes boom - though
> > > stack trace is not useful.
> >
> > Turns out really weird corruption happens if you accidentally back two
> > type3 devices
> > with the same memory device. Who would have thought it :)
> >
> > That aside ignoring the check_last_peer() failure seems to make everything
> > work for this
> > topology. I'm not seeing the crash, so my guess is we fixed it somewhere
> > along the way.
> >
> > Now for the fun one. I've replicated the crash if we have
> >
> > 1HB 1*RP 1SW, 4SW-DSP, 4Type3
> >
> > Now, I'd expect to see it not 'work' because the QEMU HDM decoder won't be
> > programmed
> > but the null pointer dereference isn't related to that.
> >
> > The bug is straight forward. Not all decoders have commit callbacks...
> > Will send out
> > a possible fix shortly.
> >
> For completeness I'm carrying this hack because I haven't gotten my head
> around the right fix for check_last_peer() failing on this test topology.
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index c49d9a5f1091..275e143bd748 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -978,7 +978,7 @@ static int cxl_port_setup_targets(struct cxl_port *port,
> rc = check_last_peer(cxled, ep, cxl_rr,
> distance);
> if (rc)
> - return rc;
> + // return rc;
> goto out_target_set;
> }
> goto add_target;
I'm still carrying this hack and still haven't worked out the right fix.
Suggestions welcome! If not I'll hopefully get some time on this
towards the end of the week.
Jonathan
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: [BUG] cxl can not create region,
Jonathan Cameron <=