[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH 3/5] cxl/core: introduce cxl_mem_report_poison()
From: |
Tony Luck |
Subject: |
Re: [RFC PATCH 3/5] cxl/core: introduce cxl_mem_report_poison() |
Date: |
Wed, 14 Feb 2024 17:19:36 -0800 |
On Fri, Feb 09, 2024 at 07:54:15PM +0800, Shiyang Ruan wrote:
> If poison is detected(reported from cxl memdev), OS should be notified to
> handle it. Introduce this function:
> 1. translate DPA to HPA;
> 2. construct a MCE instance; (TODO: more details need to be filled)
> 3. log it into MCE event queue;
>
> After that, MCE mechanism can walk over its notifier chain to execute
> specific handlers.
This looks like a useful proof of concept patch to pass errors to all
the existing logging systems (console, mcelog, rasdaemon, EDAC). But
it's a bare minimum (just passing the address and dropping any other
interesting information about the error). I think we need something
more advanced that covers more CXL error types.
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> arch/x86/kernel/cpu/mce/core.c | 1 +
> drivers/cxl/core/mbox.c | 33 +++++++++++++++++++++++++++++++++
> 2 files changed, 34 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index bc39252bc54f..a64c0aceb7e0 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -131,6 +131,7 @@ void mce_setup(struct mce *m)
> m->ppin = cpu_data(m->extcpu).ppin;
> m->microcode = boot_cpu_data.microcode;
> }
> +EXPORT_SYMBOL_GPL(mce_setup);
>
> DEFINE_PER_CPU(struct mce, injectm);
> EXPORT_PER_CPU_SYMBOL_GPL(injectm);
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 27166a411705..f9b6f50fbe80 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -4,6 +4,7 @@
> #include <linux/debugfs.h>
> #include <linux/ktime.h>
> #include <linux/mutex.h>
> +#include <asm/mce.h>
> #include <asm/unaligned.h>
> #include <cxlpci.h>
> #include <cxlmem.h>
> @@ -1290,6 +1291,38 @@ int cxl_set_timestamp(struct cxl_memdev_state *mds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_set_timestamp, CXL);
>
> +static void cxl_mem_report_poison(struct cxl_memdev *cxlmd,
> + struct cxl_poison_record *poison)
> +{
> + struct mce m;
> + u64 dpa = le64_to_cpu(poison->address) & CXL_POISON_START_MASK;
> + u64 len = le64_to_cpu(poison->length), i;
> + phys_addr_t phys_addr = cxl_memdev_dpa_to_hpa(cxlmd, dpa);
> +
> + if (phys_addr)
> + return;
> +
> + /*
> + * Initialize struct mce. Call preempt_disable() to avoid
> + * "BUG: using smp_processor_id() in preemptible" for now, not sure
> + * if this is a correct way.
> + */
> + preempt_disable();
> + mce_setup(&m);
> + preempt_enable();
> +
> + m.bank = -1;
> + /* Fake a memory read error with unknown channel */
> + m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV |
> + MCI_STATUS_MISCV | 0x9f;
> + m.misc = (MCI_MISC_ADDR_PHYS << 6);
> +
> + for (i = 0; i < len; i++) {
> + m.addr = phys_addr++;
> + mce_log(&m);
This loop looks wrong. What values do you expect for "len" (a.k.a.
poison->length)? Creating one log for each byte in the range will
be very noisy!
> + }
> +}
> +
> int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len,
> struct cxl_region *cxlr)
> {
> --
> 2.34.1
-Tony
- [RFC PATCH SET] cxl: add poison event handler, Shiyang Ruan, 2024/02/09
- [RFC PATCH 5/5] cxl/core: add poison injection event handler, Shiyang Ruan, 2024/02/09
- [RFC PATCH 1/2] hw/cxl/type3: add missing flag bit for GMER, Shiyang Ruan, 2024/02/09
- [RFC PATCH 2/5] cxl/core: introduce cxl_memdev_dpa_to_hpa(), Shiyang Ruan, 2024/02/09
- [RFC PATCH 3/5] cxl/core: introduce cxl_mem_report_poison(), Shiyang Ruan, 2024/02/09
- [RFC PATCH 1/5] cxl/core: correct length of DPA field masks, Shiyang Ruan, 2024/02/09
- [RFC PATCH 2/2] hw/cxl/type3: send a GMER while injecting poison, Shiyang Ruan, 2024/02/09
- [RFC PATCH 4/5] cxl/core: add report option for cxl_mem_get_poison(), Shiyang Ruan, 2024/02/09
- Re: [RFC PATCH SET] cxl: add poison event handler, Dave Jiang, 2024/02/12