[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #64792] Bad IPMI DCMI response from Huawei and Xfusion BMCs
From: |
Ole Holm Nielsen |
Subject: |
[bug #64792] Bad IPMI DCMI response from Huawei and Xfusion BMCs |
Date: |
Thu, 19 Oct 2023 04:15:02 -0400 (EDT) |
URL:
<https://savannah.gnu.org/bugs/?64792>
Summary: Bad IPMI DCMI response from Huawei and Xfusion BMCs
Group: GNU FreeIPMI
Submitter: oleholmnielsen
Submitted: Thu 19 Oct 2023 08:15:00 AM UTC
Category: None
Severity: 3 - Normal
Priority: 5 - Normal
Item Group: None
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
Discussion Lock: Any
Operating System: None
_______________________________________________________
Follow-up Comments:
-------------------------------------------------------
Date: Thu 19 Oct 2023 08:15:00 AM UTC By: Ole Holm Nielsen <oleholmnielsen>
We have successfully integrated the development FreeIPMI version 1.7.0 in our
Linux cluster with the Slurm resource manager. My test is described in
https://bugs.schedmd.com/show_bug.cgi?id=17639#c55 and I have documented the
FreeIPMI setup in my Slurm Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#freeipmi-issues
Now we would like to deploy Slurm including the FreeIPMI power monitoring, but
we have discovered a snag:
We have 196 older Huawei XH620 V3 nodes (Intel Broadwell) whose BMC doesn't
seem to support the IPMI DCMI extensions. A colleague at another university
has the same problem with brand new Xfusion FusionOne HPC 1288H V6 servers
(Intel IceLake, essentially rebranded Huawei servers) even though the server's
BMC is documented to support DCMI 1.5!
On the Huawei and Xfusion nodes we get this error message:
$ ipmi-dcmi --get-system-power-statistics
ipmi_cmd_dcmi_get_power_reading: command invalid or unsupported
Due to this error, Slurm logs (spams) every minute in slurmd.log "error:
_get_dcmi_power_reading: get DCMI power reading failed"
I've tried to find out how to query the Huawei BMC with IPMI DCMI but I only
get error messages:
$ ipmi-dcmi --get-dcmi-capability-info
ipmi_cmd_dcmi_get_dcmi_capability_info_supported_dcmi_capabilities: bad
completion code
I also tried each of the WORKAROUNDS listed in the ipmi-dcmi manual page, but
in every case they return the same error.
The debug option gives some details:
$ ipmi-dcmi --get-dcmi-capability-info --debug
=====================================================
Group Extension - Get DCMI Capability Info Request
=====================================================
[ 1h] = cmd[ 8b]
[ DCh] = group_extension_identification[ 8b]
[ 1h] = parameter_selector[ 8b]
=====================================================
Group Extension - Get DCMI Capability Info Response
=====================================================
[ 1h] = cmd[ 8b]
[ D6h] = comp_code[ 8b]
ipmi_cmd_dcmi_get_dcmi_capability_info_supported_dcmi_capabilities: bad
completion code
The non-DCMI commands seem to be working correctly. For example, I can read
the system power:
$ ipmi-sensors -t Power_Unit
ID | Name | Type | Reading | Units | Event
22 | Power | Power Unit | 296.00 | W | 'OK'
(lines deleted)
Question: Would a WORKAROUND be feasible to implement for Huawei and Xfusion
servers? If so, how can we help by providing debugging information?
Or is there some other way for getting the DCMI extensions to work?
Thanks a lot,
Ole
_______________________________________________________
File Attachments:
-------------------------------------------------------
Date: Thu 19 Oct 2023 08:15:00 AM UTC Name: bmc-info.log Size: 2KiB By:
oleholmnielsen
Output from bmc-info
<http://savannah.gnu.org/bugs/download.php?file_id=55257>
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?64792>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [bug #64792] Bad IPMI DCMI response from Huawei and Xfusion BMCs,
Ole Holm Nielsen <=