[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] vfio failure with intel 760p 128GB nvme
From: |
Dongli Zhang |
Subject: |
Re: [Qemu-devel] vfio failure with intel 760p 128GB nvme |
Date: |
Thu, 27 Dec 2018 23:15:25 +0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 |
Hi Alex,
On 12/27/2018 10:20 PM, Alex Williamson wrote:
> On Thu, 27 Dec 2018 20:30:48 +0800
> Dongli Zhang <address@hidden> wrote:
>
>> Hi Alex,
>>
>> On 12/02/2018 09:29 AM, Dongli Zhang wrote:
>>> Hi Alex,
>>>
>>> On 12/02/2018 03:29 AM, Alex Williamson wrote:
>>>> On Sat, 1 Dec 2018 10:52:21 -0800 (PST)
>>>> Dongli Zhang <address@hidden> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I obtained below error when assigning an intel 760p 128GB nvme to guest
>>>>> via
>>>>> vfio on my desktop:
>>>>>
>>>>> qemu-system-x86_64: -device vfio-pci,host=0000:01:00.0: vfio
>>>>> 0000:01:00.0: failed to add PCI capability address@hidden: table & pba
>>>>> overlap, or they don't fit in BARs, or don't align
>>>>>
>>>>>
>>>>> This is because the msix table is overlapping with pba. According to below
>>>>> 'lspci -vv' from host, the distance between msix table offset and pba
>>>>> offset is
>>>>> only 0x100, although there are 22 entries supported (22 entries need
>>>>> 0x160).
>>>>> Looks qemu supports at most 0x800.
>>>>>
>>>>> # sudo lspci -vv
>>>>> ... ...
>>>>> 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6
>>>>> (rev 03) (prog-if 02 [NVM Express])
>>>>> Subsystem: Intel Corporation Device 390b
>>>>> ... ...
>>>>> Capabilities: [b0] MSI-X: Enable- Count=22 Masked-
>>>>> Vector table: BAR=0 offset=00002000
>>>>> PBA: BAR=0 offset=00002100
>>>>>
>>>>>
>>>>>
>>>>> A patch below could workaround the issue and passthrough nvme
>>>>> successfully.
>>>>>
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index 5c7bd96..54fc25e 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -1510,6 +1510,11 @@ static void vfio_msix_early_setup(VFIOPCIDevice
>>>>> *vdev, Error **errp)
>>>>> msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK;
>>>>> msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1;
>>>>>
>>>>> + if (msix->table_bar == msix->pba_bar &&
>>>>> + msix->table_offset + msix->entries * PCI_MSIX_ENTRY_SIZE >
>>>>> msix->pba_offset) {
>>>>> + msix->entries = (msix->pba_offset - msix->table_offset) /
>>>>> PCI_MSIX_ENTRY_SIZE;
>>>>> + }
>>>>> +
>>>>> /*
>>>>> * Test the size of the pba_offset variable and catch if it extends
>>>>> outside
>>>>> * of the specified BAR. If it is the case, we need to apply a
>>>>> hardware
>>>>>
>>>>>
>>>>> Would you please help confirm if this can be regarded as bug in qemu, or
>>>>> issue
>>>>> with nvme hardware? Should we fix thin in qemu, or we should never use
>>>>> such buggy
>>>>> hardware with vfio?
>>>>
>>>> It's a hardware bug, is there perhaps a firmware update for the device
>>>> that resolves it? It's curious that a vector table size of 0x100 gives
>>>> us 16 entries and 22 in hex is 0x16 (table size would be reported as
>>>> 0x15 for the N-1 algorithm). I wonder if there's a hex vs decimal
>>>> mismatch going on. We don't really know if the workaround above is
>>>> correct, are there really 16 entries or maybe does the PBA actually
>>>> start at a different offset? We wouldn't want to generically assume
>>>> one or the other. I think we need Intel to tell us in which way their
>>>> hardware is broken and whether it can or is already fixed in a firmware
>>>> update. Thanks,
>>>
>>> Thank you very much for the confirmation.
>>>
>>> Just realized looks this would make trouble to my desktop as well when 17
>>> vectors are used.
>>>
>>> I will report to intel and confirm how this can happen and if there is any
>>> firmware update available for this issue.
>>>
>>
>> I found there is similar issue reported to kvm:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=202055
>>
>>
>> I confirmed with my env again. By default, the msi-x count is 16.
>>
>> Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
>> Vector table: BAR=0 offset=00002000
>> PBA: BAR=0 offset=00002100
>>
>>
>> The count is still 16 after the device is assigned to vfio (Enable- now):
>>
>> # echo 0000:01:00.0 > /sys/bus/pci/devices/0000\:01\:00.0/driver/unbind
>> # echo "8086 f1a6" > /sys/bus/pci/drivers/vfio-pci/new_id
>>
>> Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
>> Vector table: BAR=0 offset=00002000
>> PBA: BAR=0 offset=00002100
>>
>>
>> After I boot qemu with "-device vfio-pci,host=0000:01:00.0", count becomes
>> 22.
>>
>> Capabilities: [b0] MSI-X: Enable- Count=22 Masked-
>> Vector table: BAR=0 offset=00002000
>> PBA: BAR=0 offset=00002100
>>
>>
>>
>> Another interesting observation is, vfio-based userspace nvme also changes
>> count
>> from 16 to 22.
>>
>> I reboot host and the count is reset to 16. Then I boot VM with "-drive
>> file=nvme://0000:01:00.0/1,if=none,id=nvmedrive0 -device
>> virtio-blk,drive=nvmedrive0,id=nvmevirtio0". As userspace nvme uses different
>> vfio path, it boots successfully without issue.
>>
>> However, the count becomes 22 then:
>>
>> Capabilities: [b0] MSI-X: Enable- Count=22 Masked-
>> Vector table: BAR=0 offset=00002000
>> PBA: BAR=0 offset=00002100
>>
>>
>> Both vfio and userspace nvme (based on vfio) would change the count from 16
>> to 22.
>
> Yes, we've found in the bz you mention that it's resetting the device
> via FLR that causes the device to report a bogus interrupt count. The
> vfio-pci driver will always perform an FLR on the device before
> providing it to the user, so whether it's directly assigned with
> vfio-pci in QEMU or exposed as an nvme drive via nvme://, it will go
> through the same FLR path. It looks like we need yet another device
> specific reset for nvme. Ideally we could figure out how to recover
> the device after an FLR, but potentially we could reset the nvme
> controller rather than the PCI interface. This is becoming a problem
> that so many nvme controllers have broken FLRs. Thanks,
>
> Alex
>
I instrument qemu and linux a little bit and narrow down as below.
On qemu side, the count changes from 16 to 22 after line 1438 which is
VFIO_GROUP_GET_DEVICE_FD.
1432 int vfio_get_device(VFIOGroup *group, const char *name,
1433 VFIODevice *vbasedev, Error **errp)
1434 {
1435 struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
1436 int ret, fd;
1437
1438 fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
1439 if (fd < 0) {
1440 error_setg_errno(errp, errno, "error getting device from group %d",
1441 group->groupid);
1442 error_append_hint(errp,
1443 "Verify all devices in group %d are bound to
vfio-<bus> "
1444 "or pci-stub and not already in use\n",
group->groupid);
1445 return fd;
1446
On linux kernel side, the count changes from 16 to 22 in vfio_pci_enable().
The value is 16 before vfio_pci_enable(), and 22 after vfio_pci_enable() as at
line 231.
226 ret = pci_enable_device(pdev);
227 if (ret)
228 return ret;
229
230 /* If reset fails because of the device lock, fail this path
entirely */
231 ret = pci_try_reset_function(pdev);
232 if (ret == -EAGAIN) {
233 pci_disable_device(pdev);
234 return ret;
235 }
I will continue narrowing down later.
Dongli Zhang