qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [PATCH v3] block/file-posix: do not fail on unlock byte


From: Max Reitz
Subject: Re: [Qemu-block] [PATCH v3] block/file-posix: do not fail on unlock bytes
Date: Wed, 3 Apr 2019 18:56:09 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

On 03.04.19 18:41, Max Reitz wrote:
> On 01.04.19 09:21, Vladimir Sementsov-Ogievskiy wrote:
>> 29.03.2019 22:32, Kevin Wolf wrote:
>>> Am 29.03.2019 um 19:00 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 29.03.2019 20:58, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 29.03.2019 20:44, Max Reitz wrote:
>>>>>> On 29.03.19 18:40, Kevin Wolf wrote:
>>>>>>> Am 29.03.2019 um 18:30 hat Max Reitz geschrieben:
>>>>>>>> On 29.03.19 18:24, Kevin Wolf wrote:
>>>>>>>>> Am 29.03.2019 um 18:15 hat Max Reitz geschrieben:
>>>>>>>>>> On 29.03.19 12:04, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>>>>> bdrv_replace_child() calls bdrv_check_perm() with error_abort on
>>>>>>>>>>> loosening permissions. However file-locking operations may fail even
>>>>>>>>>>> in this case, for example on NFS. And this leads to Qemu crash.
>>>>>>>>>>>
>>>>>>>>>>> Let's avoid such errors. Note, that we ignore such things anyway on
>>>>>>>>>>> permission update commit and abort.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
>>>>>>>>>>> ---
>>>>>>>>>>>    block/file-posix.c | 12 ++++++++++++
>>>>>>>>>>>    1 file changed, 12 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/block/file-posix.c b/block/file-posix.c
>>>>>>>>>>> index db4cccbe51..1cf4ee49eb 100644
>>>>>>>>>>> --- a/block/file-posix.c
>>>>>>>>>>> +++ b/block/file-posix.c
>>>>>>>>>>> @@ -815,6 +815,18 @@ static int 
>>>>>>>>>>> raw_handle_perm_lock(BlockDriverState *bs,
>>>>>>>>>>>        switch (op) {
>>>>>>>>>>>        case RAW_PL_PREPARE:
>>>>>>>>>>> +        if ((s->perm | new_perm) == s->perm &&
>>>>>>>>>>> +            (s->shared_perm & new_shared) == s->shared_perm)
>>>>>>>>>>> +        {
>>>>>>>>>>> +            /*
>>>>>>>>>>> +             * We are going to unlock bytes, it should not fail. 
>>>>>>>>>>> If it fail due
>>>>>>>>>>> +             * to some fs-dependent permission-unrelated reasons 
>>>>>>>>>>> (which occurs
>>>>>>>>>>> +             * sometimes on NFS and leads to abort in 
>>>>>>>>>>> bdrv_replace_child) we
>>>>>>>>>>> +             * can't prevent such errors by any check here. And we 
>>>>>>>>>>> ignore them
>>>>>>>>>>> +             * anyway in ABORT and COMMIT.
>>>>>>>>>>> +             */
>>>>>>>>>>> +            return 0;
>>>>>>>>>>> +        }
>>>>>>>>>>>            ret = raw_apply_lock_bytes(s, s->fd, s->perm | new_perm,
>>>>>>>>>>>                                       ~s->shared_perm | ~new_shared,
>>>>>>>>>>>                                       false, errp);
>>>>>>>>>>
>>>>>>>>>> Help me understand the exact issue, please.  I understand that there 
>>>>>>>>>> are
>>>>>>>>>> operations like bdrv_replace_child() that pass &error_abort to
>>>>>>>>>> bdrv_check_perm() because they just loosen the permissions, so it 
>>>>>>>>>> should
>>>>>>>>>> not fail.
>>>>>>>>>>
>>>>>>>>>> However, if the whole effect really would be to loosen permissions,
>>>>>>>>>> raw_apply_lock_bytes() wouldn't have failed here in PREPARE anyway:
>>>>>>>>>> @unlock is passed as false, so no bytes will be unlocked.  And if
>>>>>>>>>> permissions are just loosened (as your condition checks), it should 
>>>>>>>>>> not
>>>>>>>>>> lock any bytes.
>>>>>>>>>>
>>>>>>>>>> So why does it attempt lock any bytes in the first place?  There 
>>>>>>>>>> must be
>>>>>>>>>> some discrepancy between s->perm and s->locked_perm, or 
>>>>>>>>>> ~s->shared_perm
>>>>>>>>>> and s->locked_shared_perm.  How does that occur?
>>>>>>>>>
>>>>>>>>> I suppose raw_check_lock_bytes() is what is failing, not
>>>>>>>>> raw_apply_lock_bytes().
>>>>>>>>
>>>>>>>> Hm, maybe in Vladimir's case, but not in e.g.
>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1652572 .
>>>>>>>
>>>>>>> This is reported against 3.0, which didn't avoid re-locking permissions
>>>>>>> that we already hold, so there raw_apply_lock_bytes() can still fail.
>>>>>>
>>>>>> That makes sense.  Which leaves the question why Vladimir still seems to
>>>>>> see the error there...?
>>>>>>
>>>>>
>>>>> I'm sorry :(. I'm trying to fix bug based on 2.10, and now I see that is 
>>>>> already fixed
>>>>>    upstream. I don't have a reproducer, only old coredumps.
>>>>>
>>>>> So, now it looks like we don't need this patch, as on permission 
>>>>> loosening file-posix
>>>>> don't call any FS apis, yes?
>>>>>
>>>>
>>>>
>>>> Ah, you mentioned, that raw_check_lock_bytes is still buggy.
>>>
>>> I haven't tried it out, but from looking at the code it seems so. Maybe
>>> you can reproduce on master just to be sure?
>>>
>>
>> I don't have a reproducer :(
> 
> I have one, but it only breaks before
> 2996ffad3acabe890fbb4f84a069cdc325a68108:
> 
> First, setup on an NFS mount on /mnt/nfs.  Second:
> 
> $ qemu-img create -f qcow2 /mnt/nfs/foo.qcow2 64M
> Formatting '/mnt/nfs/foo.qcow2', fmt=qcow2 size=67108864
> cluster_size=65536 lazy_refcounts=off refcount_bits=16
> $ (sleep 5; echo "{'execute':'qmp_capabilities'}"; \
>    echo "{'execute':'blockdev-del','arguments':{'node-name':'fmt'}}";
>    echo "{'execute':'quit'}") \
>   | x86_64-softmmu/qemu-system-x86_64 -qmp stdio \
>     -blockdev node-name=proto,driver=file,filename=/mnt/nfs/foo.qcow2 \
>     -blockdev node-name=fmt,driver=qcow2,file=proto
> {"QMP": {"version": {"qemu": {"micro": 90, "minor": 0, "major": 3},
> "package": "v3.1.0-rc0-71-ga883d6a0bc"}, "capabilities": []}}
> 
> Before the sleep is done, stop the service on the NFS host:
> 
> $ systemctl stop nfs-service
> 
> Once the sleep has run out (you get a {"return": {}} over QMP), start
> the service again:
> 
> $ systemctl start nfs-service
> 
> And then this happens:
> 
> Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:705:
> Failed to lock byte 100
> [1]    30486 done                 ( sleep 5; echo
> "{'execute':'qmp_capabilities'}"; echo ; echo ; ) |
>        30487 abort (core dumped)  x86_64-softmmu/qemu-system-x86_64 -qmp
> stdio -blockdev  -blockdev
> 
> It works fine after 2996ffad3acabe890fbb4f84a069cdc325a68108.

Now I have a reproducer that breaks before this patch here and works
afterwards: You just need two parents and delete one of them, so some
permissions stay taken.

So, we can do this:

$ (echo "{'execute':'qmp_capabilities'}"; \
   echo "{'execute':'nbd-server-start',
          'arguments':{'addr':{'type':'inet',
              'data':{'host':'0.0.0.0','port':'10809'}}}}"; \
   echo "{'execute':'nbd-server-add',
          'arguments':{'device':'proto'}}"; \
   sleep 5; \
   echo "{'execute':'nbd-server-stop'}"; \
   echo "{'execute':'quit'}") \
  | x86_64-softmmu/qemu-system-x86_64 -qmp stdio \
     -blockdev node-name=proto,driver=file,filename=/mnt/nfs/foo.img \
     -device virtio-blk,drive=proto
{"QMP": {"version": {"qemu": {"micro": 91, "minor": 1, "major": 3},
"package": "v4.0.0-rc1-74-g38e694fcc9"}, "capabilities": ["oob"]}}
{"return": {}}
{"return": {}}
{"return": {}}

Then immediately this on the NFS host:

$ sudo systemctl stop nfs-server; sleep 6; \
  sudo systemctl start nfs-server

And this happens on the client:

Unexpected error in raw_check_lock_bytes() at block/file-posix.c:775:
Failed to get "consistent read" lock
[1]    21289 done                 ( echo
"{'execute':'qmp_capabilities'}"; echo ; echo ; sleep 5; echo ; echo ;  |
       21290 abort (core dumped)  x86_64-softmmu/qemu-system-x86_64 -qmp
stdio -blockdev  -device

No issues after 696aaaed579ac5bf5fa336216909b46d3d8f07a8 (this patch here).

Max

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]