qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] qcow2 lock


From: Vladimir Sementsov-Ogievskiy
Subject: [Qemu-devel] qcow2 lock
Date: Mon, 9 Sep 2019 10:13:57 +0000

Hi!

I have a (may be stupid) question: what is BDRVQcow2State.lock for and when 
should it be locked?


I faced SIGSEGV here:

#0  qcow2_process_discards (bs=bs@entry=0x564b93bc8000, ret=ret@entry=0) at 
block/qcow2-refcount.c:737
#1  0x0000564b90e9f15f in qcow2_cluster_discard (bs=bs@entry=0x564b93bc8000, 
offset=0, offset@entry=3062890496, bytes=bytes@entry=134217728, 
type=type@entry=QCOW2_DISCARD_REQUEST,
     full_discard=full_discard@entry=false) at block/qcow2-cluster.c:1853
#2  0x0000564b90e8f720 in qcow2_co_pdiscard (bs=0x564b93bc8000, 
offset=3062890496, bytes=134217728) at block/qcow2.c:3749
#3  0x0000564b90ec565d in bdrv_co_pdiscard (bs=0x564b93bc8000, 
offset=3062890496, bytes=134217728) at block/io.c:2939
#4  0x0000564b90eb5c04 in blk_aio_pdiscard_entry (opaque=0x564b94f968c0) at 
block/block-backend.c:1527
#5  0x0000564b90f681ea in coroutine_trampoline (i0=<optimized out>, 
i1=<optimized out>) at util/coroutine-ucontext.c:116

SIGSEGV is on QTAILQ_REMOVE in qcow2_process_discards:
(gdb) list
732     {
733         BDRVQcow2State *s = bs->opaque;
734         Qcow2DiscardRegion *d, *next;
735
736         QTAILQ_FOREACH_SAFE(d, &s->discards, next, next) {
737             QTAILQ_REMOVE(&s->discards, d, next);
738
739             /* Discard is optional, ignore the return value */
740             if (ret >= 0) {
741                 bdrv_pdiscard(bs->file->bs, d->offset, d->bytes);


(you see bs->file->bs, yes it's old code based on 2.12, but still,
I need some help on the following)

and problem is that d is already deleted:
(gdb) p d->next
$50 = {tqe_next = 0x564b94b0b140, tqe_prev = 0x0}

Such problems may occur when there is an interleaving of such
removing loops with other usage of the queue. And this is possible,
as we call bdrv_pdiscard inside the loop which may yield.

go to frame #5, and print co->caller stack:

#0  0x0000564b90f68180 in qemu_coroutine_switch ()
#1  0x0000564b90f66c84 in qemu_aio_coroutine_enter ()
#2  0x0000564b90f50764 in aio_co_enter ()
#3  0x0000564b90f50ea9 in thread_pool_completion_bh ()
#4  0x0000564b90f500d1 in aio_bh_poll ()
#5  0x0000564b90f5360b in aio_poll ()
#6  0x0000564b90ec59cd in bdrv_pdiscard ()
#7  0x0000564b90e96a36 in qcow2_process_discards ()
#8  0x0000564b90e97785 in update_refcount ()
#9  0x0000564b90e96bdd in qcow2_free_clusters ()
#10 0x0000564b90ea29c7 in update_ext_header_and_dir ()
#11 0x0000564b90ea3a14 in qcow2_remove_persistent_dirty_bitmap ()
#12 0x0000564b90ce7bce in qmp_block_dirty_bitmap_remove ()
#13 0x0000564b90cf5390 in qmp_marshal_block_dirty_bitmap_remove ()
#14 0x0000564b90f46080 in qmp_dispatch ()
#15 0x0000564b90bedc74 in monitor_qmp_dispatch_one ()
#16 0x0000564b90bee04a in monitor_qmp_bh_dispatcher ()
#17 0x0000564b90f500d1 in aio_bh_poll ()
#18 0x0000564b90f53430 in aio_dispatch ()
#19 0x0000564b90f4ffae in aio_ctx_dispatch ()
#20 0x00007f0a8e3e9049 in g_main_context_dispatch () from 
/lib64/libglib-2.0.so.0
#21 0x0000564b90f52727 in main_loop_wait ()
#22 0x0000564b90ba0c07 in main ()


And this (at least partly) confirms my guess.

So, my actual question is, what should be fixed here:

1. yielding in qcow2_process_discards, like this:

--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -732,9 +732,13 @@ void qcow2_process_discards(BlockDriverState *bs, int ret)
  {
      BDRVQcow2State *s = bs->opaque;
      Qcow2DiscardRegion *d, *next;
+    QTAILQ_HEAD (, Qcow2DiscardRegion) discards;

-    QTAILQ_FOREACH_SAFE(d, &s->discards, next, next) {
-        QTAILQ_REMOVE(&s->discards, d, next);
+    discards = s->discards;
+    QTAILQ_INIT(&s->discards);
+
+    QTAILQ_FOREACH_SAFE(d, &discards, next, next) {
+        QTAILQ_REMOVE(&discards, d, next);

          /* Discard is optional, ignore the return value */
          if (ret >= 0) {



or
2. calling qcow2_remove_persistent_dirty_bitmap without taking lock, like this:

--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1359,8 +1359,8 @@ void 
qcow2_remove_persistent_dirty_bitmap(BlockDriverState *bs,
  {
      int ret;
      BDRVQcow2State *s = bs->opaque;
-    Qcow2Bitmap *bm;
-    Qcow2BitmapList *bm_list;
+    Qcow2Bitmap *bm = NULL;
+    Qcow2BitmapList *bm_list = NULL;

      if (s->nb_bitmaps == 0) {
          /* Absence of the bitmap is not an error: see explanation above
@@ -1368,15 +1368,17 @@ void 
qcow2_remove_persistent_dirty_bitmap(BlockDriverState *bs,
          return;
      }

+    qemu_co_mutex_lock(&s->lock);
+
      bm_list = bitmap_list_load(bs, s->bitmap_directory_offset,
                                 s->bitmap_directory_size, errp);
      if (bm_list == NULL) {
-        return;
+        goto out;
      }

      bm = find_bitmap_by_name(bm_list, name);
      if (bm == NULL) {
-        goto fail;
+        goto out;
      }

      QSIMPLEQ_REMOVE(bm_list, bm, Qcow2Bitmap, entry);
@@ -1384,12 +1386,14 @@ void 
qcow2_remove_persistent_dirty_bitmap(BlockDriverState *bs,
      ret = update_ext_header_and_dir(bs, bm_list);
      if (ret < 0) {
          error_setg_errno(errp, -ret, "Failed to update bitmap extension");
-        goto fail;
+        goto out;
      }

      free_bitmap_clusters(bs, &bm->table);

-fail:
+out:
+    qemu_co_mutex_unlock(&s->lock);
+
      bitmap_free(bm);
      bitmap_list_free(bm_list);
  }

And in this case, I'm afraid that locking is missed in some other bitmap 
related qcow2 codes :(


or
3. Just backport from upstream John's fix 0a6c86d024c52b, which acquires aio 
context around
bdrv_remove_persistent_dirty_bitmap call? Is it enough, or we still need locking
inside qcow2?


-- 
Best regards,
Vladimir

reply via email to

[Prev in Thread] Current Thread [Next in Thread]