[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Re: Strategic decision: COW format
From: |
Chunqiang Tang |
Subject: |
Re: [Qemu-devel] Re: Strategic decision: COW format |
Date: |
Mon, 14 Mar 2011 12:32:58 -0400 |
> > FVD's novel uses of the reference count table reduces the metadata
update
> > overhead down to literally zero during normal execution of a VM. This
gets
> > the bests of QCOW2's reference count table but without its oeverhead.
In
> > FVD, the reference count table is only updated when creating a new
> > snapshot or deleting an existing snapshot. The reference count table
is
> > never updated during normal execution of a VM.
>
> Do you want to send out a break-down of the steps (and cost) involved in
doing:
>
> 1. Snapshot creation.
> 2. Snapshot deletion.
> 3. Opening an image with n snapshots.
Here is a detailed description. Relevant to the discussion of snapshot,
FVD uses a one-level lookup table and a refcount table. FVD’s one-level
lookup table is very similar to QCOW2’s two-level lookup table, except
that it is much smaller in FVD, and is preallocated and hence contiguous
in image.
FVD’s refcount table is almost identical to that of QCOW2, but with a key
difference. An image consists of an arbitrary number of read-only
snapshots, and a single writeable image front, which is the current image
state perceived by the VM. Below, I will simply refer to the read-only
snapshots as snapshots, and refer to the “writeable image front” as
“writeable-front.” QCOW2’s refcount table counts clusters that are used by
either read-only snapshots or writeable-front. Because writeable-front
changes as the VM runs, QCOW2 needs to update the refcount table on the
fast path of normal VM execution.
By contrast, FVD’s refcount table only counts chunks that are used by
read-only snapshots, and does not count chunks used by write-front. This
is the key that allows FVD to entirely avoid updating the refcount table
on the fast path of normal VM execution.
Below are the detailed steps for different operations.
O1: Open an image with n snapshots.
Let me introduce some basic concepts first. The storage allocation unit in
FVD is called a chunk (like cluster in QCOW2). The default chunk size is
1MB, like that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD
image file is conceptually divided into chunks, where chunk 0 is the first
1MB of the image file, chunk 1 is the second 1MB, … chunk j, …and so
forth. The size of an image file grow as needed, just like that of QCOW2.
The refcount table is a linear array “uint16_t refcount[]”. If a chunk j
is referenced by s different snapshots, then refcount[j] = s. If a new
snapshot is created and this new snapshot also uses chunk j, then
refcount[j] is incremented to refcount[j] = s+1.
If all snapshots together use 1TB storage spaces, there are
1TB/1MB=1,000,000 chunks, and the size of the refcount table is 2MB.
Loading the entire 2MB refcount table from disk into memory takes about 15
milliseconds. If the virtual disk size perceived by the VM is also 1TB,
FVD’s one-level lookup table is 4MB. FVD’s one-level lookup table serves
the same purpose as QCOW2’s two-level lookup table, but FVD’s one-level
table is much smaller and is preallocated and hence continuous in the
image. Loading the entire 4MB lookup table from disk into memory takes
about 20 milliseconds. These numbers mean that it is quite affordable to
scan the entire tables at VM boot time, although the scan can also be
avoided in FVD. The optimizations will be described later.
When opening an image with n snapshots, an unoptimized version of FVD
performs the following steps:
O1: Load the entire 2MB reference count table from disk into memory. This
step takes about 15ms.
O2: Load the entire 4MB lookup table from disk into memory. This step
takes about 20ms.
O3: Use the two tables to build an in-memory data structure called
“free-chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap
identifies free chunks that are not used by either snapshots or writeable
front, and hence can be allocated for future writes. The size of the
free-chunk-bitmap is only 125KB for a 1TB disk, and hence the memory
overhead is negligible. The free-chunk-bitmap also supports trim
operations. The free-chunk-bitmap does not have to be persisted on disk as
it can always be rebuilt easily, although as an optimization it can be
persisted on disk on VM shutdown.
O4: Compare the refcount table and the lookup table to identify chunks
that are in both tables (i.e., shared) and hence the running VM’s write to
those chunks in writeable-front triggers copy-on-write. This step takes
about 2ms. One bit in the lookup table’s entry is stolen to mark whether a
chunk in writeable-front is shared with snapshots and hence needs
copy-on-write upon a write.
The whole process above, i.e., opening an image with n (e.g., n=1000)
snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, I
will describe optimizations that can further reduce this 39ms by saving
the 125KB free-chunk-bitmap to disk on VM shutdown, but that optimization
is more than likely to an over-engineering effort, given that 39ms on VM
boot perhaps is not a big deal.
Once the image is opened, the refcount table is discarded and never
updated during normal execution of the VM. This is how FVD gets the bests
of QCOW2’s refcount table but without its runtime overhead. When the VM
issues a write to a chunk in writeable-front and this chunk is shared with
snapshots (this is known by looking at the special marker bit in lookup
table entries), FVD uses the free-chunk-bitmap to find a free chunk,
performs copy-on-write, and updates the lookup table to point to that new
chunk. This metadata change to the lookup table is persisted in FVD’s
journal and is not updated directly to the writeable-front’s on-disk
lookup table. When the VM shuts down gracefully, the entire lookup table
is written back to the writeable-front’s on-disk lookup table (this step
takes about 20ms), and FVD’s image header is updated to indicate that
there is no need to recover from journal on the next VM boot.
Now let me describe an optimization that can further reduce the 39ms time
needed for opening an image with n snapshots. When the VM shuts down
gracefully, FVD can save the in-memory 125KB free-chunk-bitmap to disk.
When the VM reboots, FVD can load the 125KB free-chunk-bitmap from disk
and skips steps O1, O2, and O3. Since the lookup table is saved to disk on
a clean shutdown and one bit in the table entries marks whether a chunk is
shared with snapshots, the scanning step in O4 can also be avoided. In
other words, all the steps above O1-O4 can all be avoided on a clean
shutdown. They are needed only after a host crash. During the recovery
process after a host crash, the journal re-builds the lookup table; the
refcount table and the lookup table rebuild the free-chunk-bitmap.
This is a description of FVD's image open operation.
(…to be continued… I am running out of time now, and will write about
FVD’s other snapshot operations in separate emails.)
Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang
- Re: [Qemu-devel] Re: Strategic decision: COW format, (continued)
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Anthony Liguori, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Stefan Hajnoczi, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Stefan Hajnoczi, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Stefan Hajnoczi, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format,
Chunqiang Tang <=
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Message not available
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Anthony Liguori, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Anthony Liguori, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Stefan Hajnoczi, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14