Re: [Qemu-block] [PATCH v2 3/3] vmdk: Add read-only support for seSparse

On 19 Jun 2019, at 20:12, Max Reitz <address@hidden> wrote:

On 05.06.19 14:17, Sam Eiderman wrote:
Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
QEMU).

This format was lacking in the following:

   * Grain directory (L1) and grain table (L2) entries were 32-bit,
     allowing access to only 2TB (slightly less) of data.
   * The grain size (default) was 512 bytes - leading to data
     fragmentation and many grain tables.
   * For space reclamation purposes, it was necessary to find all the
     grains which are not pointed to by any grain table - so a reverse
     mapping of "offset of grain in vmdk" to "grain table" must be
     constructed - which takes large amounts of CPU/RAM.

The format specification can be found in VMware's documentation:
https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf

In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
introduced: SESparse (Space Efficient).

This format fixes the above issues:

   * All entries are now 64-bit.
   * The grain size (default) is 4KB.
   * Grain directory and grain tables are now located at the beginning
     of the file.
     + seSparse format reserves space for all grain tables.
     + Grain tables can be addressed using an index.
     + Grains are located in the end of the file and can also be
       addressed with an index.
     - seSparse vmdks of large disks (64TB) have huge preallocated
       headers - mainly due to L2 tables, even for empty snapshots.
   * The header contains a reverse mapping ("backmap") of "offset of
     grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
     specifies for each grain - whether it is allocated or not.
     Using these data structures we can implement space reclamation
     efficiently.
   * Due to the fact that the header now maintains two mappings:
       * The regular one (grain directory & grain tables)
       * A reverse one (backmap and free bitmap)
     These data structures can lose consistency upon crash and result
     in a corrupted VMDK.
     Therefore, a journal is also added to the VMDK and is replayed
     when the VMware reopens the file after a crash.

Since ESXi 6.7 - SESparse is the only snapshot format available.

Unfortunately, VMware does not provide documentation regarding the new
seSparse format.

This commit is based on black-box research of the seSparse format.
Various in-guest block operations and their effect on the snapshot file
were tested.

The only VMware provided source of information (regarding the underlying
implementation) was a log file on the ESXi:

   /var/log/hostd.log

Whenever an seSparse snapshot is created - the log is being populated
with seSparse records.

Relevant log records are of the form:

[...] Const Header:
[...] constMagic     = 0xcafebabe
[...] version        = 2.1
[...] capacity       = 204800
[...] grainSize      = 8
[...] grainTableSize = 64
[...] flags          = 0
[...] Extents:
[...] Header         : <1 : 1>
[...] JournalHdr     : <2 : 2>
[...] Journal        : <2048 : 2048>
[...] GrainDirectory : <4096 : 2048>
[...] GrainTables    : <6144 : 2048>
[...] FreeBitmap     : <8192 : 2048>
[...] BackMap        : <10240 : 2048>
[...] Grain          : <12288 : 204800>
[...] Volatile Header:
[...] volatileMagic     = 0xcafecafe
[...] FreeGTNumber      = 0
[...] nextTxnSeqNumber = 0
[...] replayJournal     = 0

The sizes that are seen in the log file are in sectors.
Extents are of the following format: <offset : size>

This commit is a strict implementation which enforces:
   * magics
   * version number 2.1
   * grain size of 8 sectors (4KB)
   * grain table size of 64 sectors
   * zero flags
   * extent locations

Additionally, this commit proivdes only a subset of the functionality
offered by seSparse's format:
   * Read-only
   * No journal replay
   * No space reclamation
   * No unmap support

Hence, journal header, journal, free bitmap and backmap extents are
unused, only the "classic" (L1 -> L2 -> data) grain access is
implemented.

However there are several differences in the grain access itself.
Grain directory (L1):
   * Grain directory entries are indexes (not offsets) to grain
     tables.
   * Valid grain directory entries have their highest nibble set to
     0x1.
   * Since grain tables are always located in the beginning of the
     file - the index can fit into 32 bits - so we can use its low
     part if it's valid.
Grain table (L2):
   * Grain table entries are indexes (not offsets) to grains.
   * If the highest nibble of the entry is:
       0x0:
           The grain in not allocated.
           The rest of the bytes are 0.
       0x1:
           The grain is unmapped - guest sees a zero grain.
           The rest of the bits point to the previously mapped grain,
           see 0x3 case.
       0x2:
           The grain is zero.
       0x3:
           The grain is allocated - to get the index calculate:
           ((entry & 0x0fff000000000000) >> 48) |
           ((entry & 0x0000ffffffffffff) << 12)
   * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
     grain which results from the guest using sg_unmap to unmap the
     grain - but the grain itself still exists in the grain extent - a
     space reclamation procedure should delete it.
     Unmapping a zero grain has no effect (0x2 will not change to 0x1)
     but unmapping an unallocated grain will (0x0 to 0x1) - naturally.

In order to implement seSparse some fields had to be changed to support
both 32-bit and 64-bit entry sizes.

Reviewed-by: Karl Heubaum <address@hidden>
Reviewed-by: Eyal Moscovici <address@hidden>
Reviewed-by: Arbel Moshe <address@hidden>
Signed-off-by: Sam Eiderman <address@hidden>
---
block/vmdk.c | 357 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 341 insertions(+), 16 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index 931eb2759c..4377779635 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c

[...]

+static int vmdk_open_se_sparse(BlockDriverState *bs,
+                               BdrvChild *file,
+                               int flags, Error **errp)
+{
+    int ret;
+    VMDKSESparseConstHeader const_header;
+    VMDKSESparseVolatileHeader volatile_header;
+    VmdkExtent *extent;
+
+    if (flags & BDRV_O_RDWR) {
+        error_setg(errp, "No write support for seSparse images available");
+        return -ENOTSUP;
+    }
Kind of works for me, but why not bdrv_apply_auto_read_only() like I had
proposed? The advantage is that this would make the node read-only if
the user has specified auto-read-_only_=on instead of failing.

Ah, I have not realized that bdrv_apply_auto_read_only() is preferred.

I’ll send a v3.

Sam

Max

From:	Sam Eiderman
Subject:	Re: [Qemu-block] [PATCH v2 3/3] vmdk: Add read-only support for seSparse snapshots
Date:	Thu, 20 Jun 2019 11:48:53 +0300