On 05.06.19 14:17, Sam Eiderman wrote:Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in QEMU).
This format was lacking in the following:
* Grain directory (L1) and grain table (L2) entries were 32-bit, allowing access to only 2TB (slightly less) of data. * The grain size (default) was 512 bytes - leading to data fragmentation and many grain tables. * For space reclamation purposes, it was necessary to find all the grains which are not pointed to by any grain table - so a reverse mapping of "offset of grain in vmdk" to "grain table" must be constructed - which takes large amounts of CPU/RAM.
The format specification can be found in VMware's documentation: https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
In ESXi 6.5, to support snapshot files larger than 2TB, a new format was introduced: SESparse (Space Efficient).
This format fixes the above issues:
* All entries are now 64-bit. * The grain size (default) is 4KB. * Grain directory and grain tables are now located at the beginning of the file. + seSparse format reserves space for all grain tables. + Grain tables can be addressed using an index. + Grains are located in the end of the file and can also be addressed with an index. - seSparse vmdks of large disks (64TB) have huge preallocated headers - mainly due to L2 tables, even for empty snapshots. * The header contains a reverse mapping ("backmap") of "offset of grain in vmdk" to "grain table" and a bitmap ("free bitmap") which specifies for each grain - whether it is allocated or not. Using these data structures we can implement space reclamation efficiently. * Due to the fact that the header now maintains two mappings: * The regular one (grain directory & grain tables) * A reverse one (backmap and free bitmap) These data structures can lose consistency upon crash and result in a corrupted VMDK. Therefore, a journal is also added to the VMDK and is replayed when the VMware reopens the file after a crash.
Since ESXi 6.7 - SESparse is the only snapshot format available.
Unfortunately, VMware does not provide documentation regarding the new seSparse format.
This commit is based on black-box research of the seSparse format. Various in-guest block operations and their effect on the snapshot file were tested.
The only VMware provided source of information (regarding the underlying implementation) was a log file on the ESXi:
/var/log/hostd.log
Whenever an seSparse snapshot is created - the log is being populated with seSparse records.
Relevant log records are of the form:
[...] Const Header: [...] constMagic = 0xcafebabe [...] version = 2.1 [...] capacity = 204800 [...] grainSize = 8 [...] grainTableSize = 64 [...] flags = 0 [...] Extents: [...] Header : <1 : 1> [...] JournalHdr : <2 : 2> [...] Journal : <2048 : 2048> [...] GrainDirectory : <4096 : 2048> [...] GrainTables : <6144 : 2048> [...] FreeBitmap : <8192 : 2048> [...] BackMap : <10240 : 2048> [...] Grain : <12288 : 204800> [...] Volatile Header: [...] volatileMagic = 0xcafecafe [...] FreeGTNumber = 0 [...] nextTxnSeqNumber = 0 [...] replayJournal = 0
The sizes that are seen in the log file are in sectors. Extents are of the following format: <offset : size>
This commit is a strict implementation which enforces: * magics * version number 2.1 * grain size of 8 sectors (4KB) * grain table size of 64 sectors * zero flags * extent locations
Additionally, this commit proivdes only a subset of the functionality offered by seSparse's format: * Read-only * No journal replay * No space reclamation * No unmap support
Hence, journal header, journal, free bitmap and backmap extents are unused, only the "classic" (L1 -> L2 -> data) grain access is implemented.
However there are several differences in the grain access itself. Grain directory (L1): * Grain directory entries are indexes (not offsets) to grain tables. * Valid grain directory entries have their highest nibble set to 0x1. * Since grain tables are always located in the beginning of the file - the index can fit into 32 bits - so we can use its low part if it's valid. Grain table (L2): * Grain table entries are indexes (not offsets) to grains. * If the highest nibble of the entry is: 0x0: The grain in not allocated. The rest of the bytes are 0. 0x1: The grain is unmapped - guest sees a zero grain. The rest of the bits point to the previously mapped grain, see 0x3 case. 0x2: The grain is zero. 0x3: The grain is allocated - to get the index calculate: ((entry & 0x0fff000000000000) >> 48) | ((entry & 0x0000ffffffffffff) << 12) * The difference between 0x1 and 0x2 is that 0x1 is an unallocated grain which results from the guest using sg_unmap to unmap the grain - but the grain itself still exists in the grain extent - a space reclamation procedure should delete it. Unmapping a zero grain has no effect (0x2 will not change to 0x1) but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
In order to implement seSparse some fields had to be changed to support both 32-bit and 64-bit entry sizes.
Reviewed-by: Karl Heubaum <address@hidden> Reviewed-by: Eyal Moscovici <address@hidden> Reviewed-by: Arbel Moshe <address@hidden> Signed-off-by: Sam Eiderman <address@hidden> --- block/vmdk.c | 357 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 341 insertions(+), 16 deletions(-)
diff --git a/block/vmdk.c b/block/vmdk.c index 931eb2759c..4377779635 100644 --- a/block/vmdk.c +++ b/block/vmdk.c
[...]+static int vmdk_open_se_sparse(BlockDriverState *bs, + BdrvChild *file, + int flags, Error **errp) +{ + int ret; + VMDKSESparseConstHeader const_header; + VMDKSESparseVolatileHeader volatile_header; + VmdkExtent *extent; + + if (flags & BDRV_O_RDWR) { + error_setg(errp, "No write support for seSparse images available"); + return -ENOTSUP; + }
Kind of works for me, but why not bdrv_apply_auto_read_only() like I hadproposed? The advantage is that this would make the node read-only ifthe user has specified auto-read-_only_=on instead of failing.
Ah, I have not realized that bdrv_apply_auto_read_only() is preferred. I’ll send a v3.
Sam Max
|