qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 4/6] util: implement seqcache


From: Max Reitz
Subject: Re: [PATCH v3 4/6] util: implement seqcache
Date: Fri, 12 Mar 2021 14:41:40 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0

On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
Implement cache for small sequential unaligned writes, so that they may
be cached until we get a complete cluster and then write it.

The cache is intended to be used for backup to qcow2 compressed target
opened in O_DIRECT mode, but can be reused for any similar (even not
block-layer related) task.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
  include/qemu/seqcache.h |  42 +++++
  util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
  MAINTAINERS             |   6 +
  util/meson.build        |   1 +
  4 files changed, 410 insertions(+)
  create mode 100644 include/qemu/seqcache.h
  create mode 100644 util/seqcache.c

Looks quite good to me, thanks.  Nice explanations, too. :)

The only design question I have is whether there’s a reason you’re using a list again instead of a hash table. I suppose we do need the list anyway because of the next_flush iterator, so using a hash table would only complicate the implementation, but still.

[...]

diff --git a/util/seqcache.c b/util/seqcache.c
new file mode 100644
index 0000000000..d923560eab
--- /dev/null
+++ b/util/seqcache.c
@@ -0,0 +1,361 @@
+/*
+ * Cache for small sequential write requests.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ *
+ * = Description =
+ *
+ * SeqCache is an abbreviation for Sequential Cache.
+ *
+ * The Cache is intended to improve performance of small unaligned sequential
+ * writes. Cache has a cluster_size parameter and the unit of caching is 
aligned
+ * cluster.  Cache has a list of cached clusters, several "finished" ones and 
at
+ * most one "unfinished". "unfinished" cluster is a cluster where last byte of
+ * last write operation is cached and there is a free place after that byte to
+ * the end of cluster. "finished" clusters are just stored in cache to be read
+ * or flushed and don't allow any writes to them.
+ *
+ * If write to the cache intersects cluster bounds, it's splat into several

*split

(Though I like “splat”.  Sounds like a wet blotch.)

+ * requests by cluster bounds. So, consider a write that doesn't intersect
+ * cluster bounds to describe the whole picture:
+ *
+ * There are two cases allowed:
+ *
+ * 1. Sequential write to "unfinished" cluster. Actually it's a write 
sequential
+ *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
+ *    filled up to the cluster bound it becomes "finished".
+ *
+ * 2. Write to new cluster, not existing in the cache. It may be sequential to
+ *    previous write or not. Current "unfinshed" cluster (if exists) becomes

*unfinished

+ *    "finished" and new "unfinished" cluster is started. Note also that offset
+ *    of the write to new cluster is not required to be aligned.
+ *
+ *    Any other write operation (non-sequential write to "unfinished" cluster
+ *    or write to any of "finished" clusters) will crash.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/queue.h"
+#include "qemu/seqcache.h"
+
+/*
+ * Cluster
+ *
+ * Representation of one cached cluster, aligned to SeqCache::cluster_size.
+ * Caches only one subregion of the cluster, started at @offset (may be
+ * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
+ * The whole subregion always lay in one aligned cluster:
+ *
+ *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
+ *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
+ *
+ * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
+ * allocated @buf length is at least:
+ *
+ *      cluster_size - offset % cluster_size
+ */
+typedef struct Cluster {
+   int64_t offset;
+   int64_t bytes;
+   uint8_t *buf;
+   QSIMPLEQ_ENTRY(Cluster) entry;
+} Cluster;
+
+/*
+ * SeqCache caches small sequential writes into "unfinished" @cur_write 
cluster,
+ * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
+ * calls.
+ *
+ * @cur_write->offset may be unaligned to @cluster_size if first write offset 
is
+ * not aligned (for example, if there was a flush request and cache was 
flushed,
+ * then we continue from the middle of the cluster with an empty cache).
+ *
+ * @cur_write may be NULL, which means we don't have current cluster and next
+ * seqcache_write() will start a new one.
+ *
+ * @all is a list of all clusters cached in the cache, some "finished" and one
+ * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
+ * one in the list.
+ *
+ * @nb_clusters is number of elements in @all list.
+ *
+ * @next_flush is an iterator for flushing. SeqCache knows nothing about how
+ * data should be flushing, so for flush user calls seqcache_get_next_flush()

s/flushing/flushed/

+ * (which moves @next_flush iterator) and when data is flushed somehow and 
cache
+ * is not needed anymore, user can call seqcache_discard_cluster().

AFAIU, next_flush always points to the first finished cluster that has not yet been returned by seqcache_get_next_flush(), is that correct? (Yes, at least the latter part is implied by this comment, I’m just asking for clarity, because I want to be sure the simple

  s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);

in seqcache_get_next_flush() does what I think it does, which is never to let s->next_flush be NULL even though there are still flushable clusters somewhere.)

+ */
+typedef struct SeqCache {
+    int64_t cluster_size;
+    int64_t nb_clusters;
+    Cluster *cur_write;
+    Cluster *next_flush;
+    QSIMPLEQ_HEAD(, Cluster) all;
+} SeqCache;

[...]

+/* Align down @offset to s->cluster_size and search for corresponding cluster 
*/
+static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
+{
+    Cluster *cl;
+    int64_t cl_start = cluster_start(s, offset);
+
+    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
+        if (cluster_start(s, cl->offset) == cl_start) {
+            return cl;
+        }
+    }
+
+    return NULL;
+}
+
+/* Makes current "unfinished" cluster the "finished" one. */

This sounds a bit like there’d be only a single finished cluster, so I’d rather write it as “Mark the current "unfinished" cluster as "finished".”

+static void seqcache_finalize_current_cluster(SeqCache *s)
+{
+    if (s->cur_write && !s->next_flush) {
+        s->next_flush = s->cur_write;
+    }
+    s->cur_write = NULL;
+}
+
+/*
+ * Write an @offset, @bytes, @buf request into the cache. The requests MUST not

s/requests/request/

+ * intersect cluster bounds.
+ */
+static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
+                               uint8_t *buf)

Could use a const, though not a must.

+{
+    assert(bytes > 0);
+    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
+
+    if (s->cur_write && offset == cached_end(s->cur_write)) {
+        /* Continue sequential process */
+        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
+        s->cur_write->bytes += bytes;
+
+        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
+            seqcache_finalize_current_cluster(s);
+        }
+
+        return;
+    }
+
+    /* We are starting a new cluster. Check that it's really new */
+    assert(!seqcache_find_cluster(s, offset));
+
+    seqcache_finalize_current_cluster(s);
+
+    s->cur_write = g_new(Cluster, 1);
+    *s->cur_write = (Cluster) {
+        .offset = offset,
+        .bytes = bytes,
+        .buf = g_malloc(s->cluster_size),

I have to ask: Why not s->cluster_size - offset % s->cluster_size as advertised by the comment describing Cluster?

More practical question: Should we use qemu_memalign() (aligning either at the cluster size or at the block alignment, which would need to be passed to seqcache_new()) when offset is aligned to the cluster size? (Or with a custom alignment, if it is aligned to that.)

I feel that for O_DIRECT images it might be kind of important to align the buffer to the host block size.

+    };
+
+    memcpy(s->cur_write->buf, buf, bytes);
+    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
+    s->nb_clusters++;
+}
+
+/* Write an @offset, @bytes, @buf request into the cache. */
+void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)

“const” might again find its place here.

+{
+    while (bytes) {
+        int64_t cl_end = cluster_end(s, offset);
+        int64_t chunk = MIN(bytes, cl_end - offset);
+
+        seqcache_write_one(s, offset, chunk, buf);
+        offset += chunk;
+        bytes -= chunk;
+        buf += chunk;
+    }
+}

[...]

+/*
+ * Get next region for flushing. @offset, @bytes and @buf are out-parameters
+ * to return the region.
+ *
+ * @unfinished is in-out argument which means that user is interested in
+ * flushing unfinished cluster too:
+ *
+ * If there are "finished" clusters, "finished" cluster is returned and
+ * *@unfinished is set to false, independently of its original value.
+ *
+ * If there are no "finished" clusters but "unfinished" exists (i.e.
+ * s->cur_write != NULL and it is the only element of s->all), then 
*@unfinished
+ * value remains the same and the following logic works:
+ *
+ *    If *@unfinished:
+ *       return s->cur_write unfinished cluster for flushing
+ *    Else
+ *       return nothing
+ *
+ *
+ * Returns true and set @offset, @bytes, @buf and @unfinished if there is
+ * something to flush (accordingly to @unfinished value), returns false
+ * otherwise.
+ *
+ * Nothing is removed from the cache.

Out of curiosity, mainly, is that because the returned *buf is only valid as long as the entry is still in the cache, or is there a different reason that I’m missing?

(Hm, perhaps the fact that the user may want to keep it available for reading through seqcache_read()?)

+ */
+bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
+                             uint8_t **buf, bool *unfinished)

Could be “uint8_t *const *buf”, I suppose. Don’t know how much the callers would hate that, though.

+{
+    Cluster *req = s->next_flush;
+
+    if (s->next_flush) {
+        *unfinished = false;
+        req = s->next_flush;
+        s->next_flush = QSIMPLEQ_NEXT(req, entry);
+        if (s->next_flush == s->cur_write) {
+            s->next_flush = NULL;
+        }
+    } else if (s->cur_write && *unfinished) {
+        req = s->cur_write;

I was wondering whether flushing an unfinished cluster wouldn’t kind of finalize it, but I suppose the problem with that would be that you can’t add data to a finished cluster, which wouldn’t be that great if you’re just flushing the cache without wanting to drop it all.

(The problem I see is that flushing it later will mean all the data that already has been written here will have to be rewritten. Not that bad, I suppose.)

+    } else {
+        return false;
+    }
+
+    *offset = req->offset;
+    *bytes = req->bytes;
+    *buf = req->buf;
+
+    return true;
+}
+
+/*
+ * Find corresponding cluster and drop it. No matter does requested @offset is
+ * cached itself or not.

The second sentence sounds strange grammatically, if I understand correctly, I’d change this to something like “Find the cluster corresponding to @offset and drop it. It does not matter whether @offset itself is actually within that cluster’s cached range or not.”

Max

+ */




reply via email to

[Prev in Thread] Current Thread [Next in Thread]