qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 4/6] util: implement seqcache


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH v3 4/6] util: implement seqcache
Date: Fri, 12 Mar 2021 17:37:41 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

12.03.2021 16:41, Max Reitz wrote:
On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
Implement cache for small sequential unaligned writes, so that they may
be cached until we get a complete cluster and then write it.

The cache is intended to be used for backup to qcow2 compressed target
opened in O_DIRECT mode, but can be reused for any similar (even not
block-layer related) task.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
  include/qemu/seqcache.h |  42 +++++
  util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
  MAINTAINERS             |   6 +
  util/meson.build        |   1 +
  4 files changed, 410 insertions(+)
  create mode 100644 include/qemu/seqcache.h
  create mode 100644 util/seqcache.c

Looks quite good to me, thanks.  Nice explanations, too. :)

The only design question I have is whether there’s a reason you’re using a list 
again instead of a hash table.  I suppose we do need the list anyway because of 
the next_flush iterator, so using a hash table would only complicate the 
implementation, but still.

Yes, it seems correct for flush iterator go in same order as writes comes, so 
we need a list. We can add a hash table, it will only help on read.. But for 
compressed cache in qcow2 we try to flush often enough, so there should not be 
many clusters in the cache. So I think addition of hash table may be done later 
if needed.


[...]

diff --git a/util/seqcache.c b/util/seqcache.c
new file mode 100644
index 0000000000..d923560eab
--- /dev/null
+++ b/util/seqcache.c
@@ -0,0 +1,361 @@
+/*
+ * Cache for small sequential write requests.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ *
+ * = Description =
+ *
+ * SeqCache is an abbreviation for Sequential Cache.
+ *
+ * The Cache is intended to improve performance of small unaligned sequential
+ * writes. Cache has a cluster_size parameter and the unit of caching is 
aligned
+ * cluster.  Cache has a list of cached clusters, several "finished" ones and 
at
+ * most one "unfinished". "unfinished" cluster is a cluster where last byte of
+ * last write operation is cached and there is a free place after that byte to
+ * the end of cluster. "finished" clusters are just stored in cache to be read
+ * or flushed and don't allow any writes to them.
+ *
+ * If write to the cache intersects cluster bounds, it's splat into several

*split

(Though I like “splat”.  Sounds like a wet blotch.)

+ * requests by cluster bounds. So, consider a write that doesn't intersect
+ * cluster bounds to describe the whole picture:
+ *
+ * There are two cases allowed:
+ *
+ * 1. Sequential write to "unfinished" cluster. Actually it's a write 
sequential
+ *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
+ *    filled up to the cluster bound it becomes "finished".
+ *
+ * 2. Write to new cluster, not existing in the cache. It may be sequential to
+ *    previous write or not. Current "unfinshed" cluster (if exists) becomes

*unfinished

+ *    "finished" and new "unfinished" cluster is started. Note also that offset
+ *    of the write to new cluster is not required to be aligned.
+ *
+ *    Any other write operation (non-sequential write to "unfinished" cluster
+ *    or write to any of "finished" clusters) will crash.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/queue.h"
+#include "qemu/seqcache.h"
+
+/*
+ * Cluster
+ *
+ * Representation of one cached cluster, aligned to SeqCache::cluster_size.
+ * Caches only one subregion of the cluster, started at @offset (may be
+ * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
+ * The whole subregion always lay in one aligned cluster:
+ *
+ *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
+ *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
+ *
+ * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
+ * allocated @buf length is at least:
+ *
+ *      cluster_size - offset % cluster_size
+ */
+typedef struct Cluster {
+   int64_t offset;
+   int64_t bytes;
+   uint8_t *buf;
+   QSIMPLEQ_ENTRY(Cluster) entry;
+} Cluster;
+
+/*
+ * SeqCache caches small sequential writes into "unfinished" @cur_write 
cluster,
+ * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
+ * calls.
+ *
+ * @cur_write->offset may be unaligned to @cluster_size if first write offset 
is
+ * not aligned (for example, if there was a flush request and cache was 
flushed,
+ * then we continue from the middle of the cluster with an empty cache).
+ *
+ * @cur_write may be NULL, which means we don't have current cluster and next
+ * seqcache_write() will start a new one.
+ *
+ * @all is a list of all clusters cached in the cache, some "finished" and one
+ * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
+ * one in the list.
+ *
+ * @nb_clusters is number of elements in @all list.
+ *
+ * @next_flush is an iterator for flushing. SeqCache knows nothing about how
+ * data should be flushing, so for flush user calls seqcache_get_next_flush()

s/flushing/flushed/

+ * (which moves @next_flush iterator) and when data is flushed somehow and 
cache
+ * is not needed anymore, user can call seqcache_discard_cluster().

AFAIU, next_flush always points to the first finished cluster that has not yet 
been returned by seqcache_get_next_flush(), is that correct?

Yes, right.

 (Yes, at least the latter part is implied by this comment, I’m just asking for 
clarity, because I want to be sure the simple

   s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);

in seqcache_get_next_flush() does what I think it does, which is never to let 
s->next_flush be NULL even though there are still flushable clusters somewhere.)

+ */
+typedef struct SeqCache {
+    int64_t cluster_size;
+    int64_t nb_clusters;
+    Cluster *cur_write;
+    Cluster *next_flush;
+    QSIMPLEQ_HEAD(, Cluster) all;
+} SeqCache;

[...]

+/* Align down @offset to s->cluster_size and search for corresponding cluster 
*/
+static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
+{
+    Cluster *cl;
+    int64_t cl_start = cluster_start(s, offset);
+
+    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
+        if (cluster_start(s, cl->offset) == cl_start) {
+            return cl;
+        }
+    }
+
+    return NULL;
+}
+
+/* Makes current "unfinished" cluster the "finished" one. */

This sounds a bit like there’d be only a single finished cluster, so I’d rather write it as “Mark 
the current "unfinished" cluster as "finished".”

+static void seqcache_finalize_current_cluster(SeqCache *s)
+{
+    if (s->cur_write && !s->next_flush) {
+        s->next_flush = s->cur_write;
+    }
+    s->cur_write = NULL;
+}
+
+/*
+ * Write an @offset, @bytes, @buf request into the cache. The requests MUST not

s/requests/request/

+ * intersect cluster bounds.
+ */
+static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
+                               uint8_t *buf)

Could use a const, though not a must.

+{
+    assert(bytes > 0);
+    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
+
+    if (s->cur_write && offset == cached_end(s->cur_write)) {
+        /* Continue sequential process */
+        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
+        s->cur_write->bytes += bytes;
+
+        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
+            seqcache_finalize_current_cluster(s);
+        }
+
+        return;
+    }
+
+    /* We are starting a new cluster. Check that it's really new */
+    assert(!seqcache_find_cluster(s, offset));
+
+    seqcache_finalize_current_cluster(s);
+
+    s->cur_write = g_new(Cluster, 1);
+    *s->cur_write = (Cluster) {
+        .offset = offset,
+        .bytes = bytes,
+        .buf = g_malloc(s->cluster_size),

I have to ask: Why not s->cluster_size - offset % s->cluster_size as advertised 
by the comment describing Cluster?

I just didn't care, it should be rare case. But for generic code better be 
precise. I'll update it.


More practical question: Should we use qemu_memalign() (aligning either at the 
cluster size or at the block alignment, which would need to be passed to 
seqcache_new()) when offset is aligned to the cluster size? (Or with a custom 
alignment, if it is aligned to that.)

I feel that for O_DIRECT images it might be kind of important to align the 
buffer to the host block size.

Will do


+    };
+
+    memcpy(s->cur_write->buf, buf, bytes);
+    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
+    s->nb_clusters++;
+}
+
+/* Write an @offset, @bytes, @buf request into the cache. */
+void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)

“const” might again find its place here.

+{
+    while (bytes) {
+        int64_t cl_end = cluster_end(s, offset);
+        int64_t chunk = MIN(bytes, cl_end - offset);
+
+        seqcache_write_one(s, offset, chunk, buf);
+        offset += chunk;
+        bytes -= chunk;
+        buf += chunk;
+    }
+}

[...]

+/*
+ * Get next region for flushing. @offset, @bytes and @buf are out-parameters
+ * to return the region.
+ *
+ * @unfinished is in-out argument which means that user is interested in
+ * flushing unfinished cluster too:
+ *
+ * If there are "finished" clusters, "finished" cluster is returned and
+ * *@unfinished is set to false, independently of its original value.
+ *
+ * If there are no "finished" clusters but "unfinished" exists (i.e.
+ * s->cur_write != NULL and it is the only element of s->all), then 
*@unfinished
+ * value remains the same and the following logic works:
+ *
+ *    If *@unfinished:
+ *       return s->cur_write unfinished cluster for flushing
+ *    Else
+ *       return nothing
+ *
+ *
+ * Returns true and set @offset, @bytes, @buf and @unfinished if there is
+ * something to flush (accordingly to @unfinished value), returns false
+ * otherwise.
+ *
+ * Nothing is removed from the cache.

Out of curiosity, mainly, is that because the returned *buf is only valid as 
long as the entry is still in the cache, or is there a different reason that 
I’m missing?

(Hm, perhaps the fact that the user may want to keep it available for reading 
through seqcache_read()?)

Yes, that's for read. This way reads are not blocked by in-flight cache flush. 
We just read from cache if data in the cache, and if not read from the image. 
We don't have to wait for intersecting in-flight write to finish.


+ */
+bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
+                             uint8_t **buf, bool *unfinished)

Could be “uint8_t *const *buf”, I suppose.  Don’t know how much the callers 
would hate that, though.

Will do. And actually I wrote quite big explanation but missed the fact that 
caller don't get ownership on buf, it should be mentioned.


+{
+    Cluster *req = s->next_flush;
+
+    if (s->next_flush) {
+        *unfinished = false;
+        req = s->next_flush;
+        s->next_flush = QSIMPLEQ_NEXT(req, entry);
+        if (s->next_flush == s->cur_write) {
+            s->next_flush = NULL;
+        }
+    } else if (s->cur_write && *unfinished) {
+        req = s->cur_write;

I was wondering whether flushing an unfinished cluster wouldn’t kind of 
finalize it, but I suppose the problem with that would be that you can’t add 
data to a finished cluster, which wouldn’t be that great if you’re just 
flushing the cache without wanting to drop it all.

(The problem I see is that flushing it later will mean all the data that 
already has been written here will have to be rewritten.  Not that bad, I 
suppose.)

Yes that's all correct. Also there is additional strong reason: qcow2 depends on the fact 
that clusters become "finished" by defined rules: only when it really finished 
up the the end or when qcow2 starts writing another cluster.

For "finished" clusters with unaligned end we can safely align this end up to 
some good alignment writing a bit more data than needed. It's safe because tail of the 
cluster is never used. And we'll perform better with aligned write avoiding RMW.

But when flushing "unfinished" cluster, we should write exactly what we have in 
the cache, as there may happen parallel write to the same cluster, which will continue 
the sequential process.


+    } else {
+        return false;
+    }
+
+    *offset = req->offset;
+    *bytes = req->bytes;
+    *buf = req->buf;
+
+    return true;
+}
+
+/*
+ * Find corresponding cluster and drop it. No matter does requested @offset is
+ * cached itself or not.

The second sentence sounds strange grammatically, if I understand correctly, 
I’d change this to something like “Find the cluster corresponding to @offset 
and drop it.  It does not matter whether @offset itself is actually within that 
cluster’s cached range or not.”


Right, thanks for good rewording. I'm afraid, I often use russian language 
constructions with english words.

--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]