[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] migration/calc-dirty-rate: millisecond precision period
From: |
gudkov.andrei |
Subject: |
Re: [PATCH] migration/calc-dirty-rate: millisecond precision period |
Date: |
Mon, 31 Jul 2023 17:51:49 +0300 |
On Mon, Jul 17, 2023 at 03:08:37PM -0400, Peter Xu wrote:
> On Tue, Jul 11, 2023 at 03:38:18PM +0300, gudkov.andrei@huawei.com wrote:
> > On Thu, Jul 06, 2023 at 03:23:43PM -0400, Peter Xu wrote:
> > > On Thu, Jun 29, 2023 at 11:59:03AM +0300, Andrei Gudkov wrote:
> > > > Introduces alternative argument calc-time-ms, which is the
> > > > the same as calc-time but accepts millisecond value.
> > > > Millisecond precision allows to make predictions whether
> > > > migration will succeed or not. To do this, calculate dirty
> > > > rate with calc-time-ms set to max allowed downtime, convert
> > > > measured rate into volume of dirtied memory, and divide by
> > > > network throughput. If the value is lower than max allowed
> > > > downtime, then migration will converge.
> > > >
> > > > Measurement results for single thread randomly writing to
> > > > a 24GiB region:
> > > > +--------------+--------------------+
> > > > | calc-time-ms | dirty-rate (MiB/s) |
> > > > +--------------+--------------------+
> > > > | 100 | 1880 |
> > > > | 200 | 1340 |
> > > > | 300 | 1120 |
> > > > | 400 | 1030 |
> > > > | 500 | 868 |
> > > > | 750 | 720 |
> > > > | 1000 | 636 |
> > > > | 1500 | 498 |
> > > > | 2000 | 423 |
> > > > +--------------+--------------------+
> > >
> > > Do you mean the dirty workload is constant? Why it differs so much with
> > > different calc-time-ms?
> >
> > Workload is as constant as it could be. But the naming is misleading.
> > What is named "dirty-rate" in fact is not "rate" at all.
> > calc-dirty-rate measures number of *uniquely* dirtied pages, i.e. each
> > page can contribute to the counter only once during measurement period.
> > That's why the values are decreasing. Consider also ad infinitum argument:
> > since VM has fixed number of pages and each page can be dirtied only once,
> > dirty-rate=number-of-dirtied-pages/calc-time -> 0 as calc-time -> inf.
> > It would make more sense to report number as "dirty-volume" --
> > without dividing it by calc-time.
> >
> > Note that number of *uniquely* dirtied pages in given amount of time is
> > exactly what we need for doing migration-related predictions. There is
> > no error here.
>
> Is calc-time-ms the duration of the measurement?
>
> Taking the 1st line as example, 1880MB/s * 0.1s = 188MB.
> For the 2nd line, 1340MB/s * 0.2s = 268MB.
> Even for the longest duration of 2s, that's 846MB in total.
>
> The range is 24GB. In this case, most of the pages should only be written
> once even if random for all these test durations, right?
>
Yes, I messed with load generator.
The effective memory region was much smaller than 24GiB.
I performed more testing (after fixing load generator),
now with different memory sizes and different modes.
+--------------+-----------------------------------------------+
| calc-time-ms | dirty rate MiB/s |
| +----------------+---------------+--------------+
| | theoretical | page-sampling | dirty-bitmap |
| | (at 3M wr/sec) | | |
+--------------+----------------+---------------+--------------+
| 1GiB |
+--------------+----------------+---------------+--------------+
| 100 | 6996 | 7100 | 3192 |
| 200 | 4606 | 4660 | 2655 |
| 300 | 3305 | 3280 | 2371 |
| 400 | 2534 | 2525 | 2154 |
| 500 | 2041 | 2044 | 1871 |
| 750 | 1365 | 1341 | 1358 |
| 1000 | 1024 | 1052 | 1025 |
| 1500 | 683 | 678 | 684 |
| 2000 | 512 | 507 | 513 |
+--------------+----------------+---------------+--------------+
| 4GiB |
+--------------+----------------+---------------+--------------+
| 100 | 10232 | 8880 | 4070 |
| 200 | 8954 | 8049 | 3195 |
| 300 | 7889 | 7193 | 2881 |
| 400 | 6996 | 6530 | 2700 |
| 500 | 6245 | 5772 | 2312 |
| 750 | 4829 | 4586 | 2465 |
| 1000 | 3865 | 3780 | 2178 |
| 1500 | 2694 | 2633 | 2004 |
| 2000 | 2041 | 2031 | 1789 |
+--------------+----------------+---------------+--------------+
| 24GiB |
+--------------+----------------+---------------+--------------+
| 100 | 11495 | 8640 | 5597 |
| 200 | 11226 | 8616 | 3527 |
| 300 | 10965 | 8386 | 2355 |
| 400 | 10713 | 8370 | 2179 |
| 500 | 10469 | 8196 | 2098 |
| 750 | 9890 | 7885 | 2556 |
| 1000 | 9354 | 7506 | 2084 |
| 1500 | 8397 | 6944 | 2075 |
| 2000 | 7574 | 6402 | 2062 |
+--------------+----------------+---------------+--------------+
Theoretical values are computed according to the following formula:
size * (1 - (1-(4096/size))^(time*wps)) / (time * 2^20),
where size is in bytes, time is in seconds, and wps is number of
writes per second (I measured approximately 3000000 on my system).
Theoretical values and values obtained with page-sampling are
approximately close (<=25%). Dirty-bitmap values are much lower,
likely because the majority of writes cause page faults. Even though
dirty-bitmap logic is closer to what is happening during live
migration, I still favor page sampling because the latter doesn't
impact the performance of VM too much.
Whether calc-time < 1sec is meaningful or not depends on the size
of memory region with active writes.
1. If we have big VM and writes are evenly spread over the whole
address space, then almost all writes will go into unique pages.
In this case number of dirty pages will grow approximately
linearly with time for small calc-time values.
2. But if memory region with active writes is small enough, then many
writes will go to the same page, and the number of dirty pages
will grow sublinearly even for small calc-time values. Note that
the second scenario can happen even VM RAM is big. For example,
imagine 128GiB VM with in-memory database that is used for reading.
Although VM size is big, the memory region with active writes is
just the application stack.