[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH V3 00/22] Live Update
From: |
Steven Sistare |
Subject: |
Re: [PATCH V3 00/22] Live Update |
Date: |
Tue, 18 May 2021 12:00:59 -0400 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 |
On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>>>> Provide the cprsave and cprload commands for live update. These save and
>>>>>> restore VM state, with minimal guest pause time, so that qemu may be
>>>>>> updated
>>>>>> to a new version in between.
>>>>>>
>>>>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports
>>>>>> two
>>>>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary
>>>>>> (or
>>>>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a
>>>>>> paused state and waits for the cprload command.
>>>>>
>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>> already opens fds passed using this mechanism.
>>>>>
>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>> to open from 'getfd' fds.
>>>>>
>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>> chroot, seccomp, etc.
>>>>>
>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>> execve(2).
>>>>>
>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>> get fds via the QMP monitor in the new process before processing the
>>>>> entire command-line.
>>>>
>>>> Early on I experimented with a similar approach. Old qemu passed
>>>> descriptors to an
>>>> escrow process and exited; new qemu started and retrieved the descriptors
>>>> from escrow.
>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid
>>>> owner check.
>>>> I suspect my recent vfio extensions would smooth the rough edges.
>>>
>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>> pins pages from the original process?
>>
>> The dma unmap code verifies that the requesting task is the same as the task
>> that mapped
>> the pages. We could add an ioctl that passes ownership to a new task. We
>> would also need
>> to fix locked memory accounting, which is associated with the mm of the
>> original task.
>
>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>> execve(2) the new binary?
>>
>> That is one. Plus, re-attaching to named shared memory for pc.ram causes
>> the vfio conflict
>> errors I mentioned in the previous email. We would need to suppress
>> redundant dma map calls,
>> but allow legitimate dma maps and unmaps in response to the ongoing address
>> space changes and
>> diff callbacks caused by some drivers. It would be messy and fragile. In
>> general, it felt like
>> I was working against vfio rather than with it.
>
> OK the weirdness of vfio helps explain a bit about why you're doing it
> this way; can you help separate some difference between restart and
> reboot for me though:
>
> In 'reboot' mode; where the guest must do suspend in it's drivers, how
> much of these vfio requirements are needed? I guess the memfd use
> for the anonymous areas isn't any use for reboot mode.
Correct. For reboot no special vfio support or fiddling is needed.
> You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> vfio still care about the currently-anonymous areas?
Yes, for restart mode. The physical pages behind the anonymous memory remain
pinned and
are targets for ongoing DMA. Post-exec qemu needs a way to find those same
pages.
>> Another big reason is a requirement to preserve anonymous memory for legacy
>> qemu updates (via
>> code injection which I briefly mentioned in KVM forum). If we extend cpr to
>> allow updates
>> without exec, I still need the exec option.
>
> Can you explain what that code injection mechanism is for those of us
> who didn't see that?
Sure. Here is slide 12 from the talk. It relies on mmap(MADV_DOEXEC) which
was not
accepted upstream.
-----------------------------------------------------------------------------
Legacy Live Update
* Update legacy qemu process to latest version
- Inject code into legacy qemu process to perform cprsave: vmsave.so
. Access qemu data structures and globals
- eg ram_list, savevm_state, chardevs, vhost_devices
- dlopen does not resolve them, must get addresses via symbol lookup.
. Delete some vmstate handlers, register new ones (eg vfio)
. Call MADV_DOEXEC on guest memory. Find devices, preserve fd
* Hot patch a monitor function to dlopen vmsave.so, call entry point
- write patch to /proc/pid/mem
- Call the monitor function via monitor socket
* Send cprload to update qemu
* vmsave.so has binary dependency on qemu data structures and variables
- Build vmsave-ver.so per legacy version
- Indexed by qemu's gcc build-id
-----------------------------------------------------------------------------
- Steve
>>>> However, the main issue is that guest ram must be backed by named shared
>>>> memory, and
>>>> we would need to add code to support shared memory for all the secondary
>>>> memory objects.
>>>> That makes it less interesting for us at this time; we care about updating
>>>> legacy qemu
>>>> instances with anonymous guest memory.
>>>
>>> Thanks for explaining this more in the other sub-thread. The secondary
>>> memory objects you mentioned are relatively small so I don't think
>>> saving them in the traditional way is a problem.
>>>
>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>> migration infrastructure:
>>>
>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>> etc) so they are not saved into the savevm file. The existing --object
>>> memory-backend-file syntax can be used.
>>>
>>> - Extending the live migration protocol to detect when file descriptor
>>> passing is available (i.e. UNIX domain socket migration) and using
>>> that for memory-backend-* objects that have fds.
>>>
>>> Either of these approaches would handle RAM with existing savevm/migrate
>>> commands.
>>
>> Yes, but the vfio issues would still need to be solved, and we would need new
>> command line options to back existing and future secondary memory objects
>> with
>> named shared memory.
>>
>>> The remaining issue is how to migrate VFIO and other file descriptors
>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>> has file descriptor passing support in the QMP monitor and support for
>>> opening passed file descriptors (see qemu_open_internal(),
>>> monitor_fd_param(), and socket_get_fd()).
>>>
>>> The advantage of integrating live update functionality into the existing
>>> savevm/migrate commands is that it will work in more use cases with
>>> less code duplication/maintenance/bitrot prevention than the
>>> special-case cprsave command in this patch series.
>>>
>>> Maybe there is a fundamental technical reason why live update needs to
>>> be different from QEMU's existing migration commands but I haven't
>>> figured it out yet.
>>
>> vfio and anonymous memory.
>>
>> Regarding code duplication, I did consider whether to extend the migration
>> syntax and implementation versus creating something new. Those functions
>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>> use case, and the cpr functions handle state that is n/a for the migration
>> case.
>> I judged that handling both in the same functions would be less readable and
>> maintainable. After feedback during the V1 review, I simplified the cprsave
>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>> interaction with the migration code.
>>
>> Regarding bit rot, I still need to add a cpr test to the test suite, when
>> the
>> review is more complete and folks agree on the final form of the
>> functionality.
>>
>> I do like the idea of supporting update without exec, but as a future
>> project,
>> and not at the expense of dropping update with exec.
>>
>> - Steve
>>
- Re: [PATCH V3 00/22] Live Update, (continued)
- Re: [PATCH V3 00/22] Live Update, Stefan Hajnoczi, 2021/05/12
- Re: [PATCH V3 00/22] Live Update, Steven Sistare, 2021/05/13
- Re: [PATCH V3 00/22] Live Update, Stefan Hajnoczi, 2021/05/14
- Re: [PATCH V3 00/22] Live Update, Steven Sistare, 2021/05/14
- Re: [PATCH V3 00/22] Live Update, Stefan Hajnoczi, 2021/05/17
- Re: [PATCH V3 00/22] Live Update, Alex Williamson, 2021/05/17
- Re: [PATCH V3 00/22] Live Update, Stefan Hajnoczi, 2021/05/18
- Re: [PATCH V3 00/22] Live Update, Steven Sistare, 2021/05/18
- Re: [PATCH V3 00/22] Live Update, Dr. David Alan Gilbert, 2021/05/18
- Re: [PATCH V3 00/22] Live Update,
Steven Sistare <=
- Re: [PATCH V3 00/22] Live Update, Dr. David Alan Gilbert, 2021/05/18
- Re: [PATCH V3 00/22] Live Update, Alex Williamson, 2021/05/18
- Re: [PATCH V3 00/22] Live Update, Steven Sistare, 2021/05/18
- Re: [PATCH V3 00/22] Live Update [reboot], Dr. David Alan Gilbert, 2021/05/20
- Re: [PATCH V3 00/22] Live Update [reboot], Steven Sistare, 2021/05/21
- Re: [PATCH V3 00/22] Live Update [restart], Dr. David Alan Gilbert, 2021/05/20
- Re: [PATCH V3 00/22] Live Update [restart], Steven Sistare, 2021/05/21
- Re: [PATCH V3 00/22] Live Update [restart], Dr. David Alan Gilbert, 2021/05/24
Re: [PATCH V3 00/22] Live Update, Steven Sistare, 2021/05/19