qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH V3 00/22] Live Update


From: Steven Sistare
Subject: Re: [PATCH V3 00/22] Live Update
Date: Tue, 18 May 2021 12:00:59 -0400
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1

On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>>>> restore VM state, with minimal guest pause time, so that qemu may be 
>>>>>> updated
>>>>>> to a new version in between.
>>>>>>
>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports 
>>>>>> two
>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary 
>>>>>> (or
>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>>>> paused state and waits for the cprload command.
>>>>>
>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>> already opens fds passed using this mechanism.
>>>>>
>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>> to open from 'getfd' fds.
>>>>>
>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>> chroot, seccomp, etc.
>>>>>
>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>> execve(2).
>>>>>
>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>> get fds via the QMP monitor in the new process before processing the
>>>>> entire command-line.
>>>>
>>>> Early on I experimented with a similar approach.  Old qemu passed 
>>>> descriptors to an
>>>> escrow process and exited; new qemu started and retrieved the descriptors 
>>>> from escrow.
>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid 
>>>> owner check.
>>>> I suspect my recent vfio extensions would smooth the rough edges.
>>>
>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>> pins pages from the original process?
>>
>> The dma unmap code verifies that the requesting task is the same as the task 
>> that mapped
>> the pages.  We could add an ioctl that passes ownership to a new task.  We 
>> would also need
>> to fix locked memory accounting, which is associated with the mm of the 
>> original task.
> 
>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>> execve(2) the new binary?
>>
>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes 
>> the vfio conflict
>> errors I mentioned in the previous email.  We would need to suppress 
>> redundant dma map calls,
>> but allow legitimate dma maps and unmaps in response to the ongoing address 
>> space changes and
>> diff callbacks caused by some drivers. It would be messy and fragile. In 
>> general, it felt like 
>> I was working against vfio rather than with it.
> 
> OK the weirdness of vfio helps explain a bit about why you're doing it
> this way; can you help separate some difference between restart and
> reboot for me though:
> 
> In 'reboot' mode; where the guest must do suspend in it's drivers, how
> much of these vfio requirements are needed?  I guess the memfd use
> for the anonymous areas isn't any use for reboot mode.

Correct.  For reboot no special vfio support or fiddling is needed.

> You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> vfio still care about the currently-anonymous areas?

Yes, for restart mode.  The physical pages behind the anonymous memory remain 
pinned and 
are targets for ongoing DMA.  Post-exec qemu needs a way to find those same 
pages.

>> Another big reason is a requirement to preserve anonymous memory for legacy 
>> qemu updates (via
>> code injection which I briefly mentioned in KVM forum).  If we extend cpr to 
>> allow updates 
>> without exec, I still need the exec option.
> 
> Can you explain what that code injection mechanism is for those of us
> who didn't see that?

Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which 
was not
accepted upstream.

-----------------------------------------------------------------------------
Legacy Live Update

 * Update legacy qemu process to latest version
   - Inject code into legacy qemu process to perform cprsave: vmsave.so
     . Access qemu data structures and globals
       - eg ram_list, savevm_state, chardevs, vhost_devices
       - dlopen does not resolve them, must get addresses via symbol lookup.
     . Delete some vmstate handlers, register new ones (eg vfio)
     . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
 * Hot patch a monitor function to dlopen vmsave.so, call entry point
   - write patch to /proc/pid/mem
   - Call the monitor function via monitor socket
 * Send cprload to update qemu
 * vmsave.so has binary dependency on qemu data structures and variables
   - Build vmsave-ver.so per legacy version
   - Indexed by qemu's gcc build-id

-----------------------------------------------------------------------------

- Steve
 
>>>> However, the main issue is that guest ram must be backed by named shared 
>>>> memory, and
>>>> we would need to add code to support shared memory for all the secondary 
>>>> memory objects.
>>>> That makes it less interesting for us at this time; we care about updating 
>>>> legacy qemu 
>>>> instances with anonymous guest memory.
>>>
>>> Thanks for explaining this more in the other sub-thread. The secondary
>>> memory objects you mentioned are relatively small so I don't think
>>> saving them in the traditional way is a problem.
>>>
>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>> migration infrastructure:
>>>
>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>   etc) so they are not saved into the savevm file. The existing --object
>>>   memory-backend-file syntax can be used.
>>>
>>> - Extending the live migration protocol to detect when file descriptor
>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>   that for memory-backend-* objects that have fds.
>>>
>>> Either of these approaches would handle RAM with existing savevm/migrate
>>> commands.
>>
>> Yes, but the vfio issues would still need to be solved, and we would need new
>> command line options to back existing and future secondary memory objects 
>> with 
>> named shared memory.
>>
>>> The remaining issue is how to migrate VFIO and other file descriptors
>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>> has file descriptor passing support in the QMP monitor and support for
>>> opening passed file descriptors (see qemu_open_internal(),
>>> monitor_fd_param(), and socket_get_fd()).
>>>
>>> The advantage of integrating live update functionality into the existing
>>> savevm/migrate commands is that it will work in more use cases with
>>> less code duplication/maintenance/bitrot prevention than the
>>> special-case cprsave command in this patch series.
>>>
>>> Maybe there is a fundamental technical reason why live update needs to
>>> be different from QEMU's existing migration commands but I haven't
>>> figured it out yet.
>>
>> vfio and anonymous memory.
>>
>> Regarding code duplication, I did consider whether to extend the migration
>> syntax and implementation versus creating something new.  Those functions
>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>> use case, and the cpr functions handle state that is n/a for the migration 
>> case.
>> I judged that handling both in the same functions would be less readable and
>> maintainable.  After feedback during the V1 review, I simplified the cprsave
>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>> interaction with the migration code.
>>
>> Regarding bit rot, I still need to add a cpr test to the test suite, when 
>> the 
>> review is more complete and folks agree on the final form of the 
>> functionality.
>>
>> I do like the idea of supporting update without exec, but as a future 
>> project, 
>> and not at the expense of dropping update with exec.
>>
>> - Steve
>>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]