qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] migration: Cleanup during exit


From: Daniel P . Berrangé
Subject: Re: [Qemu-devel] [PATCH] migration: Cleanup during exit
Date: Thu, 28 Feb 2019 12:01:03 +0000
User-agent: Mutt/1.10.1 (2018-07-13)

On Thu, Feb 28, 2019 at 11:40:19AM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (address@hidden) wrote:
> > On Wed, Feb 27, 2019 at 04:49:00PM +0000, Dr. David Alan Gilbert (git) 
> > wrote:
> > > From: "Dr. David Alan Gilbert" <address@hidden>
> > > 
> > > Currently we cleanup the migration object as we exit main after the
> > > main_loop finishes; however if there's a migration running things
> > > get messy and we can end up with the migration thread still trying
> > > to access freed structures.
> > > 
> > > We now take a ref to the object around the migration thread itself,
> > > so the act of dropping the ref during exit doesn't cause us to lose
> > > the state until the thread quits.
> > > 
> > > Cancelling the migration during migration also tries to get the thread
> > > to quit.
> > > 
> > > We do this a bit earlier; so hopefully migration gets out of the way
> > > before all the devices etc are freed.
> > 
> > So does it mean that even with the patch it's still possible the
> > migration thread will be accessing device structs that have already
> > been freed which can still crash QEMU?
> 
> Possibly yes; I'm not sure how to go to the next stage and stop that
> case; the consensus seems to be we don't want to explicitly block
> during the exit process, so doing a join on the migration thread doesn't
> seem to be wanted.

Essentially the many  *_cleanup() calls at the end of main() in vl.c
are only ever safe if all background threads have stopped using the
resources that are being freed. This isn't the case with migration
currently. I also worry about other threads that might be running
in QEMU, SPICE in particular as it touchs many device backends.

Aborting the migration & joining the migration threads is the natural
way to address this. Cancelling appears to require the main loop to
still be running, so would require main_loop_should_exit() to issue
the cancel & return false unless it has completed. This would delay
shutdown for as long as it takes migration to abort.

FWIW, even the bdrv_close_all() call during shutdown can delay things
for a considerable time if draining the backends isn't a quick op,
which could be the case if there are storage problems (blocked local
I/O, or interrupted network - rbd/ceph/nfs clients). So I'm not sure
that stopping migration is inherantly worse than what's already
possible with block cleanup, in terms of delaying things.

A slightly more hacky approach would be to pthread_cancel() all the
migration threads. Normally we'd never use pthread_cancel() as it
is incredibly hard to ensure all memory used by the threads is
freed. Since we're about to exit the process though, this cleanup
does not matter. The risk, however, is that one of the threads is
holding a mutex which then blocks the rest of the *cleanup functions,
so this still feels dangerous.

Overall to me a clean cancel & join of the migration threads feels
like the only safe option, unless we just remove all notion of
cleanup from QEMU & delete all the _cleanup functions in main()
and let the OS free all memory.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



reply via email to

[Prev in Thread] Current Thread [Next in Thread]