|
From: | Anthony Liguori |
Subject: | Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool |
Date: | Fri, 12 Dec 2008 11:25:55 -0600 |
User-agent: | Thunderbird 2.0.0.17 (X11/20080925) |
Andrea Arcangeli wrote:
On Fri, Dec 12, 2008 at 10:49:45AM -0600, Anthony Liguori wrote:I meant, if you wanted to pass a file descriptor as a raw device. So: qemu -hda raw:fd=4 Or something like that. We don't support this today.ah ok.I think bouncing the iov and just using pread/pwrite may be our best bet. It means memory allocation but we can cap it. Since we're using threads,It's already capped. However currently it generates an iovec, but we've simply to check the iovcnt to be 1, if it's 1 we pread from iov.iov_base, iov.iov_len. The dma api will take care to enforce iovcnt to be 1 for the iovec if preadv/pwritev isn't detected at compile time.
Hrm, that's more complex than I was expecting. I was thinking the bdrv aio infrastructure would always take an iovec. Any details about the underlying host's ability to handle the iovec would be insulated.
we just can force a thread to sleep until memory becomes available so it's actually pretty straight forward.There's no way to detect that and wait for memory,
If we artificially cap at say 50MB, then you do something like: while (buffer == NULL) { buffer = try_to_bounce(offset, iov, iovcnt, &size); if (buffer == NULL && errno == ENOMEM) { pthread_wait_cond(more memory); } }try_to_bounce allocs with malloc() but if you exceed 50MB, then you fail with an error of ENOMEM. In your bounce_free() function, you do a pthread_cond_broadcast() to wake up any threads potentially waiting to allocate memory.
This lets us expose a preadv/pwritev function that actually works. The expectation is that bouncing will outperform just doing pread/pwrite of each vector. Of course, you could get smart and if try_to_bounce fail, fall back to pread/pwrite each vector. Likewise, you can fast-path the case of a single iovec to avoid bouncing entirely.
Regards, Anthony Liguori
it'd sigkill before you can check... at least with the default overcommit. The way the dma api works, is that it doesn't send a mega large writev, but send it in pieces capped by the max buffer size, with many iovecs with iovcnt = 1.We can use libaio on older Linux's to simulate preadv/pwritev. Use the proper syscalls on newer kernels, on BSDs, and bounce everything else.Given READV/WRITEV aren't available in not very recent kernels and given that without O_DIRECT each iocb will become synchronous, we can't use the libaio. Also once they fix linux-aio, if we do that, the iocb logic would need to be largely refactored. So I'm not sure if it worth it as it can't handle 2.6.16-18 when O_DIRECT is disabled (when O_DIRECT is enabled we could just build an array of linear iocb).
[Prev in Thread] | Current Thread | [Next in Thread] |