|
From: | Xavier Hernandez |
Subject: | Re: [Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster |
Date: | Mon, 10 Feb 2014 09:46:47 +0100 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 |
These are a few ideas I had about how to implement a MESI-like
protocol on gluster. It's more a bunch of ideas than an structured
proposal, however I hope it's clear enough to show the basic
concepts as I see them. Each inode will have two separate access levels: one for the metadata and one for the data. They could be different. Additionally, many security checks, posix compliance and some other aspects will be needed to be performed on the client side since many requests could be directly satisfied without accessing the bricks. First the easy part: under normal circumstances without any node failures or other errors. The main idea is that each client, before processing a request, will check if it has enough information from the related inode to process the request locally (i.e. at least shared access to the inode for read requests or exclusive access for writes). If it has enough information, the request will be immediately processed and returned to the upper translators and, for write requests, the operation will be continued in background. If the client doesn't have enough access to the inode, it can attach information to the request to tell the bricks which kind of access it wants for the inode. By default an operation needs a specific access level (i.e. shared access for reads and exclusive access for writes), however it can request a level less strict if the client won't need the full access in a near future (for example a write request will need exclusive access, and bricks will execute it with exclusive access, however the client can ask for shared access only if it foresees that following operations will only be reads). Additionally, for exclusive requests, a required space estimate must also be attached to the request. This value will be used by the bricks to reserve the amount of space requested for this client. This is needed to control available space on writes and allow them to be executed locally on the client side (when the available space on a brick gets too low, it can deny any exclusive access to have a better control of available space). It can specify access levels for more than one inode (this is useful for operations like rename that involve more than one inode). This information will be sent as new entries inside the xdata argument. Then the request will be sent to the bricks and the client will wait until it receives the answer. Bricks can answer in three ways: 1. The operation cannot be processed due to impossibility to get the desired access to the inode(s) - It shouldn't happen, but it must be taken into account 2. The operation has been processed successfully (even if the result of the operation is an error) but the desired level of access has not been granted 3. The operation has been processed successfully and the desired level of access has been granted When the operation succeeds and the request involved more than one inode, it might happen that the bricks grant access to one of them but not to the others. It's also possible that one brick grants access to one inode but another brick does not (for example if a bricks is in a very low space condition). In this case the client will consider that access has been denied. When the access has been denied but the request has succeeded, it means that any future request involving the same inode will need to be sent to the bricks with the extra access information again. This also gives enough control to the bricks to not grant exclusive access to some inode if it detects that multiple clients are accessing it concurrently. All requests containing inode access information will need to be strictly ordered to guarantee that all bricks process the requests in the same order. Requests executed in background because the client already had exclusive access can be executed in any order (the exclusive access is enough to avoid corruptions). Specific details about some fops: * open(), opendir(). The open flags can be used to determine the desired access. An O_RDONLY open, will request 'shared' access. A O_RDWR or O_WRONLY will request 'exclusive' access. A O_WRONLY flag could also disable caching because it will never be used. * When the last fd of an inode is released, the current ownership can be released (i.e. set the cache entry to 'invalid'). * Synchronization fops, like flush(), fsync() and fsyncdir(), will always be sent synchronously even if the client has exclusive access to the inode. The not so easy part: If something fails. The big problem is what to do when a client dies having exclusive access to some inodes or loses connection or a brick has any problem. There are a lot of cases and I haven't analyzed all of them deeply. This is only a first approach. When a brick dies: In this case all clients will cease to receive answers from it. This would need to be handled as it's currently done depending on the volume type (for replicate, the other bricks will maintain the volume working, for disperse, a part of the volume could be lost). When the bricks comes online again and reconnects, the current access levels owned by each client will need to be requested again (this is similar to the current procedure to reopen fd's). If any of the requests to restore ownership fail, the client will consider that it has lost the access to the inode and it will need to ask for it again in future requests. When a client dies: If it doesn't have ownership of any inode, nothing special happens. Otherwise, if it has 'exclusive' access to one or more inodes, all bricks will try to notify this client when another client requests 'shared' or 'exclusive' access. This notification will have a timeout. If the client doesn't answer in the specified time, it will lose the ownership and all requests coming from that client without access information attached to the xdata will be denied. This can lead to some data loss, however, since the caching will be write-through and flush(), fsync() and fsyncdir() would have been executed synchronously, the likelihood of data loss is small and the semantics of posix allow it (I'm not a posix expert, but I think that posix doesn't guarantee data to be recoverable until flush() or fsync() have been executed successfully). When the client reconnects, it will continue to execute normally. It could receive some notification of invalidation of one inode that it doesn't have anymore. In this case it will simply acknowledge the notification. When a client disconnects but it does not die: It's basically the same than the above case, however when the client reconnects it will try to recover its previous ownerships. If nothing has changed, it will recover them. Otherwise some of the inodes will be invalidated. Any pending operations on the invalidated inodes will be lost (it's as if the client had died). Xavi El 06/02/14 00:24, Anand Avati ha
escrit:
|
[Prev in Thread] | Current Thread | [Next in Thread] |