[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Linux-NTFS-Dev] NTFS resizer
From: |
Anton Altaparmakov |
Subject: |
Re: [Linux-NTFS-Dev] NTFS resizer |
Date: |
Thu, 16 Aug 2001 20:57:08 +0100 |
At 09:25 16/08/01, Andrew Clausen wrote:
On Thu, Aug 09, 2001 at 02:04:37AM +0100, Anton Altaparmakov wrote:
> At 01:18 09/08/2001, Andrew Clausen wrote:
> >On Wed, Aug 08, 2001 at 12:45:22PM +0100, Anton Altaparmakov wrote:
> > > You have a structure: the mft record. It contains all attributes nicely
> > > sorted.
> >
> >It contains the on-disk attributes, not ntfs_attr.
>
> That's my point. There should be no such thing as ntfs_attr. The on disk
> representation is entirely sufficient to do everything.
I guess this is ok with lots of accessor functions, etc.
(These are necessary for endianness, and making things like
accesing the name easier)
Obviously all accesses have to be wrapped with leXY_to_cpu() and vice
versa. Bur that is nothing unusual. After all they just expand to nothing
if compiled on a little endian architecture.
> >I think attributes need to be abstracted from MFT records, because
> >they may need to be shuffled around between MFTs. (Eg: an attribute
> >gets inserted, or a run-list grows...)
>
> Then space is made for this attribute and it is inserted. mkntfs even goes
> as far as CREATING the attributes in their correct place in the mft record
> itself, i.e. no memcpy involved for that.
>
> This is the only way to catch when attributes become too big and need
to be
> made non-resident or when they have to be moved out to other mft records.
Why "the only way"?
Well if you are not working within the mft record (i.e. you have copied the
attributes into memory structures), you don't know how much space is in use
in the mft record. You could argue that you don't care about this until you
reach the stage of writing out to disk, but that doesn't really work
because you have to differentiate resident vs. non-resident attributes in
memory, too: for resident ones your in memory attribute contains the value
of the attribute, while for non-resident ones the value is stored on disk
and your in memory copy only holds the run list. You can't keep
non-resident attribute values in memory, because an attribute value can
easily be hundreds of megabytes or even gigabytes worth of data... Nobody
has that much RAM and not even swap. (Note the current libntfs has a big
problem here as it always loads the whole attribute value at once which is
of course madness but that has been ok for me so far as the library is only
used by ntfsfix which doesn't do anything too fancy with attribute values
except for loading the whole of $LogFile, but that is limited to 4MiB
maximum size (IIRC) in normal circumstances so it is not too bad...)
BTW: when you need to move an attribute into
another mft record... you might be able to move it into a record with
lots of free space, etc.
When you are moving an attribute out of an mft record you move it to an
empty mft record. It might be possible that you can have two attributes in
the same extension mft record but I have never seen that happen in Windows.
Each attribute extent is in a separate mft record, so unless someone sees
otherwise, I would operate on the assumption that Windows NTFS driver will
choke if you put two extents into the same mft record.
I would have thought this would be easy to do at "clean" time.
> >I was thinking this shuffling should happen when the file is synced
> >to disk. (ntfs_file_sync()?)
>
> This is a possible approach and is indeed what the old NTFS driver does
> (very badly so it trashes your fs most of the time...). I hate that
> approach (but I would accept it, if it were written properly, which I
think
> is extremely hard to do because you are creating one single mammoth
> function which you will have a lot of fun to debug or alternatively you
> will have millions of if/else or switch statement to capture all the
> special cases).
I like this approach for exactly the opposite reason: it has fewer
special cases, and is much more elegant (almost trivial!)
Not single mammoth, but single miniature function...
Maybe I'm not understanding something.
Well, you can look at the kernel NTFS driver (just pick any 2.4.x kernel
out there, doesn't really matter, the concept is the same in all of them)
if you want to see how complex it is and how it doesn't manage to get it
right at all.
Alternatively, try to implement it! Perhaps my view is too biased against
this approach and I am not objective. But AFAICS this cannot be done in a
single miniature function. If you manage to do that, I will take my hat off
and bow down in awe... (((-;
> - It also has nasty side effect of resetting the attribute
> sequence numbers and even reshuffling them completely which is plain WRONG
> but considering we overwrite the journal with 0xff bytes not a too big
> problem.
why? (BTW: the update is completely atomic)
Because every attribute in a mft record is assigned a sequence number
unique for that mft record at the time of creation of the attribute. It has
nothing to do with the order of the attributes in the mft record; it's just
about order of creation. If you keep the $LogFile (and $UsnJrnl for that
matter but see below), the values in the journal(s) will conflict from the
values your wrote out and if the journal gets replayed or unrolled you will
end up with corruption. But never mind that. If the journal is still
present you will end up with corruption anyway...
(Why is shuffling wrong?) I'm new to all of this!
Journalling only. So in fact as long as there is no journal left over it
should be ok.
> btw. We really need to delete the $Extend\$UsnJrnl file when
> writing to the partition or that could screw us badly, too, but
deletion is
> not implemented yet at all.
Why?
$UsnJrnl is very similar to $LogFile but it doesn't log data changes. So if
you want, it is a "light" weight version of $LogFile for use by
applications. So backup programs, antivirus programs can look at it to see
whether a file has changed since a certain time and then they know whether
they need to backup/scan the file again or not. - That's just an example of
what it can be used for. Obviously if you write to the partition the
contents of the log are out of date and to ensure data integrity you have
to deactivate the log. This happens in a different way from $LogFile (which
you cannot delete as it is an essential system file) and in fact you just
delete the file to deactivate the $UsnJrnl. Windows will reactivate it on
reboot (or the program wanting to use it will).
> For example if you are doing a new layout off all attributes from scratch
> you will need to check every single attribute for being a certain type and
> for whether it is allowed to be resident or non-resident or both, then you
> need to check whether there is still enough space in the mft record to add
> it and if not you have to start from scratch again or you have to edit
what
> you have done so far to make space.
Ah, the knapsack problem strikes again (run into this a bit in
partitioning!). Starting "from scratch" is no worse. It's NP hard either
way.
'NP'?
> - Now if you start editing what you
> have already created you end up having _all_ the functionality required to
> handle attributes in place in their mft records AND because you are doing
> the over the top flush function you have almost all the code duplicated
> there for doing all the checks etc.
I don't see why... it still seems rather trivial to me.
You are considering editing mft records trivial? Maybe our perspectives on
what is defined as trivial differ... I guess the concept is trivial but to
implement all the functions required you are talking a lot of lines of
code. (ntfs.sys binary in Windows is >500kiB for a reason... and it uses
other drivers/kernel dlls extensively, too.)
It gets especially interesting when you start using extension mft records
because you then require the attribute list attribute present but to have
that attribute you already need to know where all the attributes are and
how many extents each spans (otherwise you don't know what the size of the
attribute list attribute value will be so you can't reserve space in the
mft record for it, so in turn you can't write the other attributes if you
are not going to be editing the mft record (note the attribute list
attribute is type 0x20 which means it comes between the standard
information and the file names so most attributes are written after it is
written), sounds like Catch-22 situation to me if you are trying to write
out all attributes into a mft record in one go). This also raises the
question whether you will maintain an attribute list attribute in memory or
whether you will create it on write to disk when all the other attributes
are generated and written to the mft record? - I would like to know what
concept will you follow to handle attribute list attributes? I don't see
any straightforward way to do it if you have all the attributes in separate
in memory attributes rather than in the mft record...
For NTFS TNG I am thinking of making attribute list attribute handling part
of a slow code path because it is very infrequent to have them. But on the
other hand it is most often $MFT itself that uses attribute lists (due to
growing fragmentation with increasing age of the volume) which needs to be
accessible real fast, so to be honest I am not actually too sure how they
will get implemented eventually. I am ignoring the problem for the moment
until I get a working driver without support for those beasts and will
think about them then...
> What about the mft record then? I mean when you are writing back which mft
> record will you write to? The same one (you have to otherwise you would
> have to release the previous one and allocate a new one...)? How will you
> know which one that was?
No problem since when writing to the attribute, there is no allocation.
So, when you rearrange the MFT's MFT records, they just move, but this
doesn't hinder writing the MFT's MFT records.
That is not that easy. Your "just move" is a complex set of operations:
Each mft record is allocated in the mft bitmap (a bitmap having a bit for
each existing mft record and if a bit is 1 the corresponding mft record is
in use and if bit is 0 mft record is free). So to move an mft record you
need to deallocate the old one and allocate the new one. Further, if you
move a file from one mft record to another, all directory entries pointing
to this file have to be updated with the new mft record. Same for all
extension mft records which need to point back to the base mft record.
Thus, IMHO, moving a files'
mft record about is a bad idea unless you absolutely have to do so. Add to
that that Unix/Linux utilities expect inode numbers to be persistent across
mounts which a moving about of mft records on every write would screw up so
utilities like tar for example would not work properly.
Also, when writing to an attribute, while there is no allocation of mft
records, there will be allocation of disk clusters for non-resident
attributes. Again, you can't just keep the non-resident attributes in
memory due to the possibly huge size. [Maximum attribute size for $DATA
attribute of any mft record is 2TiB...]
> Also, surely parted will not be working at file level but much deeper
below
> in the inode/mft record level? Or will it not treat files as opaque
> structures and use them to access the underlying mft records?
Well inode == file NOT mft, IMHO. It should work at the inode level, yes.
Terminology I have used so far is:
file = abstract file system object which has a name (dentry) which is
connected to an inode. Programs and processes work with files and the
kernel uses them to keep track of what files are open, etc.
inode = kernel object underneath the concept of files. Several files can
point to the same inode (hard links, where multiple dentries point to the
same inode) and in NTFS the reverse is also true that several inodes can
make up one file. One specific inode is in my definition the in memory
equivalent of one specific mft record of an NTFS volume.
Of course if you define a file to be an inode then it becomes clear we were
talking past each other...
Basically your definition of inode is my definition of file. (-;
And in your definition you take away the intermediate object between my
file and the mft record (i.e. my inode).
But if you do that how can you work on in your definition of files? I would
have thought that parted is a low level program which doesn't
open/read/write/close files but instead operates on the on disk structures
themselves. - At least in my understanding a high level interface like
open/read/write/close-file does not allow you to specify where a file's
data is written to on disk.
> For example if the resize requires some data to be moved because it would
> be left in unallocated space otherwise, how would you do that? You
need low
> level control of cluster allocations, file level access is useless in this
> context.
Well, in the first pass (traversing all inodes), it marks all blocks
(in a big in-memory bitmap) that need to be copied/relocated. (This
includes the above-mentioned blocks, and also the MFT, for doing the
atomic update trick)
Be very careful here. The in-memory bitmap which is capable of containing
one bit for each cluster on a large volume can in itself be very large. We
are talking 2.5MiB for my 10GiB NTFS partition at home and recently I have
seen people use up to 80GiB NTFS partitions, which would mean a bitmap size
of 20MiB. Next year (or whenever) I would not be surprised to see people
with 800GiB partitions (requiring a bitmap of 200MiB).
So, when copying blocks to free space, we need to allocate clusters, yes.
No big deal.
A file based interface (you said file==inode) is not compatible with
cluster allocation... Again I might be misunderstanding you. If you define
what your file/inode will look like it will make it easier for me to
understand what you have in mind. For me, accessing a file is done like this:
f = open(filename);
seek(f);
read(f);
write(f);
close(f);
And nothing else you can do to it. At least that is how Unix/Linux see files...
If this is not the interface we are talking about but something like:
i = read_inode(mft_record_number);
relocate_inode(i);
(de)allocate_inode(i);
write_inode(i);
etc., then I agree, that's the right level to do it.
> Also you will need to rewrite every single run list on the volume by
> adding/subtracting a fixed delta to every LCN value. - You can't do
this at
> a file level access either.
File == inode. Files/Inode's have attributes, not MFT records.
That is wrong. MFT records have attributes. The mft record is the
fundamental file system unit. If this weren't the case it would not make
sense to be localizing attributes from one file in the same mft record
(apart from speed of access obviously).
> This is why I don't understand why you want to work on a file level...
>
> My getfile would look like:
> {
> is buffer in cache? -> yes: return buffer
what is buffer?
I would like to keep a cache in memory storing mft records so I don't have
to keep reading/writing them from/to disk all the time. So when you open a
file you would check whether the base mft record corresponding to the file
is in this cache already or not. In this context buffer would be the mft
record read in from disk into a memory buffer. Admittedly in user space you
do not necessarily need to do this because the kernel is already caching
device accesses via the buffer cache (unless you are using raw io).
> my file sync would look like:
>
> for (all mft records owned by file) {
> lock mft record cached copy()
> pre write mst_fixup buffer()
> write to disk()
> fast post write mft fixup buffer()
> unlock buffer()
> }
>
> Simple, only 6 lines of code (minus error handling).
But doesn't handle run lists overflowing.
Run lists cannot overflow. I am not sure what you mean. In my scheme the
run list is already compressed in the mapping pairs array inside the file's
mft records so when they are flushed to disk, the compressed run list is
flushed to disk with the mft record it is stored in.
The mapping pairs array is updated every time the run list is changed in
memory because when you change the run list it means that you must have
allocated clusters on disk (clusters are what the run list points to after
all) and that obviously involved updating the cluster bitmap of the
partition ($Bitmap system file) so if you don't modify the run list /
mapping pairs array at the same time and assuming your cluster bitmap is
committed to disk and then the computer crashes you end up with clusters
which are allocated on disk but are not associated with anything which is
not a good idea.
> >I'm not convinced we want this [un]map() thing on records. Records
> >aren't accessed on their own, except in the context of a file. Files
> >are the atomic pieces... So, I think we should just have {read,write}
> >on records, and [un]map() on files, although I've called it get()
> >here. (unmap() is just free() on a clean file)
>
> They are in my implementation... Files have nothing to mft records.
> Directories are mft records, too.
Directories are files.
Sorry, that is what I meant to write.
Maybe this is the source of our misunderstanding.
I'm saying file == inode == "set of MFT records".
Ok. Understand now.
I'll call it inode if that sounds better. (Just, I thought that was
NTFS terminology... sorry!)
In NTFS terminology it gets very confusing because a MFT record is a FILE
record (that's the magic identifier of an MFT record). Thus, when you talk
about files it is not clear whether you talk about actual files (in my
definition) or about mft records. I have hence personally adopted the
Unix/Linux terminology of what a file and an inode is and have equated an
inode to be an mft record for simplicity.
Technically, you are correct that file represents one inode which == "set
of MFT records", but in my twisted mind I like to think of it as file
represents one "base" inode (the base mft record), which can be pointing to
"helper inodes" (extension mft records), each being ONE mft record.
I only do this because it makes working with mft records easier, at least
in my NTFS TNG implementation. Basically I integrate MFT records very
tightly with in memory inodes, which means that my (un)map_mft_record
functions also modify the struct inode of the mft record (for locking
purposes for example), thus my implementation requires an inode object for
each mft record. I just ignore the fact that all those "helper inodes" will
not be used by the kernel for anything at all, i.e. you can't open files
with those inodes and things like that... Whether this approach is
sensible/efficient or not remains to be seen as I haven't implemented
attribute lists and hence the use of extension mft records yet... - So my
ideas might turn out to be wrong, only time will tell. (-;
There is another thing: parted doesn't really need to have a full NTFS
implementation. The relocation of clusters and possibly of the mft records
and other system files could be implemented optimized for this specific
use. For example you don't actually need to be able to create files or
delete them. You just need to be able to move mft records about and same
for clusters. At the same time you need to be able to update run lists,
directory entries and other attributes. I think it would be much easier for
parted to say read a mft record, modify the attributes to reflect its new
position and then write it to it's new place. While doing each update it
would go about updating all references to this file to reflect the changes.
To move random clusters about you just need to copy over the data
unmodified, update the bitmap for the volume and update the run list of the
data attribute for the file. So you could get away with a much simpler
implementation of something like an integrated libNTFSresize/libNTFSmove
which would be extremely specific to parted. That would be a lot less code
than having a full blown libntfs that can do anything and everything on the
planet. Just an idea how you development of parted for ntfs can be speeded
up a lot...
Anton
- Re: [Linux-NTFS-Dev] NTFS resizer, (continued)