lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Converting a proprietary svn repository to git


From: Greg Chicares
Subject: Re: [lmi] Converting a proprietary svn repository to git
Date: Sat, 27 Feb 2016 18:33:22 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.5.0

On 2016-02-27 15:23, Vadim Zeitlin wrote:
> On Sat, 27 Feb 2016 02:11:51 +0000 Greg Chicares <address@hidden> wrote:
[...]
> GC> 
> /home/greg/tainted/migration/rev2sha[0]$PATH=$PATH:/home/greg/tainted/migration
>  git filter-branch --msg-filter msgfilter-rev2sha --tag-name-filter cat -- 
> --date-order --all
> GC> 
> GC> Rewrite 248c5530142dde7ad67fcad348fcbd38ba6c9895 (57/237)fatal: ambiguous 
> argument 'svn/trunk': unknown revision or path not in the working tree.
> GC> Use '--' to separate paths from revisions
> GC> rev-list --first-parent --pretty=medium svn/trunk: command returned 
> error: 128
> 
>  But the problem is, of course, that the script hasn't been tested with the
> svn repositories using non-standard layouts neither, so it fails to work in
> this case. I could probably fix this if you're interested, but it would be
> really helpful to have a copy of your repository to test it with, would
> this be possible?

Sorry: forbidden. It contains test cases that represent actual insurance
policies. These should always have been sanitized to remove all personally
identifiable customer information. However, some were not sanitized before
they were originally committed. This repository cannot be shared even with
other employees of the insurance company, absent a clear business need.

> GC> Let's compare the '--no-metadata' and '--msg-filter msgfilter-rev2sha' 
> results:
> ...
> GC> All the differences are in .git/ , and they seem to be just binary;
> GC> the contents of {data/ src/ test/} are identical. I think I can conclude
> GC> that for this migration 'msgfilter-rev2sha' isn't beneficial.
> 
>  Err, I am not sure how do you make this conclusion, even if the result may
> well be true.

Because I thought the script's job was to rewrite repository contents
to refer to git hashes rather than svn revision numbers--and it made
no such changes. But it seems I misunderstood completely:

> All msgfilter-rev2sha does is to update the references to svn
> revisions in the repository history to the corresponding git commits, so
> it's never going to result in any changes in the repository contents
> itself, it only works on metadata.

I still don't understand. Let me try to explain why, so that perhaps you
can say where my conceptual model is flawed.

I figure that any RCS is like a filesystem: a structure in which data
(file contents) can be saved and retrieved through some API. The underlying
mechanism uses code and metadata that are implementation details, hidden
behind the scenes. Thus, if I move a directory from vfat to ext4, it loses
the FAT and gains inodes, but I don't have to understand that as long as, say,
  home/greg/dog_photos/2015
still contains the same pictures with the same names and the same last-
modification dates (those metadata are relevant, but the physical order of
the files, and the way they're mapped to disk sectors, are irrelevant).

It does seem that I can do the same operations with my new git repository
as I could with the svn original. For example, 'svn log' and 'git log'
retrieve equivalent information (date, author, and commit message):

r8 | [redacted] | 2012-02-11 00:10:03 +0000 (Sat, 11 Feb 2012) | 1 line
Sanitize personal data from inforce cases

commit 14c134da700aec603b90faa9f6d31a346506af10
Author: [redacted] <address@hidden>
Date:   Sat Feb 11 00:10:03 2012 +0000
    Sanitize personal data from inforce cases

AFAICT, the only difference 'msgfilter-rev2sha' would make in that case
is that the git hash might differ. But why would I care about that? If
I follow a chain with hash 14c134da700aec603b90faa9f6d31a346506af10 to
get that log message, and the script would substitute a different chain
with a different hash to let me retrieve the same data in the same way,
what benefit does that have? What am I missing?

>  To see whether it's beneficial or not you should use "git log --grep=..."
> with the regular expression at the end of the script. As you're not going
> to have any 5 digit revision numbers (with only 237 revisions in total),
> and as it's not a problem to get some false positives here, it should be
> enough to run
> 
>       git log --grep='(r|rev\s*|revision\s*)([1-9][0-9]*)'
> 
> and check if there any references you would like to replace.

git log --grep='(r|rev\s*|revision\s*)([1-9][0-9]*)' | wc
      0       0       0

I guess that's a perl regex, which places it beyond my easy understanding;
but why didn't it find the following?

git log |grep 'revision [0-9]'
    Align data files with lmi revision 5667

> GC> What really bothers me is the git documentation:
[...]
>  Git documentation is written in rather informal style and personally I
> prefer it to more prescriptive style of the traditional man pages, I
> appreciate being told not only what I can do but also why doing it may or
> not be a good idea. And I think it does a rather good job of it in the 2
> cases highlighted above if you look at the full sentences and not just the
> quoted parts.

Thanks for explaining that. I'm used to formal technical documentation whose
sentences are as theorems and clauses are as lemmas, so that taking a clause
"out of context" cannot affect its truth-value. I always found this part of
the classic GNU maintainers' documentation charming:
  "Please reread the paragraph above, slowly and carefully. It is
  important to understand that rule precisely, much as you would
  understand a complicated C statement in order to hand-simulate it."
but I am after all a prescriptivist.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]