[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [rdiff-backup-users] Python 3 migration: considering non-UTF-8 confo
From: |
Patrik Dufresne |
Subject: |
Re: [rdiff-backup-users] Python 3 migration: considering non-UTF-8 conform filenames |
Date: |
Sat, 3 Aug 2019 06:49:31 -0500 |
Hello Éric, im very concerned about this. I did not review all your
changes, and did not notice this fact. I'm backup allot of various system
and the encoding are not all utf8. And invalid utf8 happen quite often.
The way to work around this in rdiffweb at least it's to manage path as
bytes. That is how rdiffweb 1.2.8 is working. Path are bytes. That is also
how most filesystem are working too. Paths are bytes and those are decoded
to be displayed to the user.
Not supporting non-utf8 is a deal breaker for me. What would I say to my
non technical user. Hum sorry, you must rename your file to get it backup...
Not sure
On Sat, Aug 3, 2019, 5:50 AM Eric L., <address@hidden> wrote:
> Hi,
>
> as I worked on migrating to Python 3, one of the "fanciest" aspects was
> the change from str/unicode to bytes/str "character chains" types.
>
> Without going into the technical details (python savvy persons will know
> what I mean), it means among other things that the codeset of file names
> becomes relevant and must be UTF-8. Files with a name which isn't
> compliant with UTF-8 aren't backed up.
>
> The warnings look something like:
>
> Sat Aug 3 10:51:51 2019 Warning: unable to read ACL from 'very
> complicated filename': 'utf-8' codec can't encode character '\udcb1' in
> position 54: surrogates not allowed
> Sat Aug 3 10:51:51 2019 Warning: ignoring file 'very complicated
> filename' with wrong encoding: 'utf-8' codec can't encode character
> '\udcb1' in position 54: surrogates not allowed
>
> I don't see much options because only str (i.e. codeset-aware) can be
> matched against regex, bytes can't (filenames could still be read as
> bytes).
>
> Few consequences:
>
> 1. such files can't get backed-up anymore.
> 2. old backup repos which contain such files are seen as broken - as
> long as the last version doesn't contain such files, only in increments,
> it'll be usable though.
>
> This said, non-UTF-8-compatible file systems are uncommon since many
> years, so that the impact should be very limited (in my case, old
> Windows files lying around since 2010).
>
> I'm mostly concerned about the Asian room, because I've heard (but have
> no experience whatsoever) that they might use other rich encodings than
> Unicode. The original code was IMHO already not very clean in this
> regard, the migration to UTF-8 hasn't improved things, strings are
> encoded/decoded sometimes explicitly with UTF-8 sometimes without
> explicit UTF-8 encoding.
>
> If the users on this list could comment on their experience and
> expectations it would be great. Doing tests with old backup repos on my
> PR [1] would be even greater.
>
> Don't expect miracles though, currently I don't see any viable
> alternative to the decision I've taken. I mostly wanted to make sure
> it's taken transparently.
>
> Thanks, Eric
>
> [1] https://github.com/sol1/rdiff-backup/pull/40
>
> _______________________________________________
> rdiff-backup-users mailing list at address@hidden
> https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
> Wiki URL:
> http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki