Hello Éric, im very concerned about this. I did not review all your
changes, and did not notice this fact. I'm backup allot of various system
and the encoding are not all utf8. And invalid utf8 happen quite often.
The way to work around this in rdiffweb at least it's to manage path as
bytes. That is how rdiffweb 1.2.8 is working. Path are bytes. That is also
how most filesystem are working too. Paths are bytes and those are decoded
to be displayed to the user.
Not supporting non-utf8 is a deal breaker for me. What would I say to my
non technical user. Hum sorry, you must rename your file to get it backup...
Not sure
On Sat, Aug 3, 2019, 5:50 AM Eric L., <address@hidden> wrote:
Hi,
as I worked on migrating to Python 3, one of the "fanciest" aspects was
the change from str/unicode to bytes/str "character chains" types.
Without going into the technical details (python savvy persons will know
what I mean), it means among other things that the codeset of file names
becomes relevant and must be UTF-8. Files with a name which isn't
compliant with UTF-8 aren't backed up.
The warnings look something like:
Sat Aug 3 10:51:51 2019 Warning: unable to read ACL from 'very
complicated filename': 'utf-8' codec can't encode character '\udcb1' in
position 54: surrogates not allowed
Sat Aug 3 10:51:51 2019 Warning: ignoring file 'very complicated
filename' with wrong encoding: 'utf-8' codec can't encode character
'\udcb1' in position 54: surrogates not allowed
I don't see much options because only str (i.e. codeset-aware) can be
matched against regex, bytes can't (filenames could still be read as
bytes).
Few consequences:
1. such files can't get backed-up anymore.
2. old backup repos which contain such files are seen as broken - as
long as the last version doesn't contain such files, only in increments,
it'll be usable though.
This said, non-UTF-8-compatible file systems are uncommon since many
years, so that the impact should be very limited (in my case, old
Windows files lying around since 2010).
I'm mostly concerned about the Asian room, because I've heard (but have
no experience whatsoever) that they might use other rich encodings than
Unicode. The original code was IMHO already not very clean in this
regard, the migration to UTF-8 hasn't improved things, strings are
encoded/decoded sometimes explicitly with UTF-8 sometimes without
explicit UTF-8 encoding.
If the users on this list could comment on their experience and
expectations it would be great. Doing tests with old backup repos on my
PR [1] would be even greater.
Don't expect miracles though, currently I don't see any viable
alternative to the decision I've taken. I mostly wanted to make sure
it's taken transparently.
Thanks, Eric
[1] https://github.com/sol1/rdiff-backup/pull/40
_______________________________________________
rdiff-backup-users mailing list at address@hidden
https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL:
http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
_______________________________________________
rdiff-backup-users mailing list at address@hidden
https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki