Re: [Duplicity-talk] unicode support strategy

duplicity-talk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Duplicity-talk] unicode support strategy

From:	Aaron
Subject:	Re: [Duplicity-talk] unicode support strategy
Date:	Tue, 2 Oct 2018 10:26:27 +0100

Hello Radim,

Thank you for your help with Duplicity and your comments. This is a good question.

Please see the comments in duplicity/util.py:

https://bazaar.launchpad.net/~duplicity-team/duplicity/0.8-series/view/head:/duplicity/util.py

for a discussion of unicode <--> bytes conversion in duplicity.

On Sep 30 2018, at 7:11 pm, Radim Tobolka via Duplicity-talk <address@hidden> wrote:

Hi,
going through the code, I see you've decided not to perform byte/unicode
conversion on I/O boundaries, but rather work with byte filenames and
convert to unicode when needed. Is absence of surrogateescape codec
error handler in py2 sole reason for this?

While duplicity does not solely use unicode paths internally, the intention is definitely to decode/encode all bytes to/from unicode at I/O boundaries. This is a relatively recent effort, though, and has not been completed. Nearly all of the code used to assume bytes, so you may come across some internal conversions to keep everything working while we "unicodeify" one part at a time -- please feel free to work on these!

As set out in the comments linked above (and as you have alluded to yourself), Python2 does not offer a way to losslessly translate *nix bytes paths (e.g. Linux filenames) to unicode and back again (cf os.fsencode/os.fsdecode in Python3). There are backports and workarounds, but they are not bulletproof. (Note that I am talking about paths only; everything else (files etc) should absolutely be pulled in as unicode at the I/O boundaries.)

The approach for filenames in duplicity is therefore to do the decoding at I/O boundaries for internal use, but to keep a copy of the bytes version to use when interacting with that file. When a path is read/created, the bytes version of the filename is therefore stored in path.name (which also means all the old code that assumes bytes keeps using that and keeps working). This is then decoded to unicode (using util.fsdecode, which itself uses os.fsdecode on Python 3.2+) for all internal uses/filename matching (path.uc_name).

I have taken this approach for selection.py and globmatch.py, and am fixing up other files as I get to them as part of the Python 3 prep, but it would be great to have it consistent across the codebase. Wherever possible:

things should be using .uc_name instead of .name;
strings should be unicode wherever possible; and
if, for some reason, things need to be converted, please use util.fsencode/fsdecode, as then we can transparently upgrade to using built-in os.fsencode/fsdecode as we move to Python 3.

Any questions, please ask.

Kind regards,

Aaron

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Duplicity-talk] unicode support strategy, Aaron <=
- Re: [Duplicity-talk] unicode support strategy, Radim Tobolka, 2018/10/21

Next by Date: Re: [Duplicity-talk] pytest redirect_stdin fixture patch
Next by thread: Re: [Duplicity-talk] unicode support strategy
Index(es):
- Date
- Thread