trans-coord-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How can we optimise the way we link to translations? (private)


From: Kaloian Doganov
Subject: Re: How can we optimise the way we link to translations? (private)
Date: Mon, 10 Dec 2007 20:11:51 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.50 (gNewSense gnu/linux)

Yavor Doganov wrote:

    Kaloian, could you please summarize (on the list) the most serious
    problem?

Yes, it has been months, but I'll try remind those last developments.

As far as I remember, all serious problems were fixed.  The last
stumbling block was this: Po4a's Xhtml module processes it's input in
such way, that all adjacent HTML/XML comments in it were integrated into
one large comment in the output.  This was a major issue, because we've
used HTML/XML comments to place marks and placeholders in the original
document that had to be expanded or treated specially by our
postprocessing tools at a later point.

Here is the typical example from some article.html:

    <p>
    Updated:
    <!-- timestamp start -->
    $Date: 2007/06/19 00:02:58 $
    <!-- timestamp end -->
    </p>

We don't want the line with $Date to be translatable at all, since it is
set automatically by CVS.  This is a constantly changing string that we
don't want to leak in POT and POs.  So we comment it out temporarily:

    <p>
    Updated:
    <!-- timestamp start -->
    <!-- $Date: 2007/06/19 00:02:58 $ -->
    <!-- timestamp end -->
    </p>

and then, when we run Po4a (using Html module), this line is not treated
as a translatable string:

    <p>
    Последно обновяване:
    <!-- timestamp start -->
    <!-- $Date: 2007/06/19 00:02:58 $ -->
    <!-- timestamp end -->
    </p>

After Po4a's processing, we uncomment the line in the resulted file:

    <p>
    Последно обновяване:
    <!-- timestamp start -->
    $Date: 2007/06/19 00:02:58 $
    <!-- timestamp end -->
    </p>

This output is exactly what we want.  By using comments, we have
effectively marked some strings as non-translatable.  This used to work
perfectly with Po4a's Html module (which unfortunately had other
problems), but didn't worked at all when we switched to Po4a's Xhtml
module.  The problem was due to the way Xhtml module treats comments.
For example, the fragment:

    <p>
    Updated:
    <!-- timestamp start -->
    <!-- $Date: 2007/06/19 00:02:58 $ -->
    <!-- timestamp end -->
    </p>

was parsed and integrated into this:

    <p>
    Updated:



    <!-- timestamp start
     $Date: 2007/06/19 00:02:58 $
     timestamp end -->
    </p>

As Nicolas François (one of Po4a's developers) puts it:

    There are newlines and spaces between the comments.  The comments
    are extracted, and the newline and spaces are added to the current
    paragraph (to be translated).  At the end of the paragraph, the
    cumulated text are presented in the PO.  Newlines and spaces at the
    beginning of the paragraph are extracted (i.e.  not translated), and
    printed in the output text.

So, comments are now context-dependent, and as such, they are not
suitable for marking strings as non-translatable.  Nicolas suggested
that we can use a special tag, say, <mumbo>, to mark such strings, and
tell Xhtml module not to treat contents of this tag as translatable.

This was actually done using the `tags' option, supported by Xhtml
module, and `<gnu.org-i18n>' as a special tag name:

    <p>
    Updated:
    <!-- timestamp start -->
    <gnu.org-i18n>$Date: 2007/06/19 00:02:58 $</gnu.org-i18n>
    <!-- timestamp end -->
    </p>

All content between <gnu.org-i18n> and </gnu.org-i18n> is out of
POT/PO-files.  After Po4a's run, the special tags are removed from the
output.  So, we have a way to mark some strings as non-translatable
again.

Of course, adjacent ordinary comments are still integrated, and this
leads to large areas of whitespace.  For example, the input:

    <h4>Translations of this page</h4>

    <!-- Please keep this list alphabetical. -->
    <!-- Comment what the language is for each type, i.e. de is Deutsch.-->
    <!-- If you add a new language here, please -->
    <!-- advise address@hidden and add it to -->
    <!--  - /home/www/bin/nightly-vars either TAGSLANG or WEBLANG -->
    <!--  - /home/www/html/server/standards/README.translations.html -->
    <!--  - one of the lists under the section "Translations Underway" -->
    <!--  - if there is a translation team, you also have to add an alias -->
    <!--  to mail.gnu.org:/com/mailer/aliases -->
    <!-- Please also check you have the 2 letter language code right versus -->
    <!-- <URL:http://www.w3.org/WAI/ER/IG/ert/iso639.htm> -->
    <!-- Please use W3C normative character entities -->

    <ul class="translations-list">

Becomes this in the output:

    <h4>Други преводи на тази страница:</h4>














    <!-- Please keep this list alphabetical. 
     Comment what the language is for each type, i.e. de is Deutsch.
     If you add a new language here, please 
     advise address@hidden and add it to 
      - /home/www/bin/nightly-vars either TAGSLANG or WEBLANG 
      - /home/www/html/server/standards/README.translations.html 
      - one of the lists under the section "Translations Underway" 
      - if there is a translation team, you also have to add an alias 
      to mail.gnu.org:/com/mailer/aliases 
     Please also check you have the 2 letter language code right versus 
     <URL:http://www.w3.org/WAI/ER/IG/ert/iso639.htm> 
     Please use W3C normative character entities -->
    <ul class="translations-list">

This large blank field is not beautiful at all when you look at the
code, but does not make any harm to the text of the article, as it is
presented to the reader (who reads it from a web browser), neither makes
the document invalid or unsuitable for processing with automatic tools
(XHTML parsers).  Thus, I don't consider this whitespace as a major
problem that compromises the use of Po4a for extracting and merging PO
and HMTL files.  The job is done, although suboptimally.

If we want to avoid this whitespace block, we could:

    1. Postprocess the result with a tool (some sed script, for example)
       that compresses continuous whitespace with multiple newlines into
       one or two newlines.

    2. In the original articles, use a single comment to mark a block
       with multiple lines and paragraphs, instead of a separate comment
       for every line.

    3. Modify Po4a so that it treats comments differently.

Personally, I think we should not bother with p.3.  It is not only hard,
but it's also questionable whether this is the right thing to do at all.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]