[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] an xml schema for (single|multiple)_cell_document file XML for
From: |
Václav Slavík |
Subject: |
Re: [lmi] an xml schema for (single|multiple)_cell_document file XML format |
Date: |
Mon, 12 Mar 2012 15:46:19 +0100 |
Hi,
On 9 Mar 2012, at 19:27, Václav Slavík wrote:
>>> We can try to produce a RELAX NG schema for them to see how it goes.
>>> Should we?
>>
>> Yes, please.
>
> I have some questions about the format:
>
> (1) Are the elements under <cell> optional or required? As far as I can tell,
> the reading code is permissive and will use defaults if a value is missing,
> but should that be considered a valid file?
>
> (2) Does the order of elements under <cell> matter? (The output is always
> alphabetically sorted with current code, reading code doesn't care. It's
> marginally simpler to write the grammar if the order matters, but both are
> easily possible.)
>
> (3) Are empty class_defaults and particular_cells allowed, or do they have to
> contain at least one cell each?
Attached are RELAX NG schema (using the more readable Compact Syntax) for .cns
and .ill files. They only cover the latest version of the format and assume the
following answers to my questions: (1) required, (2) significant order and (3)
at least one child <cell> must exist. They're all easily (2) or trivially (1,3)
changed.
I had a closer look at several RELAX NG tools; in the end, I settled on Jing
(http://www.thaiopensource.com/relaxng/jing.html, by the same folks as Trang).
It has the most complete implementation of RELAX NG Compact syntax, the best
error messages and supports other schema languages too.
Other than Jing, I tried:
1. xmllint — doesn't handle RELAX NG Compact Syntax at all, only the rather
verbose XML one.
2. rnv — Compact Syntax only validator, implemented in C. It doesn't recognize
all of the language (it couldn't handle the "grammar" keyword; fortunately,
it's optional). It's error messages were either cryptic or amounted to little
more than, paraphrased, "syntax error" or "invalid value". It didn't even
provide useful source file locations (the worst offender was that any issue
inside cell.rnc was reported at illustration.rnc:5, i.e. at the place where
cell.rnc was included).
I am also attaching an example census.xsd file with XML Schema converted from
census.rnc. It's rather large (126kB compared to <19kB of .rnc files), although
not as large as its corresponding RELAX NG XML file (409kB). It's much less
human-readable than the .rnc files, though. For one thing, it's heavily
structured, verbose XML, that is inhuman in itself. But to make matters worse,
Trang doesn't support RELAX NG external references that I rely on. So I had to
run the .rnc files through jing -s to produce simplified versions without them
(this is how I ended up with 409kB of .rng file) and convert that to .xsd. This
simplification step removed (by expanding them) custom data types and
duplicated the schema parts corresponding to <cell>, making it a poor choice
for human reading.
The results aren't that bad if the simplification step is omitted — see
attached illustration.xsd. I had to modify illustration.rnc to produce it, by
removing the external reference and inlining cell.rnc. That wouldn't be a good
idea for maintenance, cell.rnc is shared by both .cns and .ill files. If you
think you'll need nice XML Schema files, then we can either write some custom
script to merge cell.rnc into the other two files before passing them to trang,
or to implemented externalRefs support directly in Trang.
Regards,
Vaclav