[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re[2]: [lmi] an xml schema for (single|multiple)_cell_document file XML
From: |
Vadim Zeitlin |
Subject: |
Re[2]: [lmi] an xml schema for (single|multiple)_cell_document file XML format |
Date: |
Tue, 10 Aug 2010 16:34:10 +0200 |
On Tue, 10 Aug 2010 10:41:36 +0000 Greg Chicares <address@hidden> wrote:
GC> | Issue A (major):
GC> | ===============
GC> | The current format contains only 'cell' nodes which represent cases,
GC> | class and cells. To specify the number of nodes of each type helper nodes
GC> | 'NumberOfCases' and 'NumberOfClasses' are used. Each of 'NumberOfXXX'
GC> | is a positive integer number N, which is followed by exactly
GC> | N 'cell' nodes.
GC>
GC> The 'NumberOf' elements aren't really appropriate in xml.
GC>
GC> | A simple workaround would be to rename the 'cell' nodes into
GC> | the corresponding cell type: 'case', 'class', 'cell'. This allows
GC> | to fix the document node structure and to get rid of the redundant
GC> | nodes 'NumberOfXXX'.
GC>
GC> These three categories must be distinguished somehow. I'm inclined to add
GC> an attribute or a subelement. Changing the main element tag seems drastic.
...
GC> Alternatively, use enclosing elements instead of delimiters,
GC> transforming the present '.cns' format:
Having an attribute would work but IMHO using the enclosing elements would
be better. Adding an attribute is probably a lesser change but I think
using different parent nodes for the elements of different types makes more
sense and it shouldn't be that much more difficult to implement it.
GC> | Issue B (minor):
GC> | ===============
GC> | Most of the elements that represent an array/list/sequence are stored as
GC> | a single string with items separated with spaces.
GC> |
GC> | The bruteforce approach solves the issue by supplying complex regular
GC> | expressions for each sequence type. But if changing current format
GC> | of cns/ill files could be considered, then sequence elements could be
GC> | properly represented by a node with children nodes (instead of a single
GC> | string) which will allow rather simple validation of array/list/sequence
GC> | items separately.
GC>
GC> For input sequences, generality and expressive power are important: e.g.,
GC> 10000, retirement; 0
GC> in the 'case' cell (and replicated to the others) may suffice to specify
GC> the premium pattern for an entire census. If that's difficult to validate
GC> with XSD, so be it.
Yes, I think we'll just have to leave this as is. We could probably split
the sequence into parts and validate it at least partially but I don't
think it's worth it.
GC> | Issue D (minor):
GC> | ===============
GC> | Enum element values could contain '_' instead of spaces (' ').
GC>
GC> In the past, they could. Now we generally avoid that; for instance, solve
GC> types include:
GC> "Endowment"
GC> "Target CSV"
GC> "CSV = tax basis"
GC> "Avoid MEC"
GC> which make sense to end users, who would find "Avoid_MEC" weird.
I don't understand why should the end users look at XML files though. And
while using spaces in XML is possible, I'd indeed prefer to avoid it as
meaningful whitespace in any text format is just a recipe for trouble
(let me add "unless it's limited to the start of line" to preventively
defend myself from anti-Pythonic accusations). So why do we have to use the
same strings in XML and in the user-visible places?
Regards,
VZ