bug-guile
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20339: sxml simple: sxml->xml mishandles namespaces?


From: Ricardo Wurmus
Subject: bug#20339: sxml simple: sxml->xml mishandles namespaces?
Date: Mon, 04 Feb 2019 21:44:02 +0100
User-agent: mu4e 1.0; emacs 26.1

Hello!

I just looked at this again and I think I came with something useful.
Here’s some context:

Andy Wingo <address@hidden> writes:

> Hi :)
>
> On Wed 13 Jul 2016 15:24, address@hidden writes:
>
>> Referring to Oleg Kiseliov's paper [1], there are actually three
>> things involved:
>
> This summary is helpful, thanks.
>> What is missing? From my point of view:
>>
>>  - At xml->sxml time, the user doesn't know which namespaces
>>    are in the xml. So it would be nice if the XML parser
>>    could provide that.
>
> For some documents you do know, of course.
>
> And for larger perspective, I think that SSAX gives you all the tools
> you need to build specialist and very flexible XML parsers.  So to an
> extent solving the general problem isn't necessary -- we can always
> point people to SSAX.  But that's a bit rude ;) so if there are common
> patterns we should try to capture them in xml->sxml.  I see this bug as
> being a search for those patterns, but without the requirement of
> solving the problem in its most general form.
>
>>  - It would be super-nice if the XML parser could put that
>>    into the same nodes it found it, as described in [1]
>>    (i.e. in the (*NAMESPACES* ...) pseudo-attribute).
>>    This way we wouldn't have a global mapping, but one
>>    that resembles the original XML, even with the same
>>    prefixes. Less surprises overall. The round trip
>>    xml -> sxml -> xml would be (nearly) the identity.
>>
>>    With Ricardo's patch it would lump all the namespace
>>    declarations up in the top node, which formally is
>>    correct, but might scare XML people a bit :-)
>
> ACK.
>
>>  - At sxml->xml time there should be a way to somehow
>>    generate prefixex for "new" namespaces. I don't know
>>    at the moment how this would work, that depends on
>>    how the user is supposed to insert new nodes in the
>>    SXML. Does she specify the namespace? Both prefix
>>    (aka namespace-id, under my current assumption) *and*
>>    namespace? (note that the namespace-id/prefix alone
>>    wouldn't be sufficient).
>
> ACK.
>
> What do you think the next step is?  I am happy to wait FWIW, dunno if
> Ricardo has any feelings here.

Attached is a patch that does the requested things.  The parser
procedures like FINISH-ELEMENT have access to all the namespaces, so we
I changed the FINISH-ELEMENT procedure to return the list of namespaces
in addition to its SXML tree return value.

I changed name->sxml to use only the namespace aliases / abbreviations
instead of the namespace URIs.  (This is not very efficient because we
need to traverse the list of namespaces every time.  Maybe we could
memoize this.  On the other hand, the length of the namespaces list may
not be large enough to affect performance too much.)

In the end we get both namespace list and SXML tree from running the
parser.  Before wrapping this up in *TOP* we generate xmlns attributes
for all abbreviations and “patch” the first proper element’s attribute
list (i.e. we skip over a *PI* element if it exists).

The result is an SXML tree that begins with namespace declarations,
mapping abbreviations to URIs.  Within the SXML tree we’re only using
abbreviations, so there are no more invalid characters when converting
SXML to a string.

I would be happy if you could test this as I’m not 100% confident that
this is correct.  Here are questions I wasn’t able to answer
conclusively:

* Is the value for “namespaces” that’s passed in to the
  FINISH-ELEMENT procedure always the same?

* Will the second return value of the final call to FINISH-ELEMENT
  really always be the complete list of *all* namespaces that have been
  encountered?

* Are there valid XML documents for which the match patterns to inject
  namespace declarations would not apply?  (e.g. documents with a PI
  element and two separate XML trees)

--
Ricardo


>From 83ee9de18a0ecaa237eb73e1b75d0b21e3e8d321 Mon Sep 17 00:00:00 2001
From: Ricardo Wurmus <address@hidden>
Date: Mon, 4 Feb 2019 21:39:06 +0100
Subject: [PATCH] sxml: xml->sxml: Record and use namespace abbreviations.

* module/sxml/simple.scm (xml->sxml): Add namespace declarations to the
attribute list of the first XML element.
[name->sxml]: Accept namespaces argument to look up abbreviation.
Return name with abbreviation prefix.
[parser]: Let FINISH-ELEMENT procedure return namespaces in addition to
SXML tree.
---
 module/sxml/simple.scm | 50 +++++++++++++++++++++++++++++++++---------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/module/sxml/simple.scm b/module/sxml/simple.scm
index 703ad9137..52dd9af12 100644
--- a/module/sxml/simple.scm
+++ b/module/sxml/simple.scm
@@ -1,7 +1,8 @@
 ;;;; (sxml simple) -- a simple interface to the SSAX parser
 ;;;;
-;;;;   Copyright (C) 2009, 2010, 2013  Free Software Foundation, Inc.
+;;;;   Copyright (C) 2009, 2010, 2013, 2019  Free Software Foundation, Inc.
 ;;;;    Modified 2004 by Andy Wingo <wingo at pobox dot com>.
+;;;;    Modified 2019 by Ricardo Wurmus <address@hidden>.
 ;;;;    Originally written by Oleg Kiselyov <oleg at pobox dot com> as 
SXML-to-HTML.scm.
 ;;;; 
 ;;;; This library is free software; you can redistribute it and/or
@@ -30,6 +31,7 @@
   #:use-module (sxml ssax)
   #:use-module (sxml transform)
   #:use-module (ice-9 match)
+  #:use-module (srfi srfi-1)
   #:use-module (srfi srfi-13)
   #:export (xml->sxml sxml->xml sxml->string))
 
@@ -123,10 +125,15 @@ port."
         (acons '*DEFAULT* default-entity-handler entities)
         entities))
 
-  (define (name->sxml name)
+  (define (name->sxml name namespaces)
     (match name
       ((prefix . local-part)
-       (symbol-append prefix (string->symbol ":") local-part))
+       (let ((abbrev (and=> (find (match-lambda
+                                    ((abbrev uri . rest)
+                                     (and (eq? uri prefix) abbrev)))
+                                  namespaces)
+                            first)))
+         (symbol-append abbrev (string->symbol ":") local-part)))
       (_ name)))
 
   (define (doctype-continuation seed)
@@ -152,14 +159,16 @@ port."
                        (ssax:reverse-collect-str seed)))
              (attrs (attlist-fold
                      (lambda (attr accum)
-                       (cons (list (name->sxml (car attr)) (cdr attr))
+                       (cons (list (name->sxml (car attr) namespaces)
+                                   (cdr attr))
                              accum))
                      '() attributes)))
-         (acons (name->sxml elem-gi)
-                (if (null? attrs)
-                    seed
-                    (cons (cons '@ attrs) seed))
-                parent-seed)))
+         (values (acons (name->sxml elem-gi namespaces)
+                        (if (null? attrs)
+                            seed
+                            (cons (cons '@ attrs) seed))
+                        parent-seed)
+                 namespaces)))
 
      CHAR-DATA-HANDLER ; fhere
      (lambda (string1 string2 seed)
@@ -212,7 +221,28 @@ port."
   (let* ((port (if (string? string-or-port)
                    (open-input-string string-or-port)
                    string-or-port))
-         (elements (reverse (parser port '()))))
+         (elements (call-with-values
+                       (lambda () (parser port '()))
+                     (lambda (elements namespaces)
+                       ;; Generate namespace declarations mapping
+                       ;; abbreviations to URLs.
+                       (let ((ns-declarations
+                              (filter-map (match-lambda
+                                            (('*DEFAULT* . _) #f)
+                                            ((abbrev uri . _)
+                                             (list (symbol-append 'xmlns: 
abbrev)
+                                                   (symbol->string uri))))
+                                          namespaces)))
+                         ;; Inject namespace declarations into the first
+                         ;; proper element.
+                         (match (reverse elements)
+                           (((and pi-elem ('*PI* . _))
+                             (tag ('@ . attrs) . children))
+                            `(,pi-elem (,tag (@ ,@ns-declarations ,attrs)
+                                             ,@children)))
+                           (((tag ('@ . attrs) . children))
+                            `(,tag (@ ,@ns-declarations ,attrs)
+                                   ,@children))))))))
     `(*TOP* ,@elements)))
 
 (define check-name
-- 
2.20.1


reply via email to

[Prev in Thread] Current Thread [Next in Thread]