emacs-orgmode
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Orgmode] Re: org-feed XML entities and character encoding


From: David Maus
Subject: [Orgmode] Re: org-feed XML entities and character encoding
Date: Fri, 13 Aug 2010 17:59:19 +0200
User-agent: Wanderlust/2.15.9 (Almost Unreal) SEMI/1.14.6 (Maruoka) FLIM/1.14.9 (Gojō) APEL/10.8 Emacs/23.2 (i486-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)

Michael Brand wrote:
>Hi all,

>org-feed is becoming very useful for me, so far to manage the
>episodes of podcasts. Now I have a patch and a request for help.

>1. patch for an issue with XML entities
>=======================================

>I found that some XML entities in my feeds are not substituted. The
>comments of two recent org-feed.el commits by David Maus
>http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6
>and
>http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6
>lead me to the thread
>http://thread.gmane.org/gmane.emacs.orgmode/26352
>and invited me to replace org-feed-unescape with xml-substitute-special
>which converts more XML entities. The resulting patch below helps for
>me but of course I would like it to be reviewed by an experienced elisp
>programmer and org-feed user before being applied.

This patch is fine and `xml-substitute-special' is the right thing to
do (i.e. convert numeric character references, too).

>2. request for help about an issue with multibyte character encoding
>====================================================================

>There is an issue with multibyte characters that appear in the input
>as unescaped, multibyte encoded characters (not as XML entities, as XML
>entities multibyte characters are simply substituted correctly). I
>looked for an example with a character encoding specified in the first
>line of the XML feed like
><?xml version="1.0" encoding="utf-8"?>
>and found one here:
>http://www.openscreencast.de/blog/rss.xml

The problem with this feed is, that it contains raw unicode characters
that must be converted to utf-8 before they can be properly inserted
in the target buffer.

Attached patch does this by explicitely decoding new entries according
to their detected character encoding.

Btw.: Helpful introduction to the topic gives

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)

by Joel Spolsky

http://www.joelonsoftware.com/articles/Unicode.html

Best,
  -- David
--
OpenPGP... 0x99ADB83B5A4478E6
Jabber.... address@hidden
Email..... address@hidden

Attachment: 0001-Decode-entry-according-to-its-character-encoding.patch
Description: Text document

Attachment: pgpS0k9e_H_nU.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]