samizdat-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

import_feeds.rb-0.3 - nonexistent usefulinc.com/rss/manifest/ + xhtml


From: boud
Subject: import_feeds.rb-0.3 - nonexistent usefulinc.com/rss/manifest/ + xhtml validation
Date: Tue, 6 Feb 2007 23:32:30 +0100 (CET)

hi samizdat-devel,

Here are some minor updates to the RDF import patch, i.e. here i'm giving version 0.3.

There's no change to index.rb relative to the previous version, only import_feeds.rb is changed. This is a patch relative to the 070120 snapshot.


(1) Some feeds such as:

http://argentina.indymedia.org/syn/features_long.rdf
http://jakarta.indymedia.org/newsfeed.php?type=feature&language=id

refer in the rdf header to an xml namespace defined in a URL which responds to requests

xmlns:mn="http://usefulinc.com/rss/manifest/";

with

File not found

Change this error message for pages not found in public/404.html

This results in the <rdf:Description> box at the bottom of the file
crashing a parse error, presumably (i'm guessing) because it contains a
<mn:channels> sub-box and the rss parser is unable to handle undefined tags. In any
case, removing the whole  <rdf:Description> box  avoids the error.

The standard ruby rss library (as far as i understand it) does not have
an obvious way of handling this, so i gave up and wrote a hardwired
hack:

+        # Remove tag section not needed and known to be buggy for
+        # invalid "mn" type URI  http://usefulinc.com/rss/manifest/
+       if response =~ %r{http://usefulinc.com/rss/manifest/}
+           
response.sub!(/<rdf:Description(.*\n)*?.*mn:channels.*(.*\n)*?.*<\/rdf:Description>/,"")
+        end


It works for at least the above two sites - argentina is running
sf-active (i think) and jakarta probably an old version of ocailt:

http://argentina.indymedia.org/syn/features_long.rdf
http://jakarta.indymedia.org/newsfeed.php?type=feature&language=id



(2) The second correction is replacing <it> by <em> and removing <br />
after </li> in order for the w3 xhtml validator not to complain.


cheers
boud



--- /tmp/tmp_snapshot/samizdat/cgi-bin/index.rb 2007-01-08 03:09:52.000000000 
+0100
+++ /usr/share/samizdat/cgi-bin/index.rb        2007-02-01 00:13:29.000000000 
+0100
@@ -12,6 +12,9 @@

 require 'samizdat/engine'

+require 'import_feeds.rb'  # TODO - should this be load or require?
+#require 'message_graph'  # TODO: file hierarchy probably wrong
+
 # messages that are related to any focus (and are not comments or old
 # versions), ordered chronologically by date of relation to a focus (so that
 # when message is edited, it doesn't flow up)
@@ -161,6 +164,13 @@
     features = features.join + %{<div class="foot">} + t.nav_rss(rss_features) 
+
       t.nav(features.size < config['limit']['features'],
       skip_feature + 1, 'index.rb?', 'skip_feature') + "</div>\n"
+
+     # This is to include a graph using  message_graph.rb
+#31.01.07 - off
+#    if( config['graph'] )
+#      node_pairs = collect_features_graph(0, false, limit_page)
+#      features += message_graph_method(node_pairs)
+#    end
   end

   if render_updates
@@ -172,6 +182,14 @@
       t.nav_rss(rss_updates) + t.nav(updates.size, skip + 1))
   end

+  imported_feeds = ""   # default is zero-length string
+  if( config['import_feeds'] )
+ imported_feeds = %{<tr><td class="links-head">}+ _('RDF Feeds')+ + '</td></tr>
+    <tr><td class="links">' + import_feeds_method + '</td></tr>'
+ end +
+
   page =
     if full_front_page
 %{<table>
@@ -180,10 +198,10 @@
   </thead>
   <tr>
     <td class="focuses">#{focuses}</td>
-    <td class="features" rowspan="3">#{features}</td>
-    <td class="updates" rowspan="3">#{updates}</td>
-  </tr>
-  <tr><td class="links-head">}+_('Links')+'</td></tr>
+    <td class="features" rowspan="6">#{features}</td>
+    <td class="updates" rowspan="6">#{updates}</td>
+ </tr>} + imported_feeds + + %{<tr><td class="links-head">}+_('Links')+'</td></tr>
   <tr><td class="links">
     <div class="focus"><a href="query.rb?run&amp;query='+CGI.escape('SELECT ?resource WHERE 
(dc::date ?resource ?date) (s::inReplyTo ?resource ?parent) LITERAL ?parent IS NOT NULL ORDER BY ?date 
DESC')+'">'+_('All Replies')+'</a></div>
     <div class="focus"><a href="foci.rb">'+_('All Focuses 
(verbose)')+'</a></div>




--- /dev/null   2005-09-15 04:53:34.000000000 +0200
+++ /usr/share/samizdat/cgi-bin/import_feeds.rb 2007-02-06 23:00:34.971304448 
+0100
@@ -0,0 +1,176 @@
+#!/usr/bin/env ruby
+#
+# Samizdat logout
+#
+#   Copyright (c) 2002-2006  Dmitry Borodaenko <address@hidden>,
+#   Boud (Indymedia) <address@hidden>
+#
+#   This program is free software.
+#   You can distribute/modify this program under the terms of
+#   the GNU General Public License version 2 or later.
+#
+# vim: et sw=2 sts=2 ts=8 tw=0
+
+# VERSION import_feeds 0.3
+
+require 'samizdat/engine'
+
+require 'open-uri'
+require 'rss/1.0'
+require 'rss/dublincore'
+require 'rss/2.0'
+
+# TODO: The format_date method is from template.rb. In principle,
+# imported feeds should (could) be treated as resources - somewhat
+# similar to messages, but with some properties distinct from ordinary
+# messages. In that case, there would be no need to have redundancy
+# for the format_date method.
+def format_date(date)
+  date = date.to_time if date.methods.include? 'to_time'   # duck
+  date = date.strftime '%Y-%m-%d %H:%M' if date.kind_of? Time
+  date
+end
+
+
+def import_feeds_method()
+ + import_feeds_body = "<ul>"
+
+  interval = config['timeout']['import_feeds'] # time interval for importing
+  interval = 3600 if (interval == nil)  # failsafe default
+  timenow = Time.now  # object of Time class
+
+  # The expected caching time is the last "round number" time interval,
+  # based on total time in seconds defined in the Time class.
+  expected_caching_time = timenow.to_i.divmod(interval)[0] * interval
+  import_feeds_cache_key = 'imported_feeds/' + expected_caching_time.to_s
+
+  import_feeds_list_array  = cache[import_feeds_cache_key]
+ + if(import_feeds_list_array == nil)
+
+    import_feeds_list = Hash.new
+
+    config['import_feeds'].each do | feed_key, feed_value |
+      rss_source = feed_key
+
+      # At some point in the future, people might want to have e.g. https
+      # feeds, but there is no need to force people to write http:// when
+      # this is a very widely used default value. So protocol is optional
+      # here.
+
+ protocol = feed_value['protocol'] + protocol = "http://"; if( protocol == nil) +
+      host = feed_value['host']
+      host = _(' Hostname missing.') if (host == nil)
+      filename = feed_value['filename']
+      filename = _(' Filename missing.') if (filename == nil)
+      anURI = protocol + host + filename
+      #    anURI = protocol + feed_value['host'] + feed_value['filename']
+ + # TODO: security - check before untainting?
+      # TODO: store and prepare rdf feeds in all available languages
+ # and give the user the one s/he wants? + response= ""
+      valid_URI=0
+      begin
+ open(anURI.untaint, + "Accept-Language" => config['locale']['languages'][0]) do |file| + response += file.read + valid_URI=1
+        end
+      rescue SocketError
+        valid_URI=0
+        import_feeds_body += _('<li><em>Error opening ') + %{<a href="} +
+         anURI + %{">} + _('this feed') + "</a></em></li>\n"
+      rescue URI::InvalidURIError
+        valid_URI=0
+        import_feeds_body += _('<li><em>Error opening ') + %{<a href="} +
+         anURI + %{">} + _('this feed') + "</a></em></li>\n"
+      rescue
+        valid_URI=0
+        import_feeds_body += _('<li><em>Error opening ') + %{<a href="} +
+         anURI + %{">} + _('this feed') + "</a></em></li>\n"
+      end
+
+      if(valid_URI==1)
+
+        # Remove tag section not needed and known to be buggy for
+        # invalid "mn" type URI  http://usefulinc.com/rss/manifest/
+       if response =~ %r{http://usefulinc.com/rss/manifest/}
+           
response.sub!(/<rdf:Description(.*\n)*?.*mn:channels.*(.*\n)*?.*<\/rdf:Description>/,"")
+        end
+
+ # The parsing of the feed initially allows non-RSS-1.0 compliant + # feeds, but the do_validate method is used on individual items
+        # later on to check their validity.
+        begin
+          rss = RSS::Parser.parse(response)  # for RSS 1.0 compliant feeds
+ rescue RSS::InvalidRSSError + rss = RSS::Parser.parse(response, false) # allow non RSS 1.0 compliant
+        end
+ + if(rss) + # rss.channel in RSS 2.0 seems to contain info in "rss" for RSS 1.0 + # So rss_channel is used here as a commmon name for either. + rss_channel = rss + if rss.rss_version == "2.0"
+            rss_channel = rss.channel
+          end
+ + # if there is a 'max_entries' parameter, then use at most that
+          # number of items for that feed
+          n_items=rss_channel.items.length
+          if(feed_value['max_entries'])
+            if(n_items > feed_value['max_entries'])
+              n_items = feed_value['max_entries']
+            end
+          end
+ + for item_number in 0...n_items
+            if rss_channel.item(item_number).do_validate
+              rss_link = rss_channel.item(item_number).link.strip
+              title = rss_channel.item(item_number).title.strip
+              date = format_date(rss_channel.item(item_number).date)
+ + # add this feed to the list of valid feeds + import_feeds_list[rss_link] = { "rss_source" => rss_source, + "title" => title, "date" => date } + + end
+          end  #     import_feeds_list.each { | feed_key, feed_value |
+        end  #    if(rss)
+      end #  if(valid_URI==1)
+    end # for feed_number in ...
+
+
+
+ + # Sort the import feeds list by date. The result is an array of
+    # pairs.  The first element of each pair is the link (in principle,
+ # this should be unique). The second element of each pair is + # a hash, containing the other useful pieces of feed
+    # information (such as source, title, date)
+ import_feeds_list_array = import_feeds_list.sort { + |a,b| b[1]['date'] <=> a[1]['date'] } + + # update the cache + cache[import_feeds_cache_key] = import_feeds_list_array
+
+  end #    if(import_feeds_list_array == nil)
+
+  import_feeds_list_array.each do | feed |
+ import_feeds_body += + "<li> <em>" + feed[1]['rss_source'] +
+      '</em> <a href="' + feed[0] + '">' +
+      feed[1]['title'] + "</a> " +
+      feed[1]['date'] + "</li>\n"
+  end
+
+  import_feeds_body +=  "</ul>"
+ + import_feeds_body + +end # def import_feeds_method +




reply via email to

[Prev in Thread] Current Thread [Next in Thread]