Building an RSS feed

Draft of 2016.07.29

May include: meta ↘ Ruby ↗ &c.

I think I’d like to add an RSS feed of some sort to this site. While it’s not a blog, it does get updated frequently, and there are still enough people in the world using RSS readers to receive notification of updates. Including me, since Vienna has been revived.

As of this writing, the whole site is just a little pile of Ruby code, powered mainly by sinatra, using HTML templates written in Liquid markup, serving these posts which are written in kramdown-flavored markdown. If you haven’t done much web design, that might sound inordinately complicated, but in fact it’s a lovely little separation of concerns. I can write in markdown (as I usually do anyway), save the files in a particular place, and the Ruby server parses that, jams it into the right slots in the templates I’ve written, renders the HTML for me, and that’s what you’re looking at now. I worry about words, the system I’ve built (and pared down) over the years does the lifting.

The same idea applies to an RSS feed, I suppose. Except instead of Liquid-templated HTML, the RSS feed will be some kind of XML? The Sinatra Cookbook certainly has a simple enough brute-force example on offer.

In my current setup here, whenever somebody visits /words, all the content is checked, the YAML headers in each file (if present) are consumed and parsed, and the resulting collection of “items” is a Ruby Array instance of one Hash per item, and each of those has keys like "title" and "date", all values specified in the YAML headers of the files. Along the way I remove any items that include a "hide" header, and sort the overall collection by date.

In other words, that whole page is produced by this snippet of Ruby code:

    get '/words' do
      yamls = get_article_yamls().
                reject {|y| y["hide"]}.
                sort_by {|y| y["sort_date"]}.
                reverse
      liquid :words, :locals => {:yamls => yamls}
    end

The last line does all the rendering, of course, and there’s a pile of ungainly HTML and Liquid stuff that makes the words appear on a page: translated into somewhat-plain English, that code says, “Render the liquid template named words.liquid (or words.html), which is stored in the /views folder by default, with the variable :yamls set to this pile of information.”

Certainly seems to me that the same basic principles will apply to an XML RSS feed. That’s the sense I get from reading the recipe example. In fact, I can use the same code, more or less. I’m a bit concerned that my yamls variable hasn’t got any extracts of the texts of items, just the header information. I may have to live with that for now.

I start from the example XML template, and after a bit of review of this intro to the RSS standard I realize (1) that I can boil it down to <title>, <link>, and <description> for each entry (skipping for now the date and guid fields present in the example code), and (2) that I’m obliged to have a description.

I should probably be sure of what I’m passing in before I edit the template, though. Just to peek at the constructed hash I set up this route handler in my sinatra all:

    get '/rss' do
      yamls = get_article_yamls().
              reject {|y| y["hide"]}.
              sort_by {|y| y["origin_date"]}.
              reverse
      @entries = yamls.take(10)
      "#{@entries}" 
      # builder :rss
    end

In the “final” route handler, I’ll call builder :rss, but for how I’m just rendering the @entries object as a string. Because who knows what’s in there, and whether I’ve got it put together even remotely the way I imagine? Not me.

That said, the thing I see is about right. It has 10 entries, the entries are sorted by date (more on that below), and all the other stuff I need—except a <description>—are present. So now I feel a bit more comfortable knowing what’s going in, and flip the toggle in the route handler so that this object is sent to the template with builder :rss:

Here’s a first pass on the template itself:

xml.instruct! :xml, :version => '1.0'
xml.rss :version => "2.0" do
  xml.channel do
    xml.title "Vaguery.com"
    xml.link "http://vaguery.com"
    xml.description "Recent additions to Vaguery.com"

    @entries.each do |yaml|
      xml.item do
        xml.title yaml["title"]
        slug = yaml["slug"]
        xml.link "http://vaguery.com/#{slug}"
        xml.description "Description of #{slug}"
      end
    end
  end
end

And here’s what I see in my browser when I visit it:

<rss version="2.0">
<channel>
<title>Vaguery.com</title>
<link>http://vaguery.com</link>
<description>Recent additions to Vaguery.com</description>
<item>
<title>Building an RSS feed</title>
<link>http://vaguery.com/words/building-an-RSS-feed</link>
<description>Description of words/building-an-RSS-feed</description>
</item>
<item>
<title>Running Register Machines</title>
<link>http://vaguery.com/words/an-artificial-chemistry-3</link>
<description>Description of words/an-artificial-chemistry-3</description>
</item>
<item>
<title>A quick Artificial Chemistry in Clojure</title>
... (and so on)

In other words… well, it seems to be working already.

Except for <description>.

Just what is this?

At the moment, the code I’m running to construct this collection of data is starting off by calling YAML.load_file() on every file in a certain directory. The file structure being read is, like that of most text-based blog systems these days, a header section, followed by a body section. Something like this:

    ---
    date: "2016-07-29"
    title: "Building an RSS feed"
    topics: [meta, Ruby]
    ---

    I think I'd like to add an RSS feed of some sort to this site....

The Ruby YAML.load_file() function reads this in, but it depends on the YAML --- separators to stop parsing after the headers. In other words, whether or not it reads the entire file all at once (and it might), it never even glances at the body text following a second line ---.

Thus I seem to have at least two ways to move forward. I could add a description: string at the top of each entry when I write it. Or I could change the loop I’m running now, so that instead of immediately calling YAML.load_file() for every file, it reads the file’s text, splits that into two chunks—header and body—trims a little piece off the top of the body part, and tucks that extract into the headers part under a key.

The question really boils down to: Is there an aesthetically pleasing “description” to be had from every entry I have written, and will ever write, that consists of a fixed number of characters or lines? I think… no. I think there isn’t one.

For now, and until it gets a bit too onerous, I guess what I’ll do is actually make an effort to add a description: line or two whenever I write something. I’ve seen too many automatically-extracted feeds through the years that render as off-putting gibberish.

I’d rather write my off-putting gibberish by hand.

Perhaps not surprisingly, this path involves a lot less programming. I change the template a bit, and also do a little refactoring:

xml.instruct! :xml, :version => '1.0'
xml.rss :version => "2.0" do
  xml.channel do
    xml.title "Vaguery.com"
    xml.link "http://vaguery.com"
    xml.description "Recent additions to Vaguery.com"

    @entries.each do |yaml|
      xml.item do
        title = yaml["title"]
        slug = yaml["slug"]
        description = yaml["description"] || "\"#{title}\" is hard to describe..."
        
        xml.title title
        xml.link "http://vaguery.com/#{slug}"
        xml.description description
      end
    end
  end
end

Now if there is a YAML tag called description, that value will be used here, and if there’s not, the placeholder will be "This Item's Title" is hard to describe.... Good enough to try.

I test whether this is working by giving this very item—the one I’m writing, and you’re reading—a description field. Here’s what appears when I visit /rss now:

<rss version="2.0">
<channel>
<title>Vaguery.com</title>
<link>http://vaguery.com</link>
<description>Recent additions to Vaguery.com</description>
<item>
<title>Building an RSS feed</title>
<link>http://vaguery.com/words/building-an-RSS-feed</link>
<description>I add an RSS feed to this simple Sinatra app</description>
</item>
<item>
<title>Running Register Machines</title>
<link>http://vaguery.com/words/an-artificial-chemistry-3</link>
<description>"Running Register Machines" is hard to describe...</description>
</item>
<item>
...

Which shows me that the specified description has been swapped in, and the default description is working as well.

When exactly?

I should have dates, I guess. They’re not necessary, but they’re certaily helpful.

As things stand now, the YAML headers I’ve been using contain integer values for dates. That’s so I can more or less type any machine-readable date format in the headers of items, and also parse old items I wrote back in the 1990s, and still sort them in a reasonable way. They’re all read as strings, those are parsed by Ruby’s excellent Date.parse() function, and the resulting standard object is converted back into an integer with a (somewhat kludgey) pass through x.to_time.to_i.

For example, if the header is the string "2006-04-11", this is parsed by Date.parse() to be #<Date: 2006-04-11 ((2453837j,0s,0n),+0s,2299161j)>, and then that’s converted to 1144728000. That’s because numerical values like that one are stored and parsed in YAML as numbers, not strings, and more importantly so I can sort values like "2006-04-11" and "April 12, 2006" and have them end up in the right order.

In any case, by the time these dates arrive at the rss.builder template, they’re integers. So I should convert them back again for the RSS feed itself:

xml.instruct! :xml, :version => '1.0'
xml.rss :version => "2.0" do
  xml.channel do
    xml.title "Vaguery.com"
    xml.link "http://vaguery.com"
    xml.description "Recent additions to Vaguery.com"

    @entries.each do |yaml|
      xml.item do
        title = yaml["title"]
        slug = yaml["slug"]
        description = yaml["description"] || "\"#{title}\" is hard to describe..."
        date = yaml["origin_date"]
        
        xml.title title
        xml.link "http://vaguery.com/#{slug}"
        xml.description description
        xml.pubDate Time.at(date)
      end
    end
  end
end

Confusingly, Ruby uses the Time library to convert integers to Date items. Less confusing but still a little bit, the times are all midnight, but the dates turn out to be what I expect:

<rss version="2.0">
<channel>
<title>Vaguery.com</title>
<link>http://vaguery.com</link>
<description>Recent additions to Vaguery.com</description>
<item>
<title>Building an RSS feed</title>
<link>http://vaguery.com/words/building-an-RSS-feed</link>
<description>I add an RSS feed to this simple Sinatra app</description>
<pubDate>2016-07-29 00:00:00 -0400</pubDate>
</item>
<item>
<title>Running Register Machines</title>
<link>http://vaguery.com/words/an-artificial-chemistry-3</link>
<description>"Running Register Machines" is hard to describe...</description>
<pubDate>2016-07-28 00:00:00 -0400</pubDate>
</item>
<item>
<title>A quick Artificial Chemistry in Clojure</title>
<link>http://vaguery.com/words/an-artificial-chemistry</link>
<description>
"A quick Artificial Chemistry in Clojure" is hard to describe...
</description>
<pubDate>2016-07-27 00:00:00 -0400</pubDate>
...

That’s a good start, at least.

I’ll spend a few minutes adding description fields to older entries, and probably discover quickly how onerous that is. And I’ll also add a link element to the headers of all the pages, indicating there’s an auto-discoverable RSS feed present.

When I look at the feed in my own RSS reader, though, there’s an interesting thing: all the “midnight” times are changed to “8pm last night”. Apparently when I call Time.at(i), it uses my local machine offset (currently in EDT time zone) to decide what “time” a round number refers to, and assumes I mean UTC as a base. So “midnight” without any specified offset is 8 pm the night before, here in EDT time zone.

Crap.

After messing around, and with Barbara pairing with me to help catch the messy docs, we finally settle on the slightly messier

xml.pubDate Time.at(date).getgm.strftime("%F")

In other words, I change time time to GMT, then change that into just a date string like 2016-07-29. The RSS reader still infers that a post has to have some kind of time, so it says it’s 8am now, but at least it’s the right day.

But that’s good enough for the moment. I don’t want to deal with file change dates or anything, so I can’t see a way to make real (or even “realistic”) clock times be associated with these. Maybe another day. For now, everything will be published at 8am.