Capstone project: stepping back for a moment

We have a capstone project, a reworking in Clojure of some of the mashup features of Yahoo Pipes. At our last meeting, Luke VanderHart presented a framework for building and connecting components. I’m going to recap the framework as well as a few philosophical decisions we hammered out. But I’m also going to invite you to take a look at the code with me, with an eye towards 1) understanding each bit and 2) replaying the development of the framework by starting with simple building blocks. Luke and I both agree that the current framework, although awesome, has leapfrogged the kind of discussion and collaboration we want to foster in the study group. So I’m going to try an experiment with coding in public. On this blog, and soon on github, we’re going to build the framework again, together, feature by feature, always embracing the principle of the simplest thing that could possibly work. We’ll build up the might of Luke’s framework, but in a way that we get there together. And we’ll probably make it better from all being involved.

First off, we haven’t agreed on a name for the capstone project, so I’m just going to call it the clojure mashup project, or the mashup project, or just the project. Luke has named his repo “flowjure”, and there’s a lot of “tubes” and “pipes” in the namespaces, but that’s immaterial right now. Don’t worry about the names.

Next, you can find Luke’s code on github: http://github.com/levand/flowjure/tree/master

  1. If you don’t have git installed, install it
  2. Decide where you want to put the cloned repository. I have a ‘clojure-dev’ directory in my home directory.
  3. clone the repository in that location: clojure-dev$> git clone git://github.com/levand/flowjure.git
  4. Make sure your classpath — either your environment variable if you’re doing Clojure form the command line or in a configuration file for emacs or other editor — contains the path to flowjure/src/. In my case, that’s /Users/michael/clojure-dev/flowjure/src

Now let’s get started by going over a few things from the last meeting.

  • The mashup project is component-based. For now, we’re trying out a strict policy of every component taking one or more inputs, either external resources like URIs or other project components, put only yielding one output, either a Clojure data structure or something intended for output from (and thus marking the completion of) the component chain.
  • The component chain operates in a “pull” style, with each component in the chain asking the component(s) immediately “upstream” to give them data.
  • Because of the amny-in/one-out structure of the components and the “pull” metaphor, the component chain can’t branch. Data funnels down to a single output

Furthermore, Paul Barry described the different kinds of objects in the project in five categories:

  1. source (something external like an RSS feed or file),
  2. builder (any format into internal representation),
  3. collector/combiner (multiple inputs to one output),
  4. transform (alter one input into one output),
  5. formater/outputer (internal representation into any format)

With this in mind, let’s look as the RSS reader in the file src/flowjure/components/rss.clj:

(ns flowjure.components.rss
  (:require [clojure.xml :as xml]
             [clojure.zip :as zip])
  (:use flowjure.engine))

(defn create-map
  "Creates a map object from an RSS 'item' entry"
  [item]
  (reduce (fn [acc it]
            (assoc acc (keyword-to-str (:tag it)) 
              (first (:content it)))) {} (:content item)))

(def-component {:name "rss"
                :category "Input"
                :description "Loads an RSS Feed"
                :output-type "Record Sequence"
                :args {"url" {:doc "The URL of the RSS feed"
                              :type "String"
                              :min-required 1
                              :max-required 1}}}
  (fn [pipe pipe-args args]
    (let [xml (xml/parse (args "url"))
          zipper (zip/xml-zip xml)
          elements (-> zipper zip/down zip/children)
          items (filter #(= :item (:tag %)) elements)]
      (map create-map items))))

Part 1: Examining the code

Now, fire up a REPL, because I’m going to walk you through the execution of the meat of this code, the five lines at the end. We’re going to have to wrangle with namespaces a bit, and I’m going to advocate using the pretty printer in clojure.contrib during our REPL session.

We start out with the xml and zip namespaces already available to us, as using resolve demonstrates:

user> (resolve 'clojure.xml/parse)
#'clojure.xml/parse
user> (resolve 'clojure.zip/zipper)
#'clojure.zip/zipper
user> (resolve 'clojure.zip/down)
#'clojure.zip/down

The pretty-printer is in clojure.contrib, so we’ll need to get it into our current namespace with use:

user> (resolve 'clojure.contrib.pprint/pprint)
nil
user> (use 'clojure.contrib.pprint)
nil
user> (resolve 'clojure.contrib.pprint/pprint)
#'clojure.contrib.pprint/pprint

See http://bc.tech.coop/blog/081029.html for more basics on namespaces.

See http://code.google.com/p/clojure-contrib/wiki/PprintApiDoc for API docs on the pretty printer.

Back to the code. Here are the lines I’m interested in right now:

(let [xml (xml/parse url)
        zipper (zip/xml-zip xml)
        elements (-> zipper zip/down zip/children)
        items (filter #(= :item (:tag %)) elements)] ...)

Let’s call the lines in the function one at a time in the REPL.

First, use xml/parse to read the RSS available at a URL, let’s use “http://rss.cnn.com/rss/cnn_topstories.rss”, and convert it into a Clojure data map. That’s a powerful function!

user> (clojure.xml/parse "http://rss.cnn.com/rss/cnn_topstories.rss")
{:tag :rss, :attrs {:xmlns:media "http://search.yahoo.com/mrss/", :xmlns:feedburner "http://rssnamespace.org/feedburner/ext/1.0", :version "2.0"}, :content [{:tag :channel, :attrs nil, :content [{:tag :title, :attrs nil, :content ["CNN.com"]} {:tag :link, :attrs nil, :content ["http://www.cnn.com/?eref=rss
[snip]

Well, it’s a map, but it’s sure hard to read. Let’s take advantage of one of the pretty printers awesomer tools, the pp macro, which when called pretty-prints the last thing output (our ugly map)

user> (pp)
{:tag :rss,
 :attrs
 {:xmlns:media "http://search.yahoo.com/mrss/",
  :xmlns:feedburner "http://rssnamespace.org/feedburner/ext/1.0",
  :version "2.0"},
 :content
 [{:tag :channel,
   :attrs nil,
   :content
[snip]

That’s better, although this thing is enormous. The thing to look for is “:tag :item” — that denotes an item, the elements that we’re after.

[snip]
{:tag :item,
     :attrs nil,
     :content
     [{:tag :title,
       :attrs nil,
       :content ["Mexico City shuts down venues due to swine flu"]}
      {:tag :guid,
       :attrs {:isPermaLink "false"},
       :content
       ["http://www.cnn.com/2009/HEALTH/04/28/swine.flu/index.html?eref=rss_topstories"]}
      {:tag :link,
       :attrs nil,
       :content
       ["http://rss.cnn.com/~r/rss/cnn_topstories/~3/kjZP0YJPYI0/index.html"]}
[snip]

Now, Luke uses xml.zip/xml-zip to assemble the value for ‘zipper.’ What’s xml-zip do? Let’s check the API docs:

(xml-zip root)
Returns a zipper for xml elements (as from xml/parse), given a root element

Huh. OK. Maybe we should evaluate our parsed XML and see what we get.

user> (pprint (clojure.zip/xml-zip (clojure.xml/parse "http://rss.cnn.com/rss/cnn_topstories.rss")))
[{:tag :rss,
  :attrs
  {:xmlns:media "http://search.yahoo.com/mrss/",
   :xmlns:feedburner "http://rssnamespace.org/feedburner/ext/1.0",
   :version "2.0"},
  :content
[snip]

That looks pretty familiar. Zippers must print out as maps. We’ll have to take it on faith that internally it’s fancier than a map.

For convenience, let’s def a zipper variable to hold our Zipper.

user> (def zipper (clojure.zip/xml-zip (clojure.xml/parse "http://rss.cnn.com/rss/cnn_topstories.rss")))
#'user/zipper

Now we use the -> macro (called thread). Clojure’s English descriptions suffer from namespace collisions: thread, map — these things have multiple meanings. The thread macro is a way to line up functions:
(-> zipper clojure.zip/down clojure.zip/children) is the same as (clojure.zip/children (clojure.zip/down zipper)).

What do these do? According to the docs:

(children loc)
Returns a seq of the children of node at loc, which must be a branch

(down loc)
Returns the loc of the leftmost child of the node at this loc, or nil if no children

In other words, we get the children of the first child of the RSS XML root element as a seq. This is important: we have all the tag maps we want at the top level of the seq. We don’t have to do any digging now. This is what the seq looks like:

({:tag :title, :attrs nil, :content ["CNN.com"]} [snip] {:tag :item, :attrs nil, :content [{:tag :title, :attrs nil, :content ["HHS secretary confirmed in midst of outbreak"]}]} [snip])

Now we filter the tags for those with :tag value :item:

user> (def items (filter #(= :item (:tag %)) elements))
#'user/items
user> (pprint (first items))
{:tag :item,
 :attrs nil,
 :content
 [{:tag :title,
   :attrs nil,
   :content ["HHS secretary confirmed in midst of outbreak"]}
  {:tag :guid,
   :attrs {:isPermaLink "false"},
   :content
   ["http://www.cnn.com/2009/POLITICS/04/28/sebelius.confirmation/index.html?eref=rss_topstories"]}
  {:tag :link,
   :attrs nil,
   :content
   ["http://rss.cnn.com/~r/rss/cnn_topstories/~3/dcHTwVYr0QU/index.html"]}
  {:tag :description,
   :attrs nil,
   :content
   ["The Senate today confirmed Kansas Gov. Kathleen Sebelius as secretary of 
health and human services on a 65-31 vote. Sebelius takes office as swine flu 
numbers climb worldwide. But confirming Sebelius, who met several obstacles
during confirmation hearings, doesn't bring the health team to 100 percent. 
There are still no appointees in place for any of the department's 18 key jobs.
[snip]"]}
  {:tag :pubDate,
   :attrs nil,
   :content ["Tue, 28 Apr 2009 10:10:23 EDT"]}
  {:tag :feedburner:origLink,
   :attrs nil,
   :content
   ["http://www.cnn.com/2009/POLITICS/04/28/sebelius.confirmation/index.html?eref=rss_topstories"]}]}

Next, the create-map helper function is applied to each element in the items collection (via map–again, many meanings) to get to the content in each of these {:tag ...} data maps.

Let’s just call the function’s body without defining it. And I’m removing the keyword-to-str call within Luke’s method to make things simpler:

user> (pprint (reduce (fn [acc it]
            (assoc acc (:tag it) 
              (first (:content it)))) {} (:content (first items))))
{:feedburner:origLink
 "http://www.cnn.com/2009/POLITICS/04/28/sebelius.confirmation/index.html?eref=rss_topstories",
 :pubDate "Tue, 28 Apr 2009 10:10:23 EDT",
 :description
 "The Senate today confirmed Kansas Gov. Kathleen Sebelius as secretary of health and human services on a 65-31 vote. Sebelius takes office as swine flu numbers climb worldwide. But confirming Sebelius, who met several obstacles during confirmation hearings, doesn't bring the health team to 100 percent. There are still no appointees in place for any of the department's 18 key jobs.
\n ...", :link "http://rss.cnn.com/~r/rss/cnn_topstories/~3/dcHTwVYr0QU/index.html", :guid "http://www.cnn.com/2009/POLITICS/04/28/sebelius.confirmation/index.html?eref=rss_topstories", :title "HHS secretary confirmed in midst of outbreak"}

A better name for the create-map helper function might be “select-contents”

Part 2: The simplest thing

The current code in the tubes.components.rss namespace adheres to a pretty highly generalized interface: the defcomponent macro takes a bunch of metadata, some of which is necessary for the rss component but some of which is not (we don’t really have to specify that a function that reads a url requires exactly one url string). The interface is complicated because of all the other components defcompenent could and will create. Similarly, the function that does the work, that for our use *is* the rss component, takes three arguments: a pipe, pipe-args, and plain old args. Again, these components have a consistent and sophisticated interface, because of what we may ask them to do tomorrow. But let’s not worry about tomorrow just now.

What’s the simplest thing that could work? Well, what do we need? We need to pass a URL to some component and get back a list of data maps, each containing the different pieces of content from each item tag in the feed.

OK, how about

(ns tubes.components.rss
  (:require [clojure.xml :as xml]
             [clojure.zip :as zip]))

(defn select-contents
  "Creates a data map of the contents of an RSS 'item' entry"
  [item]
  (reduce (fn [acc it]
            (assoc acc (:tag it) 
              (first (:content it)))) {} (:content item)))

(defn rss-reader [url]
  (let [xml (xml/parse url)
          zipper (zip/xml-zip xml)
          elements (-> zipper zip/down zip/children)
          items (filter #(= :item (:tag %)) elements)]
      (map select-contents items)))

Then, from the REPL:

user> (use 'tubes.components.rss)
nil
user> (def cnn-rss (rss-reader "http://rss.cnn.com/rss/cnn_topstories.rss"))
#'user/cnn-rss

Sure, it isn’t lazy. And it doesn’t support any kind of reuse — it’s use specific. But let’s add those things in as we need them. For now, we’re just exploring how the components work when we need them.

What’s next?

I’m going to propose a discussion on the mailing list for the next week or so. I propose we take on one thing at a time, and pass code around. Please join us there. You don’t have to be a DC native at this point: the action is online.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: