I’m posting the code I’ve written for date management in the mashup project. It’s just a sketch, really. There’s plenty of work to do to fill it out. But I wanted to post what I’ve got now.
Thanks to Keith, for his suggestion I use the joda.time library.
Organization: namespaces
I’ve broken the code out into a couple of namespaces, and thus into different files. The files live in folders under a src directory that is on my development classpath. For instance, the date time utility functions in namespace org.clojurestudydc.mashups.util.datetime are in a file src/org/clojurestudydc/mashups/util/datetime.clj.
Datetime wrangling
So date parsing is a bitch because dates are expressed in so many formats. The joda.time library offers a way to parse datetime strings by describing them using different characters for different parts of the date and time. The details are here:
http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html
The org.joda.time.format.DateTimeFormat class has a factory method forPattern that takes a descriptive string in this format and returns a DateTimeFormat object ready to use on datetime strings that match the description. In Clojure, to make a DateTimeFormat object that handles RSS 1.0 datetime strings:
user> (def rss-10-fmt (org.joda.time.format.DateTimeFormat/forPattern
"yyyy-MM-dd'T'HH:mm:ssZZ"))
#'user/rss-10-fmt
Then use the object’s parseDateTime method on a timestamp:
user> (.parseDateTime rss-10-fmt "2005-10-02T09:01:15-07:00") #>DateTime 2005-10-02T12:01:15.000-04:00>
One serious shortcoming of the DateTimeFormat class is that you can’t ask it to parse a timezone expressed as an acronym (EDT, e.g.). So, I have to search timestamps for these and convert them into numerical timezone descriptors (e.g. -0400). So I wrote a function replace-timezone-abbreviation and a datamap timezones for this. (Code follows below.)
Then I built a vector of descriptions of datetime formats. I only put in two, for illustration purposes: RFC822, which RSS 2.0 holds to (mostly), and ISO8601, the closest format to RSS 1.0. Each format is expressed as a datamap with a name, a regex (:rpattern), and a DateTimeFormat pattern (:dpattern). The rest of the code I wrote builds up functions that can process a seq of datamaps and return a copy with each datamap’s value for a given key converted from a string to a joda.time.DateTime object. The code uses the formats’ regexes to select a format to use. I have placeholders in the code for “hints” of various kinds. The idea is that you could pass a format’s name, and the code would try that format out first. Pass a format, or seq of them, and the code would try to match with them first. Pass a function that can be applied to a string value to derive a joda.time.DateTime object, and the code would try applying it and only go on to the formats if some exception were returned.
OK, so here’s the code
(ns org.clojurestudydc.mashups.util.datetime
(:import (org.joda.time DateTime DateTimeComparator)
(org.joda.time.format DateTimeFormat)))
(defn date-comparator [] (DateTimeComparator/getInstance))
(def timezones {"AST" "-0400" "ADT" "-0300" "EST" "-0500" "EDT" "-0400" "CST" "-0600" "CDT" "-0500"
"MST" "-0700" "MDT" "-0600" "PST" "-0800" "PDT" "-0700"
"AST" "-0900" "ADT" "-0800" "HST" "-1000" "HDT" "-0900"})
(defn replace-timezone-abbreviation [datestring]
(let [ptn (re-pattern (apply str (interpose "|" (keys timezones))))
match (re-find ptn datestring)]
(if match
(.replace datestring match (timezones match))
datestring)))
(defstruct date-format-entry :name :rpattern :dpattern)
(def date-format-entries
[(struct date-format-entry :RFC822
#"\w\w\w, \d\d \w\w\w \d\d\d\d \d\d:\d\d:\d\d"
"EEE, dd MMM yyyy HH:mm:ss ZZ")
(struct date-format-entry :ISO8601
#"\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d-\d\d:\d\d"
"yyyy-MM-dd'T'HH:mm:ssZZ")])
(defn find-fmt-by-name [name & entries]
(filter #(= name (:name %)) (concat date-format-entries (first entries))))
(defn find-fmt-by-regex-match [datestring & xtra-entries]
(let [entries (concat date-format-entries (first xtra-entries))]
(first (filter (complement #(empty?
(re-seq (:rpattern %) datestring))) entries))))
(defn parse-string [datestring dpattern]
(.parseDateTime (DateTimeFormat/forPattern dpattern)
(replace-timezone-abbreviation datestring)))
(defn parse-item [item datekey]
(let [date-format-schema (find-fmt-by-regex-match (datekey item))]
(if (nil? date-format-schema)
item ;better to throw an exception
(let [datestring (datekey item)
dpattern (:dpattern date-format-schema)
dvalue (parse-string datestring dpattern)]
(assoc item datekey dvalue)))))
(defn parse-mapseq-date-values [mapseq datekey & hints]
(map #(parse-item % datekey) mapseq))
Here’s how you’d use it on a seq of rss items like the cleaned-rss seq we built earlier:
user> (use 'org.clojurestudydc.mashups.util.datetime)
nil
user> (def date-corrected-rss
(parse-mapseq-date-values cleaned-rss :pubDate))
#'user/date-corrected-rss
user> (:pubDate (first date-corrected-rss))
#<DateTime 2009-06-23T14:37:54.000-04:00>
Sorting
And here’s how I use the datetime functions to sort. I have more code in a namespace under ‘….alterations.sorting’, as this code conceptually makes a copy of the structure with its items in a different order:
(ns org.clojurestudydc.mashups.alterations.sorting
(:require [org.clojurestudydc.mashups.util.datetime :as dt ]))
...
(defn configure-sort-by-date-fn [date-key]
(let [dt-comparator (dt/date-comparator)]
(fn [coll] (sort #(.compare
dt-comparator (date-key %2) (date-key %1)) coll))))
The function configure-sort-by-date-fn takes a key to use to locate datetime values in a datamap and returns a function that’s ready to sort a collection (seq) of datamaps. Here’s how it’s used:
user> (use 'org.clojurestudydc.mashups.alterations.sorting) nil user> (def sort-rss-items (configure-sort-by-date-fn :pubDate)) #'user/sort-rss-items user> (map :pubDate (take 5 (sort-rss-items date-corrected-rss))) (#<DateTime 2009-06-23T15:28:45.000-04:00> #<DateTime 2009-06-23T15:06:41.000-04:00> #<DateTime 2009-06-23T15:02:38.000-04:00> #<DateTime 2009-06-23T15:02:27.000-04:00> #<DateTime 2009-06-23T14:52:36.000-04:00>)
Marking structures generated by builders
As a closing note, another “hint” for a transformer like the datetime parser could be a metadata tag denoting the origin of the structure (and thus the likely format of its values). So here’s an edit to the RSS builder that marks our collection of RSS items with origin ‘rss:
(ns org.clojurestudydc.mashups.builders.rss)
...
(defn rss-reader [url]
(let [xml (xml/parse url)
zipper (zip/xml-zip xml)
elements (-> zipper zip/down zip/children)
items (filter #(= :item (:tag %)) elements)]
(with-meta (map select-contents items)
{:origin 'rss})))
We’ll need something better than concat now to combine these maps: we have to combine their metadata too.
June 27, 2009 at 9:55 am |
Michael -
Time zone abbreviations in text format are very problematic. There are some abbreviations that refer to more than one time zone, and for a given time zone, the abbreviations themselves are localizable.
For example, this page shows that our daylight savings time is referred to by at least the two abbreviations shown, and that EDT refers to an Australian time zone as well as ours:
http://www.worldtimezone.com/wtz-names/wtz-edt.html
So parsing a time zone abbreviation will never be 100% reliable unless your data set is guaranteed to fall within language and geographical bounds. This date/time stuff is *really* hairy.
- Keith