ALIVE and Treetop

A sweet project

I’m proud to say really fun mini-project went live today. OKWU Alive for Oklahoma Wesleyan University, a very forward thinking college. They wanted a site that is mainly updated by twitter messages.

We used a combination of Radiant, Twitter4R, and an upcoming library Treetop that I heard about a RubyConf 2007. Ever since attending Nathan Sobo’s presentation I’ve wanted to put it to use, but kept putting it off.

The “challenge”

To give some context to the site, OKWU wanted to parse direct twitter messages and add them to the site. The thing that made this interesting, is that they wanted to be able to tag each message. Most messages take on the form of:

tag : message

Now I could obviously use regular expressions to parse out both the tag and the message, but what fun is that?

Treetop to the rescue

Treetop is structured to take a grammar file, that can be brought into ruby code. Here is the grammar we used to define the twitter message:

grammar Twitter
  rule status
tag delimiter message
  end

  rule tag
[a-zA-Z_0-9-]+
  end

  rule message
.*
  end

  rule delimiter
space* ':' space*
  end

  rule space
' '
  end
end

If you haven’t worked with grammar specifications before, don’t feel overwhelmed. What this essentially says is “a twitter status (another definition of a message from twitter) is composed of a tag followed by a delimiter followed by a message.” With each part, you can find a more specific definition. For example, a tag can only take the form of alphanumerical characters, underscores and dashes.

“Ok, that’s neat, but how is it useful?”

The coolness comes in with the consumption of the grammar. Here’s the code that uses Treetop:

require "treetop"
Treetop.load "twitter"

parser = TwitterParser.new
parsed_results = parser.parse("awesomified : you won't believe it's that easy")

tag = parsed_results.get_tag
message = parsed_results.get_message
puts "message: #{message} classified under: #{tag}"

As you can see, Treetop loaded in the grammar and immediately gave me a TwitterParser. From there I parsed an example twitter message, and with the results I retrieved the tag and message.

“Wait, how did you get the tag and message?”

Well, I didn’t exactly show the entire grammar. Here’s the final one:

grammar Twitter
  rule status
tag delimiter message {
  def get_tag
    tag.text_value
  end

  def get_message
    message.text_value
  end
}
  end

  rule tag
[a-zA-Z_0-9-]+
  end

  rule message
.*
  end

  rule delimiter
space* ':' space*
  end

  rule space
' '
  end
end

Almost identical to the above except…it has friggin’ ruby code attached! That means when given a status, I can call #get_tag and #get_message to return the items. Pretty doggone easy.

“Impressive, but how is this better than just using regular expressions”

So I will not deny the same thing could be accomplished with a single regex, but this looks sexy. And it has additional benefits. Lets say in the future they want to:

  • Allow multiple tags
  • Allow spaces, and commas to be valid tag delimiters
  • Allow the tags to be optional

Here’s a grammar modified with those exact requests:

grammar Twitter
  rule status
(tags delimiter)? text {
  def get_tags
    if self.class.method_defined? "tags"
      tags.get_tags
    else
      []
    end
  end

  def get_message
    text.text_value
  end
}
  end

  rule tags
tag optional_tags:(optional_tag*) {
  def get_tags
    [tag.get_tag] + optional_tags.elements.map { |e| e.get_tag }
  end
}
  end

  rule optional_tag
tag_delimiter tag {
  def get_tag
    tag.text_value
  end
}
  end

  rule tag
[a-zA-Z_0-9-]+ {
  def get_tag
    text_value
  end
}
  end

  rule text
.*
  end

  rule delimiter
space* ':' space*
  end

  rule space
' '
  end

  rule tag_delimiter
space* ',' space* / space+
  end
end

Some examples and their output:

results = parser.parse("tag1 : the message")
results.get_tags	  # => ["tag1"]
results.get_message   # => "the message"

results = parser.parse("tag1 tag2, tag3 : the message")
results.get_tags	  # => ["tag1", "tag2", "tag3"]
results.get_message   # => "the message"

results = parser.parse("the message")
results.get_tags	  # => []
results.get_message   # => "the message"

results = parser.parse(": the message")
results.get_tags	  # => []
results.get_message   # => ": the message"

# Yea, well not bad for only 15 min, lets chalk the last one up to user-error.

I want to thank Nathan Sobo for putting together such a useful and intuitive library. For more information about Treetop, you can check out the site as well as the mailing list.