Tuesday, April 3rd, 2007...1:11 pm
Scraping IMDB with Ruby and Hpricot
One service that cries for a nice web service interface is IMDB, however they like want $$ or something for it...the nerve!
With just a few lines of ruby we can query to our heart's content. In this post I'll go over how to create a very simple class for extracting information about movies from IMDB using the lovely Hpricot library.
Prerequisites
Downoad and Install Hpricot
That's it!
The Class
So let's think of what information we want to grab.
I think we need the title of the movie, the rating, and all the other junk - genre,plot outline, etc.
So we'll create an IMDB class that takes a IMDB url as input and can then give us all that other information.
So the beginning of our class looks like this
require 'hpricot'
require 'open-uri'
def initialize(url)
@url = url;
@hp = Hpricot(open(@url))
end
end
The Title
Now let's build our first method to grab the title. So I'll pull up my favorite movie, Lost In Translation, and take a look at the source code.
Here's the bit I'm looking for
So we need to search for a meta tag that has a name attribute of title and extract the contents of the 'content' attribute. Hpricot uses the XPath query language or a CSS selector based query language to allow us to quickly get to the data we want. Since this has a condition in it, we'll use XPath ,the first step in getting this data is determining what sort of XPath query we need to write.
Here's what we have for our title method
@title = @hp.at("meta[@name='title']")['content']
end
We use our @hp instance variable which is a Hpricot object, and tell it to return the first element with a tag of meta and an attribute that matches title, then give us the value of the content attribute.
The Ranking
Looking at the source for the rating, I see it's in a div named rating. So we'll simply search for that div, then use a regular expression to extract the number value out of it!
Here's the html
rating_text = (@hp/"div.rating/b").inner_text
if rating_text =~ /([\d\.]+)\/10/
@rating = $1
end
@rating
end
Looking closer we have the CSS query of "div.rating/b", in english that says find a bold tag in a d iv tag with the class name rating. The "/" operator we're using on @hp is a shortcut for the search method, it returns an Hprcot::Element, on the element it returns we call inner_text which returns the text of that element stripped of all html tags.
I then run the text through a regular expression to extract the rating out of 10 points.
The Rest
The rest of the main information about the movie is contained in these div's with class name info, as seen in the html here
<h5>Release Date:</h5>
3 October 2003 (USA)
<a class="tn15more inline" href="/rg/title-tease/releasedates/title/tt0335266/releaseinfo">more</a>
<a style="margin-left: 1em" class="tn15more inline" href="/rg/title-tease/trailers/title/tt0335266/trailers-screenplay-E19136-10-2">view trailer</a>
</div>
<div class="info">
<h5>Genre:</h5>
<a href="/Sections/Genres/Comedy/">Comedy</a> / <a href="/Sections/Genres/Drama/">Drama</a> / <a href="/Sections/Genres/Romance/">Romance</a> <a class="tn15more inline" href="/rg/title-tease/keywords/title/tt0335266/keywords">more</a>
</div>
<div class="info">
<h5>Tagline:</h5>
Everyone wants to be found. <a class="tn15more inline" href="/rg/title-tease/taglines/title/tt0335266/taglines">more</a>
</div>
<div class="info">
<h5>Plot Outline:</h5>
A movie star with a sense of emptiness, and a neglected newlywed meet up as strangers in Tokyo, Japan and form an unlikely bond. <a class="tn15more inline" href="/rg/title-tease/plotsummary/title/tt0335266/plotsummary">more</a>
</div>
<div class="info">
<h5>Plot Keywords:</h5>
<a href="/keyword/lyrical/">Lyrical</a>
/
<a href="/keyword/reflection/">Reflection</a>
/
<a href="/keyword/psychological-drama/">Psychological Drama</a>
/
<a href="/keyword/loneliness/">Loneliness</a>
/
<a href="/keyword/advertising/">Advertising</a>
<a class="tn15more inline" href="/rg/title-tease/keywords/title/tt0335266/keywords">more</a>
</div>
<div class="info">
<h5>Awards:</h5>
Won Oscar.
Another 67 wins
&
50 nominations
<a class="tn15more inline" href="/rg/title-tease/awards/title/tt0335266/awards">more</a>
</div>
We could extract each piece in it's own separate method, but I'm lazy so I think I'll just create one method called extrainfo that returns a hash of all the extra information back.
if @extrainfo == nil #don't do it twice
@extrainfo = {} #init our hash
(@hp/"div.info").each do |inf| #go through each info div
title = inf/"h5" #the type of infobox is stored in h5
if title.any? #if we found one , we got data
body = inf.inner_text
body = body.gsub(/\n/,'') #remove newlines
if body =~ /\:(.+)/ #extract body from our text
body = $1
end
@extrainfo[title.inner_text.gsub(/[:\s]/,'').downcase] = body #store the body
end
end
end
@extrainfo
end
So our first query "div.info" is a CSS based query, it searches for all div boxes with a class of info. Now, we've got to find a way to extract the title to use as a key in the hash, and the rest of the body will be the value of the hash.
Examining the info div's closer I see the title is always stored in an h5 tag, so what we'll do is for each info div we'll grab the first h5 tag from that div. Hpricot allows you to nest queries, so the results of our first search is also searchable, and so on. So we can search each of our inf variables for all h5 tags, once we find one we simply do a little bit of regular expression magic to separate the body from the title, then store that information in the @extrainformation hash.
All Done
So our finished class now looks like this:
#uncomment the next line if you installed hpricot from the gem
#require 'rubygems'
require 'hpricot'
require 'open-uri'
def initialize(url)
@url = url;
@hp = Hpricot(open(@url))
end
def title
@title = @hp.at("meta[@name='title']")['content']
end
def rating
rating_text = (@hp/"div.rating/b").inner_text
if rating_text =~ /([\d\.]+)\/10/
@rating = $1
end
@rating
end
def extrainfo
if @extrainfo == nil #don't do it twice
@extrainfo = {} #init our hash
(@hp/"div.info").each do |inf| #go through each info div
title = inf/"h5" #the type of infobox is stored in h5
if title.any? #if we found one , we got data
body = inf.inner_text
body = body.gsub(/\n/,'') #remove newlines
if body =~ /\:(.+)/ #extract body from our text
body = $1
end
@extrainfo[title.inner_text.gsub(/[:\s]/,'').downcase] = body #store the body
end
end
end
@extrainfo
end
def reset
@rating = nil
@extrainfo = nil
end
end
You could certainly get much more advanced in your IMDB scraper class, but the above should get you started. Here's a quick test script to see how our class worked.
imdb = IMDB.new('http://imdb.com/title/tt0335266/')
p imdb.rating
p imdb.title
p imdb.extrainfo
So running this script gives us the output
"Lost in Translation (2003)"
{"company"=>"American Zoetropemore", "quotes"=>"Bob:I don't get that close to the glass until I'm on the floor.more", "director"=>"Sofia Coppola", "goofs"=>"Continuity: The pink flowers in Charlotte's room disappear when she is listening to the "soul searching" tape, but reappear later in the movie.more", "language"=>"English / Japanese / German / French", "usercomments"=>"breath takingmore", "moviemeter"=>" 35% since last weekwhy?", "mpaa"=>" Rated R for some sexual content.", "filminglocations"=>" Japanmore", "certification"=>"Indonesia:Dewasa / Malaysia:18PL (re-rating) / Malaysia:(Banned) (uncut version) / Iceland:L / Canada:PG (British Columbia/Manitoba/Nova Scotia/Ontario) / Hungary:14 / Argentina:13 / Australia:PG / Brazil:14 / Canada:14A (Alberta) / Canada:G (Québec) / Chile:TE / Finland:K-11 / Germany:6 (bw) / Hong Kong:IIB / Italy:T / Netherlands:AL / New Zealand:PG / Norway:A / Peru:PT / Philippines:PG-13 / Portugal:M/12 / Singapore:PG (edited for re-rating) / Singapore:R(A) (original rating) / South Korea:15 / Spain:13 / Sweden:Btl / Switzerland:12 (canton of the Grisons) / UK:15 / USA:R / Singapore:M18 (DVD rating)", "country"=>"USA / Japan", "awards"=>" Won Oscar. Another 67 wins&50 nominationsmore", "plotkeywords"=>"Lyrical / Reflection / Psychological Drama / Loneliness / Advertisingmore", "plotoutline"=>" A movie star with a sense of emptiness, and a neglected newlywed meet up as strangers in Tokyo, Japan and form an unlikely bond. more", "soundtrack"=>" She Gets Aroundmore", "soundmix"=>"Dolby Digital ", "runtime"=>"102 min ", "writer(wga)"=>"Sofia Coppola (written by)", "trivia"=>"In 1999, Bill Murray replaced his talent agency with an automated voice mailbox that can be reached with an 800 number he gives out sparingly. Sofia Coppola reportedly left hundreds of messages on Murray's mailbox before he finally called back to discuss her offer to cast him as the star.more", "color"=>"Color ", "tagline"=>"Everyone wants to be found. more", "genre"=>"Comedy / Drama / Romance more", "aspectratio"=>"1.85 : 1 more", "movieconnections"=>" Featured in The 2004 IFP/West Independent Spirit Awards (2004) (TV)more", "releasedate"=>" 3 October 2003 (USA) more view trailer"}







15 Comments
April 3rd, 2007 at 8:11 pm
Woops, deleted the previous comment. Anyway
James Smith Wrote
Why would I download a huge database and worry about keeping it up to date if I only wanted to scrape some films in real-time.
Think about it....
April 4th, 2007 at 2:46 am
Great work ... This could be used toi extract any content as well ..
Thanks !!
April 4th, 2007 at 10:07 am
Peter Writes via e-mail:
I wonder if you have heard of scRUBYt!, a web extraction framework built
on top of WWW::Mechanize and Hpricot. I tried to come up with a similar
stuff to yours:
imdb_data = Scrubyt::Extractor.define do
fetch 'http://www.imdb.com'
fill_textfield 'q','lost in translation'
submit
click_link 'Lost in Translation'
stats do
rating '7.9/10'
votes '74,647 votes' do
count /\d+,\d+/
end
end
end
imdb_data.to_xml.write ($stdout,1)
as you can see there is not only scraping, but automatic navigation as
well...
The really great thing is that you can export this learning extractor to
a production one, and then it will work on all movie pages.
Check out http://scrubyt.org for further info.
April 5th, 2007 at 10:33 pm
Ruby Bikini - How to Process XML in Ruby...
[Â…] Continuing in the series of Brazilian bikini Web development tutorials, here is an experiment with the Yahoo Search API, Ruby and Brazilian bikinis. [Â…]...
April 6th, 2007 at 12:09 pm
[...] Tim from We Heart Code has written an easy-to-follow, detailed tutorial about scraping data from the Internet Movie Database using Ruby and Hpricot. As I would have suspected, Peter Szinek, developer of ScRUBYt! presents an even simpler solution in the comments. [...]
April 7th, 2007 at 7:43 pm
Just be careful before you get too carried away with this.
Technically this violates the IMDB's terms of service
http://us.imdb.com/help/show_article?conditions
They prohibit screen scraping without their express permission.
Now you might get away with this, as long as they don't notice. but...
Some years ago I did an exercise in Java programming and decided to scrape the imdb in a program which played 'six degrees of Kevin Bacon'. A day or so into the exercise, they had disallowed access from my ip address. Since I was doing this from work, this effectively blocked access from my entire company. I e-mailed them and explained the situation, promised to stop, and they removed the block.
But they, or their code, do(es) seem to look at their server logs and they will notice if you do this too much.
April 7th, 2007 at 8:31 pm
for ruby web scraping you should check out http://software.pmade.com/scrapes/pages/show/Quick+Start , i help write scrapes, i have a few more projects at http://crookedhideout.com
April 8th, 2007 at 2:42 pm
This is great! I'm gonna play with it and have lots-o-fun. Thnx dude!
April 20th, 2007 at 11:38 am
why don't you just use omdb.org? Its implemented in rails, it's free and its restfull. I see that omdb does not have as many information as imdb (at least not yet), but that's what's user generated content is all about, right?
April 20th, 2007 at 1:27 pm
Ben,
You answered your own question. I'd rather not get into any specific implementation that I use, but suffice to say breadth of information is key. I'll keep an eye on OMDB though.
April 20th, 2007 at 3:19 pm
Tim,
even Wikipedia started with a single article :) We know that we have a long way to go, but we're confident that we're on a good start..
Ben
April 20th, 2007 at 10:17 pm
[...] Hoje em dia as informações que podem alimentar um sistema ou website, são de vários formatos e fontes diferentes, exemplo: Sistema de Suporte a Decisão e aplicações Mashup e para demonstrar como isso é simples de se fazer usando Ruby, escrevi o exemplo abaixo baseado no artigo "Scraping IMDB with Ruby and Hpricote". [...]
July 3rd, 2007 at 11:54 am
I found this uselful to learn hpricot and solve a similar problem of web scraping. Thanks.
February 21st, 2008 at 10:23 am
[...] Scraping IMDB with Ruby and Hpricot [...]
April 1st, 2008 at 10:10 am
[...] Scraping IMDB with Ruby and Hpricot [...]
Leave a Reply