Tuesday, April 3rd, 2007...1:11 pm

Scraping IMDB with Ruby and Hpricot

Jump to Comments

One service that cries for a nice web service interface is IMDB, however they like want $$ or something for it…the nerve!

With just a few lines of ruby we can query to our heart’s content. In this post I’ll go over how to create a very simple class for extracting information about movies from IMDB using the lovely Hpricot library.

Prerequisites

Downoad and Install Hpricot
That’s it!

The Class

So let’s think of what information we want to grab.

I think we need the title of the movie, the rating, and all the other junk – genre,plot outline, etc.

So we’ll create an IMDB class that takes a IMDB url as input and can then give us all that other information.

So the beginning of our class looks like this
[ruby]
class IMDB
require ‘hpricot’
require ‘open-uri’

def initialize(url)
@url = url;
@hp = Hpricot(open(@url))
end
end
[/ruby]

The Title

Now let’s build our first method to grab the title. So I’ll pull up my favorite movie, Lost In Translation, and take a look at the source code.

Here’s the bit I’m looking for
[html]

[/html]

So we need to search for a meta tag that has a name attribute of title and extract the contents of the ‘content’ attribute. Hpricot uses the XPath query language or a CSS selector based query language to allow us to quickly get to the data we want. Since this has a condition in it, we’ll use XPath ,the first step in getting this data is determining what sort of XPath query we need to write.

Here’s what we have for our title method
[ruby]
def title
@title = @hp.at(“meta[@name=’title’]”)[‘content’]
end
[/ruby]
We use our @hp instance variable which is a Hpricot object, and tell it to return the first element with a tag of meta and an attribute that matches title, then give us the value of the content attribute.

The Ranking

Looking at the source for the rating, I see it’s in a div named rating. So we’ll simply search for that div, then use a regular expression to extract the number value out of it!

Here’s the html
[html]

User Rating:
7.9/10
(74,513 votes)

[/html]
[ruby]
def rating
rating_text = (@hp/”div.rating/b”).inner_text
if rating_text =~ /([\d\.]+)\/10/
@rating = $1
end
@rating
end
[/ruby]
Looking closer we have the CSS query of “div.rating/b”, in english that says find a bold tag in a d iv tag with the class name rating. The “/” operator we’re using on @hp is a shortcut for the search method, it returns an Hprcot::Element, on the element it returns we call inner_text which returns the text of that element stripped of all html tags.

I then run the text through a regular expression to extract the rating out of 10 points.

The Rest

The rest of the main information about the movie is contained in these div’s with class name info, as seen in the html here

[html]

Release Date:

3 October 2003 (USA)
more

view trailer

Genre:

Comedy / Drama / Romance more

Tagline:

Everyone wants to be found. more

Plot Outline:

A movie star with a sense of emptiness, and a neglected newlywed meet up as strangers in Tokyo, Japan and form an unlikely bond. more

Awards:

Won Oscar.
Another 67 wins
&

50 nominations
more

[/html]

We could extract each piece in it’s own separate method, but I’m lazy so I think I’ll just create one method called extrainfo that returns a hash of all the extra information back.

[ruby]
def extrainfo
if @extrainfo == nil #don’t do it twice
@extrainfo = {} #init our hash
(@hp/”div.info”).each do |inf| #go through each info div
title = inf/”h5″ #the type of infobox is stored in h5
if title.any? #if we found one , we got data
body = inf.inner_text
body = body.gsub(/\n/,”) #remove newlines
if body =~ /\:(.+)/ #extract body from our text
body = $1
end
@extrainfo[title.inner_text.gsub(/[:\s]/,”).downcase] = body #store the body
end
end
end
@extrainfo
end
[/ruby]

So our first query “div.info” is a CSS based query, it searches for all div boxes with a class of info. Now, we’ve got to find a way to extract the title to use as a key in the hash, and the rest of the body will be the value of the hash.

Examining the info div’s closer I see the title is always stored in an h5 tag, so what we’ll do is for each info div we’ll grab the first h5 tag from that div. Hpricot allows you to nest queries, so the results of our first search is also searchable, and so on. So we can search each of our inf variables for all h5 tags, once we find one we simply do a little bit of regular expression magic to separate the body from the title, then store that information in the @extrainformation hash.

All Done

So our finished class now looks like this:
[ruby]
class IMDB
#uncomment the next line if you installed hpricot from the gem
#require ‘rubygems’
require ‘hpricot’
require ‘open-uri’

def initialize(url)
@url = url;
@hp = Hpricot(open(@url))
end
def title
@title = @hp.at(“meta[@name=’title’]”)[‘content’]
end
def rating
rating_text = (@hp/”div.rating/b”).inner_text
if rating_text =~ /([\d\.]+)\/10/
@rating = $1
end
@rating
end

def extrainfo
if @extrainfo == nil #don’t do it twice
@extrainfo = {} #init our hash
(@hp/”div.info”).each do |inf| #go through each info div
title = inf/”h5″ #the type of infobox is stored in h5
if title.any? #if we found one , we got data
body = inf.inner_text
body = body.gsub(/\n/,”) #remove newlines
if body =~ /\:(.+)/ #extract body from our text
body = $1
end
@extrainfo[title.inner_text.gsub(/[:\s]/,”).downcase] = body #store the body
end
end
end
@extrainfo
end

def reset
@rating = nil
@extrainfo = nil
end
end
[/ruby]

You could certainly get much more advanced in your IMDB scraper class, but the above should get you started. Here’s a quick test script to see how our class worked.

[ruby]
require ‘IMDB’

imdb = IMDB.new(‘http://imdb.com/title/tt0335266/’)

p imdb.rating
p imdb.title
p imdb.extrainfo
[/ruby]

So running this script gives us the output
[code]
7.9″
“Lost in Translation (2003)”
{“company”=>”American Zoetropemore”, “quotes”=>”Bob:I don’t get that close to the glass until I’m on the floor.more”, “director”=>”Sofia Coppola”, “goofs”=>”Continuity: The pink flowers in Charlotte’s room disappear when she is listening to the "soul searching" tape, but reappear later in the movie.more”, “language”=>”English / Japanese / German / French”, “usercomments”=>”breath takingmore”, “moviemeter”=>” 35% since last weekwhy?”, “mpaa”=>” Rated R for some sexual content.”, “filminglocations”=>” Japanmore”, “certification”=>”Indonesia:Dewasa / Malaysia:18PL (re-rating) / Malaysia:(Banned) (uncut version) / Iceland:L / Canada:PG (British Columbia/Manitoba/Nova Scotia/Ontario) / Hungary:14 / Argentina:13 / Australia:PG / Brazil:14 / Canada:14A (Alberta) / Canada:G (Québec) / Chile:TE / Finland:K-11 / Germany:6 (bw) / Hong Kong:IIB / Italy:T / Netherlands:AL / New Zealand:PG / Norway:A / Peru:PT / Philippines:PG-13 / Portugal:M/12 / Singapore:PG (edited for re-rating) / Singapore:R(A) (original rating) / South Korea:15 / Spain:13 / Sweden:Btl / Switzerland:12 (canton of the Grisons) / UK:15 / USA:R / Singapore:M18 (DVD rating)”, “country”=>”USA / Japan”, “awards”=>” Won Oscar. Another 67 wins&50 nominationsmore”, “plotkeywords”=>”Lyrical / Reflection / Psychological Drama / Loneliness / Advertisingmore”, “plotoutline”=>” A movie star with a sense of emptiness, and a neglected newlywed meet up as strangers in Tokyo, Japan and form an unlikely bond. more”, “soundtrack”=>” She Gets Aroundmore”, “soundmix”=>”Dolby Digital “, “runtime”=>”102 min “, “writer(wga)”=>”Sofia Coppola (written by)”, “trivia”=>”In 1999, Bill Murray replaced his talent agency with an automated voice mailbox that can be reached with an 800 number he gives out sparingly. Sofia Coppola reportedly left hundreds of messages on Murray’s mailbox before he finally called back to discuss her offer to cast him as the star.more”, “color”=>”Color “, “tagline”=>”Everyone wants to be found. more”, “genre”=>”Comedy / Drama / Romance more”, “aspectratio”=>”1.85 : 1 more”, “movieconnections”=>” Featured in The 2004 IFP/West Independent Spirit Awards (2004) (TV)more”, “releasedate”=>” 3 October 2003 (USA) more view trailer”}

[/code]

More Information

Hpricot Showcase

Hpricot API Documentation

26 Comments

  • Woops, deleted the previous comment. Anyway

    James Smith Wrote

    I never understood why do people bother web scrapping IMDB when all the data is available for download.

    IMDB can and will change their HTML, so why dont you just download the whole DB and play with that??

    Please, download IMDB db instead of scrapping it.

    Why would I download a huge database and worry about keeping it up to date if I only wanted to scrape some films in real-time.

    Think about it….

  • Great work … This could be used toi extract any content as well ..
    Thanks !!

  • Peter Writes via e-mail:

    I wonder if you have heard of scRUBYt!, a web extraction framework built
    on top of WWW::Mechanize and Hpricot. I tried to come up with a similar
    stuff to yours:

    [ruby]
    require ‘scrubyt’

    imdb_data = Scrubyt::Extractor.define do
    fetch ‘http://www.imdb.com’

    fill_textfield ‘q’,’lost in translation’
    submit
    click_link ‘Lost in Translation’

    stats do
    rating ‘7.9/10′
    votes ‘74,647 votes’ do
    count /\d+,\d+/
    end
    end
    end

    imdb_data.to_xml.write ($stdout,1)
    [/ruby]

    as you can see there is not only scraping, but automatic navigation as
    well…
    The really great thing is that you can export this learning extractor to
    a production one, and then it will work on all movie pages.

    Check out http://scrubyt.org for further info.

  • Ruby Bikini – How to Process XML in Ruby…

    […] Continuing in the series of Brazilian bikini Web development tutorials, here is an experiment with the Yahoo Search API, Ruby and Brazilian bikinis. […]…

  • […] Tim from We Heart Code has written an easy-to-follow, detailed tutorial about scraping data from the Internet Movie Database using Ruby and Hpricot. As I would have suspected, Peter Szinek, developer of ScRUBYt! presents an even simpler solution in the comments. […]

  • Just be careful before you get too carried away with this.

    Technically this violates the IMDB’s terms of service
    http://us.imdb.com/help/show_article?conditions

    They prohibit screen scraping without their express permission.

    Now you might get away with this, as long as they don’t notice. but…

    Some years ago I did an exercise in Java programming and decided to scrape the imdb in a program which played ‘six degrees of Kevin Bacon’. A day or so into the exercise, they had disallowed access from my ip address. Since I was doing this from work, this effectively blocked access from my entire company. I e-mailed them and explained the situation, promised to stop, and they removed the block.

    But they, or their code, do(es) seem to look at their server logs and they will notice if you do this too much.

  • for ruby web scraping you should check out http://software.pmade.com/scrapes/pages/show/Quick+Start , i help write scrapes, i have a few more projects at http://crookedhideout.com

  • This is great! I’m gonna play with it and have lots-o-fun. Thnx dude!

  • why don’t you just use omdb.org? Its implemented in rails, it’s free and its restfull. I see that omdb does not have as many information as imdb (at least not yet), but that’s what’s user generated content is all about, right?

  • Ben,
    You answered your own question. I’d rather not get into any specific implementation that I use, but suffice to say breadth of information is key. I’ll keep an eye on OMDB though.

  • Tim,
    even Wikipedia started with a single article :) We know that we have a long way to go, but we’re confident that we’re on a good start..

    Ben

  • […] Hoje em dia as informações que podem alimentar um sistema ou website, são de vários formatos e fontes diferentes, exemplo: Sistema de Suporte a Decisão e aplicações Mashup e para demonstrar como isso é simples de se fazer usando Ruby, escrevi o exemplo abaixo baseado no artigo “Scraping IMDB with Ruby and Hpricote”. […]

  • I found this uselful to learn hpricot and solve a similar problem of web scraping. Thanks.

  • Hi, I also wrote an IMDB scraper using scRUBYt!, it is available here: http://wiki.scrubyt.org/index.php?title=IMDB_Movie_Scraping

    Hope it’s useful to someone :).

    -vjt

  • […] Scraping IMDB with Ruby and Hpricot […]

  • […] Scraping IMDB with Ruby and Hpricot […]

  • see sun go clean tom england england

  • Seems that the scrubyT scripts no longer work – imdb.org servers return “403 – forbidden”. Its possible to use :agent => :firefox (see scrubyt.org for details) but i haven’t gotten that to work yet either.

  • […] grep imdb if ruby scrubyt! (not hpricot? ) […]

  • If you ever want to hear a reader’s feedback :) , I rate this article for four from five. Detailed info, but I have to go to that damn google to find the missed pieces. Thank you, anyway!

  • Thanks for the lowdown. Does large scale scraping require any special server requirements? Or would a single standard web server suffice?

  • It rained here all day, so I decided to read some blog posts. Very nice write up here!

  • any way to specify the number of times to loop in the extrainfo? Like say i just wanted the first 4 or so.

  • I don’t agree with this particular article. Nevertheless, I had searched with Yahoo and I’ve found out you are correct and I was thinking in the incorrect way. Continue creating high quality material such as this.

  • Thanks much for the example code. I successfully applied Hpricot to a script in place of Beautiful Soup’s Ruby implementation gem and it is much faster now.

  • Have just given this a try and glad to report this still works 4 years later. Cheers.

Leave a Reply