in Code

Simple Ruby Scraping Walk-Through

I’ll get better at using CSS selectors, I promise, but in the meantime, I wanted to scrape a live site with Ruby as quickly and simply as possible. I wanted to make sure I understood the basic process of pulling data from a public website, and parsing it into a useful format. For more information on what scraping is, who does it and why, check out my post, here.

1. Set up
A site to scrape — I wrote a post to scrape on my personal site, just four h2 headlines, “Dogs, Cats, Truth, Rain.” I ended up deleting the post, but one is easy enough to make on your own blog.


Files — Next, I created two blank files using my text editor (Sublime), and saved them to the desktop as:

list_items.rb
scraper.rb

Folder — Then, I created a new folder, named it lib, and also saved it to my desktop. Lib/, the library directory, is where code usually lives in basic Ruby programs. I dragged and dropped the two blank files I had made inside.

Nokogiri — Now, install Nokogiri, a gem that collects that site’s HTML into a huge string and “treats it like nested nodes to extract the desired information,” according to the Flatiron School.

In your terminal (on a Mac, it’s in applications > utilities), enter:

gem install nokogiri

2. Code the list_items.rb file
Enter some basic code in list_items.rb. Now, instances of list_items.rb can be initialized and stored in @@all = [] (which we’ll use in scraper.rb).

class List_items

  attr_accessor :name

  @@all = []

  def initialize
    @@all << self
  end

  def self.all
    @@all
  end

  def self.reset_all
    @@all.clear
  end

end

3. Require
At the top of scraper.rb, type:

require 'nokogiri'
require 'open-uri'
require_relative './list_items.rb'

Open-Uri is a module that takes one argument, the URL of the website you want to scrape, and returns the site’s HTML. Then, Nokogiri helps parse it.

When we eventually run the program, we’re going to call scraper.rb, which will tell the software to also run (require) list_items.rb. So, scraper.rb here is kind of the brain of the operation.

4. Collect the HTML
Write a single line to grab the HTML with Open-Uri, and convert the HTML into a Nodeset with Nokogiri:

Nokogiri::HTML(open("http://saraharvey.xyz/2019/04/12/testing-123/"))

5. Functionality
Put that line inside a Ruby method, and the method inside a Scraper class (below). We’ll eventually be calling a new instance of Scraper and acting on it with four methods.

require 'nokogiri'
require 'open-uri'

require_relative './list_items.rb'

class Scraper

  def get_page
    Nokogiri::HTML(open("http://saraharvey.xyz/2019/04/12/testing-123/"))
  end

end

The post I created to scrape only has a couple layers of CSS, “.entry-content” and “h2.” (The period before “entry” indicates that it’s a CSS class.)

I found the CSS using the inspect tool (click here for the Google tutorial).
Here’s the finished code:

require 'nokogiri'
require 'open-uri'

require_relative './list_items.rb'

class Scraper

  def get_page
    Nokogiri::HTML(open("http://saraharvey.xyz/2019/04/12/testing-123/"))
  end

  def get_items
    self.get_page.css(".entry-content")
  end

  def make_items
    self.get_items.each do |content|
      items = List_items.new
      items.name = content.css("h2").text
    end
  end

  def print_items
    self.make_items
    List_items.all.each do |items|
      if items.name
        puts "Is this working: #{items.name}"
      end
    end
  end
end

Scraper.new.print_items

To run the program, navigate to your home directory:

cd ~
cd Desktop
cd lib
ruby scraper.rb

In trying to achieve the simplest scrape possible, I probably left something out. Connect with me on Twitter and let a girl know. @SaraHarvy