Simple Ruby Scraping Walk-Through

I’ll get better at using CSS selectors, I promise, but in the meantime, I wanted to scrape a live site with Ruby as quickly and simply as possible. I wanted to make sure I understood the basic process of pulling data from a public website, and parsing it into a useful format. For more information on what scraping is, who does it and why, check out my post, here.

1. Set up
A site to scrape — I wrote a post to scrape on my personal site, just four h2 headlines, “Dogs, Cats, Truth, Rain.” I ended up deleting the post, but one is easy enough to make on your own blog.


Files — Next, I created two blank files using my text editor (Sublime), and saved them to the desktop as:

list_items.rb
scraper.rb

Folder — Then, I created a new folder, named it lib, and also saved it to my desktop. Lib/, the library directory, is where code usually lives in basic Ruby programs. I dragged and dropped the two blank files I had made inside.

Nokogiri — Now, install Nokogiri, a gem that collects that site’s HTML into a huge string and “treats it like nested nodes to extract the desired information,” according to the Flatiron School.

In your terminal (on a Mac, it’s in applications > utilities), enter:

install gem nokogiri

2. Code the list_items.rb file
Enter some basic code in list_items.rb. Now, instances of list_items.rb can be initialized and stored in @@all = [] (which we’ll use in scraper.rb).

class List_items

  attr_accessor :name

  @@all = []

  def initialize
    @@all << self
  end

  def self.all
    @@all
  end

  def self.reset_all
    @@all.clear
  end

end

3. Require
At the top of scraper.rb, type:

require 'nokogiri'
require 'open-uri'
require_relative './list_items.rb'

Open-Uri is a module that takes one argument, the URL of the website you want to scrape, and returns the site’s HTML. Then, Nokogiri helps parse it.

When we eventually run the program, we’re going to call scraper.rb, which will tell the software to also run (require) list_items.rb. So, scraper.rb here is kind of the brain of the operation.

4. Collect the HTML
Write a single line to grab the HTML with Open-Uri, and convert the HTML into a Nodeset with Nokogiri:

Nokogiri::HTML(open("http://saraharvey.xyz/2019/04/12/testing-123/"))

5. Functionality
Put that line inside a Ruby method, and the method inside a Scraper class (below). We’ll eventually be calling a new instance of Scraper and acting on it with four methods.

require 'nokogiri'
require 'open-uri'

require_relative './list_items.rb'

class Scraper

  def get_page
    Nokogiri::HTML(open("http://saraharvey.xyz/2019/04/12/testing-123/"))
  end

end

The post I created to scrape only has a couple layers of CSS, “.entry-content” and “h2.” (The period before “entry” indicates that it’s a CSS class.)

I found the CSS using the inspect tool (click here for the Google tutorial).
Here’s the finished code:

require 'nokogiri'
require 'open-uri'

require_relative './list_items.rb'

class Scraper

  def get_page
    Nokogiri::HTML(open("http://saraharvey.xyz/2019/04/12/testing-123/"))
  end

  def get_items
    self.get_page.css(".entry-content")
  end

  def make_items
    self.get_items.each do |content|
      items = List_items.new
      items.name = content.css("h2").text
    end
  end

  def print_items
    self.make_items
    List_items.all.each do |items|
      if items.name
        puts "Is this working: #{items.name}"
      end
    end
  end
end

Scraper.new.print_items

To run the program, navigate to your home directory:

cd ~
cd Desktop
cd lib
ruby scraper.rb

In trying to achieve the simplest scrape possible, I probably left something out. Connect with me on Twitter and let a girl know. @SaraHarvy

MoCCA Fest Roundup!

The MoCCA Fest Arts Festival wrapped up its 2019 edition yesterday in New York City. Here are a few things I want MoCCA to do again next year, and what I want it to add (or take away — nylon giants are especially scary for introverts).

10/10 awesome time. Would do again.


Getting started: coding Ruby in the terminal

A New York UPS center/metaphor

I typed my first HTML in a Codecademy browser window before moving on to build HTML, CSS and PHP sites. Now, I’m enrolled in the Flatiron School boot camp, coding in an online learning environment again, and realized I’ve never actually coded Ruby (the first part of the curriculum) in the wild.

Here’s my quick take on setting up to code in the terminal on a Mac, in case anyone else’s Ruby journey thus far has been mostly online.

Install Rails, Homebrew, git, RVM, gems, SQLite and Node on Mac OS

I ended up using a Flatiron tutorial to install these programs, but tried the first thing I found online first. I ended up with lots of errors. So, I uninstalled and reinstalled, and things went a lot more smoothly the second time around.

I didn’t have the Learn IDE installed, and I didn’t set up the Learn gem because I’m still planning to finish my coursework online.

Adding a SHH key to GitHub

This is one of the steps in the above tutorial. For specific instructions from GitHub on using the SHH key, click here. I was afraid this would somehow disrupt Flatiron’s Learn.co/GitHub system, but it didn’t. I added my key to all the other keys.

Commands for coding in the terminal

You code in the terminal (Applications > Utilities > Terminal), where you just installed Rails and the other systems, using commands. As a Ruby newbie, I’m not ashamed to say that this simple, obvious fact was a revelation.

Here’s a basic tutorial by Railsbridge on using commands in the terminal.
Another tutorial by Thoughtco on using the command line.

Coding Ruby locally

Single file
I opened my text editor, Sublime, wrote some Ruby code and saved the file as my_program.rb to my desktop. I navigated to it in the terminal (by typing: cd desktop) and then ran it: ruby my_program.rb. The code ran in the terminal.

Running a program with multiple files
I wrote a sample Ruby CLI with multiple files, including a bin and a lib folder. To run it, I navigated to the program folder on the desktop, all the way to greet.rb in the bin folder (which requires the lib file):

cd desktop (enter)
cd greetings_earthing (enter)
cd bin (enter)
ruby greet.rb
=> Hello, earthling! What is your name?

Using GitHub from the terminal

I created a new repository on GitHub and set it to private (since I’m just practicing). Then, I actually selected upload files and dragged and dropped in my program from the desktop. So, not exactly interacting with GitHub form the terminal yet, but there will be opportunity.

Other ways to upload to GitHub
Cheat sheet of Git commands

If I missed anything in this simple guide, connect with me on Twitter and let me know! @SaraHarvy

10 Examples of Web Scraping in Use

Live web projects as cool as scraped ice

I’m studying Ruby through the Flatiron School, so I researched real-life examples of scraping, a common Ruby task. I understand web scraping is the process of finding and pulling information from public websites, but who does it and why? Here are five common uses with examples, though not all were scraped using Ruby, in capsule form.

Big business

Walmart vs. Amazon
E-commerce sites regularly use software to send waves of bots and web crawlers across multitudes of sites at a time. Bots search for information such as product reviews, contact information to use for marketing, or prices for comparison websites, and save copies in a spreadsheet or database.

It’s a web search that retrieves and saves enormous amounts of information. As Columbia University puts it, “Google scraped (emphasis mine) the web to catalogue all of the information on the internet and make it accessible.” A Google search, in essence, is a scrape.

Walmart was checking the prices on Amazon.com several million times a day in 2016 before Amazon blocked its bots, according to Reuters. In e-commerce, a customer toggling between two competitor sites can choose the product 50 cents cheaper with a click. So, the stakes are high.

Startups

Proven
Skincare startup Proven scrapes millions of customer reviews, information on thousands of beauty products, and scientific articles, to personalize health and beauty products. The brand both consults and creates a product specifically for each customer, according to Forbes.

Co-founder Amy Yuan searched thousands of ingredients, products, consumer reviews and scientific journal articles to find a product that worked for her own skin, according to Forbes. She and co-founder Ming Zhao extended the project with machine learning and artificial intelligence algorithms to understand the correlations between people’s skin and the ingredients that work for each person.

Naked Apartments
Type in what you’re looking for in a New York apartment, and Naked Apartments retrieves the real estate options, scraped from the web, that fill the bill. It also allows brokers to contact an apartment seeker with permission, and ranks the best sellers, a distinction they can’t buy. Zillow bought the Webby Award-nominated company in 2016.

Naked Apartments provided its in-demand listings service in 2010 and turned a profit in 2011. Travel site Trivago.com scrapes to offer comparison prices, and Indeed.com offers job listings and other directories. Almost any size business also scrapes for competitor prices, leads, search engine results to track SEO and marketing.

Community and social justice organizations

MCSafetyFeed.org
MCSafetyFeed scrapes public information in cooperation with Monroe County, New York, to maximize open data in government. The site provides a history of all the 911 calls that come in to the Monroe County dispatch center and additional data about each call that isn’t immediately available on the dispatch website or Monroe’s RSS feed. Read more about the tech aspects, here.

Community projects may provide scraped digital data, or information from reports requested from governing bodies, or a combination. The goal is usually just to get the information out there for civic transparency and accountability.

Journalism

Pfizer’s disclosures of payments to doctors
Journalism is obviously another powerful way to get information to the people. ProPublica offers a guide for journalists who may want to scrape public records themselves for reporting. The guide leads aspiring data journalists through the process of scraping pharmaceutical giant Pfizer’s record of payments made to doctors.

Another example of journalistic scraping is James Tozer’s Economist article on the countries likely to have liberal abortion laws. Read more about his entrance into data journalism, here.

Public health

Biostatistics
“What if you had an idea for an ecological study, but the data you needed wasn’t available to you?” – a pitch for Columbia University’s Mailman School of Public Health’s courses on web scraping and population health methods.

In public health, big data helps chart disease occurrence based on time, location and demographics, according to the SUNY Downstate Medical Center Department of Epidemiology and Biostatistics. Students use data to identify the relative contributions of biological, behavioral, socio-economic, and environmental risk factors to disease incidence.

Google Flu Trends
It feels more accurate to describe the complex Google Flu Trends as “a big data tool for epidemiologists” than as simply “scraping.” (Browse the Google project’s summary on Wikipedia.)

The idea was that when people have the flu, they search for flu-related information on Google, indicating the time and location of a potential outbreak. The goal was to predict outbreaks weeks earlier than the Centers for Disease Control and Prevention (CDC). Read more about the outcome, and how the process might be refined in the future, on Wired.

For more on the topic of tech and public health, epidemiology in particular, browse the 2015 article “Epidemiology in the Era of Big Data,” from the National Center for Biology Information, a branch of the National Institutes of Health.

Academics and student work

Perception of French hip hop
Developer Alexandre Robin helped a friend write a student paper on the perception of French hip hop through the decades by scraping 7,000 newspaper articles using Node. Read more about his project, here.

Maestro
Flatiron alumni Clinton Nguyen and Jason Decker used web scraping to refine browser searches. “Sign up to post your own educational trails and follow them, or simply search through on the main page for user-submitted trails that are already out there.”

If there’s anything that can be improved in this article, connect with me through Twitter! @SaraHarvy

Drawing Code

Cat photo: Buenosia Carol via Pexels

Does everyone doodle when they code? Maybe it’s because Ruby is object oriented, or because I’m a visual learner, but when I started to study Ruby and encountered something hard to understand, my first reaction was to draw.

I “drew” the problem, like it was a landscape made of logic. Like I could just see a solution if I put it on paper. When my brain’s rainbow wheel spun for longer than 10 minutes, scribbling shapes and words helped.

It was more rubber duck debugging than white board coding. I patterned quick descriptions of functions, connected with arrows and corralled in ovals. I jotted down more questions.

I’d be cool to say I was simplifying concepts until I abstracted words away in lieu of shapes. Like my inky mess was Ruby Hitsuzendō, coding Zen calligraphy. But really I think it helped me see the problem differently, and got me “out of my head” enough to stop freaking out completely.

Also, drawing word definitions, abstract thoughts and ideas can increase retention, according to a January 2019 article in the New York Times. It may be due to an “additional form of processing,” said a co-author of the study, Dr. Jeffrey Wammes, a postdoctoral fellow in the department of psychology at Yale.

Also, I’d have five or six pages open at once, my Learn.co lesson, previous lessons, ruby-doc.org, Stack Exchange, whatever programming blogs offered vital insights, so actual paper was a “non-screen” where I could think through the code. I never lost it among the screens.

Here’s a link, though, to the first few lessons of OO Ruby on Learn.co, which I simplified in my spiral notebook and then recopied using Illustator so anyone else could understand.

Scenes from a Brooklyn Sketchbook

I love productivity tricks, but I could never get into bullet journaling. I think it’s because I don’t like to draw lots of little boxes. However, I read a great article on Medium by Michael Korzonek that inspired me to try again.

I adapted his system, a topic for another post, and most consistently write down “what I want to remember from the day” (in a little box). It’s mostly New York City-style quirks.

This week, I illustrated what I wanted to remember. Here you go, a comic from Flatbush, Brooklyn, where I live with my wife and our senior Doberman, and toil away at a coding boot camp when I should be drawing.

I love that Bob’s Burgers doesn’t dog Bob for working in a restaurant and liking it. We just watched the episode where Bob gets his burnout, and makes out with the mustardy “I hate to see you brie-ve, but I love to see you go,” burger. His family makes him take a day off, but he just finds another little cafe to work. “I-I-I might be dying,” Bob says. “But this is worth it.” March 13

“Nothing makes me happier/than serving food to some guy.
It may seem so boring, it might get you snoring/but to me it’s the Fourth of July.” — Bob

A neighbor, scented faintly of happy hour, smiled fondly at our dog.
“How old is she?” he asked.
“Twelve,” I said. “An old girl.”
He got off the elevator and told her, “love you.” Like when you’re about to hang up the phone, but actually tell your boss you love them, except this was the neighbor’s Doberman pincher. March 14

The next day, I asked another neighbor where she bought the bouquet in her grocery bag. There’s a place off the Q Train 7th Avenue Stop, she said, “nothing says spring like tulips and irises.” Which was freaking adorable. March 15

Also, I’ve waited at a subway stop for 10 years, and never noticed one of the tiles was missing a 2.

We love Rockaway Brewing in Long Island City, Queens, but never visited their Rockaway Beach location. So, we took the bus, drank beer, ate bahn mi, played Scrabble and watched people.

Five folks in vintage military getup (I couldn’t identify it online) drank outside. “I have a sword on me right the #@$% now!” a cassocked bro declared to his friends, like, as a matter of conversation. I wish I had taken a photo when they sent him inside alone for five more pints.

Otherwise, the “end of the A Train” in March is a scrabbly landscape of warehouses and highways. It’s beachfront, but still the only place you’d want to wear flip flops is the beach. March 16

I was holding out for one more “C” to make “ceviche” in Scrabble, which the little bag never produced, and so I not only lost, but got trounced. My wife is killer at Scrabble, and everything else.

I did get “fellate” before the game ended. By definition, she me know, it’s an act specifically performed on a man. “Happy anniversary!” she joked. Happy anniversary to you, too, lady, who I want to remember most from all my best days. March 17