I’m studying Ruby through the Flatiron School, so I researched real-life examples of scraping, a common Ruby task. I understand web scraping is the process of finding and pulling information from public websites, but who does it and why? Here are five common uses with examples, though not all were scraped using Ruby, in capsule form.
Walmart vs. Amazon
E-commerce sites regularly use software to send waves of bots and web crawlers across multitudes of sites at a time. Bots search for information such as product reviews, contact information to use for marketing, or prices for comparison websites, and save copies in a spreadsheet or database.
It’s a web search that retrieves and saves enormous amounts of information. As Columbia University puts it, “Google scraped (emphasis mine) the web to catalogue all of the information on the internet and make it accessible.” A Google search, in essence, is a scrape.
Walmart was checking the prices on Amazon.com several million times a day in 2016 before Amazon blocked its bots, according to Reuters. In e-commerce, a customer toggling between two competitor sites can choose the product 50 cents cheaper with a click. So, the stakes are high.
Skincare startup Proven scrapes millions of customer reviews, information on thousands of beauty products, and scientific articles, to personalize health and beauty products. The brand both consults and creates a product specifically for each customer, according to Forbes.
Co-founder Amy Yuan searched thousands of ingredients, products, consumer reviews and scientific journal articles to find a product that worked for her own skin, according to Forbes. She and co-founder Ming Zhao extended the project with machine learning and artificial intelligence algorithms to understand the correlations between people’s skin and the ingredients that work for each person.
Type in what you’re looking for in a New York apartment, and Naked Apartments retrieves the real estate options, scraped from the web, that fill the bill. It also allows brokers to contact an apartment seeker with permission, and ranks the best sellers, a distinction they can’t buy. Zillow bought the Webby Award-nominated company in 2016.
Naked Apartments provided its in-demand listings service in 2010 and turned a profit in 2011. Travel site Trivago.com scrapes to offer comparison prices, and Indeed.com offers job listings and other directories. Almost any size business also scrapes for competitor prices, leads, search engine results to track SEO and marketing.
Community and social justice organizations
MCSafetyFeed scrapes public information in cooperation with Monroe County, New York, to maximize open data in government. The site provides a history of all the 911 calls that come in to the Monroe County dispatch center and additional data about each call that isn’t immediately available on the dispatch website or Monroe’s RSS feed. Read more about the tech aspects, here.
Community projects may provide scraped digital data, or information from reports requested from governing bodies, or a combination. The goal is usually just to get the information out there for civic transparency and accountability.
Pfizer’s disclosures of payments to doctors
Journalism is obviously another powerful way to get information to the people. ProPublica offers a guide for journalists who may want to scrape public records themselves for reporting. The guide leads aspiring data journalists through the process of scraping pharmaceutical giant Pfizer’s record of payments made to doctors.
Another example of journalistic scraping is James Tozer’s Economist article on the countries likely to have liberal abortion laws. Read more about his entrance into data journalism, here.
“What if you had an idea for an ecological study, but the data you needed wasn’t available to you?” – a pitch for Columbia University’s Mailman School of Public Health’s courses on web scraping and population health methods.
In public health, big data helps chart disease occurrence based on time, location and demographics, according to the SUNY Downstate Medical Center Department of Epidemiology and Biostatistics. Students use data to identify the relative contributions of biological, behavioral, socio-economic, and environmental risk factors to disease incidence.
Google Flu Trends
It feels more accurate to describe the complex Google Flu Trends as “a big data tool for epidemiologists” than as simply “scraping.” (Browse the Google project’s summary on Wikipedia.)
The idea was that when people have the flu, they search for flu-related information on Google, indicating the time and location of a potential outbreak. The goal was to predict outbreaks weeks earlier than the Centers for Disease Control and Prevention (CDC). Read more about the outcome, and how the process might be refined in the future, on Wired.
For more on the topic of tech and public health, epidemiology in particular, browse the 2015 article “Epidemiology in the Era of Big Data,” from the National Center for Biology Information, a branch of the National Institutes of Health.
Academics and student work
Perception of French hip hop
Developer Alexandre Robin helped a friend write a student paper on the perception of French hip hop through the decades by scraping 7,000 newspaper articles using Node. Read more about his project, here.
Flatiron alumni Clinton Nguyen and Jason Decker used web scraping to refine browser searches. “Sign up to post your own educational trails and follow them, or simply search through on the main page for user-submitted trails that are already out there.”