Web Scraping with Ruby and Kimurai gem

Saraswati Saud
4 min readMay 6, 2020

Web scraping is used to extract content and data from websites. These content and data are used by a number of digital businesses, market researchers, and many more to make lucrative business decisions.

Web scraping with Ruby using Kimurai gem is pretty simple. Here, I want to explain, how we can scrape apps data from the Slack apps directory with Ruby and Kumari gem. I am using this category URL as an example — https://slack.com/apps/category/AtHC5Q7RUJ-daily-tools to scrape apps data. If you check the link you will see the page something like this:

Here, when we click on each app of Daily Tools (as highlighted in the red section) it will take us to the app details section. From every app details page, we will scrape data as Name, Description, and Categories as shown in the picture below.

To keep our tutorial simple, I am using the Daily Tools — slack app section. Other sections require pagination which I am leaving as an exercise.

Before starting, I assume that you have already installed ruby on your machine. I am using ruby 2.6.1.

Let's start:

Step 1: Create a file named slack_apps_scraper.rb in your choice of editor and add the following lines of code:

#slack_apps_scraper.rb
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'kimurai'
end
require 'kimurai'class SlackAppsScraper < Kimurai::Base
@name= 'slack_spider'
@engine = :mechanize
@start_urls = ['https://slack.com/apps/category/AtHC5Q7RUJ-daily-tools']
def parse(response, url:, data: {})
# scrape body here
end
endSlackAppsScraper.crawl!

This is pretty much what we require for the general skeleton of the scraper. Now, save the file at your preferred location on your computer.

Remember that we left the parse method body empty for now. This method will contain all the body required to scrape data from the url. In the next step, let’s add the body for our method.

Step 2: Add the following lines of code in the parse method:

def parse(response, url:, data: {})
@base_uri = 'https://slack.com'
response = browser.current_response
response.css('ul.media_list li a').each do |app|
href = app.attr('href')
browser.visit(@base_uri + href)
app_response = browser.current_response
// TODO - Code to scrape data
end
end

If you go to our scrape URL, you will see how the data is structured inside the HTML tag. Example:

Step 3: Now let's scrape the data. Add the following code in the TODOsection:

item = {}
item[:name] = response.css('h2.p-app_info_title')&.text&.squish
item[:description] = response.css('#panel_more_info div.p-app_description')&.text&.squish

categories_element = response.css('p.no_margin:contains("Categories:")').first.next_sibling.parent
item[:categories] = categories_element.css('a')&.map(&:text)&.join(', ')
save_to "scraped_slack_apps.json", item, format: :pretty_json, position: false

This will scrape name, description and categories of the app as shown in the figure below and will save the extracted records in pretty_json format.

Step 4: The whole code is available here:

Step 5: Finally, let's test the application.

Run the following command in the terminal.

$> ruby slack_apps_scraper.rb

This will starts scraping data. It grabs all slack app details in the Daily Tools section. Your terminal will look something like this:

And lastly, it saves the data in a file named scraped_slack_apps.json. The scraped file is saved in the same location where the slack_apps_scraper.rb file was created.

If you open a scraped JSON file you will see the result like this:

This is the end of the tutorial.

Hope you liked it!

You can further implement many things to make the application more promising. For example: implement multi-page scraping, write tests and refactor, move the scraping task to the background jobs (i.e. Active Job and Sidekiq) and many more.

--

--