Bypass reCaptcha for web scraping in Ruby

Saraswati Saud
4 min readMay 3, 2021

--

CAPTCHA is very common these days especially while purchasing online products or logging into a website. The use of CAPTCHA for these websites is to ensure that they are dealing with a human in a condition where human interaction is essential for security.

Dealing with a CAPTCHA is pretty essential in web scraping since it can easily break down the crawlers during the process of data extraction. So, for the developers who code their own scrapers, there are many CAPTCHA solving services that can be integrated into their scraping system. 2Captcha is one of the CAPTCHA solving services that use real humans to solve the problem for you.

Working with a CAPTCHA is pretty simple in web scraping. In this article, I want to explain, how we can scrape the data of a particular website after breaking the CAPTCHA. I am using reCaptcha v2 and this URL as an example — 2captcha.com/demo/recaptcha-v2 to break the CAPTCHA. If you click on the link, you will find the page like this:

We need the API KEY of 2Captcha during our process, so the first thing that you need to do is register an account on 2captcha.com if you haven’t already done it. Once you register your account and login to the 2captcha website, you will see your API KEY like this:

Also, you need to add funds to 2captcha.com to use its service. Please refer to the 2Capthca website to know more about its charges.

Pre-requisite

Before starting, I assume that you have already installed ruby on your computer.

  1. I am using ruby 2.7.2, however, it also works with other versions.
  2. The API KEY of 2Captcha and must have funds available in 2Captcha.

Let’s begin:

Step1: Create a file name called break_captcha.rbin your choice of editor and add the following lines of code:

This is the general skeleton of the scraper. Now, save the file at the preferred location of your computer.

Step 2: Add the API KEY of 2captcha.com above the URL of the code:

class BreakCaptcha < Kimurai::Base
API_KEY = 'XXXXXXXXXXXXXXXXXXXX'
URL = 'https://2captcha.com/demo/recaptcha-v2'
@name = 'break_captcha_spider'
@engine = :selenium_chrome
@start_urls = [URL]
.
.
end

Step 3: Next thing, we need to do is add a two_captcha gem in the above code.

Step 4: The next thing we need to do is find the data-siteky from the page’s HTML. It helps to identify the website in which the captcha is found. The process for finding data-sitekey is shown below:

First, we need to find the CSS or XPath of a captcha (i.e. the dotted line in the above figure). Then from that CSS/XPath, we can find the data-sitekey of that URL.

def parse(_response, _url, data: {})
recaptcha = browser.find(:xpath, '//div[@id="recaptcha"]')
google_key = recaptcha['data-sitekey']
...
end

The next thing we need to do is initialize the two_captcha gem and pass the data-sitekeyand URL as a parameter to the inbuilt function decode_recaptcha_v2 to break the captcha.

def parse(_response, _url, data: {})
recaptcha = browser.find(:xpath, '//div[@id="recaptcha"]')
google_key = recaptcha['data-sitekey']
client = TwoCaptcha.new(CAPTCHA_KEY)
options = {
googlekey: google_key,
pageurl: URL
}
captcha = client.decode_recaptcha_v2(options)
...
end

After this wait for few seconds, after that you receive g-recaptcha-response

If you type captcha.text, you will get the solution of the captcha and if you type captcha.id you will get the numeric ID of the captcha solved by TwoCaptcha.

Step 5: The last thing you need to do is fill the form with a captcha.text value as shown below:

def parse(_response, _url, data: {})
...
browser.execute_script("document.getElementById('g-recaptcha- response').style.display = 'block';")
browser.fill_in 'g-recaptcha-response', with: captcha.text
browser.execute_script("document.getElementById('g-recaptcha-response').style.display = 'none';")
browser.find(:css, '.btn_blue').click
...
end

Since the captcha form is initially hidden, so first we need to make the form visible. After that, we need to add text value to the form and again make it hidden. And lastly, submit the Recaptcha form.

The whole code for the parse method is available below:

Now, we are at the end of the program.

Step 5: Let’s test the application:

Run the following command on your terminal:

$> HEADLESS=FALSE ruby slack_apps_scraper.rb

This will start breaking the reCaptcha v2. It will take sometimes to break the captcha and the final result will look like this:

This is the end of the tutorial. For simplicity, if you want to verify the final code, here is how the final script looks like: https://gist.github.com/SaraswatiSaud/dca49ed902871f53bad9fbeb6eb5a942

Hope you liked it!

--

--

No responses yet