Bypass reCaptcha for web scraping in Ruby
CAPTCHA is very common these days especially while purchasing online products or logging into a website. The use of CAPTCHA for these websites is to ensure that they are dealing with a human in a condition where human interaction is essential for security.
Dealing with a CAPTCHA is pretty essential in web scraping since it can easily break down the crawlers during the process of data extraction. So, for the developers who code their own scrapers, there are many CAPTCHA solving services that can be integrated into their scraping system. 2Captcha is one of the CAPTCHA solving services that use real humans to solve the problem for you.
Working with a CAPTCHA is pretty simple in web scraping. In this article, I want to explain, how we can scrape the data of a particular website after breaking the CAPTCHA. I am using reCaptcha v2 and this URL as an example — 2captcha.com/demo/recaptcha-v2 to break the CAPTCHA. If you click on the link, you will find the page like this:
We need the API KEY of 2Captcha during our process, so the first thing that you need to do is register an account on 2captcha.com if you haven’t already done it. Once you register your account and login to the 2captcha website, you will see your API KEY like this:
Also, you need to add funds to 2captcha.com to use its service. Please refer to the 2Capthca website to know more about its charges.
Pre-requisite
Before starting, I assume that you have already installed ruby on your computer.
- I am using ruby 2.7.2, however, it also works with other versions.
- The API KEY of 2Captcha and must have funds available in 2Captcha.
Let’s begin:
Step1: Create a file name called break_captcha.rb
in your choice of editor and add the following lines of code:
This is the general skeleton of the scraper. Now, save the file at the preferred location of your computer.
Step 2: Add the API KEY of 2captcha.com above the URL of the code:
class BreakCaptcha < Kimurai::Base
API_KEY = 'XXXXXXXXXXXXXXXXXXXX'
URL = 'https://2captcha.com/demo/recaptcha-v2'
@name = 'break_captcha_spider'
@engine = :selenium_chrome
@start_urls = [URL]
.
.
end
Step 3: Next thing, we need to do is add a two_captcha gem in the above code.
Step 4: The next thing we need to do is find the data-siteky
from the page’s HTML. It helps to identify the website in which the captcha is found. The process for finding data-sitekey
is shown below:
First, we need to find the CSS or XPath of a captcha (i.e. the dotted line in the above figure). Then from that CSS/XPath, we can find the data-sitekey
of that URL.
def parse(_response, _url, data: {})
recaptcha = browser.find(:xpath, '//div[@id="recaptcha"]')
google_key = recaptcha['data-sitekey']
...
end
The next thing we need to do is initialize the two_captcha gem and pass the data-sitekey
and URL as a parameter to the inbuilt function decode_recaptcha_v2
to break the captcha.
def parse(_response, _url, data: {})
recaptcha = browser.find(:xpath, '//div[@id="recaptcha"]')
google_key = recaptcha['data-sitekey']
client = TwoCaptcha.new(CAPTCHA_KEY)
options = {
googlekey: google_key,
pageurl: URL
}
captcha = client.decode_recaptcha_v2(options)
...
end
After this wait for few seconds, after that you receive g-recaptcha-response
If you type captcha.text, you will get the solution of the captcha and if you type captcha.id you will get the numeric ID of the captcha solved by TwoCaptcha.
Step 5: The last thing you need to do is fill the form with a captcha.text value as shown below:
def parse(_response, _url, data: {})
...
browser.execute_script("document.getElementById('g-recaptcha- response').style.display = 'block';")
browser.fill_in 'g-recaptcha-response', with: captcha.text
browser.execute_script("document.getElementById('g-recaptcha-response').style.display = 'none';")
browser.find(:css, '.btn_blue').click
...
end
Since the captcha form is initially hidden, so first we need to make the form visible. After that, we need to add text value to the form and again make it hidden. And lastly, submit the Recaptcha form.
The whole code for the parse method is available below:
Now, we are at the end of the program.
Step 5: Let’s test the application:
Run the following command on your terminal:
$> HEADLESS=FALSE ruby slack_apps_scraper.rb
This will start breaking the reCaptcha v2. It will take sometimes to break the captcha and the final result will look like this:
This is the end of the tutorial. For simplicity, if you want to verify the final code, here is how the final script looks like: https://gist.github.com/SaraswatiSaud/dca49ed902871f53bad9fbeb6eb5a942
Hope you liked it!