Using network CURL Scraping to collect Response Objects in a Website

This is my first tech blog. I am writing this for myself so that I come back and read if I forget the process. 

I will be demonstrating the process on the Indian WRIS Website and will be scraping the website data present. I selected this particular website to scrape using Network and not web-based automation based tools such as Selenium because the interface is too complicated and laggy. And going to 1000 pages will take a lot of time even if automated.

The URL of the site is here.

Understanding the Network

Open the site from which you want to scrape the data. Open Developers Tools by pressing F12 on your keyboard or going to the Settings tab.  Make sure that you are on the page from which you want the data. For me, I will go to the Applications tab on the left and select the right options. In developers tools, switch to Network Tab. 

Now, you need to reload the page and monitor the network.
The green lines show the data being parsed through the network. Now, you will see that there are a lot of files which were parsed in the website. It contains everything ranging from images, CSS, JSON objects to font files. You need to find the data you wish to scrape in this panel. 

There are a few shortcuts you can take though, to find your data quickly. 

1. Select the timeframe 

In the timeline above, you can select the timeframe in which you are sure that the data was loaded. Suppose you reloaded the website 5-6 times, the data would have been loaded 5-6 times, and every file is mentioned 5-6 times in the list below. But if you select the time frame, you will only be shown the data loaded in that given time frame. 

For eg, I will select the densest green line areas, because I am sure my data is present there. 

2. Filter the data according to your needs

Here, select the file time you want to scrape from the site. For eg, if you are on a wallpaper site, you might just need to select the Img or Media tab. For JSON objects select the XHR tab. XHR is basically XMLHttpRequest (XHR) is an API in the form of an object whose methods transfer data between a web browser and a web server. The object is provided by the browser's JavaScript environment.

Generating the cURL and scraping the Response Object

Now, it's time to find the data you want to scrape. For me, it is this JSON object I want to scrape. 

It is a list of states. Now you need to right-click on the file, (table for me, in the right panel). Copy the CURL for that file. 

For me, it is:

If you want to know what CURL is, it basically is a URL that contains the data about the item being parsed. Sort of an API. 

According to online sources it is,
curl is a command line tool to transfer data to or from a server, using any of the supported protocols (HTTP, FTP, IMAP, POP3, SCP, SFTP, SMTP, TFTP, TELNET, LDAP or FILE). curl is powered by Libcurl. This tool is preferred for automation, since it is designed to work without user interaction

Anyways, the magic begins here. Now, you need to convert this CURL into python-requests and scrape this data.  

Now, there is a site available which directly converts the curl into python scripts, which you can directly run on your computer. 

The URL of the site is: https://curl.trillworks.com/


You have to paste the cURL in the left side, and it will directly give you the python script, which you can run in your ide to get the object. 

Please note that I have a line in the last to print out the response object in the form of text to the terminal. 

Running the script and final thoughts

Here's the response object I collected. This might take some time to initiate depending on the traffic and the size of the website you are running this on, but once initiated, it will parse all the data in a second. 


Note: If you are getting this error while running the script,
requests.exceptions.SSLError: HTTPSConnectionPool(host='wdo.indiawris.gov.in', port=443): Max retries exceeded with url: /api/reservoir/table (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

You can bypass the error, by giving a simple verify=False parameter in the response call. 

You can collect anything from any site using this method. And this method is comparatively faster than the ones people use, involving Soup, Selenium and others. 

You can save this data in the form of JSON files if you want to. Or any format you scraped it for. 
  • If you have to make multiple such calls on a website, many times you just have to change the state reference and the data attributes only. You can loop over them keeping them as variable and collect and save multiple such responses.
  • If you have to collect multiple responses, and it is running slower on your local PC, try running this script on the Google Collab platform, mounting your drive to save the files. You will be amazed by the speed.
This is it for the first tutorial. I found all of these steps by myself, and this might be too trivial for you. But it is very new and exciting for me. I will be posting more such tutorials if I find, mostly related to network and hacking stuff. 

Let me know if you face any difficulties in the above tutorial, need some clarifications or can help improve it. Thank you.

Comments

Post a Comment