I will be demonstrating the process on the Indian WRIS Website and will be scraping the website data present. I selected this particular website to scrape using Network and not web-based automation based tools such as Selenium because the interface is too complicated and laggy. And going to 1000 pages will take a lot of time even if automated.
The URL of the site is here.
Understanding the Network
The green lines show the data being parsed through the network. Now, you will see that there are a lot of files which were parsed in the website. It contains everything ranging from images, CSS, JSON objects to font files. You need to find the data you wish to scrape in this panel. 1. Select the timeframe

2. Filter the data according to your needs
Generating the cURL and scraping the Response Object
If you want to know what CURL is, it basically is a URL that contains the data about the item being parsed. Sort of an API. curl is a command line tool to transfer data to or from a server, using any of the supported protocols (HTTP, FTP, IMAP, POP3, SCP, SFTP, SMTP, TFTP, TELNET, LDAP or FILE). curl is powered by Libcurl. This tool is preferred for automation, since it is designed to work without user interaction
Anyways, the magic begins here. Now, you need to convert this CURL into python-requests and scrape this data.
Now, there is a site available which directly converts the curl into python scripts, which you can directly run on your computer.
The URL of the site is: https://curl.trillworks.com/
You have to paste the cURL in the left side, and it will directly give you the python script, which you can run in your ide to get the object.
Please note that I have a line in the last to print out the response object in the form of text to the terminal.
Running the script and final thoughts
requests.exceptions.SSLError: HTTPSConnectionPool(host='wdo.indiawris.gov.in', port=443): Max retries exceeded with url: /api/reservoir/table (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))
You can bypass the error, by giving a simple verify=False parameter in the response call.
- If you have to make multiple such calls on a website, many times you just have to change the state reference and the data attributes only. You can loop over them keeping them as variable and collect and save multiple such responses.
- If you have to collect multiple responses, and it is running slower on your local PC, try running this script on the Google Collab platform, mounting your drive to save the files. You will be amazed by the speed.






machax++
ReplyDelete