For my second Android™ project/application, which will remain unnamed for the moment, I will first need to gather some historical data from more than a couple of local (Malaysian) web sites. Some of this publicly available data go back as far as as 1985! It is insane to even contemplate fetching all this data by hand and so I had to come up with some kind of software to help me do that easily and quickly.
Previously, I would simply use PHP‘s
fsockopen to quickly grab a web document or 2, or to masquerade as a browser, but this time
fsockopen was simply not going to cut it. I needed something a lot easier to set up and one that could survive whatever robot traps there are on these kinds of sites usually.
cURL and Wget
When amateurs like me want to develop software that will masquerade as web robots or crawlers, the obvious choices are of course Wget and cURL. I have had limited experience with
Wget, especially when setting up cron jobs on my web servers, but I have never had any with cURL.
After quickly doing some research on both, I concluded the one more suited for my needs today is cURL.
What is cURL?
From the web site:
cURL is the name of the project. The name is a play on ‘Client for URLs’, originally with URL spelled in uppercase to make it obvious it deals with URLs. The fact it can also be pronounced ‘see URL’ also helped, it works as an abbreviation for “Client URL Request Library” or why not the recursive version: “Curl URL Request Library”.
It took me nearly 3 weeks, but today I have completed my “web robot” that successfully crawls all the necessary web sites, grabs any document I want, extracts just the information I need and puts it all, very nicely, into a MySQL database!
My custom web crawler, powered by PHP and cURL, is able to connect to a web site, manage cookies, send referrer data, request compressed web pages, navigate itself around a web site to get to the best parts, fetch the document containing the data I want, parse it, just extract the data I need, verify that it is correct, and save it all to the database! And it does this all at the rate of 1.5 minutes for one month’s worth of data from one web site.
Considering that I have over 20 years of data to fetch, and that too from more than one web site, it is not bad at all, if you ask me!
At this rate, GIDApp No. 2 should be ready in 3 months.