I wanted to have a diagram showing the number of hits for a particular search term in Google, Yahoo, Bing and other search engines and how they change over time. Naturally, this should be done as a Linux cron job executed in regular intervals.
The first attempt,
No Format |
---|
wget 'http://www.google.de/search?hl=en&q="my+query"' --2011-02-03 09:29:54-- http://www.google.de/search?hl=en&q=%22my+query%22 Resolving www.google.de... 74.125.79.147, 74.125.79.99, 74.125.79.104 Connecting to www.google.de|74.125.79.147|:80... connected. HTTP request sent, awaiting response... 403 Forbidden 2011-02-03 09:29:55 ERROR 403: Forbidden. |
did not work. Google seems to block accesses with wget
. Even if something is returned, it is a few lines of rather cryptic JavaScript code.
It seems we have to use a real webbrowser here.
No Format |
---|
lynx 'http://www.google.de/search?hl=en&q="my+query"' |
works better, but we get a lot of requests for cookies. So, let's accept all cookies by default:
No Format |
---|
lynx 'http://www.google.de/search?hl=en&q="my+query"' |
Finally, we would like to dump the whole to stdout, instead of running it interactively. We then find the interesting bit of information with some extra grep
commands:
No Format |
---|
lynx -accept_all_cookies -dump 'http://www.google.de/search?hl=en&q="my+query"' | grep About | grep results About 3,660,000 results (0.12 seconds) |
This script can now be called from the /etc/crontab
file. It also works with Yahoo and Bing (but not for certain Web-2.0-intensive applications such as twitter).
Note that Google and probably all other search engines do not like to be queried by shell scripts. If you do this extensively your IP address will probably be blocked. So you can't build your own search engine on top of Google's. The proper way to do this is to register with Google and to use their API (and accept their terms of use).
So far, I am executing this automated script every hour for a few days and I am not yet on Google's radar.
See also: the actual results of this search script (sorry, page available in German language only).