Tuesday, October 10, 2006

Monitor a webpage for updates

Yes, more Linux scripts.

I was participating in a flame war online discussion on a forum, and wanted to know the second someone replied to my forum post. So while I was hitting F5 over and over like an idiot I realized that computers are supposed to do dull, repetitive tasks, not me! So, I hacked together a few of my other scripts and came up with a kluge of a solution that worked.

I was able to continue working on my projects, and be notified immediately if someone replied to my troll flame-bait well-worded response. And there was much rejoicing...

More recently, I wanted to track a package that I was supposed to be shipped to me quickly. (f'ing UPS) I dug up that old script and couldn't believe that I'd actually used the wasteful code I'd written! I quickly started on a more elegant solution.

First we find out how many lines are on the page:
[XXXX@XXXXXXXXXXXXX ~]$ lynx --dump 'http://page-to-check'|wc -l
193

Here we're using "Lynx" which is a text-only web browser. Since it only gets text, we don't have to worry about leaching a large amount of bandwidth. (in the case of the UPS tracking, I was pulling around 55K every 10 minutes, which is probably just fine, especially considering that I'm only going to be doing it for a short while) I'm using lynx's "dump" function to dump the page to the screen, instead of opening the page in the window for me to view and navigate through. I'm then taking this output, and sending it to the "wc" program using a pipe (|) ("\" with shift). The "wc" program is a word count program that will count the words in a file or from input, I'm using the "-l" switch to make it count the lines from the web page that lynx dumped for us. The line count is 193.

Now that we know how many lines it was we last checked it, we can make a loop that checks to see if the line count has increased, and notifies us.

[XXXX@XXXXXXXXXXXXX ~]$ while(true); do if [ `lynx --dump 'http://page-to-check'|wc -l` -gt "193" ]; then kdialog --sorry omgpancakesdangerous; fi; sleep 600; done


I've bolded the part I've already gone over, and the program output. The while(true); do means to do the following forever. if [ checks if the command we ran previously (bold) has an output that is -gt greater than 193. If it does then it will run kdialog --sorry omgpancakesdangerous which will send a pop-up box to the user with the text "omgpancakesdangerous", because they are. fi; closes the If statement, and sleep 600; tells the script to wait 600 seconds before trying again. The done at the end is the end of the while(true); do loop, and will start the script over again at the if [ statement.

Important note on choosing a wait time:
There is little difference in the speed you set it to. Having it check every 5 seconds versus every 60 seconds is really not that much longer considering how quickly one minute passes. Think about how often the pages updates, and how long you'll have the script running. If you want to check a forum to fire a response back ASAP, set the wait time for 60 seconds. If you want to see if your favorite blog has made any updates, set the wait time to 1 hour. There's no need to check a blog for updates every minute, and checking a forum thread for updates every hour is a bit slow. The point here in moderation! If you set a short wait time, you should only let the script run for about an hour. Don't check a site every 60 seconds for two weeks! Bandwidth costs money. Leaching unnecessary bandwidth from a site will increase costs for the site owner, and most webmasters don't take kindly to impolite scrapers (scripts that "scrape" content off a site over and over) Keep in mind how much bandwidth you're using; if you're scraping pictures off a site over and over, you'll definitely get noticed since pictures take up a lot more data than just grabbing the text. If you're rude, or wasteful, you may find your IP address blocked from accessing the site at all!

BE NICE! Think about the nice webmasters who are running the site you're using. They don't want to have to block you, and you don't want to have to be blocked, so be nice and play by the rules.

1 comment:

Fletch said...

Heh. I had to look up this post to get my code when I wanted to monitor tracking information on a package...