This is a followup to my previous article.
The example in the previous post only checked three programs, but with more entries the overall execution time increases. With about 36 entries the execution time is around 20-40 seconds, depending on how fast the websites answer as each website is being accessed in sequence and waiting for each request to complete. This can and will add up. Additionally running the various unix commands in the pipe will not help towards overall efficiency.
So I decided to give it a try to make the script better and a hopefully a lot faster, especially when more websites are being accessed. My first thought was to try it with perl but parsing HTML with perl never quite worked for me when I wrote some test scripts. This may be due to me being inapt using HTML::TreeBuilder and/or XML::XPath. In any case, I was getting nowhere…
The second choice of a scripting language to use was ruby. As I already had done some website scraping using the ruby gems mechanize, nokogiri and hpricot I thought trying to convert the shell script into a ruby script would be easy. Little did I know :)
Despite the fact it only accesses some websites and stores the version number (and the program name) in a sqlite3 database I hope it can serve as an example on how to use threads, mechanize, nokogiri, sqlite3 and how to pass code to a function.
Adding more checks does not increase the overall runtime too much. A test with 36 entries results in an overall runtime between two to five seconds.
Finding the actual xpath is not that easy, firebug (Firefox) or the Developer Tools from Google Chrome can provide helpful hints but this is best tested in a small test script where also the version extraction ruby code can be written.
XPath and version extraction test script (example Metapad):
#!/usr/bin/ruby require 'rubygems' require 'mechanize' require 'nokogiri' # exit upon control-c trap("INT") { exit 1 } # site data url = 'http://liquidninja.com/metapad/download.html' name = 'MetaPad' xpath = '/html/body/table/tr/td/table/tr/td/table/tr/td/table/tr/td' # and access the site agent = Mechanize.new agent.history.max_size = 0 agent.user_agent_alias = 'Mac Safari' agent.read_timeout = 3 page = agent.get(url) found = page.search(xpath) # this is to make it easier to just "drop" in the source code # into the "real script". data = found # display the "elements" we have hit to see if we are "close" with the xpath data.each do |node| puts "-- NODE --" puts node end puts "--- OUTPUT ---" ##### version extraction code versions = Array.new data.each do |node| versions.push(node.text) if node.to_s.include?('Version') end data = versions.first data = data.strip.gsub!(/Version /, ''); #### version extraction code end # assign the result back to 'result' result = data # and print what we have puts "#{name}: #{result}" exit 0
The sqlite3 database table
sqlite> .schema win32 CREATE TABLE win32 (program varchar(30) PRIMARY KEY, cver TEXT, pver TEXT, err INT default '0'); sqlite>
Prior to writing to the database, an entry for the program must exist. This is an example on how to create an entry:
sqlite> insert into win32 (program) values ('MetaPad'); sqlite> select * from win32 where program='MetaPad'; MetaPad|||0 sqlite>
And here is the final ruby script that checks three vendor websites and writes to the sqlite3 database:
#!/usr/bin/ruby require 'rubygems' require 'mechanize' require 'nokogiri' require 'sqlite3' require 'socket' # exit upon control-c trap("INT") { exit 1 } # Setup the database dbfile = 'win32.db' dbtable = 'win32' # no database, no party if not File.exists?(dbfile) puts "Error, could not find database: #{dbfile}" exit 10 end # connect to the database database = SQLite3::Database.new(dbfile) # This hash will store the program name # and its (scraped) version string pv = Hash.new # Using threads to make website scraping concurrent (and thus faster) mythreads = Array.new # This is the core routine of the script # It takes three(!) arguments: # url: What URL to access/get HTML from # xpath: Using nokogiri, "extract" only a part of the HTML content # CODE: The source code to parse the part of the HTML content to obtain # the version string def getversion(url,xpath) # Set the default return value to '' result = '' # TCP connection check, host/service available? host = url.split('/') host = host[2] socket = TCPSocket.open(host, '80') # No socket, no access begin socket = TCPSocket.open(host, '80') rescue return result end socket.close # This sets up a simple web client agent = Mechanize.new agent.history.max_size = 0 agent.user_agent_alias = 'Mac Safari' agent.read_timeout = 3 agent.keep_alive = false # Try and download the page # If something goes wrong, result will not have changed # and thus will be '' which will force the error counter to be increased begin page = agent.get url rescue Mechanize::ResponseCodeError return result end # Parse the content using the xpath and nokogiri # which is accessed via the mechanize method .search found = page.search xpath # If the array (of nokogiri) objects is not empty # execute the provided code block to obtain the version number if not found.empty? then result = yield found if block_given? end return result end ### # Metapad t = Thread.new { url = 'http://liquidninja.com/metapad/download.html' name = 'MetaPad' xpath = '/html/body/table/tr/td/table/tr/td/table/tr/td/table/tr/td' # This is the call where url, xpath and some code is being supplied to the function # The code passed uses the 'data' object which is the one also being 'returned' # to the subroutine when the code block finishes result = getversion( url, xpath ) { |data| versions = Array.new data.each do |node| versions.push(node.text) if node.to_s.include?('Version') end data = versions.first data = data.strip.gsub!(/Version /, '') } pv["#{name}"] = result } mythreads.push(t) ### # Pidgin t = Thread.new { name = 'Pidgin' url = 'http://developer.pidgin.im/wiki/ChangeLog' xpath = '/html/body/div/div/div/div/div/h2' # Simpler code to extract the version number result = getversion( url, xpath ) { |data| data = data.first.text data = data.split(' ') data = data[1] } pv["#{name}"] = result } mythreads.push(t) ### # Adobe Flash Player t = Thread.new { url = 'http://www.adobe.com/de/software/flash/about/' name = 'Adobe Flash Plugin' xpath = '/html/body/div/div/div/div/table/tbody/tr/td' # Simplest version :) result = getversion( url, xpath ) { |data| data = data[5].text } pv["#{name}"] = result } mythreads.push(t) #### More entries could go here :) # Collect all threads and wait, if necessary mythreads.each { |t| t.join } # Create a string of database commands by # looping through the 'pv' hash and create the individual update commands for sqlite3 # Each command is one element of the 'sql' array sql = Array.new pv.each do |p,v| sql.push("update #{dbtable} set cver = '#{v}' where program = '#{p}';") end # Convert the 'sql' array to one big string and batch execute database.execute_batch( sql.join("\n") ) # The cver column for each program has been updated # What is left is to query the database and print the results, if there are any # reset error counter when cver has a value database.execute("update #{dbtable} set err = 0 WHERE cver != ''") # Report if the err counter is >= 5 # Rhis means 5 consecutive runs went 'wrong' and requires a check as to # what/why. # For example, if the vendor changes the output/structure of the website... data = database.execute2("select program AS ProgramName,err AS ErrorRuns from win32 where err >= 5") # When executing the method execute2 the table header is also returned # this means, the data array will always contain one element # It will contain a second (and more) element(s) if actual data is returned if data[1] then data.each do |line| printf("%-20s %-15s\n", line[0], line[1]) end end # Increase the error counter if current version is empty database.execute("update win32 set err = err + 1 WHERE cver=''") # Compare the current version with previous version # Do not compare when current version is empty data = database.execute2("select program AS ProgramName,pver AS OldVersion,cver AS CurrentVersion from win32 where ( cver != pver and cver != '')") # see remark about execute2 above if data[1] then data.each do |line| printf("%-20s %-15s %-15s\n", line[0], line[1], line[2]) end end # Set the previous version to the value of current version # Do not set if the current version is empty database.execute("update win32 SET pver = cver WHERE ( pver != cver and cver != '')") exit 0