Simple Things
Ink and Incapability
How to query software program vendor websites for the program version using ruby
This is a followup to my previous article.
The example in the previous post only checked three programs, but with more entries the overall execution time increases. With about 36 entries the execution time is around 20-40 seconds, depending on how fast the websites answer as each website is being accessed in sequence and waiting for each request to complete. This can and will add up. Additionally running the various unix commands in the pipe will not help towards overall efficiency.
So I decided to give it a try to make the script better and a hopefully a lot faster, especially when more websites are being accessed. My first thought was to try it with perl but parsing HTML with perl never quite worked for me when I wrote some test scripts. This may be due to me being inapt using HTML::TreeBuilder and/or XML::XPath. In any case, I was getting nowhere…
The second choice of a scripting language to use was ruby. As I already had done some website scraping using the ruby gems mechanize, nokogiri and hpricot I thought trying to convert the shell script into a ruby script would be easy. Little did I know :)
Despite the fact it only accesses some websites and stores the version number (and the program name) in a sqlite3 database I hope it can serve as an example on how to use threads, mechanize, nokogiri, sqlite3 and how to pass code to a function.
Adding more checks does not increase the overall runtime too much. A test with 36 entries results in an overall runtime between two to five seconds.
Finding the actual xpath is not that easy, firebug (Firefox) or the Developer Tools from Google Chrome can provide helpful hints but this is best tested in a small test script where also the version extraction ruby code can be written.
XPath and version extraction test script (example Metapad):
#!/usr/bin/ruby
require 'rubygems'
require 'mechanize'
require 'nokogiri'
# exit upon control-c
trap("INT") { exit 1 }
# site data
url = 'http://liquidninja.com/metapad/download.html'
name = 'MetaPad'
xpath = '/html/body/table/tr/td/table/tr/td/table/tr/td/table/tr/td'
# and access the site
agent = Mechanize.new
agent.history.max_size = 0
agent.user_agent_alias = 'Mac Safari'
agent.read_timeout = 3
page = agent.get(url)
found = page.search(xpath)
# this is to make it easier to just "drop" in the source code
# into the "real script".
data = found
# display the "elements" we have hit to see if we are "close" with the xpath
data.each do |node|
puts "-- NODE --"
puts node
end
puts "--- OUTPUT ---"
##### version extraction code
versions = Array.new
data.each do |node|
versions.push(node.text) if node.to_s.include?('Version')
end
data = versions.first
data = data.strip.gsub!(/Version /, '');
#### version extraction code end
# assign the result back to 'result'
result = data
# and print what we have
puts "#{name}: #{result}"
exit 0
The sqlite3 database table
sqlite> .schema win32 CREATE TABLE win32 (program varchar(30) PRIMARY KEY, cver TEXT, pver TEXT, err INT default '0'); sqlite>
Prior to writing to the database, an entry for the program must exist. This is an example on how to create an entry:
sqlite> insert into win32 (program) values ('MetaPad');
sqlite> select * from win32 where program='MetaPad';
MetaPad|||0
sqlite>
And here is the final ruby script that checks three vendor websites and writes to the sqlite3 database:
#!/usr/bin/ruby
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'sqlite3'
require 'socket'
# exit upon control-c
trap("INT") { exit 1 }
# Setup the database
dbfile = 'win32.db'
dbtable = 'win32'
# no database, no party
if not File.exists?(dbfile)
puts "Error, could not find database: #{dbfile}"
exit 10
end
# connect to the database
database = SQLite3::Database.new(dbfile)
# This hash will store the program name
# and its (scraped) version string
pv = Hash.new
# Using threads to make website scraping concurrent (and thus faster)
mythreads = Array.new
# This is the core routine of the script
# It takes three(!) arguments:
# url: What URL to access/get HTML from
# xpath: Using nokogiri, "extract" only a part of the HTML content
# CODE: The source code to parse the part of the HTML content to obtain
# the version string
def getversion(url,xpath)
# Set the default return value to ''
result = ''
# TCP connection check, host/service available?
host = url.split('/')
host = host[2]
socket = TCPSocket.open(host, '80')
# No socket, no access
begin
socket = TCPSocket.open(host, '80')
rescue
return result
end
socket.close
# This sets up a simple web client
agent = Mechanize.new
agent.history.max_size = 0
agent.user_agent_alias = 'Mac Safari'
agent.read_timeout = 3
agent.keep_alive = false
# Try and download the page
# If something goes wrong, result will not have changed
# and thus will be '' which will force the error counter to be increased
begin
page = agent.get url
rescue Mechanize::ResponseCodeError
return result
end
# Parse the content using the xpath and nokogiri
# which is accessed via the mechanize method .search
found = page.search xpath
# If the array (of nokogiri) objects is not empty
# execute the provided code block to obtain the version number
if not found.empty?
then
result = yield found if block_given?
end
return result
end
###
# Metapad
t = Thread.new {
url = 'http://liquidninja.com/metapad/download.html'
name = 'MetaPad'
xpath = '/html/body/table/tr/td/table/tr/td/table/tr/td/table/tr/td'
# This is the call where url, xpath and some code is being supplied to the function
# The code passed uses the 'data' object which is the one also being 'returned'
# to the subroutine when the code block finishes
result = getversion( url, xpath ) { |data|
versions = Array.new
data.each do |node|
versions.push(node.text) if node.to_s.include?('Version')
end
data = versions.first
data = data.strip.gsub!(/Version /, '')
}
pv["#{name}"] = result
}
mythreads.push(t)
###
# Pidgin
t = Thread.new {
name = 'Pidgin'
url = 'http://developer.pidgin.im/wiki/ChangeLog'
xpath = '/html/body/div/div/div/div/div/h2'
# Simpler code to extract the version number
result = getversion( url, xpath ) { |data|
data = data.first.text
data = data.split(' ')
data = data[1]
}
pv["#{name}"] = result
}
mythreads.push(t)
###
# Adobe Flash Player
t = Thread.new {
url = 'http://www.adobe.com/de/software/flash/about/'
name = 'Adobe Flash Plugin'
xpath = '/html/body/div/div/div/div/table/tbody/tr/td'
# Simplest version :)
result = getversion( url, xpath ) { |data| data = data[5].text }
pv["#{name}"] = result
}
mythreads.push(t)
#### More entries could go here :)
# Collect all threads and wait, if necessary
mythreads.each { |t| t.join }
# Create a string of database commands by
# looping through the 'pv' hash and create the individual update commands for sqlite3
# Each command is one element of the 'sql' array
sql = Array.new
pv.each do |p,v|
sql.push("update #{dbtable} set cver = '#{v}' where program = '#{p}';")
end
# Convert the 'sql' array to one big string and batch execute
database.execute_batch( sql.join("\n") )
# The cver column for each program has been updated
# What is left is to query the database and print the results, if there are any
# reset error counter when cver has a value
database.execute("update #{dbtable} set err = 0 WHERE cver != ''")
# Report if the err counter is >= 5
# Rhis means 5 consecutive runs went 'wrong' and requires a check as to # what/why.
# For example, if the vendor changes the output/structure of the website...
data = database.execute2("select program AS ProgramName,err AS ErrorRuns from win32 where err >= 5")
# When executing the method execute2 the table header is also returned
# this means, the data array will always contain one element
# It will contain a second (and more) element(s) if actual data is returned
if data[1]
then
data.each do |line|
printf("%-20s %-15s\n", line[0], line[1])
end
end
# Increase the error counter if current version is empty
database.execute("update win32 set err = err + 1 WHERE cver=''")
# Compare the current version with previous version
# Do not compare when current version is empty
data = database.execute2("select program AS ProgramName,pver AS OldVersion,cver AS CurrentVersion from win32 where ( cver != pver and cver != '')")
# see remark about execute2 above
if data[1]
then
data.each do |line|
printf("%-20s %-15s %-15s\n", line[0], line[1], line[2])
end
end
# Set the previous version to the value of current version
# Do not set if the current version is empty
database.execute("update win32 SET pver = cver WHERE ( pver != cver and cver != '')")
exit 0
Hey there, You’ve done a great job. I will certainly digg it and personally recommend to my friends. I am confident they will be benefited from this web site.
Thank you very much!