How to query software program vendor websites for the program version using ruby

This is a followup to my previous article.

The example in the previous post only checked three programs, but with more entries the overall execution time increases. With about 36 entries the execution time is around 20-40 seconds, depending on how fast the websites answer as each website is being accessed in sequence and waiting for each request to complete. This can and will add up. Additionally running the various unix commands in the pipe will not help towards overall efficiency.

So I decided to give it a try to make the script better and a hopefully a lot faster, especially when more websites are being accessed. My first thought was to try it with perl but parsing HTML with perl never quite worked for me when I wrote some test scripts. This may be due to me being inapt using HTML::TreeBuilder and/or XML::XPath. In any case, I was getting nowhere…

The second choice of a scripting language to use was ruby. As I already had done some website scraping using the ruby gems mechanize, nokogiri and hpricot I thought trying to convert the shell script into a ruby script would be easy. Little did I know :)

Despite the fact it only accesses some websites and stores the version number (and the program name) in a sqlite3 database I hope it can serve as an example on how to use threads, mechanize, nokogiri, sqlite3 and how to pass code to a function.

Adding more checks does not increase the overall runtime too much. A test with 36 entries results in an overall runtime between two to five seconds.

Finding the actual xpath is not that easy, firebug (Firefox) or the Developer Tools from Google Chrome can provide helpful hints but this is best tested in a small test script where also the version extraction ruby code can be written.

XPath and version extraction test script (example Metapad):

#!/usr/bin/ruby

require 'rubygems'
require 'mechanize'
require 'nokogiri'

# exit upon control-c
trap("INT") { exit 1 } 

# site data
url   = 'http://liquidninja.com/metapad/download.html'
name  = 'MetaPad'
xpath = '/html/body/table/tr/td/table/tr/td/table/tr/td/table/tr/td'

# and access the site 
agent                  = Mechanize.new
agent.history.max_size = 0 
agent.user_agent_alias = 'Mac Safari'
agent.read_timeout     = 3 
page                   = agent.get(url)
found                  = page.search(xpath)

# this is to make it easier to just "drop" in the source code
# into the "real script".
data = found

# display the "elements" we have hit to see if we are "close" with the xpath
    data.each do |node|
        puts "-- NODE --"
        puts node
    end
    puts "--- OUTPUT ---"

##### version extraction code 

versions = Array.new

data.each  do |node|
   versions.push(node.text) if node.to_s.include?('Version')
end

data = versions.first
data = data.strip.gsub!(/Version /, '');

#### version extraction code end
# assign the result back to 'result' 
result = data

# and print what we have
puts "#{name}: #{result}"
exit 0

 
The sqlite3 database table

sqlite> .schema win32
CREATE TABLE win32 (program varchar(30) PRIMARY KEY, cver TEXT, pver TEXT, err INT default '0');
sqlite> 

Prior to writing to the database, an entry for the program must exist. This is an example on how to create an entry:

sqlite> insert into win32 (program) values ('MetaPad');
sqlite> select * from win32 where program='MetaPad';
MetaPad|||0
sqlite> 

 

And here is the final ruby script that checks three vendor websites and writes to the sqlite3 database:

#!/usr/bin/ruby

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'sqlite3'
require 'socket'

# exit upon control-c
trap("INT") { exit 1 }

# Setup the database
dbfile  = 'win32.db'
dbtable = 'win32'

# no database, no party
if not File.exists?(dbfile)
    puts "Error, could not find database: #{dbfile}"
    exit 10
end

# connect to the database
database = SQLite3::Database.new(dbfile)

# This hash will store the program name
# and its (scraped) version string
pv = Hash.new

# Using threads to make website scraping concurrent (and thus faster)
mythreads = Array.new

# This is the core routine of the script
# It takes three(!) arguments:
#  url:      What URL to access/get HTML from
#  xpath:    Using nokogiri, "extract" only a part of the HTML content
#  CODE:     The source code to parse the part of the HTML content to obtain
#            the version string

def getversion(url,xpath)
    # Set the default return value to ''
    result = ''

    # TCP connection check, host/service available?
    host   = url.split('/')
    host   = host[2]
    socket = TCPSocket.open(host, '80')

    # No socket, no access
    begin 
        socket = TCPSocket.open(host, '80')
    rescue  
        return result 
    end
    socket.close

    # This sets up a simple web client
    agent                  = Mechanize.new
    agent.history.max_size = 0
    agent.user_agent_alias = 'Mac Safari'
    agent.read_timeout     = 3
    agent.keep_alive       = false

    # Try and download the page
    # If something goes wrong, result will not have changed
    # and thus will be '' which will force the error counter to be increased
    begin
        page                   = agent.get url 
    rescue Mechanize::ResponseCodeError
        return result
    end

    # Parse the content using the xpath and nokogiri
    # which is accessed via the mechanize method .search 
    found                  = page.search xpath

    # If the array (of nokogiri) objects is not empty
    # execute the provided code block to obtain the version number
    if not found.empty?
    then
        result = yield found if block_given?
    end

    return result
end

###
# Metapad
t = Thread.new {
    url    = 'http://liquidninja.com/metapad/download.html'
    name   = 'MetaPad'
    xpath  = '/html/body/table/tr/td/table/tr/td/table/tr/td/table/tr/td'

# This is the call where url, xpath and some code is being supplied to the function
# The code passed uses the 'data' object which is the one also being 'returned'
# to the subroutine when the code block finishes

    result = getversion( url, xpath ) { |data|
                                        versions = Array.new
                                        data.each  do |node|
                                            versions.push(node.text) if node.to_s.include?('Version')
                                        end
                                        data = versions.first
                                        data = data.strip.gsub!(/Version /, '')
                                      }
    pv["#{name}"] = result
}
mythreads.push(t)

###
# Pidgin
t = Thread.new {
    name   = 'Pidgin'
    url    = 'http://developer.pidgin.im/wiki/ChangeLog'
    xpath  = '/html/body/div/div/div/div/div/h2'

# Simpler code to extract the version number
    result = getversion( url, xpath ) { |data|
                                        data = data.first.text
                                        data = data.split(' ')
                                        data = data[1]
                                      }
    pv["#{name}"] = result
}
mythreads.push(t)

###
# Adobe Flash Player
t = Thread.new {
    url    = 'http://www.adobe.com/de/software/flash/about/'
    name   = 'Adobe Flash Plugin'
    xpath  = '/html/body/div/div/div/div/table/tbody/tr/td'

# Simplest version :)
    result = getversion( url, xpath ) { |data| data = data[5].text }
    pv["#{name}"] = result
}
mythreads.push(t)

#### More entries could go here :)

# Collect all threads and wait, if necessary
mythreads.each { |t| t.join }

# Create a string of database commands by
# looping through the 'pv' hash and create the individual update commands for sqlite3
# Each command is one element of the 'sql' array
sql = Array.new

pv.each do |p,v|
    sql.push("update #{dbtable} set cver = '#{v}' where program = '#{p}';")
end

# Convert the 'sql' array to one big string and batch execute
database.execute_batch( sql.join("\n") )

# The cver column for each program has been updated
# What is left is to query the database and print the results, if there are any

# reset error counter when cver has a value
database.execute("update #{dbtable} set err = 0 WHERE cver != ''")

# Report if the err counter is >= 5
# Rhis means 5 consecutive runs went 'wrong' and requires a check as to # what/why.
# For example, if the vendor changes the output/structure of the website...
data = database.execute2("select program AS ProgramName,err AS ErrorRuns from win32 where err >= 5")

# When executing the method execute2 the table header is also returned
# this means, the data array will always contain one element
# It will contain a second (and more) element(s) if actual data is returned
if data[1]
then
    data.each do |line|
        printf("%-20s %-15s\n", line[0], line[1])
    end
end

# Increase the error counter if current version is empty
database.execute("update win32 set err = err  + 1 WHERE cver=''")

# Compare the current version with previous version
# Do not compare when current version is empty
data =  database.execute2("select program AS ProgramName,pver AS OldVersion,cver AS CurrentVersion from win32 where ( cver != pver and cver != '')")

# see remark about execute2 above
if data[1]
then
    data.each do |line|
        printf("%-20s %-15s %-15s\n", line[0], line[1], line[2])
    end
end

# Set the previous version to the value of current version
# Do not set if the current version is empty
database.execute("update win32 SET pver = cver WHERE ( pver != cver and cver != '')")

exit 0
Advertisement