Tuesday, May 1, 2007

Natural order String comparison for the Ruby Array class

Sometimes you want to order an array of strings with embedded numbers. But if we use the sort method from the array class we can have some big surprises (some information has been deleted for clarity):

irb>foo = ['foobar', 'a13bb32', 'a10', 'a13bb2c10',
       'a2', 'a3', 'a13bb2c9', 'a13', 'a20', 'a13bb2',
       'a13bb1', 'a13bb10', 'a13bb20', 'a13bb3', 'a13bb01']

irb>puts foo.sort
a10
a13
a13bb01
a13bb1
a13bb10
a13bb2
a13bb20
a13bb2c10
a13bb2c9
a13bb3
a13bb32
a2
a20
a3
foobar


The problem is that the sort method from the Array class cannot "see" the embedded numbers. It only sorts comparing each characters one at a time (usually with the ASCII numbers for the priority), so that 'a13' comes before 'a2' (because '1' comes before '2', the sort method doesn't compare '13' with '2'). Because we can open Ruby classes easily, we can add a new way to sort our strings with embedded numbers:

# --------8<--------
class Array

  # Method which sort an array composed of strings with embedded numbers by
  # the 'natural' representation of numbers inside a string.
  def natcmp
  
    reg_number = /\d+/
  
    # We call the sort method of the Array class.
    self.sort do |str1, str2|
  
      # We try to find an embedded number
      a = str1.match(reg_number)
      b = str2.match(reg_number)
  
      # If there is no number
      if [a,b].include? nil
        str1 <=> str2
      else
        while true
          begin
            # We compare strings before the number. If there
            # are equal, we will have to compare the numbers
            if (comp = a.pre_match <=> b.pre_match) == 0
              # If the numbers are equal
              comp = (a[0] == b[0]) ? comp = a[0] + a.post_match <=> b[0] + b.post_match :
                                      comp = a[0].to_i <=> b[0].to_i
            end
  
            str1, str2 = a.post_match, b.post_match
            a = str1.match(reg_number)
            b = str2.match(reg_number)
          rescue
            break
          end
        end
        comp
      end
    end
  end
  
  # Same as 'natcmp' but replace in place.
  def natcmp!
    self.replace(natcmp)
  end

end
# --------8<--------

So now, if we try our new method, here's the result:

irb>puts foo.natcmp
a2
a3
a10
a13
a13bb1
a13bb01
a13bb2
a13bb2c9
a13bb2c10
a13bb3
a13bb10
a13bb20
a13bb32
a20
foobar


We can even use the 'natcmp!' method if we want to change the array in place. I hope this will help somebody.