Tuesday, May 1, 2007

Natural order String comparison for the Ruby Array class

Sometimes you want to order an array of strings with embedded numbers. But if we use the sort method from the array class we can have some big surprises (some information has been deleted for clarity):

irb>foo = ['foobar', 'a13bb32', 'a10', 'a13bb2c10',
       'a2', 'a3', 'a13bb2c9', 'a13', 'a20', 'a13bb2',
       'a13bb1', 'a13bb10', 'a13bb20', 'a13bb3', 'a13bb01']

irb>puts foo.sort
a10
a13
a13bb01
a13bb1
a13bb10
a13bb2
a13bb20
a13bb2c10
a13bb2c9
a13bb3
a13bb32
a2
a20
a3
foobar


The problem is that the sort method from the Array class cannot "see" the embedded numbers. It only sorts comparing each characters one at a time (usually with the ASCII numbers for the priority), so that 'a13' comes before 'a2' (because '1' comes before '2', the sort method doesn't compare '13' with '2'). Because we can open Ruby classes easily, we can add a new way to sort our strings with embedded numbers:

# --------8<--------
class Array

  # Method which sort an array composed of strings with embedded numbers by
  # the 'natural' representation of numbers inside a string.
  def natcmp
  
    reg_number = /\d+/
  
    # We call the sort method of the Array class.
    self.sort do |str1, str2|
  
      # We try to find an embedded number
      a = str1.match(reg_number)
      b = str2.match(reg_number)
  
      # If there is no number
      if [a,b].include? nil
        str1 <=> str2
      else
        while true
          begin
            # We compare strings before the number. If there
            # are equal, we will have to compare the numbers
            if (comp = a.pre_match <=> b.pre_match) == 0
              # If the numbers are equal
              comp = (a[0] == b[0]) ? comp = a[0] + a.post_match <=> b[0] + b.post_match :
                                      comp = a[0].to_i <=> b[0].to_i
            end
  
            str1, str2 = a.post_match, b.post_match
            a = str1.match(reg_number)
            b = str2.match(reg_number)
          rescue
            break
          end
        end
        comp
      end
    end
  end
  
  # Same as 'natcmp' but replace in place.
  def natcmp!
    self.replace(natcmp)
  end

end
# --------8<--------

So now, if we try our new method, here's the result:

irb>puts foo.natcmp
a2
a3
a10
a13
a13bb1
a13bb01
a13bb2
a13bb2c9
a13bb2c10
a13bb3
a13bb10
a13bb20
a13bb32
a20
foobar


We can even use the 'natcmp!' method if we want to change the array in place. I hope this will help somebody.

6 comments:

Chris Fritz said...

This is great. I need natural order sorting for filenames for a minor task automation script of mine, which I'm moving from multiple bash batch files to one Ruby program, and I was almost ready to write my own function when I found yours. This is perfect!

Thanks for writing this, and for putting it out here for everyone to benefit from!

Chuck Bergeron said...

This proves extremely helpful for a project I'm currently working on, thanks a ton!

How come something like this isn't in Ruby itself?

Unknown said...

thanks, this is exactly what i was looking for!

shisohan said...

Nice
But I've got two issues with this:
a) you name the method "natcmp", this is IMO a misnomer (even you call it "natural sorting" in the text)

b) putting it into Array is the wrong place. This requires the array to be of strings and makes it impossible (unreasonably hard) to sort objects that have a string attribute by it. Also it means that you have to add 2 components: comparison and sorting.
Putting the comparison into String is the better choice for those reasons IMO.

For an example see http://pastie.org/557709
I also added String#natsortkey to allow using it with sort_by.

Robert Reiz said...

Here is a new open source project written in ruby. Called NaturalSorter. Please take a look.
https://github.com/reiz/naturalsorter

Robert Reiz said...

OH. Here again as a real link:

https://github.com/reiz/naturalsorter