oldmoe: Document Matching In Ruby

Document Matching In Ruby

Posted by oldmoe Wednesday, August 06, 2008 8/06/2008 09:55:00 PM

Labels: document matching , information retreival , ruby , tf-idf

With the debut of the new version of meOwns we have introduced several features that are concerned with how objects are related to each other. In the user profile page, you get a list of other user that are similar to him/her. And you are faced with the chemistry meter which tells you how much you are related to this user (if you are logged in of course). If you happen to be viewing your own profile page you will get a list of recommended items that you might be interested in. Last but not least, when you view an item, you get a list of similar items.

In this article I will be talking about the features from an implementation point of view. Naturally the first hurdle was to define the problem. What we needed was to find a way to match items and users. So first we needed to represent them in a way that can be matched. We started from the items first and looked at how to match two items together.

What is an item? In meOwns an item is simply a name, a description, a type and some tags. We decided to ignore photos and comments from the item. Each of those fields gets a weight which affects the value of terms found in it. Here is a sample product (encoded in Yaml):

Name : Fiat Sienna
Type : Car
Description : 1.6 HP, not bad for a sedan, relatively good performance for the price, best sedan i've bought
Tags : Cars Fiat Silver

The above fields are then processed to extract relevant terms from them, this is done in the following manner

Remove punctuation and non alphanumeric characters (replace them with spaces)

Collapse spaces and split the text on them

Match the generated list of terms to a stop word list to remove them (words like "on", "the" should not be considered in the index)

The remaining terms are converted to lower case and then converted to their stem representation (we use a snowball stemmer for now)

The above can be represented as follows:

Attribute : Value
===================
Name : fiat sienna
Type : car
Description : hp sedan performance price sedan buy
Tags : car fiat silver

After doing so, we create a list of terms with their frequencies, each field has a frequency multiplier according to its significance. Assumming a multiplier value of 1 for all the fields

Term : Frequency
=======================
fiat        : 2
car         : 2
sedan       : 2
sienna      : 1
performance : 1
silver      : 1
buy         : 1
hp          : 1

This Term-Frequency vector is the basis of doing item matching in meowns. Several different approaches can be implemented to reach an item representation. Even the details of a given approach can vary significantly. I have chosen to stick to the easiest approach in the initial implementation. Those willing to dig further are free to lookup more into document representation and indexing strategies.

User representation is just an aggregation of their item (and wished item) representations. This way user and items share the same term vector structure and hence we can match users to items as much as we can match items to items and users to users.

The matching process involves further encoding of the term frequency vector. Those in the academia refer to this as TF-IDF (Term Frequency, Inverse Document Frequency) representation. In lay man's terms this is a representation of how significant a term is to the certain document.

It is composed of the product of two parts. This first which is TF (Term Frequency) is simply the frequency of the term in the document divided by the total sum of frequency of terms in the same document.

The second (IDF) is the total number of documents in the corpse (the document store) divided by the number of documents which contain this term. If the term is found in all documents then the IDF will be equal to 1 and hence won't have an effect on the final product of the two parts. On the other hand, if we have a corpus of 1,000,000 documents and only one with the given term then it will multiply the TF value by 1,000,000 which is significant.

A slight variation (widely used) is to multiply the TF with the Logarithmic value of the IDF. After we are done calculating TF-IDF values for all the terms in term vectors matching can be done as follows

Term Vector A vs. Term Vector B = cos a = (A . B) / (|A|.|B|)
A.B = dot product for the two vectors of TF-IDF values 
|A|.|B| = scalar product of the magnitudes of the two vectors

The returned value is called the cosine similarity between the two items. It ranges between 0 and 1 where zero means no correlation and one means exact match. We are still experimenting with threshold values but these are essentially the figures you get when you see another user's chemistry meter for example. In another installment we will discuss how are we implementing behind the scenes matching of users and items in an efficient way.

Just to justify calling the post document matching in Ruby, here's a Ruby code to implement the above

# Monkey patch string to be able to extract terms from any string
Class String
  def to_terms(boost = 1, terms = {})
    # remove all non letters and reject stop words
    terms_list = self.gsub(/(\s|\d|\W)+/u,' ').rstrip.strip.split(' ').reject{|term|$stop_words.include?(term)}
    # transform to a hash with a frequency * boost value
    terms_list.each do|term|
      if terms[term]
        terms[term] = terms[term] + boost
      else
        terms[term] = boost
      end
    end
    terms
  end 
end

#our item class, which we match upon
Class Item 
  def to_terms(terms = {})
    #a hash of attributes to serialize 
    {:name => 10, 
     :description => 3, 
     :type_name => 1, 
     :tag_names => 1}.each_pair do |field, boost| 
      terms = send(field).to_terms(boost, terms)
    end
    terms
  end
end

The above methods enable us to extract the terms from different items. The first method refers to a global (bad me) stop word list which should be present.

Now that we have the items represented as term frequencies we can generate their tf-idf vectors (we will keep them as hashes though) and we can use them to do the matching

Class Item
  def to_tf_idf
    #assume we have a method that returns the df value for any term
    terms = self.to_terms
    total_frequency = terms.values.inject(0){|a,b|a+b}
    terms.each do |term,freq| 
      terms[term]= (freq / total_frequency) * self.df(term)
      magnitude = magnitude + terms[term]**2
    end
    terms, magnitude
  end
  def match(item)
    my_tf_idf,  my_magnitude  = self.to_tf_idf
    his_tf_idf, his_magnitude = item.to_tf_idf
    dot_product = 0
    my_tf_idf.each do |term,tf_idf|
      dot_product = dot_product + tf_idf * his_tf_idf[term] if his_tf_idf[term]
    end
    cosine_similarity = dot_product / (my_magnitude * his_magnitude)
  end
end

Pretty easy, now you get a value between 0 and 1 that represents how similar those two items are.

Comments (9)

Aron said on August 7, 2008 at 2:50 PM
	Outstanding! This is something I have to deal with often. A very useful tip.

Unknown said on August 8, 2008 at 6:36 PM
	Nice write up Muhammad. So you have been ramping up on your startup , meowns, recently?

oldmoe said on August 9, 2008 at 4:17 AM
	Alfred! how are you? hope things are going fine for you. I have been working on meowns yeah, amongst other things. What was keeping you busy lately?

Anonymous said on August 15, 2008 at 9:07 PM
	excellent. I'm doing something similar - as research - but using couchDB to hold the data + ruby-stemmer (access the snowball c implementation from ruby).

Anonymous said on December 25, 2008 at 11:47 AM
	There is a bug : while calculating cosine similarity, the square root of the magnitude shud have been used. => cosine_similarity = dot_product / Math.sqrt(my_magnitude * his_magnitude)

Anonymous said on December 25, 2008 at 11:52 AM

Also, while cleaning the text u have not used the stems of algorithms. It can be implemented by using the Porter Stemmer algorithm provided at http://tartarus.org/~martin/PorterStemmer/ruby.txt

U will have to comment the last few lines of the code (the if loop)and just call the string.stem method here to get the stems of words (ofcourse after including the stemmable.rb file in this code i.e. require 'stemmable')

Anonymous said on December 25, 2008 at 11:54 AM
	I am sorry, in the previous post it shud be "U have not used the stems of words"

oldmoe said on December 25, 2008 at 6:35 PM
	@Himanshu, I forgot both the sqrt and using the stemmer, thanks for pointing that out

Anonymous said on March 24, 2009 at 6:17 AM
	Making hero gold is the old question : Honestly there is no fast way to make lots of hero online gold. Sadly enough a lot of the people that all of a sudden come to with millions of hero online moneyalmost overnight probably duped . Although there are a lot of ways to make lots of hero money here I will tell you all of the ways that I know and what I do to buy hero gold.

Document Matching In Ruby

Comments (9)

About Me

Blog Archive

Followers

Labels

Site Feed

Links

Blog Roll