Building your favourite TV series search engine - Information Retrieval Using BM25 Ranking
One of the naive approach to ranking documents is TF-IDF.
TF-IDF considers only two elements
1.) TF - Term frequency
TF(w(i),d(i))= total count of word w(i) in document d(i) / total word count in document d(i)
2.) Inverse Document Frequency
IDF(w(i),d(i))= Log(Total number of document / Number of documents contains the word w(i))
Inverse document gives more weightage to words are less frequent and less weightage to words that are occurring often
But! what TF-IDF misses is the length of the document.
For an example, A million word document obviously has more chance capturing all the words, So it has more chance of getting ranked at the top always for any given set of keyword
BM25
BM25 ranking solves this issue by introducing a parameter for relative length of document.
The formula looks quite Complex, but let’s break it down
The BM25 for the query “Machine Learning” would be summation of BM25Score(Machine)+BM25Score(Learning)
The first part in the formula is the IDF of the term.
The second part in the formula represents the TF of the word which is normalized by the length
f(q(i),D) is the Term frequency of the word q(i) in document D
K and b are the parameters which can be tuned. |D| represents the length of the document and avgdl represents the average length of all the documents in the database.
For example, the query “Machine Learning” is scored against Two documents in the database. D1 & D2, where D1 has 10 words and D2 has 50 words.
Let’s assume
TF and IDF of both the words are 1, k=1.5,b=0.5
Score(Machine,D1)=1*[1*(1.5+1)/(1+1.5(1-0.5+(0.5*10/60) = 1.33
Score(Learning,D1)=1*[1*(1.5+1)/(1+1.5(1–0.5+(0.5*10/60) = 1.33
Score(query,D1)=1.33+1.33 = 2.66
Score(Machine,D2)=1*[1*(1.5+1)/(1+1.5(1–0.5+(0.5*50/60) = 1.30
Score(Learning,D2)=1*[1*(1.5+1)/(1+1.5(1–0.5+(0.5*50/60) = 1.30
Score(query,D2)=1.30 + 1.30 = 2.60
Even though Both the TF and IDF values are same for both terms in the query the difference in the score is due to the difference in the length of the document.
Now let’s look into a interesting project
Ever wanted to watch an particular dialogue/scene of friends but couldn’t find which episode it is from.
Let’s use BM25 for retrieving the episode.
What do we need?
We need to build a Database of documents containing all the dialogues from the Friends.
How do we do that?How can we get the dialogues?
SUBTITLES, !!!
yes converting the subtitles files(.srt) into .csv files does the job for you! and Lots of pre-processing work is needed to clean the .srt file.
But don’t worry, I have already done that work for you. The dataset can be obtained at the this Link -> Dataset
If you are interested in How to create the dataset from scratch refer create_dataset.py python script in my Github Repository
Now we have our Database of documents ready
Let’s get started with building our own episode search engine!.
BM25okapi is a wonderful library , which made our work easy
Line 1–3 import all the libraries
Line 5–7 load the dataset, convert it to list and tokenize the corpus
Line 8 BM25Okapi(tokenized_corpus) builds the scoring for all the documents in the database.
Line 10–15 search function takes query as arguments and converts into lower case, splits into tokens.
bm25.scores(tokenized_query) calculates the score with each document(episode). Index has top 5 index of the document having highest score and returning top episode
To deploy a Flask app refer my app.py python script Github Repository
Thanks for reading !
Visit my LinkedIn Post, Github Repository
Reference