If you have some song lyrics and you want to know which type (pop, jazz, rock, etc.) of song is that?. what we do? .or you have a bundle of email and you want to know that ,which of them are spam and which are not. Yup! we have an idea.The Naive Bayes Classifier is a very popular algorithm that can be used for Text classification.
Here i would like to explain Naive Bayes classifier practically with in the sense of a python developer. Theoretical side of Naive Bayes is explained here .
Here i take a bunch of song lyrics form a website, in this website there we have a lots of song lyrics belongs to different category.so we select the artist category and find a artist form a genre and download all the available lyrics in that website.
Here we should aware about the amount of data. if we have more amount of data our prediction become more accurate.so we want a large amount of data for training data for our classifier module. here it is not a good idea that take each lyrics copy that from the website and past it in to a file with file name of song .it is hard and risk factor, so in our case we use web-scraping to download all the songs lyrics of an artist.here i made it in such a way that when we give a link of an artist's page as input, first of all it build the directory in the name of artist and it download all lyrics one by one and save them as separated file with in the name of that songs.
Now we have our data in directories (directories are in the name of artist and that directory contain many files of song lyrics and the name of file same as what the name of song in that website)and it is the time to build the classifier.In this case though, we don’t even have numeric features. We just have text. We need to somehow convert this text into numbers that we can do calculations on.
So what do we do? Simple! We use word frequencies.for that we want cleaned set of data.first of all we will clean our data by removing spaces, special characters, and so on. Now we have a some cleaned text data and make this string of data in to a list of words .
Now we play with Bayes’ Theorem and naive part.so in the above example suppose we have 100 lyrics of Justin Bieber the famous pop artist and we give about 95% of lyrics as training data for train the classifier module and only the remaining 5% for test data . for that we take some file from the directories of and put them in to another directory called test data .
So we can go through some examples.it make more effective.we can take artist Justin Bieber from pop and Metallica form metal.so we have some set of training data belongs to both category are separated in different directories and some test data.
let's take a test data (means a file from test data)and we don't know the taken test data belongs to which genre.how we can find it?. OK no problem. put that file in your pocket and lets do some thing.
First we check probability of each genre by counting no of file in each directory and divide it by total files in all directory. (ie, if 90 files in metal and 50 file in pop directories then probability of metal is 90 / 140 )
total no files in both two directories / no of files in the metal directory
do same thing for another.
Then take test data file from your pocket and convert that string of text in to a list of words and put it back to your pocket.
OK, Now you have one list of words in your pocket and probability of both genre.Now you take metal directory and convert all the contents of each files in this directory in to a list of words and count the total number of words in this list.Take first word from the list of test data and check how many times it present in the list of trained words. Now calculate,
p(first word | metal) = Number of repetition of a word / Total words in trained data list
Then do same for each and every words in the list of test data.it is a good idea to remove same words in the list of test data to get time benefit and reduce processing power .after that multiply all results of above division together and probability of corresponding genre.
p(first word | metal) x p(second word | metal) x...........x p(metal)
However, we run into a problem here ,some words in test data doesn’t appear in list of train words .This is rather inconvenient since we are going to be multiplying it with the other probabilities, This equals 0, since in a multiplication, if one of the terms is zero, the whole calculation is nullified. Doing things this way simply doesn’t give us any information at all, so we have to find a way around.
How do we do it? By using something called Laplace smoothing: we add 1 to every count so it’s never zero. To balance this, we add the number of possible words to the divisor, so the division will never be greater than 1. possible words is total words in the list of train data and test data(avoid repetition) .
so equation become:
p(first word | genre) =
Number of repetition of a word + 1 / Total words in trained data list + no of possible words
Now we just multiply all the probabilities,
p(test data | metal) = p(first word | metal) x p(second word | metal) x.....x p(metal)
Now do above steps up on pop .now you have p(test data | metal) and p(test data | pop) and find which have higher probability. if pop have higher probability the test data is belongs to pop genre otherwise it belongs to metal.do same steps for different files in test data.
This is the simple idea of Naive Bayes classifier and I build a simple implementation of song lyrics classifier using Naive Bayes and here is the link of git-hub