In this guest post, you’ll learn how to detect YouTube comment spam by reading this tutorial by Dr. Joshua Eckroth, a Ph.D. from Ohio State University in AI and cognitive science and an assistant professor of computer science at Stetson University, where he teaches big data mining and analytics, artificial intelligence (AI), and software engineering.

In this article, you’ll look at a technique for detecting YouTube comment spam using bags of words and random forests. The dataset is pretty straightforward. You’ll use a dataset that has about 2,000 comments from popular YouTube videos ( The dataset is formatted in a way, where each row has a comment followed by a value marked as 1 or 0 for spam or not spam.

First, import a single dataset. This dataset is actually split into four different files. Your set of comments comes from the PSY-Gangnam Style video:

Then print a few comments as follows:

Here, you’ll be able to see that there are more than two columns, but you only require the content and the class columns. The content column contains the comments and the class column contains the values 1 or 0 for spam or not.

For example, notice that the first two comments are marked as not spam, but then the comment subscribe to me for call of duty vids is spam and hi guys please my android photo editor download yada yada is spam as well. Before you start sorting comments, look at the count of how many rows in the dataset are spam and how many are not. The result acquired is 175 and 175 respectively, which sums up to 350 rows overall in this file:



In scikit-learn, the bag of words technique is actually called CountVectorizer, which means counting the number of times each word appears and inserting them in a vector. To create a vector, you need to make an object for CountVectorizer and then perform the fit and transform simultaneously:

This is performed in two different steps.

  1. The fit step is where it discovers the words present in the dataset
  2. The second is the transform step, which gives you the bag of words matrix for those phrases. The result obtained in this matrix is 350 rows by 1,418 columns:

There are 350 rows, which mean that you have 350 different comments and 1,418 words. 1418 words apparently are words that appear across all of these phrases.

Now print a single comment and then run the analyzer on that comment so that you can see how well the phrases break it apart. As seen in the following screenshot, the comment has been printed first and then you’ll analyze it to see how it broke it into words:

You can use the vectorizer feature to find out the words the dataset found after vectorizing. The following is the result found after vectorizing, where it starts with numbers and ends with regular words:

Execute the following command to shuffle the dataset with fraction 100% that is adding frac=1:

Now split the dataset into training and testing sets. Assume that the first 300 will be for training, while the latter 50 will be for testing:

In the preceding code, vectorizer.fit_transform(d_train[‘CONTENT’]) is an important step. At this stage, you have a training set that you want to perform a fit transform on, which means that it will learn the words and also produce the matrix. However, for the testing set, you don’t perform a fit transform again, since you don’t want the model to learn different words for the testing data. You’ll use the same words that it learned on the training set.

Suppose the testing set has different words, out of which some are unique to the testing set that might have never appeared in the training set. That’s perfectly fine and anyhow you’re going to ignore it.

Since you’re using the training set to build a random forest or decision tree or whatever would be the case, you have to use a certain set of words. These words will have to be the same words used on the testing set. You cannot introduce new words to the testing set since the random forest or any other model would not be able to gauge them.

Now perform the transform on the dataset, and later you can use the answers for training and testing. The training set now has 300 rows and 1,287 different words or columns, and the testing set has 50 rows, but you have the same 1,287 columns:

Even though the testing set has different words, you need to make sure that it is transformed in the same way as the training set with the same columns. Now begin with the building of the random forest classifier. You can convert this dataset into 80 different trees and fit the training set so that you can score its performance on the testing set:

The output of the score is 98%; that’s really good. Here it seems that it got confused between spam and not-spam. You need to be sure that the accuracy is high; for that you can perform a cross-validation with five different splits. To perform a cross-validation, you can use all the training data and let it split into two different groups: 20%, 80%, and 20% will be testing data, and 80% will be the training data:

Now perform an average of the scores that you just obtained, which comes to about 95% accuracy. Now print all the data as seen in the following screenshot:

The entire dataset has five different videos with comments, which means that all together you have around 2,000 rows. On checking all the comments, notice that there are 1005 spam comments and 951 not-spam comments; this is quite close enough to split it in to even parts:

Now shuffle the entire dataset and separate the comments and answers:

You need to perform a couple of steps here with CountVectorizer followed by the random forest. For this, use a feature in scikit-learn called a pipeline. The pipeline is really convenient and will bring together two or more steps so that all the steps are treated as one.

Build a pipeline with the bag of words and then use countVectorizer followed by the random forest classifier. Now print the pipeline and steps required:

You can let the pipeline name each step by itself by adding CountVectorizer in your RandomForestClassifier and it will name them CountVectorizer and RandomForestclassifier:

Once the pipeline is created, you can just call it fit, and it will perform the rest that is, first it will perform the fit and then transform with the CountVectorizer, followed by a fit with the RandomForest classifier. That’s the benefit of having a pipeline:

Now you can call score so that it knows when you are scoring. It will to run it through the bag of words countVectorizer followed by predicting with the RandomForestClassifier:

This whole procedure will produce a score of about 94. You can only predict a single example with the pipeline. For example, imagine you have a new comment after the dataset has been trained and you want to know whether the user has just typed this comment or whether it’s spam:

It’s detected correctly; but what about the following comment:

To overcome this, deploy this classifier into an environment and predict whether it is a spm or not when someone types a new comment. You can use your pipeline to figure out how accurate your cross-validation was. In this case, the average accuracy was about 94:

It’s pretty good. Now add TF-IDF to your model to make it more precise. This will be placed after countVectorizer.

After you have produced the counts, you can then produce a TF-IDF score for these counts. Now add this to the pipeline and perform another cross-validation check with the same accuracy:

This show the steps required for the pipeline:

The following output got you CountVectorizer, a TF-IDF transformer, and RandomForestClassifier. Notice that countvectorizer can be lower case or upper case in the dataset; it is on you to decide how many words you want to have. You can either use single words or bigrams, which would be pairs of words, or trigrams, which can be triples words. You can also remove stop words, which are really common English words such as andor, and the. With TF-IDF, you can turn off the idf component and just keep the tf component, which would just be a log of the count. You can use idf as well. With random forests, you’ve got a choice of how many trees you use, which is the number of estimators.

There’s another feature of scikit-learn available that allows you to search all of these parameters. It finds out what the best parameters are:

You can make a little dictionary where you say the name of the pipeline step and then mention what the parameter name would be and this gives you your options. For demonstration, try a maximum number of words or maybe just a maximum of 1,000 or 2,000 words.

Using ngrams, you can mention just single words or pairs of words that are stop words; use the English dictionary of stop words; or don’t use stop words, which means that in the first case you need to get rid of common words, and in the second case you do not get rid of common words. Using TF-IDF, you can use idf to state whether it’s yes or no. The random forest created uses 20, 50, or 100 trees. Using this, you can perform a grid search, which runs through all of the combinations of parameters and finds out what the best combination is. So, give your pipeline number 2, which has the TF-IDF along with it. You can use fit to perform the search and the outcome can be seen in the following screenshot:

Since there are a large number of words, it takes a little while, around 40 seconds and ultimately finds the best parameters. You can get the best parameters out of the grid search and print them to see what the score is:

So, you got an accuracy of around 96%. You used around 1,000 words, only single words, used yes to get rid of stop words, had 100 trees in the random forest, and used yes and the IDF and the TF-IDF computation. Here you’ve demonstrated not only bag of words, TF-IDF, and random forest, but also the pipeline feature and the parameter search feature known as grid search.

If you found this article interesting, you can explore Dr. Joshua Eckroth’s Python Artificial Intelligence Projects for Beginners to build smart applications by implementing real-world artificial intelligence projects. This book demonstrates AI projects in Python, covering modern techniques that make up the world of artificial intelligence.