Text classification and feature union with DataFrameMapper in Python

A while ago, I submitted a Machine Learning exercise to predict fraudulent items based on several input features (among which: item description (text), number of likes, shares, etc.). Therefore, I needed to run the algorithm while combining both text data and categorical / continuous variables.

Step 1 – Starting with text data: Text feature extraction

There are several ways to do the text feature extraction in Python, like CountVectorizer and TfidfTransformer, which will transform the text data into numerical features that can be used for machine learning.

After researching the different text analysis libraries available in Python, I decided to use the tfidvectoriser to perform the text feature extraction needed for the algorithm.

Since the Naive Bayes is known to be an effective classifier for text data, I started with the text classification using solely one feature: the description field.

# Using Count vectorizer, we get the occurence count of words.
# However, the count does not account for word importance.
# Usually, this can be done using the tfidf algorithm, which will
# downscale the score for the words that appear often, and
# therefore will give more importance to the words that have
# significance but occur in small portions.
# For the description column, we will be using the tfidf vectorizer

from sklearn.model_selection import train_test_split
from sklearn_pandas import DataFrameMapper, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer

categories = data['index']
desc = data['description'].fillna('')

vectoriser = TfidfVectorizer()
features = vectoriser.fit_transform(desc)

features.shape
#(35182, 66023)

x, x_test, y, y_test = train_test_split(features,categories,test_size=0.2,train_size=0.8, random_state = 0)

clf = MultinomialNB().fit(x, y)
predicted = clf.predict(x_test)

def printreport(exp, pred):
    print(pd.crosstab(exp, pred, rownames=['Actual'], colnames=['Predicted']))

    print('\n \n')
    print(classification_report(exp, pred))

printreport(y_test, predicted)

 

Step 2: Combining text with other features using the Dataframe mapper

Using Naive Bayes with only one feature gave average results, so I decided to combine additional non-text features to the machine learning algorithm.

In order to use the transformed text data (which is now a matrix) in combination with other features, whether they are categorical or numerical, I found a useful package called sklearn-pandas, which has the DataFrameMapper functionality that maps the original pandas dataframe columns into transformations and then returns a tuple that can be used with the ML algorithm.

The sklearn-pandas package needs to be installed (can  be done using the command below):

pip install sklearn-pandas

So how can we use it to combine transformed text data with other features?

Below is a small code on how this can be done:


data = data.fillna('')

#Add the features of the dataframe that you want to transform and/or combine
mapper = DataFrameMapper([
     ('description', TfidfVectorizer()),
     ('nb_like', None),
     ('picture_labels', TfidfVectorizer()),
     ('nb_share', None),
 ])

"""
Use the fit_transform method to transform the old dataframe into a new one
that can be fed to the machine learning algorithm.
"""

#sample Usage
features = mapper.fit_transform(data)
categories = data['INDEX New']

# Split the data between train and test
x, x_test, y, y_test = train_test_split(features,categories,test_size=0.2,train_size=0.8, random_state = 0)

clf = sk.RandomForestClassifier(random_state=0)
clf.fit(x, y)

predicted = clf.predict(x_test)

printreport(y_test, predicted)

Another alternative would be to use the Pipeline package available in sklearn, in combination with featureUnion. Here’s a great tutorial that describes how they can be used.

The full exercise can be found here!

 

One thought on “Text classification and feature union with DataFrameMapper in Python

Leave a reply to sinan Cancel reply