A while ago, I submitted a Machine Learning exercise to predict fraudulent items based on several input features (among which: item description (text), number of likes, shares, etc.). Therefore, I needed to run the algorithm while combining both text data and categorical / continuous variables.
Step 1 – Starting with text data: Text feature extraction
There are several ways to do the text feature extraction in Python, like CountVectorizer and TfidfTransformer, which will transform the text data into numerical features that can be used for machine learning.
After researching the different text analysis libraries available in Python, I decided to use the tfidvectoriser to perform the text feature extraction needed for the algorithm.
Since the Naive Bayes is known to be an effective classifier for text data, I started with the text classification using solely one feature: the description field.
# Using Count vectorizer, we get the occurence count of words. # However, the count does not account for word importance. # Usually, this can be done using the tfidf algorithm, which will # downscale the score for the words that appear often, and # therefore will give more importance to the words that have # significance but occur in small portions. # For the description column, we will be using the tfidf vectorizer from sklearn.model_selection import train_test_split from sklearn_pandas import DataFrameMapper, cross_val_score from sklearn.feature_extraction.text import TfidfVectorizer categories = data['index'] desc = data['description'].fillna('') vectoriser = TfidfVectorizer() features = vectoriser.fit_transform(desc) features.shape #(35182, 66023) x, x_test, y, y_test = train_test_split(features,categories,test_size=0.2,train_size=0.8, random_state = 0) clf = MultinomialNB().fit(x, y) predicted = clf.predict(x_test) def printreport(exp, pred): print(pd.crosstab(exp, pred, rownames=['Actual'], colnames=['Predicted'])) print('\n \n') print(classification_report(exp, pred)) printreport(y_test, predicted)
Step 2: Combining text with other features using the Dataframe mapper
Using Naive Bayes with only one feature gave average results, so I decided to combine additional non-text features to the machine learning algorithm.
In order to use the transformed text data (which is now a matrix) in combination with other features, whether they are categorical or numerical, I found a useful package called sklearn-pandas, which has the DataFrameMapper functionality that maps the original pandas dataframe columns into transformations and then returns a tuple that can be used with the ML algorithm.
The sklearn-pandas package needs to be installed (can be done using the command below):
pip install sklearn-pandas
So how can we use it to combine transformed text data with other features?
Below is a small code on how this can be done:
data = data.fillna('') #Add the features of the dataframe that you want to transform and/or combine mapper = DataFrameMapper([ ('description', TfidfVectorizer()), ('nb_like', None), ('picture_labels', TfidfVectorizer()), ('nb_share', None), ]) """ Use the fit_transform method to transform the old dataframe into a new one that can be fed to the machine learning algorithm. """ #sample Usage features = mapper.fit_transform(data) categories = data['INDEX New'] # Split the data between train and test x, x_test, y, y_test = train_test_split(features,categories,test_size=0.2,train_size=0.8, random_state = 0) clf = sk.RandomForestClassifier(random_state=0) clf.fit(x, y) predicted = clf.predict(x_test) printreport(y_test, predicted)
The full exercise can be found here!