Text classification and feature union with DataFrameMapper in Python

A while ago, I submitted a Machine Learning exercise to predict fraudulent items based on several input features (among which: item description (text), number of likes, shares, etc.). Therefore, I needed to run the algorithm while combining both text data and categorical / continuous variables.

Step 1 – Starting with text data: Text feature extraction

There are several ways to do the text feature extraction in Python, like CountVectorizer and TfidfTransformer, which will transform the text data into numerical features that can be used for machine learning.

After researching the different text analysis libraries available in Python, I decided to use the tfidvectoriser to perform the text feature extraction needed for the algorithm.

Since the Naive Bayes is known to be an effective classifier for text data, I started with the text classification using solely one feature: the description field.

# Using Count vectorizer, we get the occurence count of words.
# However, the count does not account for word importance.
# Usually, this can be done using the tfidf algorithm, which will
# downscale the score for the words that appear often, and
# therefore will give more importance to the words that have
# significance but occur in small portions.
# For the description column, we will be using the tfidf vectorizer

from sklearn.model_selection import train_test_split
from sklearn_pandas import DataFrameMapper, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer

categories = data['index']
desc = data['description'].fillna('')

vectoriser = TfidfVectorizer()
features = vectoriser.fit_transform(desc)

#(35182, 66023)

x, x_test, y, y_test = train_test_split(features,categories,test_size=0.2,train_size=0.8, random_state = 0)

clf = MultinomialNB().fit(x, y)
predicted = clf.predict(x_test)

def printreport(exp, pred):
    print(pd.crosstab(exp, pred, rownames=['Actual'], colnames=['Predicted']))

    print('\n \n')
    print(classification_report(exp, pred))

printreport(y_test, predicted)


Step 2: Combining text with other features using the Dataframe mapper

Using Naive Bayes with only one feature gave average results, so I decided to combine additional non-text features to the machine learning algorithm.

In order to use the transformed text data (which is now a matrix) in combination with other features, whether they are categorical or numerical, I found a useful package called sklearn-pandas, which has the DataFrameMapper functionality that maps the original pandas dataframe columns into transformations and then returns a tuple that can be used with the ML algorithm.

The sklearn-pandas package needs to be installed (can  be done using the command below):

pip install sklearn-pandas

So how can we use it to combine transformed text data with other features?

Below is a small code on how this can be done:

data = data.fillna('')

#Add the features of the dataframe that you want to transform and/or combine
mapper = DataFrameMapper([
     ('description', TfidfVectorizer()),
     ('nb_like', None),
     ('picture_labels', TfidfVectorizer()),
     ('nb_share', None),

Use the fit_transform method to transform the old dataframe into a new one
that can be fed to the machine learning algorithm.

#sample Usage
features = mapper.fit_transform(data)
categories = data['INDEX New']

# Split the data between train and test
x, x_test, y, y_test = train_test_split(features,categories,test_size=0.2,train_size=0.8, random_state = 0)

clf = sk.RandomForestClassifier(random_state=0)
clf.fit(x, y)

predicted = clf.predict(x_test)

printreport(y_test, predicted)

Another alternative would be to use the Pipeline package available in sklearn, in combination with featureUnion. Here’s a great tutorial that describes how they can be used.

The full exercise can be found here!



Intro to Pandas functionalities in Python

Over the past few months, I started working with Python, and specifically with the Numpy and Pandas libraries.

Consequently, I decided to write this post to illustrate with some examples the basic functionalities of the Pandas dataframes, the numpy dataframes, the lambda functions, etc., that I think could be useful to anyone who is starting to use Pandas.

The sum product, squared sum of squares and cosine similarity functions illustrated below are the ones I implemented for a recommendation engines exercise.


The Pandas library is build on NumPy and provides methods to manipulate and analyze dataframes.

By convention, we usually import the Pandas library as pd:

import pandas as pd
data = {
        'User 1': [ 10,10,0],
        'User 2': [14, 6, 0],
        'User 3': [5, 14, 1],
        'User 4': [8, 2, 10]

likesdf = pd.DataFrame(data,
                       index = ['likes', 'dislikes', 'neutral'],
                       columns=['User 1', 'User 2', 'User 3', 'User 4'])

The likesdf  Pandas dataframe we just created looks like the one below:

User 1 User 2 User 3 User 4
likes 10 14 5 8
dislikes 10 6 14 2
neutral 0 0 1 10

Now, how do we access the values in the dataframe?

Pandas provides several method to access the rows and column values in the dataframe. Some are based on position (of row or column, mainly iloc), others on index (mainly loc). It also support sthe regular dataframe slicing, as we will see below.

Retrieving Columns:

There are several ways to view columns in a Pandas dataframe:

  • Using the column position and the double brackets: [[]], which will return a Dataframe object.
Will Return the following dataframe:
User 1
likes      10
dislikes   10
neutral    0
  • Using iloc: likesdf.iloc[:,0]. With this notation, we are specifying to return all rows of column 0.
  • We could also specify the column likesdf[‘User 1’]


likesdf['User 1']

Both will lead to the same result:

likes       10
dislikes    10
neutral      0
Name: User 1, dtype: int64

Retrieving Rows:

  • Using loc, we can access a row by its index


User 1    10
User 2    14
User 3     5
User 4     8
Name: likes, dtype: int64
  • If we want to select several rows, we can use the command below:

#This will return a dataframe
likesdf.loc[['likes', 'dislikes']]

User 1 User 2 User 3 User 4
likes 10 14 5 8
dislikes 10 6 14 2
  • We can also use loc to retrieve the row positions

#This will return a dataframe

User 3
likes 5
dislikes 14

Using Pandas’ Supported Operations:

Using only basic Panda’s functionalities, I implemented the below functions to calculate the squared sum squared product of a dataframe:

Squared sum of squares:

The below function will:
- Square each element of the dataframe
- Take the sum of each row
- Take the square root of the sum
def SQSumSq(df):
     return (df**2).sum(axis=1)**0.5


Now, Imagine that we want to calculate the sum product for the below dataframes, df1 and df2, respectively.

Sports Books Leadership
question1 0.2 0 0.2
question2 0 0.25 0.25
question3 0 0 0
question4 0 0 0.25
question5 0 0.333333 0
User 1 User 2 User 3 User 4
question1 1 -1 0 0
question2 -1 1 0 0
question3 0 0 0 0
question4 0 1 0 0
question5 0 0 1 0

An easy way to do this is using the dot product function available in pandas:

Sum Product:

def sumproductDF(df1, df2):
    sumprDF = df2.apply(lambda x: x.dot(df1), axis = 0)
    return sumprDF

sumproductDF(df1, df2)

Here, there is no need to loop through any data frame to select specific columns or rows at each iteration to be able to perform the dot product! Fast and easy 🙂

The returned dataframe will look like the one below:

Sports Books Leadership
User 1 0.2 -0.25 -0.05
User 2 -0.2 0.25 0.3
User 3 0 0.333333 0
User 4 0 0 0

Cosine Similarity:


For cosine similarity between two vectors,  I first started with a function that had 3 for loops. But then, I decided to go for a cleaner solution using the Pandas’ functionalities, which turned out to be much more concise!

Using the two dataframes df1 and df2 below, we will get the cosine similarity of each user for every question:

Sports Books Leadership
question1 1 0 1
question2 0 1 1
question4 0 0 1
question5 0 1 0
question6 1 0 0
question8 0 0 1
question10 0 1 0
question11 0 0 1
question12 1 0 0
question13 0 0 1
question14 0 1 1
question16 1 0 0
question17 0 1 1
question19 0 1 1
question20 0 0 1
Sports Books Leadership
User 1 3 -2 -1
User 2 -2 2 2
User 3 -2 1 1


def cosineSimilarity(df1, df2):
      df1SQSumSq = SQSumSq(df1)
      df2SQSumSq = SQSumSq(df2)
      preds = df1.apply(lambda x: df2.dot(x), axis=1)
      #np.outer will calculate the outer product of the vectors
      return preds/np.outer(df1SQSumSq, df2SQSumSq)

Which will return the below dataframe:

User 1 User 2 User 3
question1 0.377964 0 -0.288675
question2 -0.566947 0.816497 0.57735
question4 -0.267261 0.57735 0.408248
question5 -0.534522 0.57735 0.408248
question6 0.801784 -0.57735 -0.816497
question8 -0.267261 0.57735 0.408248
question10 -0.534522 0.57735 0.408248
question11 -0.267261 0.57735 0.408248
question12 0.801784 -0.57735 -0.816497
question13 -0.267261 0.57735 0.408248
question14 -0.566947 0.816497 0.57735
question16 0.801784 -0.57735 -0.816497
question17 -0.566947 0.816497 0.57735
question19 -0.566947 0.816497 0.57735
question20 -0.267261 0.57735 0.408248

Happy Pythoning!

You can check the full code for this project under my github.

Cross-Validation: Concept and Example in R

What is Cross-Validation?

In Machine Learning, Cross-validation is a resampling method used for model evaluation to avoid testing a model on the same dataset on which it was trained. This is a common mistake, especially that a separate testing dataset is not always available. However, this usually leads to inaccurate performance measures (as the model will have an almost perfect score since it is being tested on the same data it was trained on). To avoid this kind of mistakes, cross validation is usually preferred.

The concept of cross-validation is actually simple: Instead of using the whole dataset to train and then test on same data, we could randomly divide our data into training and testing datasets.

There are several types of cross-validation methods (LOOCV – Leave-one-out cross validation, the holdout method, k-fold cross validation). Here, I’m gonna discuss the K-Fold cross validation method.
K-Fold  basically consists of the below steps:

  1. Randomly split the data into k subsets, also called folds.
  2. Fit the model on the training data (or k-1 folds).
  3. Use the remaining part of the data as test set to validate the model. (Usually, in this step the accuracy or test error of the model is measured).
  4. Repeat the procedure k times.

Below is a simple illustration of the procedure taken from Wikipedia.


How can it be done with R?

In the below exercise, I am using logistic regression to predict whether a passenger in the famous Titanic dataset has survived or not. The purpose is to find an optimal threshold on the predictions to know whether to classify the result as 1 or 0.

Threshold Example: Consider that the model has predicted the following values for two passengers: p1 = 0.7 and p2 = 0.4. If the threshold is 0.5, then p1 > threshold and passenger 1 is in the survived category. Whereas, p2 < threshold, so passenger 2 is in the not survived category.

However, and depending on our data, the 0.5 ‘default’ threshold will not alway ensure the maximum the number of correct classifications. In this context, we could use Cross-validation to determine the best threshold for each fold based on the results of running the model on the validation set.

In my implementation, I followed the below steps:

  1. Split the data randomly into 80 (train and validation), 20 (test with unseen data).
  2. Run cross-validation on 80% of the data, which will be used to train and validate the model. 
  3. Get the optimal threshold after running the model on the validation dataset according to the best accuracy at each fold iteration.
  4. Store the best accuracy and the optimal threshold resulting from the fold iterations in a dataframe.
  5. Find the best threshold (the one that has the highest accuracy) and use it as a cutoff when testing the model against the test dataset.

Note: ROC is usually the best method to be used to find an optimal ‘cutoff’ probability, but for sake of simplicity, i am using accuracy in the code below.  

The below cross_validation method will:

  1. Create a ‘perf‘ dataframe that will store the results of the testing of the model on the validation data.
  2. Use the createFolds method to create nbfolds number of folds.
  3. On each of the folds:
    • Train the model on k-1 folds
    • Test the model on the remaining part of the data
    • Measure the accuracy of the model using the performance method.
    • Add the optimal threshold and its accuracy to the perf  dataframe.
  4. Look in the perf dataframe for optThresh – the threshold that has the highest accuracy.
  5. Use it as cutoff when testing the model on the test set (20% of original data).
  6. Use F1 score to measure the accuracy of the model.
cross_validation = function(nbfolds, split){
 perf = data.frame()
 #create folds
 folds = createFolds(split$trainset$survived, nbfolds, list = TRUE, returnTrain = TRUE)

#loop nbfolds times to find optimal threshold
 for(i in 1:nbfolds)
      #train the model on part of the data
      model = glm(survived~., data=split$trainset[folds[[i]],], family = "binomial")

      #validate on the remaining part of the data
      probs = predict(model, type="response", newdata = split$trainset[-folds[[i]],])

      #Threshold selection based on Accuracy
      #create a prediction object based on the predicted values
      pred = prediction(probs,split$trainset[-folds[[i]],]$survived)

      #measure performance of the prediction
      acc.perf = performance(pred, measure = "acc")

      #Find index of most accurate threshold and add threshold in data frame
      ind = which.max( slot(acc.perf, "y.values")[[1]] )
      acc = slot(acc.perf, "y.values")[[1]][ind]
      optimalThreshold = slot(acc.perf, "x.values")[[1]][ind]
      row = data.frame(threshold = optimalThreshold, accuracy = acc)

      #Store the best thresholds with their performance in the perf dataframe
      perf = rbind(perf, row)

 #Get the threshold with the max accuracy among the nbfolds and predict based on it on the unseen test set
 indexOfMaxPerformance = which.max(perf$accuracy)
 optThresh = perf$threshold[indexOfMaxPerformance]
 probs = predict(model, type="response", newdata = split$testset)
 predictions = data.frame(survived=split$testset$survived, pred=probs)
 T = table(predictions$survived, predictions$pred > optThresh)
 F1 = (2*(T[1,1]))/((2*(T[1,1]))+T[2,1]+T[1,2])

Then, if we run this method 100 times we can measure our max model accuracy when using cross-validation:

# Feature selection method
easyFeatureSelection = function(split) {
corrs = abs(cor(split$trainset)[1,])
toKeep = corrs[corrs > 0.1 & !is.na(corrs)]
split$trainset = subset(split$trainset, select=names(toKeep))
split$testset = subset(split$testset, select=names(toKeep))

# Perform cross validation
for(i in 1:100){
 #split the data into 80-20
 split = splitdf(df, i, 0.8)
 #perform feature selection at each iteration on the training data
 split = easyFeatureSelection(split)
 #get optimal performance from 10 folds at each iteration
 performance = cross_validation(10, split)



We can see that we have a maximum performance of about 0.885 when using the accuracy method for threshold selection, which is not bad at all, taking into consideration that the model is being tested on unseen data.


My First Hackathon!

A month ago, I decided with 4 of my fellow IE Big Data Students to attend Hackatrips, our first hackathon!

The event was hosted during Fitur, one of the biggest tourism fairs in Europe. The theme for this year’s fair  was ‘Sustainable Tourism’. According to the organizers, an ideal hackathon team would consist of three developers, one designer and one tourism expert. Even though this was not really a Data Science hackathon, which is our area of expertise, we decided to go in order to know more about the hackathon process and its dynamics. We did want, however, for Big Data to be the center of our project.

An hour after the event had started, we already knew what our idea was and what types of tools we need to implement it. However, as the first 6 hours went by, we were struggling in trying to achieve a fully integrated solution due to some technical restrictions, which is when we decided that it was time for us to pack our bags and ‘leave with dignity’.

As we went to communicate our decision to the organizers, they cleared things up and explained that the purpose of the hackathon is to be able to present an idea while having at least a prototype available so that the jury could have a sense of what the purpose and design of the application were.

Knowing that, we decided to focus on selling our idea and go with an approach that consists in dividing the tasks according to our profiles in order to have something to show for the demo and presentation that were due the next day. Although just a few hours before it seemed hard to imagine, at the end of the first day we already had a working bot and a fully working CARTO map! The next day, we got there even more motivated, and by the time the final presentations were due, we already had an elaborated presentation, and a complete storyline.

Our project “BotTu”, is a data-driven solution aimed at analyzing unsustainable practices. It consists of a chatbot where users can denounce using images or text any violations against sustainability. All gathered information would then be centralized and stored so that NGOs or governmental organizations could use the CARTO maps, and visualize where, when and what type of violations are being reported.

At the end of the event, we were awarded the prize of the best use of CARTO. Since this is a Madrid based Big Data Visualization startup, we were extremely happy with the result. We learned a lot and not only did we ‘hack’ a Hackathon without the skills needed but we also kept going even when the goal seemed impossible to achieve.

Since this was our first hackathon, we identified some of the painpoints that held us back at the beginning of the event. So, if you ever attend a hackathon, keep the below tips in mind:

1- This mistake has cost us hours: You do NOT need to have a fully working solution. As long as your idea is solid and you have a prototype to show, you’ll be fine.

2- Collaborate with your teammates and give it your all.

3- Since time is limited, divide the tasks among your team as corresponds with every member’s skills.

4- The presentation is very important. Make sure to spend enough time preparing for it.

5- Work efficiently.

6- Don’t give up too soon.

Our presentation and interview can be found in the below link!


First blog post

Hello and welcome to my blog!

I have started this blog to share my journey in learning Big Data & Data Science: Data Analysis, Machine Learning, Big Data Technologies, etc.

I will be sharing code snippets, general ideas and I will try to  keep this blog as updated as possible!



Ideas From Big Data Spain 2016

Last week, on the 17th and 18th of November, I attended the Big Data Spain conference. It was my first time attending this type of events, and it was an excellent opportunity to meet experts in the fields and attend high-quality talks. So I decided to write this post to share a few of the presented slides and ideas.

Ps: Please excuse the quality of some slides/pictures, they were all taken by my phone camera 🙂

First, Congrats to Big Spain on being the second biggest Big Data conference in Europe, right after O’Reilly Strata. This year’s edition also had around 50% increase than last year’s!

Now let’s dig into the details…

Has AI Arrived? (A keynote by Paco Nathan, O’Reilly)

Machine Learning is further becoming a powerful tool for companies. It is helping them to better understand their customers, and helping us as individuals to easily find what we need.

If we think about successful tech start-ups, can we find one that is not actually applying some kind of machine learning?  From this simple question we can sense how essential machine learning has become in shaping any company’s roadmap.

That said, there is still fear that AI will be coming for our jobs, which is understandable in the wake of self-driving cars, self-driving trucks, bots, etc. However, machine learning, or the ‘universal learner’, as Pedro Domingos describes it, is already part of our daily lives. It is growing at a very fast pace and touching every single industry, so staying ‘out’ of it is no longer an option, especially that it has also become easily accessible to non-experts. But we need to be aware that it can have a dark side when it’s not properly used.

At the end, it is essential to differentiate between machine and human capabilities. Machines are much better in some aspects (speed, scale, repeatability, etc.), whereas humans are better in others. And ultimately, knowing how to integrate both expertise will be key to success for any organization.

Shortening the feedback loop: How Spotify’s big data ecosystem has evolved to produce real-time processing (Josh Baer, Spotify)

As opposed to historical data and batch processing, which usually takes hours or days to clean and derive insights from, real-time processing provides instant insights and actionable data that allows us to take time-critical decisions.

When it comes to decision making, data is most valued when processed in real-time because it leads to a fast feedback loop. This, in turn, allows the early detection of bugs introduced by new features and can give developers real-time access to the data logged by the users.

Prepping Your Analytics Organization For Artificial Intelligence (Ramkumar Ravichandran, Visa)

Where do we stand with respect to AI?

It is important to keep in mind that AI may not be able to solve every problem. Now, as these systems continue to evolve, will AI intelligence ever reach Einstein’s intelligence, who we consider as a ‘superhuman’? If it does, will we ever be able to program common sense or creativity into a system?

On another hand, organizations cannot start working on AI if they do not have the needed data and a mature analytics approach (which goes all the way to prescriptive analytics). These two are in fact are the foundations for AI. The data needs to be reliable and the machine should not be overwhelmed with unnecessary or noisy data for later processing.

In addition, we should not think of AI as being a unique entity with only one role. It can rather be classified by capability. So while some AI systems are able to only execute one specific task, others can be much more complex, or may even be considered as super intelligent.

From Data to AI With The Machine Learning Canvas (Louis Dorard, PAPIs.io)

Louis presented the Machine Learning canvas, which is a template that can help organize and identify the tasks needed to develop predictive systems based on machine learning. By dividing the tasks into four categories (Goal, Predict, Learn, and Evaluate), it provides an explicit and systematic approach to tackling this type of projects.

Keynote by Chema Alonso, Telefónica

In this talk, Chema specifically focused on the topic “Location data collection.”

It is no secret that our smartphone is constantly tracking our location, even when our location services are off. So, how is our location data usually collected?

Battery cookies: With this type of ‘hidden’ collection, it is enough to analyze the device’s battery consumption over time to know its location. The strength of the signal varies depending on how far you are from the cellular base. A weaker signal leads to a faster drainage of the battery. This type of apps does not even need to ask for permissions because battery use can be easily accessed as it is considered harmless.

Trust: The collection of location data might also be considered as a trade-off, but that doesn’t imply that we should blindly agree to the terms and conditions of a ‘free’ application.

In fact, we should always ask ourselves: Is this app worth giving away my privacy for it? Do I trust what the owners are going to do with my data?

Why Apache Flink Is Better Than Spark (Rubén Casado, Accenture Digital)

As more businesses are migrating from the traditional BI to the big data platform, it is essential to understand the difference between them, illustrated below:

In addition, now that real-time data is proving to be the most valuable when it comes to making timely business decisions, stream processing has come into play. One of its key features is the ability to process streaming data and get real-time insights, which makes it indispensable to ensure velocity in data analysis.

In fact, Stream Processing can be classified into 3 categories: Hard, Soft and Near real-time.

Hard systems (such as pacemakers), with a latency limit of micro or milliseconds, have no tolerance to delays. If delays do happen, they would lead to a system failure and human lives may be at risk.

Soft systems (such as VoIP), with latency up to milliseconds to seconds, have a low tolerance to delay: the system fails but there’s no loss in human lives.

Near real-time systems (usually video-conference), have a high tolerance to delay. There’s no risk of system failure.

There are different types of processing semantics in stream processing systems:

  • At-least-once: Every message will be processed at least once. A message may be sent more than once to the application, but we ensure that every message is processed.
  • At-most-once: No message is processed more than once, but some messages may not be processed.
  • Exactly-once: Every message is processed once. This is the most complex method to implement.

How is processing done in Spark vs Flink?

Spark is based on micro-batches: before it starts processing the data, it needs to divide it into batches, each having a minimum size of 0.5 seconds, which provokes a processing delay. On the other hand, Flink is capable of processing at event time and therefore it does not cause that type of delay.

Keynote by Oscar Méndez, Stratio

In his keynote, Oscar mainly addressed the digital transformation and the importance of implementing a data-centric approach to eliminate the need to replicate the data in different places. When our data is at the center of our applications, it becomes much easier to access it in real time. Below is the design and advantages of adopting such an approach.

These were some of the ideas I collected from Big Data Spain 2016. Outside of these talks, there were some really cool activities to play with too! MathWorks had their own Enigma crack machine at their booth, so we were able to see how it was decrypting our messages and well, I felt like I was Alan Turing for 10 minutes 🙂

A big thank you goes to the Big Data Spain community for organizing a great event, I hope to make it for next year’s edition!

At the end, what are we again?