Intro to Pandas functionalities in Python

Over the past few months, I started working with Python, and specifically with the Numpy and Pandas libraries.

Consequently, I decided to write this post to illustrate with some examples the basic functionalities of the Pandas dataframes, the numpy dataframes, the lambda functions, etc., that I think could be useful to anyone who is starting to use Pandas.

The sum product, squared sum of squares and cosine similarity functions illustrated below are the ones I implemented for a recommendation engines exercise.

Pandas:

The Pandas library is build on NumPy and provides methods to manipulate and analyze dataframes.

By convention, we usually import the Pandas library as pd:

import pandas as pd
data = {
        'User 1': [ 10,10,0],
        'User 2': [14, 6, 0],
        'User 3': [5, 14, 1],
        'User 4': [8, 2, 10]
        }

likesdf = pd.DataFrame(data,
                       index = ['likes', 'dislikes', 'neutral'],
                       columns=['User 1', 'User 2', 'User 3', 'User 4'])

The likesdf  Pandas dataframe we just created looks like the one below:

User 1 User 2 User 3 User 4
likes 10 14 5 8
dislikes 10 6 14 2
neutral 0 0 1 10

Now, how do we access the values in the dataframe?

Pandas provides several method to access the rows and column values in the dataframe. Some are based on position (of row or column, mainly iloc), others on index (mainly loc). It also support sthe regular dataframe slicing, as we will see below.

Retrieving Columns:

There are several ways to view columns in a Pandas dataframe:

  • Using the column position and the double brackets: [[]], which will return a Dataframe object.
likesdf[[0]]
"""
Will Return the following dataframe:
likesdf[[0]]
Out[3]:
User 1
likes      10
dislikes   10
neutral    0
"""
  • Using iloc: likesdf.iloc[:,0]. With this notation, we are specifying to return all rows of column 0.
  • We could also specify the column likesdf[‘User 1’]

likesdf.iloc[:,0]

likesdf['User 1']

Both will lead to the same result:

likes       10
dislikes    10
neutral      0
Name: User 1, dtype: int64

Retrieving Rows:

  • Using loc, we can access a row by its index

likesdf.loc['likes']

User 1    10
User 2    14
User 3     5
User 4     8
Name: likes, dtype: int64
  • If we want to select several rows, we can use the command below:

#This will return a dataframe
likesdf.loc[['likes', 'dislikes']]

User 1 User 2 User 3 User 4
likes 10 14 5 8
dislikes 10 6 14 2
  • We can also use loc to retrieve the row positions

#This will return a dataframe
likesdf.iloc[[0,1],[2]]

User 3
likes 5
dislikes 14

Using Pandas’ Supported Operations:

Using only basic Panda’s functionalities, I implemented the below functions to calculate the squared sum squared product of a dataframe:

Squared sum of squares:

"""
The below function will:
- Square each element of the dataframe
- Take the sum of each row
- Take the square root of the sum
"""
def SQSumSq(df):
     return (df**2).sum(axis=1)**0.5

 

Now, Imagine that we want to calculate the sum product for the below dataframes, df1 and df2, respectively.

Sports Books Leadership
question1 0.2 0 0.2
question2 0 0.25 0.25
question3 0 0 0
question4 0 0 0.25
question5 0 0.333333 0
User 1 User 2 User 3 User 4
question1 1 -1 0 0
question2 -1 1 0 0
question3 0 0 0 0
question4 0 1 0 0
question5 0 0 1 0

An easy way to do this is using the dot product function available in pandas:

Sum Product:


def sumproductDF(df1, df2):
    sumprDF = df2.apply(lambda x: x.dot(df1), axis = 0)
    return sumprDF


sumproductDF(df1, df2)

Here, there is no need to loop through any data frame to select specific columns or rows at each iteration to be able to perform the dot product! Fast and easy 🙂

The returned dataframe will look like the one below:

Sports Books Leadership
User 1 0.2 -0.25 -0.05
User 2 -0.2 0.25 0.3
User 3 0 0.333333 0
User 4 0 0 0

Cosine Similarity:

images

For cosine similarity between two vectors,  I first started with a function that had 3 for loops. But then, I decided to go for a cleaner solution using the Pandas’ functionalities, which turned out to be much more concise!

Using the two dataframes df1 and df2 below, we will get the cosine similarity of each user for every question:

Sports Books Leadership
question1 1 0 1
question2 0 1 1
question4 0 0 1
question5 0 1 0
question6 1 0 0
question8 0 0 1
question10 0 1 0
question11 0 0 1
question12 1 0 0
question13 0 0 1
question14 0 1 1
question16 1 0 0
question17 0 1 1
question19 0 1 1
question20 0 0 1
Sports Books Leadership
User 1 3 -2 -1
User 2 -2 2 2
User 3 -2 1 1

 

def cosineSimilarity(df1, df2):
      df1SQSumSq = SQSumSq(df1)
      df2SQSumSq = SQSumSq(df2)
      preds = df1.apply(lambda x: df2.dot(x), axis=1)
      #np.outer will calculate the outer product of the vectors
      return preds/np.outer(df1SQSumSq, df2SQSumSq)

Which will return the below dataframe:

User 1 User 2 User 3
question1 0.377964 0 -0.288675
question2 -0.566947 0.816497 0.57735
question4 -0.267261 0.57735 0.408248
question5 -0.534522 0.57735 0.408248
question6 0.801784 -0.57735 -0.816497
question8 -0.267261 0.57735 0.408248
question10 -0.534522 0.57735 0.408248
question11 -0.267261 0.57735 0.408248
question12 0.801784 -0.57735 -0.816497
question13 -0.267261 0.57735 0.408248
question14 -0.566947 0.816497 0.57735
question16 0.801784 -0.57735 -0.816497
question17 -0.566947 0.816497 0.57735
question19 -0.566947 0.816497 0.57735
question20 -0.267261 0.57735 0.408248

Happy Pythoning!

You can check the full code for this project under my github.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s