Over the past few months, I started working with Python, and specifically with the Numpy and Pandas libraries.

Consequently, I decided to write this post to illustrate with some examples the basic functionalities of the Pandas dataframes, the numpy dataframes, the lambda functions, etc., that I think could be useful to anyone who is starting to use Pandas.

The sum product, squared sum of squares and cosine similarity functions illustrated below are the ones I implemented for a recommendation engines exercise.

### Pandas:

The Pandas library is build on NumPy and provides methods to manipulate and analyze dataframes.

By convention, we usually import the **Pandas** library as **pd:**

import pandas as pd data = { 'User 1': [ 10,10,0], 'User 2': [14, 6, 0], 'User 3': [5, 14, 1], 'User 4': [8, 2, 10] } likesdf = pd.DataFrame(data, index = ['likes', 'dislikes', 'neutral'], columns=['User 1', 'User 2', 'User 3', 'User 4'])

The likesdf Pandas dataframe we just created looks like the one below:

User 1 | User 2 | User 3 | User 4 | |

likes | 10 | 14 | 5 | 8 |

dislikes | 10 | 6 | 14 | 2 |

neutral | 0 | 0 | 1 | 10 |

Now, how do we access the values in the dataframe?

**Pandas **provides several method to access the rows and column values in the dataframe. Some are based on position (of row or column, mainly **iloc**), others on index (mainly **loc**). It also support sthe regular dataframe slicing, as we will see below.

### Retrieving Columns:

There are several ways to view columns in a Pandas dataframe:

- Using the column position and the double brackets: [[]], which will return a
object.**Dataframe**

likesdf[[0]] """ Will Return the following dataframe: likesdf[[0]] Out[3]: User 1 likes 10 dislikes 10 neutral 0 """

- Using
likesdf.iloc[:,0]. With this notation, we are specifying to return all rows of column 0.**iloc:** - We could also specify the column likesdf[‘User 1’]

likesdf.iloc[:,0] likesdf['User 1']

Both will lead to the same result:

likes 10 dislikes 10 neutral 0 Name: User 1, dtype: int64

### Retrieving Rows:

- Using
, we can access a row by its index**loc**

likesdf.loc['likes']

User 1 10 User 2 14 User 3 5 User 4 8 Name: likes, dtype: int64

- If we want to select several rows, we can use the command below:

#This will return a dataframe likesdf.loc[['likes', 'dislikes']]

User 1 | User 2 | User 3 | User 4 | |

likes | 10 | 14 | 5 | 8 |

dislikes | 10 | 6 | 14 | 2 |

- We can also use
to retrieve the row**loc****positions**

#This will return a dataframe likesdf.iloc[[0,1],[2]]

User 3 | |

likes | 5 |

dislikes | 14 |

## Using Pandas’ Supported Operations:

Using only basic Panda’s functionalities, I implemented the below functions to calculate the squared sum squared product of a dataframe:

### Squared sum of squares:

""" The below function will: - Square each element of the dataframe - Take the sum of each row - Take the square root of the sum """ def SQSumSq(df): return (df**2).sum(axis=1)**0.5

Now, Imagine that we want to calculate the sum product for the below dataframes, df1 and df2, respectively.

Sports | Books | Leadership | |

question1 | 0.2 | 0 | 0.2 |

question2 | 0 | 0.25 | 0.25 |

question3 | 0 | 0 | 0 |

question4 | 0 | 0 | 0.25 |

question5 | 0 | 0.333333 | 0 |

User 1 | User 2 | User 3 | User 4 | |

question1 | 1 | -1 | 0 | 0 |

question2 | -1 | 1 | 0 | 0 |

question3 | 0 | 0 | 0 | 0 |

question4 | 0 | 1 | 0 | 0 |

question5 | 0 | 0 | 1 | 0 |

An easy way to do this is using the dot product function available in pandas:

### Sum Product:

def sumproductDF(df1, df2): sumprDF = df2.apply(lambda x: x.dot(df1), axis = 0) return sumprDF sumproductDF(df1, df2)

Here, there is no need to loop through any data frame to select specific columns or rows at each iteration to be able to perform the dot product! Fast and easy 🙂

The returned dataframe will look like the one below:

Sports | Books | Leadership | |

User 1 | 0.2 | -0.25 | -0.05 |

User 2 | -0.2 | 0.25 | 0.3 |

User 3 | 0 | 0.333333 | 0 |

User 4 | 0 | 0 | 0 |

### Cosine Similarity:

For cosine similarity between two vectors, I first started with a function that had 3 for loops. But then, I decided to go for a cleaner solution using the Pandas’ functionalities, which turned out to be much more concise!

Using the two dataframes df1 and df2 below, we will get the cosine similarity of each user for every question:

Sports | Books | Leadership | |

question1 | 1 | 0 | 1 |

question2 | 0 | 1 | 1 |

question4 | 0 | 0 | 1 |

question5 | 0 | 1 | 0 |

question6 | 1 | 0 | 0 |

question8 | 0 | 0 | 1 |

question10 | 0 | 1 | 0 |

question11 | 0 | 0 | 1 |

question12 | 1 | 0 | 0 |

question13 | 0 | 0 | 1 |

question14 | 0 | 1 | 1 |

question16 | 1 | 0 | 0 |

question17 | 0 | 1 | 1 |

question19 | 0 | 1 | 1 |

question20 | 0 | 0 | 1 |

Sports | Books | Leadership | |

User 1 | 3 | -2 | -1 |

User 2 | -2 | 2 | 2 |

User 3 | -2 | 1 | 1 |

def cosineSimilarity(df1, df2): df1SQSumSq = SQSumSq(df1) df2SQSumSq = SQSumSq(df2) preds = df1.apply(lambda x: df2.dot(x), axis=1) #np.outer will calculate the outer product of the vectors return preds/np.outer(df1SQSumSq, df2SQSumSq)

Which will return the below dataframe:

User 1 | User 2 | User 3 | |

question1 | 0.377964 | 0 | -0.288675 |

question2 | -0.566947 | 0.816497 | 0.57735 |

question4 | -0.267261 | 0.57735 | 0.408248 |

question5 | -0.534522 | 0.57735 | 0.408248 |

question6 | 0.801784 | -0.57735 | -0.816497 |

question8 | -0.267261 | 0.57735 | 0.408248 |

question10 | -0.534522 | 0.57735 | 0.408248 |

question11 | -0.267261 | 0.57735 | 0.408248 |

question12 | 0.801784 | -0.57735 | -0.816497 |

question13 | -0.267261 | 0.57735 | 0.408248 |

question14 | -0.566947 | 0.816497 | 0.57735 |

question16 | 0.801784 | -0.57735 | -0.816497 |

question17 | -0.566947 | 0.816497 | 0.57735 |

question19 | -0.566947 | 0.816497 | 0.57735 |

question20 | -0.267261 | 0.57735 | 0.408248 |

Happy Pythoning!

You can check the full code for this project under my github.