Pandas Tutorial
Pandas is an open-source library that is built on top of the NumPy library. It is used for working with data sets, i.e. analyzing, cleaning, exploring, and manipulating data.
- Creating a Dataframe
- Get rows and columns using loc and iloc
- Indexes
- Filtering
- Updating Rows and Columns
- Add/Remove Rows and Columns
- Sorting
- Grouping and Aggregating
- Cleaning Data
person = {
'first': 'Pranav',
'last': 'Shirole',
'email': '123pranav@email.com'
}
people = {
'first': ['Pranav', 'Jane', 'John'],
'last': ['Shirole', 'Doe', 'Doe'],
'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com']
}
people['email']
Let's create a dataframe from the above dictionary.
import pandas as pd
import numpy as np
df = pd.DataFrame(people)
df
df['email']
The above code returns a Series
object. A Series
has an index, which you can see on the left (0, 1, 2).
df.email
It's better to use the brackets rather than the dot notation because if there is a column name that is the same as a method or attribute of a dataframe, then we'll get an error.
type(df['email'])
A Series
is rows of data of a single column. It's a 1-D array.
A DataFrame
is rows and columns of data, a 2-D array. It's a container of multiple Series
objects.
# pass a list of columns inside the brackets
df[['last', 'email']]
The above code returns a DataFrame
, a filtered-down dataframe.
type(df[['last', 'email']])
df.columns
# returns a Series that contains the values of the first row of data
df.iloc[0]
The above code returns a Series
that contains the values of the first row of data. Also when accessing a row, the index is now set to the column name.
# returns a DataFrame
df.iloc[[0, 1]]
The above code returns a DataFrame
.
We can also select columns with loc
and iloc
. The rows will be the first value and the columns will be the second value that we pass in the brackets. So if we thought of loc
and iloc
as functions, we think of rows as the first argument and columns as the second argument.
# index of email column will be 2 since it's the third column
df.iloc[[0, 1], 2]
df
df.loc[0]
df.loc[[0, 1]]
df.loc[[0, 1], 'email']
We can also pass in a list of columns with loc
and iloc
.
# the output will follow the order in which the columns are inputted regardless of the main dataframe
df.loc[[0, 1], ['email', 'last']]
df['email']
We can set email column as the index for the dataframe. The dataframe doesn't actually change unless you use inplace=True
, which is nice since it lets us experiment with our dataset.
df.set_index('email', inplace=True)
df
df.index
Why would changing the index be useful?
Because it enables us to see all the infromation on someone just by using their email.
df.loc['123pranav@email.com']
df.loc['123pranav@email.com', 'last']
Note that once you change the index, you cannot use the default index to locate your rows. But you can still use iloc
.
# df.loc[0]
# but you can still use iloc
df.iloc[0]
You can also reset the index back to the default.
df.reset_index(inplace=True)
df
df['last'] == 'Doe'
We get a Series
object with Boolean values. True
values are the ones that met our filter criteria. False
values are the ones that did not meet our filter criteria.
filt = (df['last'] == 'Doe')
df[filt]
df[df['last'] == 'Doe']
You can filter data using loc
by passing in a series of Boolean values.
df.loc[filt]
This is great because we can still grab data for specific columns as well.
# remember: df.loc[rows, cols]
df.loc[filt, 'email']
filt = (df['last'] == 'Doe') & (df['first'] == 'John')
df.loc[filt, 'email']
filt = (df['last'] == 'Shirole') | (df['first'] == 'John')
df.loc[filt, 'email']
You can get the opposite of a filter using ~
.
df.loc[~filt, 'email']
df = pd.DataFrame(people)
df
df.columns
df.columns = ['first name', 'last name', 'email']
df
df.columns = [x.upper() for x in df.columns]
df
df.columns = df.columns.str.replace(' ', '_')
df
df.columns = [x.lower() for x in df.columns]
df
We can only rename some columns by passing in a dictionary of column names. We need to include inplace=True
for the changes to take place.
df.rename(columns = {'first_name': 'first', 'last_name': 'last'}, inplace=True)
df
df.loc[2]
df.loc[2] = ['John', 'Smith', 'JohnSmith@email.com']
df
df.loc[2, ['last', 'email']]
df.loc[2, ['last', 'email']] = ['Doe', 'JohnDoe@email.com']
df
df.loc[2, 'last'] = 'Smith'
df
The specifier at
can be used to look up or change a specific value. But you can, and maybe for consistency, should use loc
.
df.at[2, 'last'] = 'Doe'
df
filt = (df['email'] == 'JohnDoe@email.com')
df[filt]
df[filt]['last']
# this gives a warning and does not work
# you cannot change your last name in this method
df[filt]['last'] = 'Smith'
df
Just use loc
or at
when setting values.
df.loc[filt, 'last'] = 'Smith'
df
# this does not actually make the change
df['email'].str.lower()
df['email'] = df['email'].str.lower()
df
df['email'].apply(len)
def update_email(email):
return email.upper()
df['email'].apply(update_email)
df['email'] = df['email'].apply(update_email)
df
lambda
functions are anonymous functions, without a name.
df['email'] = df['email'].apply(lambda x: x.lower())
df
Using apply
on a DataFrame.
df['email'].apply(len)
df.apply(len)
The above code returns the number of rows in each column.
len(df['email'])
df.apply(len, axis='columns')
df.apply(pd.Series.min) # will choose alphabetically
The above code returns the lowest value alphabetically in the first
, last
and email
columns respectively.
df.apply(lambda x: x.min())
This is usually more useful when your dataframe contains numerical data.
Running apply
on a Series, applies a function to every value in the series. Running apply to a DataFrame, applies a function to every Series in the DataFrame.
df.applymap(len)
df.applymap(str.lower)
df['first'].map({'Corey': 'Chris', 'Jane': 'Mary'})
df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})
df['first'] = df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})
df
df = pd.DataFrame(people)
df
df['first'] + ' ' + df['last']
df['full_name'] = df['first'] + ' ' + df['last']
df
df.drop(columns=['first', 'last'], inplace=True)
df
We can also reverse this process and split the full_name
column into two different columns.
# split where there is a space
df['full_name'].str.split(' ')
The result of the above code is a list where the first name is the first value and the second name is the second value.
We can assign the values to two different columns, we need to expand this list using the expand
argument.
df['full_name'].str.split(' ', expand=True)
We get two columns of the split result.
df[['first', 'last']] = df['full_name'].str.split(' ', expand=True)
df
We have added the first
and last
columns with the values of the list.
# use ignore_index=True
df.append({'first': 'Tony'}, ignore_index=True)
Since we only assigned a first name to our row, the other values in the columns are NaN
(not a number), used for missing values.
Let's create another dataframe.
people2 = {
'first': ['Tony', 'Steve'],
'last': ['Stark', 'Rogers'],
'email': ['ironman@email.com', 'cap@email.com']
}
df2 = pd.DataFrame(people2)
df2
Let's add the two dataframes together.
df.append(df2, ignore_index=True)
Since there is no inplace
argument here, we need to set the dataframe to df
to make the changes permanent.
df = df.append(df2, ignore_index=True)
df
# to apply changes, use inplace=True
df.drop(index=4)
To drop particular rows with a condition, you can pass in the indexes of the filter.
# note the index method used in the end
df.drop(index=df[df['last'] == 'Doe'].index)
# this is more readable
filt = df['last'] == 'Doe'
df.drop(index=df[filt].index)
df.drop(['full_name'], axis=1, inplace=True)
df.sort_values(by='last')
df.sort_values(by='last', ascending=False)
When sorting on multiple columns, if the first column has identical values, it will then sort on the second column value.
So in the case below, if you want to sort by last name, then it will firstly sort by last name and then will sort by first name for the similar last names of Doe.
df.sort_values(by=['last', 'first'])
people3 = {
'first': ['Pranav', 'Jane', 'John', 'Thor'],
'last': ['Shirole', 'Doe', 'Doe', 'Odinson'],
'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com']
}
df = pd.DataFrame(people3)
df
We can also have one column sorted in ascending order and another in descending order.
# last name in descending and first in ascending
# pass a list to ascending too
df.sort_values(by=['last', 'first'], ascending=[False, True], inplace=True)
df
Here, we see that the indexes have changed in accordance with the sorted values.
We can set the indexes back to the default values using sort_index
.
df.sort_index(inplace=True)
df
To sort only a single column, i.e. a Series, we can use sort_values
.
# just leave the arguments blank
df['last'].sort_values()
people4 = {
'first': ['Pranav', 'Jane', 'John', 'Thor', 'Tony', 'Steve', 'Bruce', 'Clark'],
'last': ['Shirole', 'Doe', 'Doe', 'Odinson', 'Stark', 'Rogers', 'Wayne', 'Kent'],
'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com',
'ironman@email.com', 'cap@email.com', 'thebatman@email.com', None],
'age': [30, 35, 25, 90, 50, 75, 40, np.nan],
'salary': [80000, 120000, 100000, 80000, 500000, 45000, 350000, None],
'work': ['analyst', 'developer', 'developer', 'avenger', 'avenger', 'avenger', 'vigilante', 'superhero'],
'country': ['India', 'India', 'India', 'USA', 'USA', 'USA', 'USA', 'Krypton']
}
df = pd.DataFrame(people4)
df
# ignore the NaN values
df['salary'].mean()
# ignore the NaN values
df['salary'].median()
# only for numerical values
df.median()
df['salary'].count()
df['work']
df['work'].value_counts()
We can use the normalize=True
argument for the value_counts
method to view percentages.
df['work'].value_counts(normalize=True)
37.5% of the people are avengers. 25% are developers.
df.groupby(['country'])
This DataFrameGroupBy
object contains a bunch of groups.
Let's set the GroupBy
object as a variable to reuse it later.
country_grp = df.groupby(['country'])
country_grp.get_group('USA')
We can also use a filter to perform the same task as above.
filt = df['country'] == 'USA'
df.loc[filt]
df.loc[filt]['work'].value_counts()
filt = df['country'] == 'India'
df.loc[filt]
df.loc[filt]['work'].value_counts()
country_grp['work'].value_counts().loc['India']
The above code (or DataFrameGroupBy
object) is useful because we can run one code for each country in the dataset without using a filter for every other country.
country_grp['work'].value_counts().loc['USA']
country_grp['salary'].median()
country_grp['salary'].median().loc['India']
country_grp['salary'].agg(['median', 'mean'])
country_grp['salary'].agg(['median', 'mean']).loc['USA']
country_grp['salary'].agg(['median', 'mean']).loc[['USA', 'India']]
We can calculate sums using sum
. It works on numbers as well as Boolean data types (where it will take True
as 1 and False
as 0).
filt = df['country'] == 'India'
df.loc[filt]['work'].str.contains('deve').sum()
avengers = country_grp['work'].apply(lambda x: x.str.contains('aven').sum())
avengers
country_respondents = df['country'].value_counts()
country_respondents
We can combine more than one Series together using concat
.
avengers_df = pd.concat([country_respondents, avengers], axis='columns', sort=True)
avengers_df
avengers_df.rename(columns={'country':'number of people', 'work':'are avengers'}, inplace=True)
avengers_df
avengers_df['pct are avengers'] = (avengers_df['are avengers']/avengers_df['number of people'])*100
avengers_df
avengers_df.sort_values(by='pct are avengers', ascending=False, inplace=True)
avengers_df
avengers_df.loc['USA']
people5 = {
'first': ['Pranav', 'Jane', 'John', 'Bruce', np.nan, None, 'NA'],
'last': ['Shirole', 'Doe', 'Doe', 'Wayne', np.nan, None, 'Missing'],
'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com',
None, np.nan, 'anonymous@email.com', 'Missing'],
'age': ['32', '38', '40', '45', None, None, 'Missing']
}
df = pd.DataFrame(people5)
df
df.dropna()
Two of the default arguments of dropna
are as follows:
df.dropna(axis='index', how='any')
The axis
argument can either be set to index
or columns
. index
will tell pandas to drop NA values from rows that have missing values. columns
will tell pandas to drop columns with NA values.
THe how
argument can either be set to any
or all
. any
will drop the rows with one or more missing values. all
will drop rows in which all values are missing.
df.dropna(axis='columns', how='all')
df.dropna(axis='columns', how='any')
We get an empty dataframe because any columns with even a single missing value are dropped.
df.dropna(axis='index', how='any', subset=['email'])
df.dropna(axis='index', how='all', subset=['last', 'email'])
df.replace('NA', np.nan, inplace=True)
df.replace('Missing', np.nan, inplace=True)
df
df.dropna()
df.dropna(axis='index', how='all', subset=['last', 'email'])
You can see which values would and would not be treated as NA by using the isna
method.
df.isna()
To substitute NA values with a specific value, use the fillna
method.
df.fillna('MISSING')
Check the data types using thee dtypes
attribute.
df.dtypes
So if we wanted the average age, it wouldn't work with the current object data type.
type(np.nan)
As you can see above, the NAN values are of the data type float. Which means that if your dataset has NAN values and you want to perform some math on the numbers, you need to convert your column data type to float (not int). Another option would be to convert the missing values into another number like 0 and then convert the data type to int, but in most cases this would be a bad idea (for e.g., when you want to find an average).
Note: If you have a dataframe with all values of the same data type, and you want to convert all columns at once to another data type, you can use astype
method of the DataFrame
object. For e.g., you can convert all int columns to float using df.astype(float)
.
df['age'] = df['age'].astype(float)
df.dtypes
df['age'].mean()