Creating a Dataframe

person = {
    'first': 'Pranav', 
    'last': 'Shirole', 
    'email': '123pranav@email.com'
}

people = {
    'first': ['Pranav', 'Jane', 'John'], 
    'last': ['Shirole', 'Doe', 'Doe'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com']
}

people['email']

['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com']

Let's create a dataframe from the above dictionary.

import pandas as pd
import numpy as np

df = pd.DataFrame(people)
df

df['email']

0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

The above code returns a Series object. A Series has an index, which you can see on the left (0, 1, 2).

df.email

0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

It's better to use the brackets rather than the dot notation because if there is a column name that is the same as a method or attribute of a dataframe, then we'll get an error.

type(df['email'])

pandas.core.series.Series

A Series is rows of data of a single column. It's a 1-D array.
A DataFrame is rows and columns of data, a 2-D array. It's a container of multiple Series objects.

# pass a list of columns inside the brackets
df[['last', 'email']]

The above code returns a DataFrame, a filtered-down dataframe.

type(df[['last', 'email']])

pandas.core.frame.DataFrame

df.columns

Index(['first', 'last', 'email'], dtype='object')

Get rows and columns using `loc` and `iloc`

`iloc`

iloc allows us to access rows by integer location.

# returns a Series that contains the values of the first row of data
df.iloc[0]

first                 Pranav
last                 Shirole
email    123pranav@email.com
Name: 0, dtype: object

The above code returns a Series that contains the values of the first row of data. Also when accessing a row, the index is now set to the column name.

# returns a DataFrame
df.iloc[[0, 1]]

The above code returns a DataFrame.

We can also select columns with loc and iloc. The rows will be the first value and the columns will be the second value that we pass in the brackets. So if we thought of loc and iloc as functions, we think of rows as the first argument and columns as the second argument.

# index of email column will be 2 since it's the third column
df.iloc[[0, 1], 2]

0    123pranav@email.com
1      janedoe@email.com
Name: email, dtype: object

`loc`

For loc, we search by labels.

df

df.loc[0]

first                 Pranav
last                 Shirole
email    123pranav@email.com
Name: 0, dtype: object

df.loc[[0, 1]]

df.loc[[0, 1], 'email']

0    123pranav@email.com
1      janedoe@email.com
Name: email, dtype: object

We can also pass in a list of columns with loc and iloc.

# the output will follow the order in which the columns are inputted regardless of the main dataframe
df.loc[[0, 1], ['email', 'last']]

Indexes

df['email']

0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

We can set email column as the index for the dataframe. The dataframe doesn't actually change unless you use inplace=True, which is nice since it lets us experiment with our dataset.

df.set_index('email', inplace=True)

df

df.index

Index(['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com'], dtype='object', name='email')

Why would changing the index be useful?
Because it enables us to see all the infromation on someone just by using their email.

df.loc['123pranav@email.com']

first     Pranav
last     Shirole
Name: 123pranav@email.com, dtype: object

df.loc['123pranav@email.com', 'last']

'Shirole'

Note that once you change the index, you cannot use the default index to locate your rows. But you can still use iloc.

# df.loc[0]

# but you can still use iloc
df.iloc[0]

first     Pranav
last     Shirole
Name: 123pranav@email.com, dtype: object

You can also reset the index back to the default.

df.reset_index(inplace=True)
df

Filtering

You can filter specific data from the dataset using pandas.

df['last'] == 'Doe'

0    False
1     True
2     True
Name: last, dtype: bool

We get a Series object with Boolean values. True values are the ones that met our filter criteria. False values are the ones that did not meet our filter criteria.

filt = (df['last'] == 'Doe')

df[filt]

df[df['last'] == 'Doe']

You can filter data using loc by passing in a series of Boolean values.

df.loc[filt]

This is great because we can still grab data for specific columns as well.

# remember: df.loc[rows, cols]
df.loc[filt, 'email']

1    janedoe@email.com
2    johndoe@email.com
Name: email, dtype: object

filt = (df['last'] == 'Doe') & (df['first'] == 'John')
df.loc[filt, 'email']

2    johndoe@email.com
Name: email, dtype: object

filt = (df['last'] == 'Shirole') | (df['first'] == 'John')
df.loc[filt, 'email']

0    123pranav@email.com
2      johndoe@email.com
Name: email, dtype: object

You can get the opposite of a filter using ~.

df.loc[~filt, 'email']

1    janedoe@email.com
Name: email, dtype: object

Updating Rows and Columns

Updating Columns

df = pd.DataFrame(people)
df

df.columns

Index(['first', 'last', 'email'], dtype='object')

df.columns = ['first name', 'last name', 'email']
df

df.columns = [x.upper() for x in df.columns]
df

df.columns = df.columns.str.replace(' ', '_')

df

df.columns = [x.lower() for x in df.columns]
df

We can only rename some columns by passing in a dictionary of column names. We need to include inplace=True for the changes to take place.

df.rename(columns = {'first_name': 'first', 'last_name': 'last'}, inplace=True)
df

Updating Rows

df.loc[2]

first                 John
last                   Doe
email    johndoe@email.com
Name: 2, dtype: object

df.loc[2] = ['John', 'Smith', 'JohnSmith@email.com']
df

df.loc[2, ['last', 'email']]

last                   Smith
email    JohnSmith@email.com
Name: 2, dtype: object

df.loc[2, ['last', 'email']] = ['Doe', 'JohnDoe@email.com']
df

df.loc[2, 'last'] = 'Smith'
df

The specifier at can be used to look up or change a specific value. But you can, and maybe for consistency, should use loc.

df.at[2, 'last'] = 'Doe'
df

filt = (df['email'] == 'JohnDoe@email.com')
df[filt]

df[filt]['last']

2    Doe
Name: last, dtype: object

# this gives a warning and does not work
# you cannot change your last name in this method
df[filt]['last'] = 'Smith'
df

/tmp/ipykernel_183/3199811375.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[filt]['last'] = 'Smith'

Just use loc or at when setting values.

df.loc[filt, 'last'] = 'Smith'
df

# this does not actually make the change
df['email'].str.lower()

0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

df['email'] = df['email'].str.lower()
df

Other methods to update

apply, map, applymap, replace

apply

apply is used for calling a function on our values. It can work on a Series and DataFrame.

Using apply on a Series.

df['email'].apply(len)

0    19
1    17
2    17
Name: email, dtype: int64

def update_email(email):
    return email.upper()

df['email'].apply(update_email)

0    123PRANAV@EMAIL.COM
1      JANEDOE@EMAIL.COM
2      JOHNDOE@EMAIL.COM
Name: email, dtype: object

df['email'] = df['email'].apply(update_email)
df

lambda functions are anonymous functions, without a name.

df['email'] = df['email'].apply(lambda x: x.lower())
df

Using apply on a DataFrame.

df['email'].apply(len)

0    19
1    17
2    17
Name: email, dtype: int64

df.apply(len)

first    3
last     3
email    3
dtype: int64

The above code returns the number of rows in each column.

len(df['email'])

3

df.apply(len, axis='columns')

0    3
1    3
2    3
dtype: int64

df.apply(pd.Series.min) # will choose alphabetically

first                   Jane
last                     Doe
email    123pranav@email.com
dtype: object

The above code returns the lowest value alphabetically in the first, last and email columns respectively.

df.apply(lambda x: x.min())

first                   Jane
last                     Doe
email    123pranav@email.com
dtype: object

This is usually more useful when your dataframe contains numerical data.

Running apply on a Series, applies a function to every value in the series. Running apply to a DataFrame, applies a function to every Series in the DataFrame.

applymap

applymap applies a function to every individual element in the DataFrame. It only works on DataFrames.

df.applymap(len)

df.applymap(str.lower)

map

map method is used for substituting each value in a Series with another value. It only works on a Series.

df['first'].map({'Corey': 'Chris', 'Jane': 'Mary'})

0     NaN
1    Mary
2     NaN
Name: first, dtype: object

Note that the values we didn't substitute were converted to NaN values. That may not be what we want.
If you want to keep the other names, use the replace method.

replace

df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})

0    Pranav
1      Mary
2      John
Name: first, dtype: object

df['first'] = df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})
df

Add/Remove Rows and Columns

df = pd.DataFrame(people)
df

Add column

df['first'] + ' ' + df['last']

0    Pranav Shirole
1          Jane Doe
2          John Doe
dtype: object

df['full_name'] = df['first'] + ' ' + df['last']
df

Drop column

df.drop(columns=['first', 'last'], inplace=True)

df

We can also reverse this process and split the full_name column into two different columns.

# split where there is a space
df['full_name'].str.split(' ')

0    [Pranav, Shirole]
1          [Jane, Doe]
2          [John, Doe]
Name: full_name, dtype: object

The result of the above code is a list where the first name is the first value and the second name is the second value.
We can assign the values to two different columns, we need to expand this list using the expand argument.

df['full_name'].str.split(' ', expand=True)

We get two columns of the split result.

df[['first', 'last']] = df['full_name'].str.split(' ', expand=True)

df

We have added the first and last columns with the values of the list.

Add rows

You can add a new row using append. Make sure to use ignore_index=True as an argument.

# use ignore_index=True
df.append({'first': 'Tony'}, ignore_index=True)

Since we only assigned a first name to our row, the other values in the columns are NaN (not a number), used for missing values.

Let's create another dataframe.

people2 = {
    'first': ['Tony', 'Steve'], 
    'last': ['Stark', 'Rogers'], 
    'email': ['ironman@email.com', 'cap@email.com']
}
df2 = pd.DataFrame(people2)

df2

Let's add the two dataframes together.

df.append(df2, ignore_index=True)

Since there is no inplace argument here, we need to set the dataframe to df to make the changes permanent.

df = df.append(df2, ignore_index=True)
df

Remove rows

To remove rows, we can pass in the indexes that we want to remove using the drop method. To apply the changes, use inplace=True.

# to apply changes, use inplace=True
df.drop(index=4)

To drop particular rows with a condition, you can pass in the indexes of the filter.

# note the index method used in the end
df.drop(index=df[df['last'] == 'Doe'].index)

# this is more readable
filt = df['last'] == 'Doe'
df.drop(index=df[filt].index)

df.drop(['full_name'], axis=1, inplace=True)

Sorting

Sort columns

df.sort_values(by='last')

df.sort_values(by='last', ascending=False)

When sorting on multiple columns, if the first column has identical values, it will then sort on the second column value.
So in the case below, if you want to sort by last name, then it will firstly sort by last name and then will sort by first name for the similar last names of Doe.

df.sort_values(by=['last', 'first'])

people3 = {
    'first': ['Pranav', 'Jane', 'John', 'Thor'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Odinson'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com']
}
df = pd.DataFrame(people3)
df

We can also have one column sorted in ascending order and another in descending order.

# last name in descending and first in ascending
# pass a list to ascending too
df.sort_values(by=['last', 'first'], ascending=[False, True], inplace=True)
df

Here, we see that the indexes have changed in accordance with the sorted values.
We can set the indexes back to the default values using sort_index.

df.sort_index(inplace=True)
df

To sort only a single column, i.e. a Series, we can use sort_values.

# just leave the arguments blank
df['last'].sort_values()

1        Doe
2        Doe
3    Odinson
0    Shirole
Name: last, dtype: object

Grouping and Aggregating

people4 = {
    'first': ['Pranav', 'Jane', 'John', 'Thor', 'Tony', 'Steve', 'Bruce', 'Clark'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Odinson', 'Stark', 'Rogers', 'Wayne', 'Kent'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com', 
              'ironman@email.com', 'cap@email.com', 'thebatman@email.com', None],
    'age': [30, 35, 25, 90, 50, 75, 40, np.nan],
    'salary': [80000, 120000, 100000, 80000, 500000, 45000, 350000, None],
    'work': ['analyst', 'developer', 'developer', 'avenger', 'avenger', 'avenger', 'vigilante', 'superhero'],
    'country': ['India', 'India', 'India', 'USA', 'USA', 'USA', 'USA', 'Krypton']
}
df = pd.DataFrame(people4)
df

Aggregating

# ignore the NaN values
df['salary'].mean()

182142.85714285713

# ignore the NaN values
df['salary'].median()

100000.0

# only for numerical values
df.median()

age           40.0
salary    100000.0
dtype: float64

df['salary'].count()

7

df['work']

0      analyst
1    developer
2    developer
3      avenger
4      avenger
5      avenger
6    vigilante
7    superhero
Name: work, dtype: object

df['work'].value_counts()

avenger      3
developer    2
vigilante    1
analyst      1
superhero    1
Name: work, dtype: int64

We can use the normalize=True argument for the value_counts method to view percentages.

df['work'].value_counts(normalize=True)

avenger      0.375
developer    0.250
vigilante    0.125
analyst      0.125
superhero    0.125
Name: work, dtype: float64

37.5% of the people are avengers. 25% are developers.

Grouping

Let's group the people by their work.

df.groupby(['country'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f30d4c6d610>

This DataFrameGroupBy object contains a bunch of groups.
Let's set the GroupBy object as a variable to reuse it later.

country_grp = df.groupby(['country'])

country_grp.get_group('USA')

We can also use a filter to perform the same task as above.

filt = df['country'] == 'USA'
df.loc[filt]

df.loc[filt]['work'].value_counts()

avenger      3
vigilante    1
Name: work, dtype: int64

filt = df['country'] == 'India'
df.loc[filt]

df.loc[filt]['work'].value_counts()

developer    2
analyst      1
Name: work, dtype: int64

country_grp['work'].value_counts().loc['India']

work
developer    2
analyst      1
Name: work, dtype: int64

The above code (or DataFrameGroupBy object) is useful because we can run one code for each country in the dataset without using a filter for every other country.

country_grp['work'].value_counts().loc['USA']

work
avenger      3
vigilante    1
Name: work, dtype: int64

country_grp['salary'].median()

country
India      100000.0
Krypton         NaN
USA        215000.0
Name: salary, dtype: float64

country_grp['salary'].median().loc['India']

100000.0

Multiple aggregate functions

We can also get multiple aggregate functions at once using agg.

country_grp['salary'].agg(['median', 'mean'])

country_grp['salary'].agg(['median', 'mean']).loc['USA']

median    215000.0
mean      243750.0
Name: USA, dtype: float64

country_grp['salary'].agg(['median', 'mean']).loc[['USA', 'India']]

We can calculate sums using sum. It works on numbers as well as Boolean data types (where it will take True as 1 and False as 0).

filt = df['country'] == 'India'
df.loc[filt]['work'].str.contains('deve').sum()

2

avengers = country_grp['work'].apply(lambda x: x.str.contains('aven').sum())
avengers

country
India      0
Krypton    0
USA        3
Name: work, dtype: int64

country_respondents = df['country'].value_counts()
country_respondents

USA        4
India      3
Krypton    1
Name: country, dtype: int64

We can combine more than one Series together using concat.

avengers_df = pd.concat([country_respondents, avengers], axis='columns', sort=True)
avengers_df

avengers_df.rename(columns={'country':'number of people', 'work':'are avengers'}, inplace=True)
avengers_df

avengers_df['pct are avengers'] = (avengers_df['are avengers']/avengers_df['number of people'])*100
avengers_df

avengers_df.sort_values(by='pct are avengers', ascending=False, inplace=True)
avengers_df

avengers_df.loc['USA']

number of people     4.0
are avengers         3.0
pct are avengers    75.0
Name: USA, dtype: float64

Cleaning Data

people5 = {
    'first': ['Pranav', 'Jane', 'John', 'Bruce', np.nan, None, 'NA'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Wayne', np.nan, None, 'Missing'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 
              None, np.nan, 'anonymous@email.com', 'Missing'],
    'age': ['32', '38', '40', '45', None, None, 'Missing']
}

df = pd.DataFrame(people5)

df

df.dropna()

Two of the default arguments of dropna are as follows:

df.dropna(axis='index', how='any')

The axis argument can either be set to index or columns. index will tell pandas to drop NA values from rows that have missing values. columns will tell pandas to drop columns with NA values. THe how argument can either be set to any or all. any will drop the rows with one or more missing values. all will drop rows in which all values are missing.

df.dropna(axis='columns', how='all')

df.dropna(axis='columns', how='any')

We get an empty dataframe because any columns with even a single missing value are dropped.

df.dropna(axis='index', how='any', subset=['email'])

df.dropna(axis='index', how='all', subset=['last', 'email'])

df.replace('NA', np.nan, inplace=True)
df.replace('Missing', np.nan, inplace=True)

df

df.dropna()

df.dropna(axis='index', how='all', subset=['last', 'email'])

You can see which values would and would not be treated as NA by using the isna method.

df.isna()

To substitute NA values with a specific value, use the fillna method.

df.fillna('MISSING')

Check the data types using thee dtypes attribute.

df.dtypes

first    object
last     object
email    object
age      object
dtype: object

So if we wanted the average age, it wouldn't work with the current object data type.

type(np.nan)

float

As you can see above, the NAN values are of the data type float. Which means that if your dataset has NAN values and you want to perform some math on the numbers, you need to convert your column data type to float (not int). Another option would be to convert the missing values into another number like 0 and then convert the data type to int, but in most cases this would be a bad idea (for e.g., when you want to find an average).
Note: If you have a dataframe with all values of the same data type, and you want to convert all columns at once to another data type, you can use astype method of the DataFrame object. For e.g., you can convert all int columns to float using df.astype(float).

df['age'] = df['age'].astype(float)

df.dtypes

first     object
last      object
email     object
age      float64
dtype: object

df['age'].mean()

38.75

	first	last	email
0	Pranav	Shirole	123pranav@email.com
1	Jane	Doe	janedoe@email.com
2	John	Doe	johndoe@email.com

	first	last
email
123pranav@email.com	Pranav	Shirole
janedoe@email.com	Jane	Doe
johndoe@email.com	John	Doe

	email	full_name
0	123pranav@email.com	Pranav Shirole
1	janedoe@email.com	Jane Doe
2	johndoe@email.com	John Doe

	email	first	last
3	ironman@email.com	Tony	Stark
0	123pranav@email.com	Pranav	Shirole
4	cap@email.com	Steve	Rogers
1	janedoe@email.com	Jane	Doe
2	johndoe@email.com	John	Doe

	first	last	email	age	salary	work	country
3	Thor	Odinson	thor@email.com	90.0	80000.0	avenger	USA
4	Tony	Stark	ironman@email.com	50.0	500000.0	avenger	USA
5	Steve	Rogers	cap@email.com	75.0	45000.0	avenger	USA
6	Bruce	Wayne	thebatman@email.com	40.0	350000.0	vigilante	USA

	median	mean
country
India	100000.0	100000.0
Krypton	NaN	NaN
USA	215000.0	243750.0

	first	last	email	age
0	False	False	False	False
1	False	False	False	False
2	False	False	False	False
3	False	False	True	False
4	True	True	True	True
5	True	True	False	True
6	True	True	True	True

	first	last	email
0	pranav	shirole	123pranav@email.com
1	jane	doe	janedoe@email.com
2	john	smith	johndoe@email.com