Creating a Dataframe

person = {
    'first': 'Pranav', 
    'last': 'Shirole', 
    'email': '123pranav@email.com'
}
people = {
    'first': ['Pranav', 'Jane', 'John'], 
    'last': ['Shirole', 'Doe', 'Doe'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com']
}
people['email']
['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com']

Let's create a dataframe from the above dictionary.

import pandas as pd
import numpy as np 
df = pd.DataFrame(people)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
df['email']
0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

The above code returns a Series object. A Series has an index, which you can see on the left (0, 1, 2).

df.email
0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

It's better to use the brackets rather than the dot notation because if there is a column name that is the same as a method or attribute of a dataframe, then we'll get an error.

type(df['email'])
pandas.core.series.Series

A Series is rows of data of a single column. It's a 1-D array.
A DataFrame is rows and columns of data, a 2-D array. It's a container of multiple Series objects.

# pass a list of columns inside the brackets
df[['last', 'email']]
last email
0 Shirole 123pranav@email.com
1 Doe janedoe@email.com
2 Doe johndoe@email.com

The above code returns a DataFrame, a filtered-down dataframe.

type(df[['last', 'email']])
pandas.core.frame.DataFrame
df.columns
Index(['first', 'last', 'email'], dtype='object')

Get rows and columns using loc and iloc

iloc

iloc allows us to access rows by integer location.

# returns a Series that contains the values of the first row of data
df.iloc[0]
first                 Pranav
last                 Shirole
email    123pranav@email.com
Name: 0, dtype: object

The above code returns a Series that contains the values of the first row of data. Also when accessing a row, the index is now set to the column name.

# returns a DataFrame
df.iloc[[0, 1]]
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com

The above code returns a DataFrame.

We can also select columns with loc and iloc. The rows will be the first value and the columns will be the second value that we pass in the brackets. So if we thought of loc and iloc as functions, we think of rows as the first argument and columns as the second argument.

# index of email column will be 2 since it's the third column
df.iloc[[0, 1], 2]
0    123pranav@email.com
1      janedoe@email.com
Name: email, dtype: object

loc

For loc, we search by labels.

df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
df.loc[0]
first                 Pranav
last                 Shirole
email    123pranav@email.com
Name: 0, dtype: object
df.loc[[0, 1]]
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
df.loc[[0, 1], 'email']
0    123pranav@email.com
1      janedoe@email.com
Name: email, dtype: object

We can also pass in a list of columns with loc and iloc.

# the output will follow the order in which the columns are inputted regardless of the main dataframe
df.loc[[0, 1], ['email', 'last']]
email last
0 123pranav@email.com Shirole
1 janedoe@email.com Doe

Indexes

df['email']
0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object

We can set email column as the index for the dataframe. The dataframe doesn't actually change unless you use inplace=True, which is nice since it lets us experiment with our dataset.

df.set_index('email', inplace=True)
df
first last
email
123pranav@email.com Pranav Shirole
janedoe@email.com Jane Doe
johndoe@email.com John Doe
df.index
Index(['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com'], dtype='object', name='email')

Why would changing the index be useful?
Because it enables us to see all the infromation on someone just by using their email.

df.loc['123pranav@email.com']
first     Pranav
last     Shirole
Name: 123pranav@email.com, dtype: object
df.loc['123pranav@email.com', 'last']
'Shirole'

Note that once you change the index, you cannot use the default index to locate your rows. But you can still use iloc.

# df.loc[0]

# but you can still use iloc
df.iloc[0]
first     Pranav
last     Shirole
Name: 123pranav@email.com, dtype: object

You can also reset the index back to the default.

df.reset_index(inplace=True)
df
email first last
0 123pranav@email.com Pranav Shirole
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

Filtering

You can filter specific data from the dataset using pandas.

df['last'] == 'Doe'
0    False
1     True
2     True
Name: last, dtype: bool

We get a Series object with Boolean values. True values are the ones that met our filter criteria. False values are the ones that did not meet our filter criteria.

filt = (df['last'] == 'Doe')
df[filt]
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe
df[df['last'] == 'Doe']
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

You can filter data using loc by passing in a series of Boolean values.

df.loc[filt]
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

This is great because we can still grab data for specific columns as well.

# remember: df.loc[rows, cols]
df.loc[filt, 'email']
1    janedoe@email.com
2    johndoe@email.com
Name: email, dtype: object
filt = (df['last'] == 'Doe') & (df['first'] == 'John')
df.loc[filt, 'email']
2    johndoe@email.com
Name: email, dtype: object
filt = (df['last'] == 'Shirole') | (df['first'] == 'John')
df.loc[filt, 'email']
0    123pranav@email.com
2      johndoe@email.com
Name: email, dtype: object

You can get the opposite of a filter using ~.

df.loc[~filt, 'email']
1    janedoe@email.com
Name: email, dtype: object

Updating Rows and Columns

Updating Columns

df = pd.DataFrame(people)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
df.columns
Index(['first', 'last', 'email'], dtype='object')
df.columns = ['first name', 'last name', 'email']
df
first name last name email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
df.columns = [x.upper() for x in df.columns]
df
FIRST NAME LAST NAME EMAIL
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
df.columns = df.columns.str.replace(' ', '_')
df
FIRST_NAME LAST_NAME EMAIL
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
df.columns = [x.lower() for x in df.columns]
df
first_name last_name email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com

We can only rename some columns by passing in a dictionary of column names. We need to include inplace=True for the changes to take place.

df.rename(columns = {'first_name': 'first', 'last_name': 'last'}, inplace=True)
df 
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com

Updating Rows

df.loc[2]
first                 John
last                   Doe
email    johndoe@email.com
Name: 2, dtype: object
df.loc[2] = ['John', 'Smith', 'JohnSmith@email.com']
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith JohnSmith@email.com
df.loc[2, ['last', 'email']]
last                   Smith
email    JohnSmith@email.com
Name: 2, dtype: object
df.loc[2, ['last', 'email']] = ['Doe', 'JohnDoe@email.com']
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe JohnDoe@email.com
df.loc[2, 'last'] = 'Smith'
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith JohnDoe@email.com

The specifier at can be used to look up or change a specific value. But you can, and maybe for consistency, should use loc.

df.at[2, 'last'] = 'Doe'
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe JohnDoe@email.com
filt = (df['email'] == 'JohnDoe@email.com')
df[filt]
first last email
2 John Doe JohnDoe@email.com
df[filt]['last']
2    Doe
Name: last, dtype: object
# this gives a warning and does not work
# you cannot change your last name in this method
df[filt]['last'] = 'Smith'
df

/tmp/ipykernel_183/3199811375.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[filt]['last'] = 'Smith'
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe JohnDoe@email.com

Just use loc or at when setting values.

df.loc[filt, 'last'] = 'Smith'
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith JohnDoe@email.com
# this does not actually make the change
df['email'].str.lower()
0    123pranav@email.com
1      janedoe@email.com
2      johndoe@email.com
Name: email, dtype: object
df['email'] = df['email'].str.lower()
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith johndoe@email.com

Other methods to update

apply, map, applymap, replace

apply

apply is used for calling a function on our values. It can work on a Series and DataFrame.

Using apply on a Series.

df['email'].apply(len)
0    19
1    17
2    17
Name: email, dtype: int64
def update_email(email):
    return email.upper()
df['email'].apply(update_email)
0    123PRANAV@EMAIL.COM
1      JANEDOE@EMAIL.COM
2      JOHNDOE@EMAIL.COM
Name: email, dtype: object
df['email'] = df['email'].apply(update_email)
df
first last email
0 Pranav Shirole 123PRANAV@EMAIL.COM
1 Jane Doe JANEDOE@EMAIL.COM
2 John Smith JOHNDOE@EMAIL.COM

lambda functions are anonymous functions, without a name.

df['email'] = df['email'].apply(lambda x: x.lower())
df 
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Smith johndoe@email.com

Using apply on a DataFrame.

df['email'].apply(len)
0    19
1    17
2    17
Name: email, dtype: int64
df.apply(len)
first    3
last     3
email    3
dtype: int64

The above code returns the number of rows in each column.

len(df['email'])
3
df.apply(len, axis='columns')
0    3
1    3
2    3
dtype: int64
df.apply(pd.Series.min) # will choose alphabetically
first                   Jane
last                     Doe
email    123pranav@email.com
dtype: object

The above code returns the lowest value alphabetically in the first, last and email columns respectively.

df.apply(lambda x: x.min())
first                   Jane
last                     Doe
email    123pranav@email.com
dtype: object

This is usually more useful when your dataframe contains numerical data.

Running apply on a Series, applies a function to every value in the series. Running apply to a DataFrame, applies a function to every Series in the DataFrame.

applymap

applymap applies a function to every individual element in the DataFrame. It only works on DataFrames.

df.applymap(len)
first last email
0 6 7 19
1 4 3 17
2 4 5 17
df.applymap(str.lower)
first last email
0 pranav shirole 123pranav@email.com
1 jane doe janedoe@email.com
2 john smith johndoe@email.com

map

map method is used for substituting each value in a Series with another value. It only works on a Series.

df['first'].map({'Corey': 'Chris', 'Jane': 'Mary'})
0     NaN
1    Mary
2     NaN
Name: first, dtype: object

Note that the values we didn't substitute were converted to NaN values. That may not be what we want.
If you want to keep the other names, use the replace method.

replace

df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})
0    Pranav
1      Mary
2      John
Name: first, dtype: object
df['first'] = df['first'].replace({'Corey': 'Chris', 'Jane': 'Mary'})
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Mary Doe janedoe@email.com
2 John Smith johndoe@email.com

Add/Remove Rows and Columns

df = pd.DataFrame(people)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com

Add column

df['first'] + ' ' + df['last']
0    Pranav Shirole
1          Jane Doe
2          John Doe
dtype: object
df['full_name'] = df['first'] + ' ' + df['last']
df
first last email full_name
0 Pranav Shirole 123pranav@email.com Pranav Shirole
1 Jane Doe janedoe@email.com Jane Doe
2 John Doe johndoe@email.com John Doe

Drop column

df.drop(columns=['first', 'last'], inplace=True)
df
email full_name
0 123pranav@email.com Pranav Shirole
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

We can also reverse this process and split the full_name column into two different columns.

# split where there is a space
df['full_name'].str.split(' ')
0    [Pranav, Shirole]
1          [Jane, Doe]
2          [John, Doe]
Name: full_name, dtype: object

The result of the above code is a list where the first name is the first value and the second name is the second value.
We can assign the values to two different columns, we need to expand this list using the expand argument.

df['full_name'].str.split(' ', expand=True)
0 1
0 Pranav Shirole
1 Jane Doe
2 John Doe

We get two columns of the split result.

df[['first', 'last']] = df['full_name'].str.split(' ', expand=True)
df
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe

We have added the first and last columns with the values of the list.

Add rows

You can add a new row using append. Make sure to use ignore_index=True as an argument.

# use ignore_index=True
df.append({'first': 'Tony'}, ignore_index=True)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe
3 NaN NaN Tony NaN

Since we only assigned a first name to our row, the other values in the columns are NaN (not a number), used for missing values.

Let's create another dataframe.

people2 = {
    'first': ['Tony', 'Steve'], 
    'last': ['Stark', 'Rogers'], 
    'email': ['ironman@email.com', 'cap@email.com']
}
df2 = pd.DataFrame(people2)
df2
first last email
0 Tony Stark ironman@email.com
1 Steve Rogers cap@email.com

Let's add the two dataframes together.

df.append(df2, ignore_index=True)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe
3 ironman@email.com NaN Tony Stark
4 cap@email.com NaN Steve Rogers

Since there is no inplace argument here, we need to set the dataframe to df to make the changes permanent.

df = df.append(df2, ignore_index=True)
df
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe
3 ironman@email.com NaN Tony Stark
4 cap@email.com NaN Steve Rogers

Remove rows

To remove rows, we can pass in the indexes that we want to remove using the drop method. To apply the changes, use inplace=True.

# to apply changes, use inplace=True
df.drop(index=4)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
1 janedoe@email.com Jane Doe Jane Doe
2 johndoe@email.com John Doe John Doe
3 ironman@email.com NaN Tony Stark

To drop particular rows with a condition, you can pass in the indexes of the filter.

# note the index method used in the end
df.drop(index=df[df['last'] == 'Doe'].index)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
3 ironman@email.com NaN Tony Stark
4 cap@email.com NaN Steve Rogers
# this is more readable
filt = df['last'] == 'Doe'
df.drop(index=df[filt].index)
email full_name first last
0 123pranav@email.com Pranav Shirole Pranav Shirole
3 ironman@email.com NaN Tony Stark
4 cap@email.com NaN Steve Rogers
df.drop(['full_name'], axis=1, inplace=True)

Sorting

Sort columns

df.sort_values(by='last')
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe
4 cap@email.com Steve Rogers
0 123pranav@email.com Pranav Shirole
3 ironman@email.com Tony Stark
df.sort_values(by='last', ascending=False)
email first last
3 ironman@email.com Tony Stark
0 123pranav@email.com Pranav Shirole
4 cap@email.com Steve Rogers
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe

When sorting on multiple columns, if the first column has identical values, it will then sort on the second column value.
So in the case below, if you want to sort by last name, then it will firstly sort by last name and then will sort by first name for the similar last names of Doe.

df.sort_values(by=['last', 'first'])
email first last
1 janedoe@email.com Jane Doe
2 johndoe@email.com John Doe
4 cap@email.com Steve Rogers
0 123pranav@email.com Pranav Shirole
3 ironman@email.com Tony Stark
people3 = {
    'first': ['Pranav', 'Jane', 'John', 'Thor'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Odinson'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com']
}
df = pd.DataFrame(people3)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
3 Thor Odinson thor@email.com

We can also have one column sorted in ascending order and another in descending order.

# last name in descending and first in ascending
# pass a list to ascending too
df.sort_values(by=['last', 'first'], ascending=[False, True], inplace=True)
df
first last email
0 Pranav Shirole 123pranav@email.com
3 Thor Odinson thor@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com

Here, we see that the indexes have changed in accordance with the sorted values.
We can set the indexes back to the default values using sort_index.

df.sort_index(inplace=True)
df
first last email
0 Pranav Shirole 123pranav@email.com
1 Jane Doe janedoe@email.com
2 John Doe johndoe@email.com
3 Thor Odinson thor@email.com

To sort only a single column, i.e. a Series, we can use sort_values.

# just leave the arguments blank
df['last'].sort_values()
1        Doe
2        Doe
3    Odinson
0    Shirole
Name: last, dtype: object

Grouping and Aggregating

people4 = {
    'first': ['Pranav', 'Jane', 'John', 'Thor', 'Tony', 'Steve', 'Bruce', 'Clark'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Odinson', 'Stark', 'Rogers', 'Wayne', 'Kent'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 'thor@email.com', 
              'ironman@email.com', 'cap@email.com', 'thebatman@email.com', None],
    'age': [30, 35, 25, 90, 50, 75, 40, np.nan],
    'salary': [80000, 120000, 100000, 80000, 500000, 45000, 350000, None],
    'work': ['analyst', 'developer', 'developer', 'avenger', 'avenger', 'avenger', 'vigilante', 'superhero'],
    'country': ['India', 'India', 'India', 'USA', 'USA', 'USA', 'USA', 'Krypton']
}
df = pd.DataFrame(people4)
df
first last email age salary work country
0 Pranav Shirole 123pranav@email.com 30.0 80000.0 analyst India
1 Jane Doe janedoe@email.com 35.0 120000.0 developer India
2 John Doe johndoe@email.com 25.0 100000.0 developer India
3 Thor Odinson thor@email.com 90.0 80000.0 avenger USA
4 Tony Stark ironman@email.com 50.0 500000.0 avenger USA
5 Steve Rogers cap@email.com 75.0 45000.0 avenger USA
6 Bruce Wayne thebatman@email.com 40.0 350000.0 vigilante USA
7 Clark Kent None NaN NaN superhero Krypton

Aggregating

# ignore the NaN values
df['salary'].mean()
182142.85714285713
# ignore the NaN values
df['salary'].median()
100000.0
# only for numerical values
df.median()
age           40.0
salary    100000.0
dtype: float64
df['salary'].count()
7
df['work']
0      analyst
1    developer
2    developer
3      avenger
4      avenger
5      avenger
6    vigilante
7    superhero
Name: work, dtype: object
df['work'].value_counts()
avenger      3
developer    2
vigilante    1
analyst      1
superhero    1
Name: work, dtype: int64

We can use the normalize=True argument for the value_counts method to view percentages.

df['work'].value_counts(normalize=True)
avenger      0.375
developer    0.250
vigilante    0.125
analyst      0.125
superhero    0.125
Name: work, dtype: float64

37.5% of the people are avengers. 25% are developers.

Grouping

Let's group the people by their work.

df.groupby(['country'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f30d4c6d610>

This DataFrameGroupBy object contains a bunch of groups.
Let's set the GroupBy object as a variable to reuse it later.

country_grp = df.groupby(['country'])
country_grp.get_group('USA')
first last email age salary work country
3 Thor Odinson thor@email.com 90.0 80000.0 avenger USA
4 Tony Stark ironman@email.com 50.0 500000.0 avenger USA
5 Steve Rogers cap@email.com 75.0 45000.0 avenger USA
6 Bruce Wayne thebatman@email.com 40.0 350000.0 vigilante USA

We can also use a filter to perform the same task as above.

filt = df['country'] == 'USA'
df.loc[filt]
first last email age salary work country
3 Thor Odinson thor@email.com 90.0 80000.0 avenger USA
4 Tony Stark ironman@email.com 50.0 500000.0 avenger USA
5 Steve Rogers cap@email.com 75.0 45000.0 avenger USA
6 Bruce Wayne thebatman@email.com 40.0 350000.0 vigilante USA
df.loc[filt]['work'].value_counts()
avenger      3
vigilante    1
Name: work, dtype: int64
filt = df['country'] == 'India'
df.loc[filt]
first last email age salary work country
0 Pranav Shirole 123pranav@email.com 30.0 80000.0 analyst India
1 Jane Doe janedoe@email.com 35.0 120000.0 developer India
2 John Doe johndoe@email.com 25.0 100000.0 developer India
df.loc[filt]['work'].value_counts()
developer    2
analyst      1
Name: work, dtype: int64
country_grp['work'].value_counts().loc['India']
work
developer    2
analyst      1
Name: work, dtype: int64

The above code (or DataFrameGroupBy object) is useful because we can run one code for each country in the dataset without using a filter for every other country.

country_grp['work'].value_counts().loc['USA']
work
avenger      3
vigilante    1
Name: work, dtype: int64
country_grp['salary'].median()
country
India      100000.0
Krypton         NaN
USA        215000.0
Name: salary, dtype: float64
country_grp['salary'].median().loc['India']
100000.0

Multiple aggregate functions

We can also get multiple aggregate functions at once using agg.

country_grp['salary'].agg(['median', 'mean'])
median mean
country
India 100000.0 100000.0
Krypton NaN NaN
USA 215000.0 243750.0
country_grp['salary'].agg(['median', 'mean']).loc['USA']
median    215000.0
mean      243750.0
Name: USA, dtype: float64
country_grp['salary'].agg(['median', 'mean']).loc[['USA', 'India']]
median mean
country
USA 215000.0 243750.0
India 100000.0 100000.0

We can calculate sums using sum. It works on numbers as well as Boolean data types (where it will take True as 1 and False as 0).

filt = df['country'] == 'India'
df.loc[filt]['work'].str.contains('deve').sum()
2
avengers = country_grp['work'].apply(lambda x: x.str.contains('aven').sum())
avengers
country
India      0
Krypton    0
USA        3
Name: work, dtype: int64
country_respondents = df['country'].value_counts()
country_respondents
USA        4
India      3
Krypton    1
Name: country, dtype: int64

We can combine more than one Series together using concat.

avengers_df = pd.concat([country_respondents, avengers], axis='columns', sort=True)
avengers_df
country work
India 3 0
Krypton 1 0
USA 4 3
avengers_df.rename(columns={'country':'number of people', 'work':'are avengers'}, inplace=True)
avengers_df
number of people are avengers
India 3 0
Krypton 1 0
USA 4 3
avengers_df['pct are avengers'] = (avengers_df['are avengers']/avengers_df['number of people'])*100
avengers_df
number of people are avengers pct are avengers
India 3 0 0.0
Krypton 1 0 0.0
USA 4 3 75.0
avengers_df.sort_values(by='pct are avengers', ascending=False, inplace=True)
avengers_df
number of people are avengers pct are avengers
USA 4 3 75.0
India 3 0 0.0
Krypton 1 0 0.0
avengers_df.loc['USA']
number of people     4.0
are avengers         3.0
pct are avengers    75.0
Name: USA, dtype: float64

Cleaning Data

people5 = {
    'first': ['Pranav', 'Jane', 'John', 'Bruce', np.nan, None, 'NA'], 
    'last': ['Shirole', 'Doe', 'Doe', 'Wayne', np.nan, None, 'Missing'], 
    'email': ['123pranav@email.com', 'janedoe@email.com', 'johndoe@email.com', 
              None, np.nan, 'anonymous@email.com', 'Missing'],
    'age': ['32', '38', '40', '45', None, None, 'Missing']
}
df = pd.DataFrame(people5)
df
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
4 NaN NaN NaN None
5 None None anonymous@email.com None
6 NA Missing Missing Missing
df.dropna()
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
6 NA Missing Missing Missing

Two of the default arguments of dropna are as follows:

df.dropna(axis='index', how='any')

The axis argument can either be set to index or columns. index will tell pandas to drop NA values from rows that have missing values. columns will tell pandas to drop columns with NA values. THe how argument can either be set to any or all. any will drop the rows with one or more missing values. all will drop rows in which all values are missing.

df.dropna(axis='columns', how='all')
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
4 NaN NaN NaN None
5 None None anonymous@email.com None
6 NA Missing Missing Missing
df.dropna(axis='columns', how='any')
0
1
2
3
4
5
6

We get an empty dataframe because any columns with even a single missing value are dropped.

df.dropna(axis='index', how='any', subset=['email'])
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
5 None None anonymous@email.com None
6 NA Missing Missing Missing
df.dropna(axis='index', how='all', subset=['last', 'email']) 
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
5 None None anonymous@email.com None
6 NA Missing Missing Missing
df.replace('NA', np.nan, inplace=True)
df.replace('Missing', np.nan, inplace=True)
df
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
4 NaN NaN NaN None
5 None None anonymous@email.com None
6 NaN NaN NaN NaN
df.dropna()
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
df.dropna(axis='index', how='all', subset=['last', 'email']) 
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne None 45
5 None None anonymous@email.com None

You can see which values would and would not be treated as NA by using the isna method.

df.isna()
first last email age
0 False False False False
1 False False False False
2 False False False False
3 False False True False
4 True True True True
5 True True False True
6 True True True True

To substitute NA values with a specific value, use the fillna method.

df.fillna('MISSING')
first last email age
0 Pranav Shirole 123pranav@email.com 32
1 Jane Doe janedoe@email.com 38
2 John Doe johndoe@email.com 40
3 Bruce Wayne MISSING 45
4 MISSING MISSING MISSING MISSING
5 MISSING MISSING anonymous@email.com MISSING
6 MISSING MISSING MISSING MISSING

Check the data types using thee dtypes attribute.

df.dtypes
first    object
last     object
email    object
age      object
dtype: object

So if we wanted the average age, it wouldn't work with the current object data type.

type(np.nan)
float

As you can see above, the NAN values are of the data type float. Which means that if your dataset has NAN values and you want to perform some math on the numbers, you need to convert your column data type to float (not int). Another option would be to convert the missing values into another number like 0 and then convert the data type to int, but in most cases this would be a bad idea (for e.g., when you want to find an average).
Note: If you have a dataframe with all values of the same data type, and you want to convert all columns at once to another data type, you can use astype method of the DataFrame object. For e.g., you can convert all int columns to float using df.astype(float).

df['age'] = df['age'].astype(float)
df.dtypes
first     object
last      object
email     object
age      float64
dtype: object
df['age'].mean()
38.75