Skip to content

Pandas

Introduction

Pandas is a high-level data manipulation tool built with the Numpy package. Its key data structure is called the DataFrame, which allow us to store and manipulate tabular data (we can think of the rows as different "observations" and the columns as the variables).

Note

Intuitively, pandas DataFrames can be thought of as a way to hold Excel spreadsheets data in a Python object.

Since Pandas is not part of the standard Python library, we will need to install it first in the virtual environment, for instance with pip:

pip install pandas

Once we have it installed, to use it in the code will need to import it with

import pandas

# Code here
...

Note

Often, pandas is renamed to pd in the import, as:

import pandas as pd
This is of course not mandatory, but very common.

After importing the package, we can start using its most important object, the DataFrame. There are several ways to create a DataFrame, for example:

dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
       "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
       "area": [8.516, 17.10, 3.286, 9.597, 1.221],
       "population": [200.4, 143.5, 1252, 1357, 52.98] }

import pandas as pd
brics = pd.DataFrame(dict)
print(brics)
which outputs
     area    capital       country  population
0   8.516   Brasilia        Brazil      200.40
1  17.100     Moscow        Russia      143.50
2   3.286  New Dehli         India     1252.00
3   9.597    Beijing         China     1357.00
4   1.221   Pretoria  South Africa       52.98

Note

If you want a nice, complete introduction and walkthrough to Pandas, it is recommended to check the official documentation's own guide. The notes in this page are not comprehensive.

Pandas vs. Excel

Tasks such as data cleaning, data normalization, visualization, and statistical analysis can be performed on both Excel and Pandas. That being said, Pandas has some major benefits over Excel:

  • Limitation by size: Excel can handle around 1 million rows, while Python can handle millions and millions of rows (the limitation is on PC computing power and memory).
  • Complex data transformation: Memory-intensive computations in Excel can crash a workbook. Python can handle complex computations without major problems.
  • Automation: Excel was not designed to automate tasks. You can create a macro or use VBA to simplify some tasks, but Python is a general programming language (meaning we can program almost anything).

Basic objects: Series and Dataframes

Fundamentally, Pandas has two main objects that we will be using:

  • Pandas Series are one-dimensional labeled arrays capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). We can think of a Pandas Series as a single "column" of a table.
  • Pandas Dataframes are 2-dimensional labeled data structures with columns of potentially different types. We can think of it like a spreadsheet or SQL table, or a dict of Series objects.

Note

In general we will only be using Pandas Dataframes, which are the most commonly used pandas object. However, since Pandas Dataframes are a collection of Pandas Series, Series will appear often as the result of using some of the methods of Dataframes.

Series creation

In a Series object, the axis labels are collectively referred to as the index. The basic method to create a Series is to call:

s = pd.Series(data, index=index_name)

Here, data can be many different things, but they need to be 1D. For example:

  • A Python dict or list
  • A scalar value
  • A numpy array

If data is a numpy array, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

s
Out[4]: 
a    0.469112
b   -0.282863
c   -1.509059
d   -1.135632
e    1.212112
dtype: float64

s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

pd.Series(np.random.randn(5))
Out[6]: 
0   -0.173215
1    0.119209
2   -1.044236
3   -0.861849
4   -2.104569
dtype: float64

Note

The index of a Series is used to label and identify each element of the underlying data. We can access the index of a Series with the .index attribute:

cities = ['Kolkata', 'Chicago', 'Toronto', 'Lisbon']
populations = [14.85, 2.71, 2.93, 0.51]
city_series = pd.Series(populations, index=cities)
print(city_series.index)

# Output:
Index(['Kolkata', 'Chicago', 'Toronto', 'Lisbon'], dtype='object')

Dataframe creation

DataFrames are, in practice, a dictionary of columns. That's why column labels must be unique. Each column corresponds to a Series, and different columns can be of different type. Rows are identified by an index, shared by all the columns (and may be non-unique).

To instantiate a DataFrame we may use different kinds of input, such as:

  • A dict of 1D objects: Numpy arrays, lists, dicts, or Series. In any case, the arrays must all be the same length. If an index is passed, it must also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length:

    d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
    
    pd.DataFrame(d)
    Out[46]: 
       one  two
    0  1.0  4.0
    1  2.0  3.0
    2  3.0  2.0
    3  4.0  1.0
    
    pd.DataFrame(d, index=["a", "b", "c", "d"])
    Out[47]: 
       one  two
    a  1.0  4.0
    b  2.0  3.0
    c  3.0  2.0
    d  4.0  1.0
    
  • From a list of dicts (where each dict has the same keys, which correspond to the DataFrame-to-be columns):

    In [53]: data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
    
    In [54]: pd.DataFrame(data2)
    Out[54]: 
       a   b     c
    0  1   2   NaN
    1  5  10  20.0
    
    In [55]: pd.DataFrame(data2, index=["first", "second"])
    Out[55]: 
            a   b     c
    first   1   2   NaN
    second  5  10  20.0
    
    In [56]: pd.DataFrame(data2, columns=["a", "b"])
    Out[56]: 
       a   b
    0  1   2
    1  5  10
    
  • Tables from a .csv file or Excel file (e.g., .xlsx)

  • A single Pandas Series
  • 2-D numpy.ndarray

In any case, the DataFrame construction always follows the same syntax:

import pandas as pd

# suppose "a" is a Python object type that can be converted to a DataFrame
a = ...

# DataFrame object creation
df = pd.DataFrame(a)

Note

Although Series are 1D, nothing prevents us from converting a Pandas Series into a Pandas DataFrame. This is often very convenient since DataFrames and Series do not have the same methods available.

Reading/writing dataframes

Dataframes are very useful, but how can we transfer tabular information between different Python sessions?

Note

Remember that information stored in Python objects disappears once the current execution is finished.

The most common way to achieve this is to save the tables in files. These files can be reused in future executions, or transferred through the internet (this is mostly how we download datasets from the internet).

A simple way to store big data sets is to use .csv files (comma separated files). CSV files contain plain text and is a well known format that can be read by everyone (including Pandas).

Reading from a CSV

Reading from a CSV is as easy as:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string()) 

# Output

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4

If the file has a header, or uses different characters (instead of commas), the pandas documentation has several examples on how to use the .read_csv() method.

Note

In the previous example, the to_string() method is not really necessary. This method is used to make sure that we are printing the whole DataFrame (otherwise, if it has many columns and rows, pandas might shorten it when we try to print).

Saving to a CSV

Saving an already existing DataFrame df into a .csv can be accomplished with:

df.to_csv('file_name.csv')

If you want to export without the index, simply add index=False;

df.to_csv('file_name.csv', index=False)

If you get a UnicodeEncodeError, simply add encoding='utf-8':

df.to_csv('file_name.csv', encoding='utf-8')

Converting a dataframe to a dictionary

We can convert a dataframe to a dictionary with the .to_dict() method. This method has several options, but the most common one is to convert the dataframe to a dictionary of lists, where the keys are the column names and the values are the values of each column:

import pandas as pd

df = pd.DataFrame(
    {
        "first": ["John", "Mary"],
        "last": ["Doe", "Bo"],
        "job": ["Nurse", "Economist"],
        "height": [5.5, 6.0],
        "weight": [130, 150],
    }
)

print(df.to_dict())
# Output
{'first': {0: 'John', 1: 'Mary'}, 
 'last': {0: 'Doe', 1: 'Bo'}, 
 'job': {0: 'Nurse', 1: 'Economist'}, 
 'height': {0: 5.5, 1: 6.0}, 
 'weight': {0: 130, 1: 150}
 }

There are other options, such as converting the dataframe to a dictionary of lists (orient="list"), or to a list of dictionaries (orient="records").

Viewing DataFrames

To view the contents of a DataFrame, we can use one of several options:

  • Using the print(some_dataframe) function
  • To view the top of a dataframe, use DataFrame.head()
  • To view the bottom of a dataframe, use DataFrame.tail()
In [13]: df.head()
Out[13]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

In [14]: df.tail(3)
Out[14]: 
                   A         B         C         D
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988

We can also explore the data with an IDE such as PyCharm, or we can export the DataFrame to a .csv file and open it with a spreadsheet editor.

Accessing data in a DataFrame

There are 3 options to access the data of a pandas object:

  • [], the "standard getter" which is highly overloaded
  • .loc[], label based selector
  • .iloc[], integer location based selector

Note

The [] operator is highly overloaded and its use is discouraged. Use .loc[] and .iloc[] whenever possible.

Accessing data in one ore more columns

We can either use the standard getter [] or .loc[].

With the standard getter it works like this (the return type is another DataFrame):

my_dataframe = pd.DataFrame(
  {"colA": [10, 20, 30], 4.5: ["Alice", "Eve", "Bob"]},
  index=["a", "b", "b"],
)
>>> my_dataframe[[4.5]]
     4.5
a  Alice
b    Eve
b    Bob
>>> my_dataframe[[4.5, "colA"]]
     4.5  colA
a  Alice    10
b    Eve    20
b    Bob    30

To select columns with the .loc[] operator we use the syntax my_dataframe.loc[:, col_selector]. The : indicates that we want to select all rows (more on that later), and col_selector should be a list of the columns that we want to select:

my_dataframe = pd.DataFrame(
  {"colA": [10, 20, 30, 40], 4.5: ["Alice", "Eve", "Bob", "Charlie"]},
  index=["a", "b", "b", "c"],
)
>>> my_dataframe.loc[:, [4.5]]  # returns a DataFrame
     4.5
a  Alice
b    Eve
b    Bob
c    Charlie
>>> my_dataframe.loc[:, [4.5, "colA"]]  # returns a DataFrame
     4.5  colA
a  Alice     10
b    Eve     20
b    Bob     30
c    Charlie 40

Note

In both examples we could have used a column name, instead of a list of column names, as a selector. In this case, the return type, instead of a DataFrame, would be a Pandas Series. In general, try to avoid working with Series whenever possible and only use methods that return DataFrames.

Slicing

Slices act on row index, and are a way to access data from row X to row Y following the same logic as explained before: a slice [1:3] will return all members from 1 to 3, except for the last one (i.e., 3 will be excluded). The return is a DataFrame:

my_dataframe = pd.DataFrame(
  {"colA": [10, 20, 30, 40], 4.5: ["Alice", "Eve", "Bob", "Charlie"]},
  index=["a", "b", "b", "c"],
)
>>> my_dataframe[1:3]
   colA  4.5
b    20  Eve
b    30  Bob
>>> my_dataframe[:"b"]
   colA    4.5
a    10  Alice
b    20    Eve
b    30    Bob

Selecting rows and columns at the same time

To select rows with the .loc[] operator we use the syntax my_dataframe.loc[row_selector, column_selector]. As before, column_selector can be :, which indicates that we want to select all columns (but we could also put in here any list of columns we want).

Note

my_dataframe.loc[row_selector, :] can also be written as my_dataframe.loc[row_selector], without the :. However, the first version is more explicit, and so its use is recommended.

We can select rows based on row name, row subset (by using a list, as with columns), a slice or, interestingly, based on boolean conditions:

my_dataframe = pd.DataFrame(
  {"colA": [10, 20, 30, 40], 4.5: ["Alice", "Eve", "Bob", "Charlie"]},
  index=["a", "b", "b", "c"],
)

>>> my_dataframe.loc[["c", "a"], :]
   colA      4.5
c    40  Charlie
a    10    Alice


>>> my_dataframe.loc[3:2, :]
   colA      4.5
3    20      Eve
1    30      Bob
2    40  Charlie

>>> my_dataframe.loc[my_dataframe["colA"] > 30, 4.5]
       4.5
c  Charlie 

Note

In the last example we have used a simple "greater than" boolean condition to select data in the DataFrame, but we can make this boolean condition as sophisticated or complex as we want, for example

df.loc[(df["a"] > 0) & (df["b"] == df["d"]), ["a", "c", "g"]

To filter rows by the contents of a string column, we can use the .str accessor:

df.loc[df["a"].str.contains("hello"), :]

Accessing data from its location on the DataFrame

The .iloc[] operator works the same as .loc[], but on integer location of rows and also columns. The syntax is the same, my_dataframe[row_selector, column_selector] and you can use : to indicate all rows or columns.

Note

iloc does not use the index labels to select rows and columns, but their integer location. Hence, my_dataframe.iloc[0, :] will always return the first row of the DataFrame, regardless of the index.

Column selectors
  • Selecting a single column (my_dataframe.iloc[:, 1]) always returns a Series.
  • Selecting a list of columns (my_dataframe.iloc[:, [1]]) always returns a DataFrame.
Row selectors
  • Selecting a single integer position (my_dataframe.iloc[1, :]) always returns a Series.
  • Selecting with a list of integer positions (my_dataframe.iloc[[1, 5], :]) always returns a DataFrame.
  • Selecting with a slice of integer positions (my_dataframe.iloc[1:4, :]) always returns a DataFrame. Recall that the upper limit is excluded.
  • Selecting with a list of boolean indexes (my_dataframe.iloc[1:4, :]) always returns a DataFrame

Setting new values on a DataFrame

We can use the same methods that have been shown to access data to set new values to the DataFrame. We just need to put the dataframe cell selection on the left side of a Python assignment, for example:

# this assigns the value 10 to all cells in the column "D"
df.loc[:, "D"] = 10

# this assigns the value 0 to all cells with negative values
df.loc[df < 0, :] = 0

Working with DataFrames

This section describes very common methods or tasks that you will be performing with pandas DataFrames. Before we start, however, one word of caution: DataFrames are objects, that derive from the pandas.DataFrame class. As objects, they have a lot of available methods that have been implemented for the class, and we may think that when we call one of this methods, the original DataFrame might be modified (since, after all, methods can change the attributes of the class). However, in pandas this is not the case.

Note

By default, in pandas, dataframe operations return a copy of the dataframe and leave the original dataframe data intact.

Let's see this with an example. Assume we have the following dataframe df:

print(df)

# Output
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F
If we sort it by col1 and then print df again, the result will not change:
df.sort_values(by=['col1'])
print(df)

# Output
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F
This is because these methods do not change the internal state (i.e., its attributes) of the df instance. Hence, if we want to use the output of this method, we will need to save the return output of this methods to another variable (or chain it):

new_df = df.sort_values(by=['col1'])
print(new_df)

# Output
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
5    C     4     3    F
4    D     7     2    e
3  NaN     8     4    D

Hence, if we want to "update" the DataFrame after a change, we will need to assign the output of the method to the original variable, like:

new_df = df.sort_values(by=['col1'])

Note

We can make this functions modify the original dataframe with the optional parameter inplace=True (False is the default). However, in general its use is discouraged.

Common tasks

Finding the size of a DataFrame
  • To display the number of rows, columns, etc.: df.info()
  • To get the number of rows and columns: df.shape
  • To get the number of rows: len(df)
  • To get the number of columns: len(df.columns)

We also have access to an attribute df.empty, of bool type, which returns True if the DataFrame is empty and False otherwise.

Removing duplicates

Pandas drop_duplicates() method helps in removing duplicates from Pandas Dataframes.

import pandas as pd 

data = { 
    "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"], 
    "B": [50, 40, 40, 30, 50], 
    "C": [True, False, False, False, True] 
} 

df = pd.DataFrame(data) 

print(df.drop_duplicates())

# Output
       A       B      C
0    TeamA    50    True
1    TeamB    40    False
3    TeamC    30    False
Filling missing data

We can use the .isna() method to detect (and fill, if we need to) cells without values:

df.loc[df["a"].isna(), :] = 0

Note

The .isnull() method is an alias for the .isna() method, so both are exactly the same.

Note

Pandas primarily uses the value np.nan to represent missing data. Missing data is by default not included in computations.

Resetting the index

After dropping and filtering the rows of a DataFrame, the original index values for each row remain. If we want to re-create the index, dropping the original values, we can do it with

DataFrame.reset_index(drop=True)

Renaming columns

To rename specific columns of a DataFrame, we may use the df.rename() method:

df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
Only the columns in the dictionary passed as input will be renamed.

Sorting

We can sort a DataFrame by the values of one or multiple columns, like so:

df.sort_values(by=['col1', 'col2'], ascending=True)
  col1  col2  col3 col4
1    A     1     1    B
0    A     2     0    a
2    B     9     9    c
5    C     4     3    F
4    D     7     2    e
3  NaN     8     4    D
Creating new columns

There are several ways to create a new column, but the most common one follow this syntax:

df["new_column_name"] = ...
and on the ... we can put whatever we want. For example, if we have a list my_list with the same length as the DataFrame has rows, we could do:
df["new_column_name"] = my_list
Alternatively, we could create a new column using the values from other columns, for example:
df["new_column"] = df["A"] + df["B"]

Counting unique values

The unique function in pandas is used to find the unique values from a series:

import pandas as pd

# Creating a dataframe
df = pd.DataFrame({'Sports': ['Football', 'Cricket', 'Baseball', 'Basketball',
                'Tennis', 'Table-tennis', 'Archery', 'Swimming', 'Boxing'], 
                'Player': ["Messi", "Afridi", "Chad", "Johnny", "Federer",
                 "Yong", "Mark", "Phelps", "Khan"],
                 'Country': ["Argentina", "Pakistan", "England", "England", "Switzerland",
                 "China", "China", "USA", "Pakistan" ],
                 'Rank': [1, 9, 7, 12, 1, 2, 11, 1, 1] })

# Finding unique countries
print(df["Country"].unique())
# Finding unique rankings
print(df["Rank"].unique())

# Output
['Argentina' 'Pakistan' 'England' 'Switzerland' 'China' 'USA']
[1, 9, 7, 12, 2, 11]
Pivoting a dataframe with melt: untidy to tidy data

The pandas.melt() function is used to transform or reshape data in a dataframe, transforming it from wide format to long format (this is sometimes called pivoting). The function takes a set of columns and converts it into a single column.

Example 1

>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
...                    'B': {0: 1, 1: 3, 2: 5},
...                    'C': {0: 2, 1: 4, 2: 6}})
>>> df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6


>>> pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
3  a        C      2
4  b        C      4
5  c        C      6

Example 2

import pandas as pd

df = pd.DataFrame(
    {
        "first": ["John", "Mary"],
        "last": ["Doe", "Bo"],
        "job": ["Nurse", "Economist"],
        "height": [5.5, 6.0],
        "weight": [130, 150],
    }
)
melt_df = df.melt(
    id_vars=["first", "last"], 
    var_name="quantity", 
    value_vars=["height", "weight"]
)

print("\n Unmelted: ")
print(df)
print("\n Melted: ")
print(melt_df)

# Output
Unmelted: 
   first  last  job         height  weight
0  John   Doe   Nurse       5.5     130
1  Mary   Bo    Economist   6.0     150

Melted: 
      first     last    quantity   value
0     John      Doe     height     5.5
1     Mary      Bo      height     6.0
2     John      Doe     weight     130.0
3     Mary      Bo      weight     150.0

Basic statistics

There are many statistical operations already implemented in pandas. We can use them by selecting a subset of the DataFrame (e.g., a few columns) and then calling these methods. Some of the most important are:

  • The mean: some_df.mean()
  • The median: some_df.median()
  • The standard deviation: some_df.std()
  • The min and max: .min() and .max()
  • The number of records for each category in a column:
    In [12]: titanic["Pclass"].value_counts()
    Out[12]: 
    Pclass
    3    491
    1    216
    2    184
    Name: count, dtype: int64
    

There is also a helpful .describe() method that gives you several of these at the same time.

Applying custom functions

The pandas DataFrame apply() function is used to apply a function along an axis of a DataFrame (it also works with Pandas Series). This function that we apply can be an external or a custom defined function. It works like this:

import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]})
def square(x):
    return x * x

df1 = df.apply(square)

print(df)
print(df1)

# Output
   A   B
0  1  10
1  2  20

   A    B
0  1  100
1  4  400

Note

We don't need to apply the function to the whole dataframe. We can slice it and only apply the function to a subset of the columns.

If you look at the above example, our square() function is very simple. We can easily convert it into a lambda function:

df1 = df.apply(lambda x: x * x)
The output will remain the same as in the last example.

Group by

The groupby() method is used for grouping the data according to the categories and applying a function to aggregate them categories. This is easier seen with an example. Suppose that we have the following dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
     "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
     "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
     "C": np.random.randn(8),
     "D": np.random.randn(8),
    }
)

print(df)

# Output
     A      B         C         D
0  foo    one  0.845112  2.525473
1  bar    one -0.485309  0.067261
2  foo    two  1.106288  0.205404
3  bar  three -0.958754  0.923104
4  foo    two  2.033509 -0.436023
5  bar    two -0.945948  0.869062
6  foo    one -0.288766 -1.497993
7  foo  three  0.344638 -0.786353

We can ask ourselves what happens if we combine the data from all the columns that share values in the columns A and B (e.g., all the columns that have foo in column A and bar in column B).

For this we need what's known as an aggregating function. This function gives an answer to the following question: assume that I found all the rows that share values in A and B and, somehow, I want to group together the values in all the remaining columns. How do I do that? Do I calculate the mean of those values? The sum? The minimum?

With Python's groupby() function we may use any aggregating function we want: the important thing to understand is that this function will be applied to the set defined by all rows where A and B are shared.

Going back to the previous dataframe, and using the sum() function as an example, we could use it as:

df.groupby(["A", "B"], as_index=False).sum()

# Output
     A      B         C         D
0  bar    one -0.485309  0.067261
1  bar  three -0.958754  0.923104
2  bar    two -0.945948  0.869062
3  foo    one  0.556346  1.027480
4  foo  three  0.344638 -0.786353
5  foo    two  3.139796 -0.230619

The syntax is always the same: df.groupby(list_of_columns_used_to_classify, as_index=False).some_func(), where some_func() is the function that is used to aggregate the columns that are not in list_of_columns_used_to_classify.

Note

In the last example the as_index=False is important. If we don't use it, then the group by will return a dataframe with a "strange" new index, created by the combination of the values of A and B.

Joins and merges

Pandas provides a single function, merge(), as the equivalent of standard SQL database join operations, but in this case between DataFrames.

Before getting into the details of how to use merge(), you should first understand the various forms of joins. The idea of a merge (or join, as is known in SQL) is that we have two tables, and we stitch them together on the basis of two columns (one from the first table, another from the second table) having the same value. The resulting merged table will be the first table glued with the second table, with the shared column acting as glue. This can be done in different ways:

  • inner merge: the resulting merged table will contain all the rows from the original tables that found a match in the other table, but all rows (from either table) that find no match are discarded.
  • outer merge: the opposite of the previous case: every row from either table will appear in the merged table, and if for some of them there is no match, we will get null values on the columns coming from the other table.
  • left merge: this is an intermediate case between the inner and the outer merge. It behaves like the inner merge, but we also keep the rows from the first ("left") table that have no matches.
  • right merge: like the left merge, but the table on the second table is acting as if it was the left table (usually we don't use the right merge for anything, since we can perform a left merge with the two tables in the opposite order).

These are some examples on how to use the merge() function:

import pandas as pd

#create DataFrame
df1 = pd.DataFrame({'team': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
                    'points': [18, 22, 19, 14, 14, 11, 20, 28]})

df2 = pd.DataFrame({'team': ['A', 'B', 'C', 'D', 'G', 'H', 'Z'],
                    'assists': [4, 9, 14, 13, 10, 8, 15]})

#view DataFrames
print(df1)

  team  points
0    A      18
1    B      22
2    C      19
3    D      14
4    E      14
5    F      11
6    G      20
7    H      28

print(df2)

  team  assists
0    A        4
1    B        9
2    C       14
3    D       13
4    G       10
5    H        8
6    Z       15


df1.merge(df2, on='team', how='outer')
# Output
  team  points  assists
0    A    18.0      4.0
1    B    22.0      9.0
2    C    19.0     14.0
3    D    14.0     13.0
4    E    14.0      NaN
5    F    11.0      NaN
6    G    20.0     10.0
7    H    28.0      8.0
8    Z     NaN     15.0

df1.merge(df2, on='team', how='left')
# Output
  team  points  assists
0    A      18      4.0
1    B      22      9.0
2    C      19     14.0
3    D      14     13.0
4    E      14      NaN
5    F      11      NaN
6    G      20     10.0
7    H      28      8.0

df1.merge(df2, on='team', how='inner')
# Output
  team  points  assists
0    A      18        4
1    B      22        9
2    C      19       14
3    D      14       13
4    G      20       10
5    H      28        8

Concatenating DataFrames

If we have 2 or more DataFrames which share exactly the same columns, we can concatenate them like this:

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)


df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)


df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)

frames = [df1, df2, df3]

result = pd.concat(frames)
print(result)

# Output
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

Variable types and memory usage

A pandas DataFrame can have columns of different types. To find out what these are, we may use df.dtypes:

print(df.dtypes)

# Output
float              float64
int                  int64
datetime    datetime64[ns]
string              object
dtype: object

Note

For very large datasets, make sure that you are not using more bits than you need in the columns. You can transform the variable types, for example int64 to int16, anytime.

To actually learn how much RAM memory (in Bytes) we are using for a particular DataFrame, we may use the df.memory_usage(deep=True) function:

>>> df.memory_usage(deep=True)
(lists each column's full memory usage)

>>> df.memory_usage(deep=True).sum()
462432