DataFrame

I. What is DataFrame?

A DataFrame is a data type from pandas module. It represents spreadsheet-like data (rows and columns).

You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

DataFrames can contain different data types: strings, ints, floats, tuples, etc.

Components of a DataFrame DataFrame objects consist of three components, stored as attributes:

  • .values: A two-dimensional NumPy array of values.
  • .columns: An index of columns contains the column names.
  • .index: An index for the rows contains either row numbers or row names.
df.columns
df.index

II. Creating a DataFrame

You can create DataFrame by passing in a dictionary or a list of lists to the pd.DataFrame() method, or by reading data from a CSV file pd.read_csv

1. List of dictionaries by row

Passing in a dictionary to pd.DataFrame()

list_of_dicts = [
  {"name": "Anna", "height_cm": 110, "weight_kg": 2, "date_of_birth": "2019-03-14"},
  {"name": "Jane", "height_cm": 121, "weight_kg": 2, "date_of_birth": "2019-05-09"}
]
new_babies = pd.DataFrame(dict_of_lists)

We can also create DataFrame this way:

import pandas as pd

df1 = pd.DataFrame({
  'Product ID': [1, 2, 3, 4],
  'Product Name': ['t-shirt', 't-shirt', 'skirt', 'skirt'],
  'Color': ['blue', 'green', 'red', 'black']
})

Note: the columns will appear in alphabetical order because dictionaries don’t have any inherent order for columns.

2. Dictionary of list - by column

When using this method, we need to go through the data column by column.

  • Key: column name
  • Value: lists of column values
dict_of_lists = {
  "name": ["Anna", "Jane"],
  "height_cm": [110, 121],
  "weight_kg": [2, 3],
  "date_of_birth": ["2019-03-14", "2019-05-09"]
}
new_babies = pd.DataFrame(dict_of_lists)

We can also pass in data this way:

import pandas as pd

df2 = pd.DataFrame([
  [1, 'San Diego', 100],
  [2, 'Los Angeles', 120],
  [3, 'San Francisco', 90],
  [4, 'Sacramento', 115]
],
  columns = [
    'Store ID', 'Location', 'Number of Employees'
  ])

print(df2)

III. Inspecting a DataFrame

1. head()

If it’s a small DataFrame, you can display it with print(df).

However, with a larger one, it’s better to display certain items without looking at the entire DataFrame.

.head() method selects the first 5 rows of a DataFrame:

df.head()
print(df.head())

If you want to see more rows, add the number. For example, df.head(8) shows the first 8 rows.

2. describe()

describe() method display the summary statistics of the table, like mean and median.

df.describe()

3. info()

The method info() displays a summary of the table, such as the data type and number of missing values.

The result includes all columns and their data types.

df.info()

It’s useful for DataFrame with many columns that are difficult to display using .head()

4. columns

.columns display the column names of the table

5. value_counts()

.value_counts() displays the distinct values for a column

6. shape

.shape attribute contains a tuple that holds the number of rows followed by the number of columns.

df.shape
# Output: (4, 6)

7. Looking at types

To check the types of each column of a DataFrame, we can use:

print(df.dtypes)

IV. Merge DataFrames

1. Inner Merge

To merge 2 or more DataFrames with matching rows, we use .merge()

  1. We can use pd.merge(df1, df2)

For example, to merge orders with customers, you could use:

new_df = pd.merge(orders, customers)
  1. We can also use df1.merge(df2)
new_df = orders.merge(customers)

2. Outer Merge

Outer Join would include all rows from both tables, even if they don’t match.

pd.merge(df1, df2, how='outer')

For example, company A has customer’s name and email. Company B has customer’s name and phone number.

We merge 2 companies’ data with:

data = pd.merge(company_a, company_b, how='outer')
print(data)

Any missing values are filled in with None or nan (which stands for “Not a Number”).

nameemailphone
Hanna Jonehjone@gmail.comnan
John Smithjsmith@yahoo.com023-012
Jane Mayjmay@gmail.com453-103
Niles Arrnan467-305

3. Left and Right Merge

3.1. Left merge includes all rows from the first (left) table, but only rows from the second (right) table that match the first table.

pd.merge(df1, df2, how='left')

3.2. Right merge includes all rows from the second (right) table, but only rows from the first (left) table that match the second table.

pd.merge(df1, df2, how='right')

V. Concatenate DataFrames

To reconstruct a single DataFrame from multiple smaller DataFrames, we can use:

pd.concat([df1, df2, df2, ...])

This method only works if all DataFrames have the same numbers of columns.