Aggregating Data

Aggregate statistics functions

We can calculate statistics on a column of a DataFrame with Pandas’ aggregate statistics functions.

These are some main functions

CommandMeaning
meanAverage of all values in column
stdStandard deviation
medianMedian
maxMaximum value in column
minMinimum value in column
countNumber of values in column
nuniqueNumber of unique values in column
uniqueList of unique values in column

For example, we can see a list of colors in the inventory: print(inventory.color.unique())

Mean and median

Summary statistics summarize and tell us about dataset.

# Print the mean of age
print(df["age"].mean())

# Print the median of age
print(df["age"].median())

.Efficient summaries

.agg() method is useful to compute multiple statistics on multiple variables.

Dropping duplicates

Removing duplicates is an essential skill to get accurate counts, because often you don’t want to count the same thing multiple times.

We can drop duplicate with this syntax:

df.drop_duplicates(subset="column-name")

Counting categorical variables

Counting is a great way to get an overview of your data.

# Count the number of stores of each type
store_counts = store_types["type"].value_counts()
print(store_counts)

# Get the proportion of stores of each type
store_props = store_types["type"].value_counts(normalize=True)
print(store_props)

# Count the number of each department number and sort
dept_counts_sorted = store_depts["department"].value_counts(sort=True)
print(dept_counts_sorted)

# Get the proportion of departments of each number and sort
dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

Pivoting on one variable

Pivot tables are the standard way of aggregating data in spreadsheets.

.pivot_table() method is just an alternative to .groupby()

The .pivot_table() method has several useful arguments, including fill_value and margins.

  • fill_value replaces missing values with a real value (known as imputation).
  • margins is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.

groupby()

Aggregate statistic functions can be applied across multiple rows by using a groupby function.

groupby function creates a new Series, not a DataFrame, so we need to add reset_index().

Generally, you’ll always see a groupby statement followed by reset_index:

df.groupby('column1').column2.function().reset_index()
  • column1 is the column that we want to group by
  • column2 is the column that we want to perform a measurement on
  • function we want to apply

For example, to check the most expensive shoes in the orders

pricey_shoes = orders.groupby('shoe_type').price.max().reset_index()

Note

We can use groupby to more than one column by adding a list of columns:

For example, we create a DataFrame with the total number of shoes of each shoe_type/shoe_color.

When we’re using count(), it doesn’t really matter which column we perform the calculation on. We can calculate based on id column.

shoe_counts = orders.groupby(['shoe_type', 'shoe_color']).id.count().reset_index()
shoe_typeshoe_colorid
0ballet flatsblack2
1ballet flatsbrown11
2ballet flatsnavy17
3ballet flatsred13
4ballet flatswhite7
5sandalsblack3
6sandalsbrown10
7sandalsnavy13
8sandalsred14
9sandalswhite10
10stilettosblack8
11stilettosbrown14
12stilettosnavy7
13stilettosred16
14stilettoswhite5
15wedgesbrown13
16wedgesnavy16
17wedgesred4
18wedgeswhite17

Pivot table

A pivot table is a table of statistics that summarizes the data of a more extensive table.

This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.

In Pandas, the command for pivot is:

df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')

For example, to change the above table into pivot table, we use:

shoe_counts = orders.groupby(['shoe_type', 'shoe_color']).id.count().reset_index()

shoe_counts_pivot = shoe_counts.pivot(
  columns = 'shoe_color',
  index = 'shoe_type',
  values = 'id').reset_index()

print(shoe_counts_pivot)
shoe_typeblackbrownnavyredwhite
0ballet flats2.011.017.013.07.0
1sandals3.010.013.014.010.0
2stilettos8.014.07.016.05.0
3wedgesnan13.016.04.017.0

Lambda function & apply()

If you’d like to perform more complicated than mean or, you can use the apply method and lambda functions.

apply() function

The Pandas apply() function can apply a function on every value in a column or row of a DataFrame.

For example, this code overwrites the existing ‘Name’ columns by making names uppercase.

df['Name'] = df.Name.apply(upper)

Calculations with .groupby()

# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = sales_by_type / sum(sales_by_type)
print(sales_propn_by_type)