Dev Handbook > Python > Pandas > Aggregating Data

Aggregating Data

Aggregate statistics functions

We can calculate statistics on a column of a DataFrame with Pandas’ aggregate statistics functions.

These are some main functions

Command	Meaning
mean	Average of all values in column
std	Standard deviation
median	Median
max	Maximum value in column
min	Minimum value in column
count	Number of values in column
nunique	Number of unique values in column
unique	List of unique values in column

For example, we can see a list of colors in the inventory: print(inventory.color.unique())

Mean and median

Summary statistics summarize and tell us about dataset.

# Print the mean of age
print(df["age"].mean())

# Print the median of age
print(df["age"].median())

.Efficient summaries

.agg() method is useful to compute multiple statistics on multiple variables.

Dropping duplicates

Removing duplicates is an essential skill to get accurate counts, because often you don’t want to count the same thing multiple times.

We can drop duplicate with this syntax:

df.drop_duplicates(subset="column-name")

Counting categorical variables

Counting is a great way to get an overview of your data.

# Count the number of stores of each type
store_counts = store_types["type"].value_counts()
print(store_counts)

# Get the proportion of stores of each type
store_props = store_types["type"].value_counts(normalize=True)
print(store_props)

# Count the number of each department number and sort
dept_counts_sorted = store_depts["department"].value_counts(sort=True)
print(dept_counts_sorted)

# Get the proportion of departments of each number and sort
dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

Pivoting on one variable

Pivot tables are the standard way of aggregating data in spreadsheets.

.pivot_table() method is just an alternative to .groupby()

The .pivot_table() method has several useful arguments, including fill_value and margins.

fill_value replaces missing values with a real value (known as imputation).
margins is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.

groupby()

Aggregate statistic functions can be applied across multiple rows by using a groupby function.

groupby function creates a new Series, not a DataFrame, so we need to add reset_index().

Generally, you’ll always see a groupby statement followed by reset_index:

df.groupby('column1').column2.function().reset_index()

column1 is the column that we want to group by
column2 is the column that we want to perform a measurement on
function we want to apply

For example, to check the most expensive shoes in the orders

pricey_shoes = orders.groupby('shoe_type').price.max().reset_index()

Note

We can use groupby to more than one column by adding a list of columns:

For example, we create a DataFrame with the total number of shoes of each shoe_type/shoe_color.

When we’re using count(), it doesn’t really matter which column we perform the calculation on. We can calculate based on id column.

shoe_counts = orders.groupby(['shoe_type', 'shoe_color']).id.count().reset_index()

	shoe_type	shoe_color	id
0	ballet flats	black	2
1	ballet flats	brown	11
2	ballet flats	navy	17
3	ballet flats	red	13
4	ballet flats	white	7
5	sandals	black	3
6	sandals	brown	10
7	sandals	navy	13
8	sandals	red	14
9	sandals	white	10
10	stilettos	black	8
11	stilettos	brown	14
12	stilettos	navy	7
13	stilettos	red	16
14	stilettos	white	5
15	wedges	brown	13
16	wedges	navy	16
17	wedges	red	4
18	wedges	white	17

Pivot table

A pivot table is a table of statistics that summarizes the data of a more extensive table.

This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.

In Pandas, the command for pivot is:

df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')

For example, to change the above table into pivot table, we use:

shoe_counts = orders.groupby(['shoe_type', 'shoe_color']).id.count().reset_index()

shoe_counts_pivot = shoe_counts.pivot(
  columns = 'shoe_color',
  index = 'shoe_type',
  values = 'id').reset_index()

print(shoe_counts_pivot)

	shoe_type	black	brown	navy	red	white
0	ballet flats	2.0	11.0	17.0	13.0	7.0
1	sandals	3.0	10.0	13.0	14.0	10.0
2	stilettos	8.0	14.0	7.0	16.0	5.0
3	wedges	nan	13.0	16.0	4.0	17.0

Lambda function & apply()

If you’d like to perform more complicated than mean or, you can use the apply method and lambda functions.

apply() function

The Pandas apply() function can apply a function on every value in a column or row of a DataFrame.

For example, this code overwrites the existing ‘Name’ columns by making names uppercase.

df['Name'] = df.Name.apply(upper)

Calculations with .groupby()

# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = sales_by_type / sum(sales_by_type)
print(sales_propn_by_type)