Histogram

A histogram visualizes the distribution of values in a dataset.

I. Create Histogram in Matplotlib

To plot data with Histogram, we use:

plt.hist(data)

II. Changing bins

bins sets the number of points in our histogram.

By default, matplotlib creates a histogram with 10 bins of equal size spanning from the smallest sample to the largest sample in our data.

To change the number of bins, we use the keyword bin

plt.hist(data, bins=nbins)

For example, we can divide the histogram into 20 bins to see more details.

plt.hist(data, bins=nbins)

III. Changing range

range sets the minimum and maximum datapoints that we will include in our histogram.

plt.hist(data, range=(xmin, xmax))

For example

plt.hist(df.height, range=(50, 180))

IV. Normalizing

Normalization reduces the height of each bar by a constant factor so that the sum of the areas of each bar adds to one.

It makes two histograms comparable even if the sample sizes are different.

We can normalize histogram with keyword density=True. Each bar will represent a proportion of the entire dataset.

plt.hist(df.male_weight, density=True)
plt.hist(df.female_weight, density=True)

V. Multiple Histograms

When having multiple histograms, it can be difficult to read histograms on top of each other.

To solve this problem, we can use:

  • Keyword alpha (between 0 and 1) to set the transparency of the histogram
  • histtype='step' to draw just the outline of a histogram
plt.hist(budget1, bins=20, alpha=0.4)
plt.hist(budget2, bins=20, alpha=0.4)

VI. Different types of distribution

To classify a dataset, we can count the number of distinct peaks present in the graph

  • A unimodal dataset has only one distinct peak
  • A bimodal dataset has two distinct peaks. It happens when the data contains two different populations
  • A multimodal dataset has more than two peaks
  • A uniform dataset doesn’t have any distinct peaks
  • A symmetric dataset has equal amounts of data on both sides of the peak. Both sides should look about the same
  • A skew-right dataset has a long tail on the right of the peak, but most of the data is on the left
  • A skew-left dataset has a long tail on the left of the peak, but most of the data is on the right