Bailee Egan - Histogram with custom binning

Introduction

Histograms are one of the simplest yet effective ways to visual data. In most default histograms, data is divided into evenly spaces bins, but what if you don't want that?

Take for example, a distribution of exam grades.

center

Perhaps this might not be a totally accurate depiction of most exam grades (highly dependent on the class), but the distribution of grade percentage might not entirely match the distribution of letter grades. What if we're interested in the number of F's, D's, C's, Bs, A's, and A+'s?

Required libraries

You only need ggplots, which is conveniently included in the tidyverse package.

library(tidyverse)

First solution

For this example, letter grades are assigned as follows: F: 0-60: D: 60-70, C: 70-80, B: 80-90, A: 90-95, A+: 95-100.

We can set custom breaks using ggplots geom_histogram() function.

ggplot(data = data.frame(y=grades), aes(y)) +
  geom_histogram(breaks=c(0,60,70,80,90,95,100), color="black", fill="white")

center

Depending on your goals, this might be fine, but if you want bars of equal width, then you will need to do a bit more coding.

On with it

We're actually going to make a bar graph disguised as a histogram. For this, we need to create a table in the following format.

##   Category Count
## 1     0-60   100
## 2    60-70   400
## 3    70-80   380
## 4      ...   ...

We'll make a function that can be recycled for different datasets. values is the vector of your numeric data. breaks are the breaks in the histogram. The return value of this function will be the data frame in the format above.

bin = function(values, breaks) {
  categories = c()
  counts = c()

  # Main steps:
  #  1: Fill out categories
  #  2: Assign counts to each category

  return(data.frame(Category = categories, Count = counts))
}

Filling out the categories

To fill out the categories, we're just using paste() to concatenate each break value with the next one in the vector, separating the pair of strings by a dash. For each created category, we'll also initialize the count for that category, which is zero.

categories = c()
counts = c()

for (i in 1:(length(breaks) - 1)) {
  categories = c(categories, paste(breaks[i], breaks[i + 1], sep="-"))
  counts = c(counts, 0)
}

After creating the categories, we need them to be factors so that ggplot doesn't mixup the order. Because the categories are created in order, they can be used as the levels.

categories = factor(categories, levels=categories)

Filling out the counts

We have our categories. Now we need the counts for each category. We'll loop through each value and then, for each value, loop through the categories -- specifically the upper and lower limit of each category. If the value fits inside the upper and lower limit, then add the count to the corresponding category.

for (r in 1:length(values)) { # 'values' here is the input vector of your data
  value = values[r]
  if (is.numeric(value)) { # Safety first

    for (i in 1:(length(breaks) - 1)) {
      lower = breaks[i]
      upper = breaks[i + 1]
      if (value >= lower && value < upper) {
        counts[i - 1] = counts[i - 1] + 1
        break
      }
    }

  }
}

Here's the completed function.

bin = function(values, breaks) {
  categories = c()
  counts = c()

  ### Fill out categories

  for (i in 1:(length(breaks) - 1)) {
    categories = c(categories, paste(breaks[i], breaks[i + 1], sep="-"))
    counts = c(counts, 0)
  }
  categories = factor(categories, levels=categories)

  ### Assign counts to each category  

  for (r in 1:length(values)) {
    value = values[r]
    if (!is.na(value)) {
      for (i in 1:(length(breaks) - 1)) {
        lower = breaks[i - 1]
        upper = breaks[i]
        if (value >= lower && value < upper) {
          counts[i - 1] = counts[i - 1] + 1
          break
        }
      }
    }
  }

  # Return the data frame

  return(data.frame(Category = categories, Count = counts))
}

And here is the output.

bin(grades, breaks=c(0,60,70,80,90,95,100))

##   Category Count
## 1     0-60    91
## 2    60-70   369
## 3    70-80   444
## 4    80-90    91
## 5    90-95     5
## 6   95-100     0

Finally, we can plot the data as a histogram.

grades.binned = bin(grades, breaks=c(0,60,70,80,90,95,100))
ggplot(grades.binned, aes(Category, Count)) +
  geom_bar(stat="identity")

center

Touching up

We can afford to touch up this graph a bit. Change that X-axis title, and you set width=1 if you want to remove the space between the bars.

ggplot(grades.binned, aes(Category, Count)) +
  geom_bar(stat="identity", width = 1, color="black", fill="white") +
  labs(x="Grade (%)")

center Go wild if you want.

ggplot(cbind(grades.binned, Letter=c("F", "D", "C", "B", "A", "A+")), aes(Category, Count)) +
  geom_bar(stat="identity", width = 1, color="black", fill="white") +
  geom_text(aes(label=Letter, y=Count+20), size=8) +
  labs(x="Grade (%)") +
  theme_classic()

center