Progress in Public Health

Data Visualization

Introduction

This document uses age-adjusted death rates from the National Center for Health Statistics for the years 1900-2018 to demonstrate good data visualization principles. Clutter is removed and a coherent message is communicated to the audience.

Load Data

Code

df = (
    pl.read_csv(here("projects/death_rates/death_rates.csv"))
    .with_columns(pl.col("Year")
    .cast(pl.Utf8)
    .str.to_date(format="%Y"))
)

df.glimpse()

Rows: 595
Columns: 3
$ Year                    <date> 1900-01-01, 1901-01-01, 1902-01-01, 1903-01-01, 1904-01-01, 1905-01-01, 1906-01-01, 1907-01-01, 1908-01-01, 1909-01-01
$ Cause                    <str> 'Heart Disease', 'Heart Disease', 'Heart Disease', 'Heart Disease', 'Heart Disease', 'Heart Disease', 'Heart Disease', 'Heart Disease', 'Heart Disease', 'Heart Disease'
$ Age Adjusted Death Rate  <f64> 265.4, 272.6, 285.2, 304.5, 331.5, 327.8, 325.5, 356.5, 328.6, 329.2

Visualization

Plot All Causes of Death

Code

fig, ax = plt.subplots()


for cause, group in df.group_by("Cause"):
    ax.plot(group["Year"], group["Age Adjusted Death Rate"], label=cause, linewidth=2.5)


# ax.spines["top"].set_visible(False)
# ax.spines["right"].set_visible(False)
# ax.tick_params(length=0)
ax.legend(title="Cause")
# ax.set_title("Initial Rough Plot")

plt.show()

There’s a lot going on here, not to mention a legend that is in the way. Instead of looking at all five causes of death in the data, let’s focus on a single disease.

Plot Heart Disease

Let’s subset for heart disease and see what that looks like.

Code

# Subset for heart disease
df_heart = (
    df
    .filter(pl.col("Cause") == "Heart Disease")
)


# Initialize the plot
fig, ax = plt.subplots()

# Plot the segment for years <= 2000 in skyblue
ax.plot(
    df_heart["Year"],
    df_heart["Age Adjusted Death Rate"],
    linewidth=2.5,
    label="Heart Disease",
    color="#7EC0EE"
)

ax.legend(title="Cause")

plt.show()

This makes it easier to see what is happening with heart disease specifically. There is an interesting pattern here, which is that death rates from heart disease are lower than they have ever been since 1900, but that progress was not linear.

Let’s make a graph that highlights how long it took to recover to the death rate in 1900, which was 265.4 per 100,000 U.S. residents.

The graph below utilizes several methods to tell this story:

The top and right spines are removed.
The axis tick marks are removed.
Color is used to distinguish between the period of recovery to the death rate in 1900 and the period afterwards, when death rates are below the level observed in 1900.
A horizontal line is added to further distinguish the recovery to 1900 death rates.
The legend is removed in favor of direct labeling in the graph and in the title of the graph itself.

Code

# Split the data into two parts: before (and including) 2000 and after 2000
cutoff = pl.lit("2000-01-01").str.to_date()
df_old = df_heart.filter(pl.col("Year") <= cutoff)
df_new = df_heart.filter(pl.col("Year") >= cutoff)




# Load the font for the plot
font = load_font(
    "https://github.com/google/fonts/blob/main/ofl/merriweather/Merriweather%5Bopsz%2Cwdth%2Cwght%5D.ttf?raw=true"
)



# Initialize the plot
fig, ax = plt.subplots()

# Plot the segment for years <= 2000 in skyblue
ax.plot(
    df_old["Year"],
    df_old["Age Adjusted Death Rate"],
    linewidth=2.5,
    color="#7EC0EE"
)

# Plot the segment for years > 2000 in lightgrey
ax.plot(
    df_new["Year"],
    df_new["Age Adjusted Death Rate"],
    linewidth=2.5,
    color="lightgrey"
)

# Extract the rate for the year 1900
rate_1900 = df_heart.filter(pl.col("Year") == datetime.date(1900, 1, 1))["Age Adjusted Death Rate"].item()

# Add a dashed horizontal line from 1900 to 2000 at rate_1900
ax.hlines(
    y=rate_1900,
    xmin=datetime.date(1900, 1, 1),
    xmax=datetime.date(2000, 1, 1),
    colors="black",
    linestyles="--"
)

# Style the plot
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.tick_params(length=0)
# ax.set_title()
# ax.set_ylabel()
ax.set_xlabel("Year", font = font)


ax.text(
    x = 2000,
    y = 475,
    size = 11,
    s = "Heart Disease\n Death Rate",
    color = "#7EC0EE",
    weight = "bold",
    ha="left"
)

# Adjust top margin
fig.subplots_adjust(top=0.8, bottom=0.05)


# Title with higlight_text

text = "It took 100 years for the <Heart Disease Death Rate>\nto recover to levels observed in 1900"
ax_text(
    x=-10**4.45, y=720, s=text,
    fontsize=16, color='black',
    font = font,
    highlight_textprops=[
      {"color": "#7EC0EE", 'fontweight': 'bold'}
    ]
)

# Subtitle
fig.text(
  x=.124,
  y=.825,
  s="Death Rate (per 100K)",
  size=10,
  font=font,
  color="gray"
)

# caption
fig.text(
  x=.6,
  y=-0.05,
  s="Source: National Center for Health Statistics",
  size=6,
  color="#6E6E6E"
)


plt.show()

This all makes it easier for the viewer to notice that it took 100 years to recover to the heart disease death rate observed in 1900. Visualizations that are easier to read are more likely to be understood and acted upon.