Skip to contents

Introduction

This vignette walks through the steps needed to recreate a figure from Garrison and Rodgers (2016). The example uses kinship pair data prepared with the discord package and visualizes the results using updated ggplot2 tools.

Data Preparation

Data Cleaning

This section reuses the data preparation pipeline developed in the regression vignette.

That vignette demonstrated how to set up data for discordant regression analysis by using discord data processing tools. Those tools facilitate the construction of kinship links, including identifying sibling pairs, merging sibling characteristics, and calculating pair-level variables.

Here, we reuse that same pipeline to prepare the data for plotting. Specifically, we apply the same kinship pairing, data merging, and cleaning procedures, but our focus is now on visualizing patterns rather than fitting regression models.

The underlying dataset is the NLSY79, which includes measures of flu vaccination and socioeconomic status (SES) for kinship pairs. As in the regression vignette, we restrict the sample to individuals who are housemates and have a relatedness of 0.5.

For full details on the data processing and kinship link construction, see the regression vignette.

## Setup: Use discord_data
# Visualizing the Results
library(discord)
library(NlsyLinks)

library(tidyverse)
library(ggplot2)
library(grid)
library(gridExtra)
library(janitor)

# Load the data
data(data_flu_ses)

# Get kinship links for individuals with the following variables:
link_vars <- c(
  "FLU_total", "FLU_2008", "FLU_2010",
  "FLU_2012", "FLU_2014", "FLU_2016",
  "S00_H40", "RACE", "SEX"
)

link_pairs <- Links79PairExpanded %>%
  filter(RelationshipPath == "Gen1Housemates" & RFull == 0.5)


df_link <- CreatePairLinksSingleEntered(
  outcomeDataset = data_flu_ses,
  linksPairDataset = link_pairs,
  outcomeNames = link_vars
)

consistent_kin <- df_link %>%
  group_by(SubjectTag_S1, SubjectTag_S2) %>%
  count(
    FLU_2008_S1, FLU_2010_S1,
    FLU_2012_S1, FLU_2014_S1,
    FLU_2016_S1, FLU_2008_S2,
    FLU_2010_S2, FLU_2012_S2,
    FLU_2014_S2, FLU_2016_S2
  ) %>%
  na.omit()

# Create the flu_modeling_data object with only consistent responders.
# Clean the column names with the {janitor} package.
flu_modeling_data <- semi_join(df_link,
  consistent_kin,
  by = c(
    "SubjectTag_S1",
    "SubjectTag_S2"
  )
) %>%
  clean_names()

Creating the Discord Data

With the data prepared, we restructure it using discord_data().

discord_flu <- discord_data(
  data = flu_modeling_data,
  outcome = "flu_total",
  predictors = "s00_h40",
  id = "extended_id",
  sex = "sex",
  race = "race",
  pair_identifiers = c("_s1", "_s2"),
  demographics = "both"
) %>%
  filter(!is.na(s00_h40_mean), !is.na(flu_total_mean))

Because we are interested in differences between kin, we create a new variable, ses_diff_group, that classifies SES differences into three categories: “More Advantage”, “Equally Advantage”, and “Less Advantage”. This variable is later used to group observations in the marginal density plots.

discord_flu <- discord_flu %>%
  mutate(
    # # Classify Difference Grouping
    ses_diff_group = factor(
      case_when(
        scale(s00_h40_diff) > 0.33 ~ "More Advantage",
        scale(s00_h40_diff) < -0.33 ~ "Less Advantage",
        abs(scale(s00_h40_diff)) <= 0.33 ~ "Equally Advantage"
      ),
      levels = c(
        "Less Advantage",
        "Equally Advantage",
        "More Advantage"
      )
    )
  )

Setting Up Color Palette

To visually distinguish sibling differences in SES, we create a color palette that transitions from red (indicating lower SES) to blue (indicating higher SES). This palette is used in the scatter plots to color points based on the SES difference between siblings. That is the predictor variable in our discord data.

# Create a color palette for the shading
shading <- c("firebrick4", "firebrick1", "dodgerblue1", "dodgerblue4")
max_val <- max(abs(discord_flu$s00_h40_diff), na.rm = TRUE)
values <- seq(-max_val, max_val, length = length(shading))

Plotting the Results

Individual Level Plot

This plot is for looking at individual level data rather than sibling pair means or differences. It provides context for understanding the relationship between SES and flu vaccinations before examining sibling differences.

This scatter plot shows individual SES at age 40 against individual flu vaccinations. Point color indicates the SES difference between siblings.

The first step is to create the base plot with sibling 1 data. In the next code block, we add sibling 2 data to the plot.

# Individual level plot
individual_plot <- individual_plot_sib1 <- ggplot(flu_modeling_data, aes(
  x = s00_h40_s1,
  y = flu_total_s1,
  color = s00_h40_s1 - s00_h40_s2
)) +
  geom_point(
    size = 0.8, alpha = 0.8, na.rm = TRUE,
    position = position_jitter(width = 0.2, height = 0.2)
  ) +
  geom_smooth(method = "lm", se = FALSE, color = "black") 

But if we only plot sibling 1, we can visualize that first.

individual_plot_sib1 +
  scale_colour_gradientn(
    name = "Sibling\nDifferences\nin SES",
    colours = shading,
    na.value = "#AD78B6",
    values = scales::rescale(c(-max_val, max_val))
  ) +
  labs(
    x = "SES at Age 40",
    y = "Flu Vaccination Count"
  ) +
  theme_minimal() +
  ggtitle("Individual Level Plot: Sibling 1 Only") +
  theme(plot.title = element_text(hjust = 0.5))
#> Warning: Removed 3 rows containing non-finite outside the scale range
#> (`stat_smooth()`).

Now, we add sibling 2 to the plot, using the same color scheme to indicate SES differences between siblings.

individual_plot  <- individual_plot +
  # added sibling 2 to the plot
  geom_point(
    size = 0.8, alpha = 0.8, na.rm = TRUE,
    position = position_jitter(width = 0.2, height = 0.2),
    aes(
      x = s00_h40_s2,
      y = flu_total_s2,
      color = s00_h40_s2 - s00_h40_s1 # this reverses the color difference so sibling 2 points use the opposite color gradient compared to sibling 1, making it visually clear which sibling is being represented and how their SES difference is encoded
    )
  ) +
  scale_colour_gradientn(
    name = "Sibling\nDifferences\nin SES",
    colours = shading,
    na.value = "#AD78B6",
    values = scales::rescale(c(-max_val, max_val))
  ) +
  labs(
    x = "SES at Age 40",
    y = "Flu Vaccination Count"
  ) +
  theme_minimal() #+
#  theme(legend.position = "none")

individual_plot +
  ggtitle("Individual Level Plot") +
  theme(plot.title = element_text(hjust = 0.5))
#> Warning: Removed 3 rows containing non-finite outside the scale range
#> (`stat_smooth()`).

Between Family Plots

This section creates a between-family plot that visualizes mean SES at age 40 against mean flu vaccinations for sibling pairs. Points are colored based on the SES difference between siblings. Each point represents a sibling pair, with the x-axis showing the average SES of the pair and the y-axis showing the average flu vaccinations.

Additionally, marginal density plots can be added to show the distribution of mean SES and mean flu vaccinations for different SES difference groups.

Like the individual level plot, this between-family plot shows a positive association between mean SES and mean flu vaccinations. Higher average SES among sibling pairs is associated with higher average flu vaccination rates.

Adding Marginal Density Plots

To enhance the visualization, we can add marginal density plots to the scatter plot. These plots show the distribution of mean SES and mean flu vaccinations, grouped by the SES difference category.

# Marginal X density (SES mean)
xdensity <- ggplot(discord_flu, aes(x = s00_h40_mean, group = ses_diff_group, color = ses_diff_group)) +
  geom_density(adjust = 2, linewidth = 1, fill = NA) +
  scale_colour_manual(
    name = "Sibling\nDifferences\nin SES",
    values = c("firebrick1", "#AD78B6", "dodgerblue1")
  ) +
  theme_minimal() +
  theme(
    legend.position = "left",
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    legend.title = element_text(size = 8),
    legend.text = element_text(size = 6)
  ) +
  labs(x = NULL, y = NULL)

And for the Y density plot:


# Marginal Y density (Flu mean)
ydensity <- ggplot(discord_flu, aes(x = flu_total_mean, group = ses_diff_group, color = ses_diff_group)) +
  geom_density(adjust = 2, linewidth = 1, fill = NA) +
  scale_colour_manual(
    values = c("firebrick1", "#AD78B6", "dodgerblue1")
  ) +
  coord_flip() +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.text.x = element_blank()
  ) +
  labs(x = NULL, y = NULL)

To finalize the marginal plots, we add titles and adjust themes for better presentation.

Assembling the Final Plot

Finally, we arrange the main scatter plot and the marginal density plots into a cohesive layout using gridExtra::grid.arrange(). The x-density plot is placed above the main scatter plot, and the y-density plot is placed to the right of the main scatter plot. We can do this by creating a blank placeholder plot to fill the empty space in the layout.

# Blank placeholder plot
blankPlot <- ggplot() +
  theme_void()

# Final layout
grid.arrange(
  arrangeGrob(xdensity, blankPlot, ncol = 2, widths = c(4, 1)),
  arrangeGrob(main_plot, ydensity, ncol = 2, widths = c(4, 1)),
  heights = c(1.5, 4),
  top = textGrob("Sibling Differences in SES and Flu Vaccinations",
    gp = gpar(fontsize = 20, font = 3)
  )
)
#> `geom_smooth()` using formula = 'y ~ x'

Bonus: Faceting by SES Difference Group

Other ways to plot this data include using facet_wrap to create separate panels for each SES difference group. This can help visualize the differences in flu vaccinations across different SES categories.

Within Family Plots

This plot compares differences in SES at age 40 to differences in flu vaccinations, with points colored by SES difference. Marginal plots are omitted for simplicity but can be added using the same structure.

# Within Family
## Setup: Use discord_data

# Main scatter plot
main_plot_within <- ggplot(discord_flu, aes(
  x = s00_h40_diff,
  y = flu_total_diff, colour = s00_h40_diff
)) +
  geom_point(
    size = 0.8, alpha = 0.8, na.rm = TRUE,
    position = position_jitter(w = 0.2, h = 0.2)
  ) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_colour_gradientn(
    name = "Sibling\nDifferences\nin SES",
    colours = shading,
    na.value = "#AD78B6",
    values = scales::rescale(c(
      min(discord_flu$s00_h40_diff, na.rm = TRUE),
      mean(discord_flu$s00_h40_diff, na.rm = TRUE),
      max(discord_flu$s00_h40_diff, na.rm = TRUE)
    ))
  ) +
  theme_minimal() +
  theme(legend.position = "left") +
  labs(x = "Diff SES at Age 40", y = "Diff Flu Vaccinations (2006–2016)")

main_plot_within

You can further facet this plot by the difference in SES between kin to see how the relationship varies across different groups. The following code does this and adds a title to the plot.

main_plot_within + facet_wrap(~ses_diff_group, ncol = 1) +
  theme(legend.position = "bottom") +
  labs(title = "Within Family Differences in SES and Flu Vaccinations")

Conclusion

This vignette demonstrated how to visualize sibling differences in SES and flu vaccinations using discord-structured data. Scatter and density plots highlight associations by SES difference group. The use of color to indicate the difference in SES between kin adds an additional layer of insight into the data.