5 Long Form Data
In contrast to wide form data, long form data, also known as “tidy data,” structures datasets where each row represents a single observation, and each column represents a variable. This format is highly beneficial for statistical modeling and data analysis because it simplifies the application of various data manipulation and analysis functions. As before, we’ll use the twinData
dataset from the OpenMx
package, but we’ll convert it to long form to illustrate handling and analyzing data in this format.
5.1 Converting from Wide to Long Form
library(tidyverse)
library(NlsyLinks)
library(discord)
library(BGmisc)
library(OpenMx)
library(conflicted) # to handle conflicts
conflicted::conflicts_prefer(OpenMx::vech,dplyr::filter) # Resolve conflicts
data(twinData)
df_long <- twinData %>% select(-age)
# Convert wide data to long form
df_long <- df_long %>%
pivot_longer(
cols = matches('1$|2$'), # Select columns ending in '1' or '2'
cols_vary = "slowest", # Specify that the columns are in the same order for each twin
names_to = c(".value", "twin"), # Split the column names into variable and twin number
names_pattern = "(.*)(1|2)" # Capture the variable and twin number
)
5.2 Data Structure
Let’s take a look at the structure of the dataset now that it’s in long form.
## [1] "tbl_df" "tbl" "data.frame"
## Rows: 7,616
## Columns: 11
## $ fam <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ zyg <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ part <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ cohort <chr> "younger", "younger", "younger", "younger", "younger", "young…
## $ zygosity <fct> MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, M…
## $ twin <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
## $ wt <int> 58, 54, 55, 66, 50, 60, 65, 40, 60, 76, 48, 70, 51, 53, 58, 4…
## $ ht <dbl> 1.7000, 1.6299, 1.6499, 1.5698, 1.6099, 1.5999, 1.7500, 1.559…
## $ htwt <dbl> 20.0692, 20.3244, 20.2020, 26.7759, 19.2894, 23.4375, 21.2245…
## $ bmi <dbl> 20.9943, 21.0828, 21.0405, 23.0125, 20.7169, 22.0804, 21.3861…
## $ age <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
The dataset now contains a much larger number of observations, as each twin’s data is represented as a separate row. The zygosity variable indicates the zygosity of the twins, while other variables are split into variable and value pairs, reflecting measurements or characteristics of the twins.
5.3 Adding sex
and zyg
To facilitate analyses that depend on sex or zygosity type, we’ll add these as new columns derived from zygosity.
# Add 'sex' and 'zyg' columns based on 'zygosity'
df_long <- df_long %>%
mutate(sex = case_when(
zygosity %in% c("MZFF", "DZFF") ~ "F",
zygosity %in% c("MZMM", "DZMM") ~ "M",
TRUE ~ "OS"
),
zyg = case_when(
zygosity %in% c("MZFF", "MZMM") ~ "MZ",
zygosity %in% c("DZFF", "DZMM", "DZOS") ~ "DZ",
TRUE ~ NA_character_
))
Unfortunately, the data do not contain the information for the gender for each twin, so we will just have to settle for noting that the data is missing.
5.4 Summary Statistics
Once again, let’s calculate summary statistics for numeric variables across the full sample. This will provide a quick overview of central tendencies and variability in the dataset. When working with long form data, it is often helpful to start with summarizing by the data structure you already have. In this case, we will calculate summary statistics by specific measurement across all twins.
Data Structure and Summary Statistics Examine the transformed data and calculate summary statistics similar to those performed in the wide form.
summary_stats_long <- df_long %>%
summarise(across(where(is.numeric), list(
mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
min = ~min(., na.rm = TRUE),
max = ~max(., na.rm = TRUE),
IQR = ~IQR(., na.rm = TRUE)
), .names = "{col}_{fn}")) %>%
pivot_longer(
cols = everything(),
names_to = c("variable", "statistic"),
names_sep = "_"
) %>%
pivot_wider(
names_from = statistic,
values_from = value
)
summary_stats_long %>%
print(n = 15) # to see more rows
## # A tibble: 7 × 7
## variable mean sd median min max IQR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fam 1904. 1099. 1904. 1 3808 1904.
## 2 part 1.93 0.265 2 0 2 0
## 3 wt 63.9 11.7 62 34 127 16
## 4 ht 1.68 0.0958 1.68 1.34 1.99 0.150
## 5 htwt 22.6 3.18 22.2 13.3 46.2 3.83
## 6 bmi 21.8 0.941 21.7 18.1 26.8 1.20
## 7 age 34.5 14.2 30 17 88 19
Generate frequency tables for sex and zyg, paralleling the wide form analysis. # Frequency Tables
Paralleling the wide form analysis, let us create frequency tables for categorical variables like zygosity and sex. These tables should provide a clear picture of the distribution of these categories within the dataset. In some ways, these calculations are simpler in long form data because each row is already an individual observation.
# Counting 'zygosity' and calculating percentages
zygosity_summary_long <- df_long %>%
count(zyg, name = "count") %>%
mutate(percentage = count / sum(count) * 100) %>%
rename(category = zyg) %>% # Renaming the column for clarity
mutate(variable = "zygosity") # Adding a descriptor column for the variable
# Counting 'sex' and calculating percentages
sex_summary_long <- df_long %>%
count(sex, name = "count") %>%
mutate(percentage = count / sum(count) * 100) %>%
rename(category = sex) %>% # Renaming the column for clarity
mutate(variable = "sex") # Adding a descriptor column for the variable
# Combining both summaries into a single dataframe
combined_summary_long <- bind_rows(zygosity_summary_long, sex_summary_long) %>%
select(variable, category, everything()) # Reordering columns for clarity
combined_summary_long
## # A tibble: 5 × 4
## variable category count percentage
## <chr> <chr> <int> <dbl>
## 1 zygosity DZ 4018 52.8
## 2 zygosity MZ 3598 47.2
## 3 sex F 3966 52.1
## 4 sex M 1838 24.1
## 5 sex OS 1812 23.8
As you can see, the long form data structure allows for a straightforward calculation of frequency tables for categorical variables. The resulting tables provide a clear picture of the distribution, and it does not differ from the wide form analysis, as long as one remembers to that we’re now looking at individual twins rather than pairs.
5.5 Data Visualization
5.5.1 1. Univariate Distributions
Univariate distributions are used to examine the distribution of a single variable within a dataset. They provide insights into the central tendency, variability, and shape of the data distribution. Unlike wide form data, long form data can be directly visualized using ggplot2 without the need for additional data manipulation.
5.5.1.1 Histograms
Histograms are useful for visualizing the frequency distribution of a single variable. They help identify the distribution pattern, such as normal distribution, skewness, or the presence of outliers.
Histogram of Weight
ggplot(df_long, aes(x = wt)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
labs(x = "Weight", y = "Frequency", title = "Distribution of Weights") +
theme_minimal()
This histogram shows the distribution of weights across all twins in the dataset. The x-axis represents the weight, while the y-axis shows the frequency of each weight range.
5.5.1.2 Histograms
Histograms are useful for visualizing the frequency distribution of a single variable. They help identify the distribution pattern, such as normal distribution, skewness, or the presence of outliers.
Histogram of Weight
ggplot(df_long, aes(x = wt)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
labs(x = "Weight", y = "Frequency", title = "Distribution of Weights") +
theme_minimal()
This histogram shows the distribution of weights across all twins in the dataset. The x-axis represents the weight, while the y-axis shows the frequency of each weight range.
Histogram of Weight by Zygosity
ggplot(df_long, aes(x = wt, fill = zyg)) +
geom_histogram(bins = 30, color = "black", alpha = 0.5) +
labs(x = "Weight", y = "Frequency", title = "Distribution of Weights by Zygosity") +
theme_minimal()
This histogram shows the distribution of weights by zygosity type. The fill color distinguishes between monozygotic (MZ) and dizygotic (DZ) twins. The x-axis represents the weight, while the y-axis shows the frequency of each weight range. The transparency helps in visualizing the overlap between the two distributions.
5.5.2 Density Plots
Density plots provide a smooth representation of the distribution of a variable. They are useful for comparing distributions between groups.
Density Plot of Weights by Zygosity
ggplot(df_long, aes(x = wt, fill = zyg)) +
geom_density(alpha = 0.5) +
labs(x = "Weight", y = "Density", title = "Density Plot of Weights by Zygosity") +
theme_minimal()
This density plot shows the distribution of weights by zygosity type. The fill color distinguishes between monozygotic (MZ) and dizygotic (DZ) twins. The x-axis represents the weight, while the y-axis shows the density of each weight range. The plot provides a smooth representation of the distribution, highlighting the differences between the two zygosity groups.
Density Plot of Weights by Sex
ggplot(df_long, aes(x = wt, fill = sex)) +
geom_density(alpha = 0.5) +
labs(x = "Weight", y = "Density", title = "Density Plot of Weights by Sex") +
theme_minimal() +
facet_wrap(~sex)
This density plot shows the weight distributions for male (M) and female (F) twins and OS twins. The plot is faceted by sex to provide a clear comparison between the groups groups.
5.5.2.1 2. Box Plots
Box plots are useful for visualizing the distribution of a variable and identifying potential outliers. They display the median, quartiles, and extremes of the data.
Box Plot of Weights by Zygosity
ggplot(df_long, aes(x = zyg, y = wt, fill = zyg)) +
geom_boxplot() +
labs(x = "Zygosity", y = "Weight", title = "Box Plot of Weights by Zygosity") +
theme_minimal()
This box plot shows the distribution of weights by zygosity type. The x-axis represents the zygosity (monozygotic or dizygotic), while the y-axis shows the weight range. The plot displays the median, quartiles, and potential outliers for each zygosity group.
Violin Plot of Heights by Zygosity
ggplot(df_long, aes(x = zyg, y = ht, fill = zyg)) +
geom_violin() +
labs(x = "Zygosity", y = "Height", title = "Violin Plot of Heights by Zygosity") +
theme_minimal()
This violin plot shows the distribution of heights by zygosity type. The x-axis represents the zygosity (monozygotic or dizygotic), while the y-axis shows the height range. The plot provides a smooth representation of the distribution, highlighting the differences between the two zygosity groups.
5.5.3 2. Bivariate Distributions
Bivariate distributions are used to examine the relationship between two variables. They help in understanding the correlation and interaction between the variables.
5.5.3.1 Scatter Plots
Scatter plots are useful for visualizing the relationship between two continuous variables. They help identify patterns, trends, and potential outliers in the data.
Scatter Plot of Height vs. Weight
ggplot(df_long, aes(x = wt, y = ht, color = zyg)) +
geom_point(alpha = 0.5) +
labs(x = "Weight", y = "Height", title = "Scatter Plot of Weight vs. Height by Zygosity") +
theme_minimal() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
This scatter plot shows the relationship between weight and height, with points colored by zygosity. A linear regression line is added to highlight the trend in the data. The x-axis represents the weight, while the y-axis shows the height. The plot helps in understanding the correlation between weight and height, as well as the differences between monozygotic and dizygotic twins.
5.5.4 4. Marginal Density Plots
Marginal density plots are useful for adding distribution information to scatter plots, providing additional context on the variables’ distributions.
Marginal Density Plot of Height vs. Weight
library(ggExtra)
p <- ggplot(df_long, aes(x = wt, y = ht, color = zyg)) +
geom_point(alpha = 0.5) +
labs(x = "Weight", y = "Height", title = "Scatter Plot of Weight vs. Height by Zygosity") +
theme_minimal()
ggMarginal(p, type = "density")
This plot shows a scatter plot of weight vs. height with marginal density plots on the x and y axes, providing additional distribution information.
5.5.5 5. Correlations
Correlation matrices and correlograms are useful for visualizing the relationships between multiple variables. They show the strength and direction of correlations between variables.
** Correlation Matrix of Twin Data **
library(ggcorrplot)
# Select only the variables of interest
df_cor_long <- df_long %>% select(wt, ht)
# Compute correlation matrix
corr_long <- cor(df_cor_long ,use="pairwise.complete") %>% round(2)
ggcorrplot(corr_long, type = "lower", lab = TRUE,
lab_size = 3,
method = "circle",
colors = c("tomato2", "white", "springgreen3"),
title = "Correlation Matrix of Twin Data",
ggtheme = theme_bw)
This correlation matrix shows the relationships between weight and height. The values indicate the strength and direction of the correlations, with the color representing the correlation strength.
Correlation Matrix by Zygosity
corr_zyg_long <- df_long %>%
group_by(zyg) %>%
summarise(
cor_wt_ht = cor(wt, ht, use = "pairwise.complete")
) %>%
pivot_longer(-zyg, names_to = "pairs", values_to = "correlation") %>%
unite("pairs", pairs, zyg, sep = "_") %>%
pivot_wider(names_from = pairs, values_from = correlation)
combined_matrix_long <- matrix(1, nrow = 2, ncol = 2)
rownames(combined_matrix_long) <- colnames(combined_matrix_long) <- c("wt", "ht")
# Fill the lower triangle with MZ correlations
combined_matrix_long[lower.tri(combined_matrix_long)] <- c(corr_zyg_long$cor_wt_ht_MZ)
# Fill the upper triangle with DZ correlations
combined_matrix_long[upper.tri(combined_matrix_long)] <- c(corr_zyg_long$cor_wt_ht_DZ)
ggcorrplot(combined_matrix_long, show.diag = TRUE, lab = TRUE,
lab_size = 3, method = "circle",
colors = c("tomato2", "white", "springgreen3"),
title = "Correlation Matrix of Twin Data by Zygosity",
ggtheme = theme_bw) +
labs(caption = "MZ correlations in the lower triangle,\nDZ correlations in the upper triangle")
This plot shows a correlation matrix separated by zygosity. MZ correlations are displayed in the lower triangle, and DZ correlations are in the upper triangle, allowing for a comparison of correlation strengths between the two groups.