3 Wide Form Data

Wide form data is a common way to store data, especially when it comes from family data. surveys or experiments where each family has multiple members (twins, siblings etc) In this form, each row represents a family, and each column represents a variable or measurement for a specific family member. This format is easy to understand and work with, but it can be challenging for certain types of analyses or visualizations. In this section, demonstrate straightforward ways to import, summarize, and visualize wide form data using the twinData dataset from the OpenMx package.

3.1 Import Data

We’re going to use the twinData dataset from the OpenMx package. This dataset contains 3,808 pairs of Australian twins on the 12 variables.

library(tidyverse)
library(NlsyLinks)
library(discord)
library(BGmisc)
library(OpenMx)
library(conflicted) # to handle conflicts
conflicted::conflicts_prefer(OpenMx::vech,dplyr::filter) # Resolve conflicts


data(twinData)

df_wide <- twinData

3.2 Data Structure

Let’s take a look at the structure of the dataset using the class and glimpse functions.

class(df_wide)

## [1] "data.frame"

glimpse(df_wide)

## Rows: 3,808
## Columns: 16
## $ fam      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ age      <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
## $ zyg      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ part     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ wt1      <int> 58, 54, 55, 66, 50, 60, 65, 40, 60, 76, 48, 70, 51, 53, 58, 4…
## $ wt2      <int> 57, 53, 50, 76, 48, 60, 65, 39, 57, 64, 51, 67, 51, 55, 57, 5…
## $ ht1      <dbl> 1.7000, 1.6299, 1.6499, 1.5698, 1.6099, 1.5999, 1.7500, 1.559…
## $ ht2      <dbl> 1.7000, 1.6299, 1.6799, 1.6499, 1.6299, 1.5698, 1.7698, 1.529…
## $ htwt1    <dbl> 20.0692, 20.3244, 20.2020, 26.7759, 19.2894, 23.4375, 21.2245…
## $ htwt2    <dbl> 19.7232, 19.9481, 17.7154, 27.9155, 18.0662, 24.3418, 20.7476…
## $ bmi1     <dbl> 20.9943, 21.0828, 21.0405, 23.0125, 20.7169, 22.0804, 21.3861…
## $ bmi2     <dbl> 20.8726, 20.9519, 20.1210, 23.3043, 20.2583, 22.3454, 21.2270…
## $ cohort   <chr> "younger", "younger", "younger", "younger", "younger", "young…
## $ zygosity <fct> MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, MZFF, M…
## $ age1     <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…
## $ age2     <int> 21, 24, 21, 21, 19, 26, 23, 29, 24, 28, 29, 19, 23, 22, 23, 2…

As we can see, the dataset contains 3808 observations of 16 variables. Most of these variables include two entries, one for twin 1 and one for twin 2. The zygosity variable indicates the zygosity of the twins, while other variables represent different measurements or characteristics of the twins, and denote which twin they refer to by the suffixes 1 and 2.

We can also see that the dataset contains a mix of numeric and character variables. To improve the readability of the dataset, we are going to split the information from the zygosity variable into two separate variables: sex and zyg. This will make it easier to analyze and visualize the data later on.

df_wide <- df_wide %>% mutate(sex = 
                      case_when(zygosity %in% c("MZFF","DZFF") ~ "F", 
                                zygosity %in% c("MZMM","DZMM") ~ "M",
                                TRUE ~ "OS"),
                    zyg = case_when(zygosity %in% c("MZFF","MZMM") ~ "MZ", 
                                    zygosity %in% c("DZFF","DZMM","DZOS") ~ "DZ",
                                TRUE ~ NA_character_ ))

3.2.1 Annotated explainaton for the code snippet above:

The provided R code uses the tidyverse package’s dplyr library to manipulate a data frame named df_wide. The mutate function is used to create new variables or modify existing ones within the data frame. (In this case it is creating sex and over writing zyg. Here’s an annotation of what each part of the code is doing:

Data Frame Assignment: df_wide <- df_wide %>%

This line indicates that we are taking the existing data frame df_wide and using the %>% operator (pipe) to pass it through additional functions. The result will be stored back into the df_wide.

Mutate Function: mutate(sex = ..., zyg = ...)

The mutate function is used to add new columns to the data frame or change existing ones. In this case, one new column sex is being added and the existing column zyg is being modified.

Creating sex Column:

sex = case_when(...) creates a new column named sex based on conditions applied to the zygosity column. The case_when function is similar to a series of if-else statements. For each row, it checks the conditions in order and assigns a value to sex based on the first matching condition.

Conditions for sex Column:

zygosity %in% c("MZFF","DZFF") ~ "F" condition checks if the zygosity value is either “MZFF” or “DZFF”. If true, “F” (Female) is assigned to the sex column.
zygosity %in% c("MZMM","DZMM") ~ "M" checks if the zygosity value is either “MZMM” or “DZMM”. If true, “M” (Male) is assigned to the sex column.
TRUE ~ "OS" is a catch-all condition that assigns “OS” (Other or Unknown sex) if none of the above conditions are met.

Creating zyg Column:

Similar to the sex column, zyg = case_when(...) overwrites an old column zyg that categorizes zygosity into broader categories.

Conditions for zyg Column:

zygosity %in% c("MZFF","MZMM") ~ "MZ" checks if zygosity is either “MZFF” or “MZMM” and assigns “MZ” (Monozygotic) indicating identical twins.
zygosity %in% c("DZFF","DZMM","DZOS") ~ "DZ" checks if zygosity falls into any of “DZFF”, “DZMM”, or “DZOS”, assigning “DZ” (Dizygotic) indicating fraternal twins.
TRUE ~ NA_character_ assigns a missing value (NA) if none of the above conditions are met, possibly used to handle cases where zygosity data is not clearly defined or is missing.

This code effectively categorizes individuals in the dataset based on zygosity into more workable groups for sex and twin status, which can (and will) be useful for further analysis and visualization.

3.3 Summary Statistics (Wide Form)

Let’s calculate summary statistics for numeric variables across the full sample. This will provide a quick overview of central tendencies and variability in the dataset. When working with wide form data, it is often helpful to start with summarizing by the data structure you already have. In this case, we will calculate summary statistics by specific twin (twin 1 or twin 2).

3.3.1 Numeric Variables

Calculate summary statistics for numeric variables across the full sample. This helps provide a quick overview of central tendencies and variability.

# Calculate summary statistics for numeric variables across the full sample
summary_stats <- df_wide %>%
  summarise(across(where(is.numeric), list(
    mean = ~mean(., na.rm = TRUE),
    sd = ~sd(., na.rm = TRUE),
    median = ~median(., na.rm = TRUE),
    min = ~min(., na.rm = TRUE),
    max = ~max(., na.rm = TRUE),
    IQR = ~IQR(., na.rm = TRUE)
  ), .names = "{col}_{fn}")) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("variable", "statistic"),
    names_sep = "_"
  ) %>%
  pivot_wider(
    names_from = statistic,
    values_from = value
  )

summary_stats

## # A tibble: 13 × 7
##    variable    mean        sd  median   min     max      IQR
##    <chr>      <dbl>     <dbl>   <dbl> <dbl>   <dbl>    <dbl>
##  1 fam      1904.   1099.     1904.    1    3808    1904.   
##  2 age        34.5    14.2      30    17      88      19    
##  3 part        1.93    0.265     2     0       2       0    
##  4 wt1        62.2    10.9      60    34     111      15    
##  5 wt2        65.6    12.2      65    35     127      16    
##  6 ht1         1.66    0.0896    1.65  1.34    1.98    0.130
##  7 ht2         1.69    0.0990    1.7   1.40    1.99    0.140
##  8 htwt1      22.5     3.16     21.9  13.3    41.9     3.84 
##  9 htwt2      22.8     3.19     22.3  14.2    46.2     3.78 
## 10 bmi1       21.7     0.940    21.6  18.1    26.1     1.21 
## 11 bmi2       21.8     0.939    21.7  18.6    26.8     1.18 
## 12 age1       34.5    14.2      30    17      88      19    
## 13 age2       34.5    14.2      30    17      88      19

As you can see in the table above, the summary statistics provide a quick overview of the central tendencies and variability in the dataset for numeric variables. The table includes the mean, standard deviation, median, minimum, maximum, and interquartile range (IQR) for each numeric variable. But they do so by the twin number, which is useful for our purposes, but… isn’t the only descriptive information we’ll need. We will need to calculate these statistics for the full sample, not by twin number. But we’ll still going to start with the twin number, and then we’ll calculate the summary statistics for the full sample.

3.3.2 Frequency Tables

Create frequency tables for categorical variables like zygosity and sex, providing a clear picture of the distribution of these categories within the dataset. These numbers are easier to calculation when data are wide like this because we’re actually interested in the dyad, not the individual.

# Counting 'zygosity' and calculating percentages
zygosity_summary <- df_wide %>%
  count(zyg, name = "count") %>%
  mutate(percentage = count / sum(count) * 100) %>%
  rename(category = zyg) %>%  # Renaming the column for clarity
  mutate(variable = "zygosity")  # Adding a descriptor column for the variable

# Counting 'sex' and calculating percentages
sex_summary <- df_wide %>%
  count(sex, name = "count") %>%
  mutate(percentage = count / sum(count) * 100) %>%
  rename(category = sex) %>%  # Renaming the column for clarity
  mutate(variable = "sex")  # Adding a descriptor column for the variable

# Combining both summaries into a single dataframe
combined_summary <- bind_rows(zygosity_summary, sex_summary) %>%
  select(variable, category, everything())  # Reordering columns for clarity

combined_summary

##   variable category count percentage
## 1 zygosity       DZ  2009   52.75735
## 2 zygosity       MZ  1799   47.24265
## 3      sex        F  1983   52.07458
## 4      sex        M   919   24.13340
## 5      sex       OS   906   23.79202

As you can see in the table above, the frequency tables provide a clear picture of the distribution of categories within the dataset for the zygosity and sex. Interestingly, there are many more same sex female twins, which make up 1983/3808 (52%) of the dataset.

What if you want to examine these wideform by cohort? You can use the group_by function to group the data by cohort and then calculate the summary statistics for each cohort.

library(tidyverse)

# Grouping by 'cohort' and calculating summary statistics for each group across values that are numeric
df_summary <- df_wide %>%
    group_by(cohort) %>%
    summarise(across(where(is.numeric), list(
        mean = ~mean(., na.rm = TRUE),
        sd = ~sd(., na.rm = TRUE),
        median = ~median(., na.rm = TRUE),
        IQR = ~IQR(., na.rm = TRUE)
    ), .names = "{.col}_{.fn}")) %>%
    # Ensure only statistic columns are being pivoted
    pivot_longer(
        cols = -cohort,  # Exclude 'cohort' from pivoting
        names_to = c("variable", "statistic"),
        names_sep = "_"
    )
# lock in the variable order
variable_order <- unique(df_summary$variable)

df_summary <- df_summary  %>% 
    pivot_wider(
        names_from = statistic,
        values_from = value,
        values_fn = list(value = mean)  # Aggregate if necessary
    )

df_summary  %>% mutate(variable = factor(variable, levels = variable_order)) %>%
    arrange(variable, cohort)

## # A tibble: 26 × 6
##    cohort  variable    mean      sd median   IQR
##    <chr>   <fct>      <dbl>   <dbl>  <dbl> <dbl>
##  1 older   fam      2859    548.      2859   949
##  2 younger fam       955    551.       955   954
##  3 older   age        45.4   12.2       42    19
##  4 younger age        23.5    3.66      23     7
##  5 older   part        1.95   0.224      2     0
##  6 younger part        1.92   0.299      2     0
##  7 older   wt1        63.8   11.2       62    15
##  8 younger wt1        60.5   10.3       59    14
##  9 older   wt2        67.0   12.7       66    19
## 10 younger wt2        64.1   11.6       64    17
## # ℹ 16 more rows

As you can see from the table, the summary statistics are calculated for each cohort across the numeric variables. This provides a quick overview of the central tendencies and variability in the dataset for each cohort. Now, these data are still in wide form, but we can easily convert them to long form if needed.

What about descriptive statistics by zygosity and sex?

library(tidyverse)

# Grouping by 'zyg,sex' and calculating summary statistics for each group
df_summary <- df_wide %>%
    group_by(zyg,sex) %>%
    summarise(across(where(is.numeric), list(
        mean = ~mean(., na.rm = TRUE),
        sd = ~sd(., na.rm = TRUE),
        median = ~median(., na.rm = TRUE),
        IQR = ~IQR(., na.rm = TRUE)
    ), .names = "{.col}_{.fn}")) %>%
    # Ensure only statistic columns are being pivoted
    pivot_longer(
        cols = -c(zyg,sex),  # Exclude 'zygosity' from pivoting
        names_to = c("variable", "statistic"),
        names_sep = "_"
    )

## `summarise()` has grouped output by 'zyg'. You can override using the `.groups`
## argument.

# lock in the variable order
variable_order <- unique(df_summary$variable)

df_summary <- df_summary  %>% 
    pivot_wider(
        names_from = statistic,
        values_from = value,
        values_fn = list(value = mean)  # Aggregate if necessary
    )

df_summary  %>% mutate(variable = factor(variable, levels = variable_order)) %>%
    arrange(variable,zyg,sex)

## # A tibble: 65 × 7
## # Groups:   zyg [2]
##    zyg   sex   variable   mean     sd median   IQR
##    <chr> <chr> <fct>     <dbl>  <dbl>  <dbl> <dbl>
##  1 DZ    F     fam      2109.  1029.   2891  2048 
##  2 DZ    M     fam      2144.  1009.   1370. 2042.
##  3 DZ    OS    fam      2509.   980.   1852. 1956.
##  4 MZ    F     fam      1338.   992.   1956. 1956.
##  5 MZ    M     fam      1750.  1010.   2583  2013 
##  6 DZ    F     age        35.4   14.3    32    20 
##  7 DZ    M     age        32.3   13.9    28    16 
##  8 DZ    OS    age        32.9   13.9    28    17 
##  9 MZ    F     age        35.7   14.3    32    21 
## 10 MZ    M     age        34.4   14.1    31    18 
## # ℹ 55 more rows

As you can see from the table, the summary statistics are calculated for each zygosity and sex across the numeric variables.