Skip to contents

Introduction

The {discord} package was originally developed for use with the National Longitudinal Survey of Youth (NLSY), but its functionality extends far beyond that. When paired with its sister package {BGmisc}, discord can be applied to any dataset containing basic family structure information, allowing researchers to explore genetic and environmental influences without requiring pre-constructed kinship links.

This vignette demonstrates how to:

  • Construct kinship links from simple family data (e.g., individual ID, mother ID, father ID).
  • Simulate phenotyipic data based on known genetic and environmental structures.
  • Fit a discordant-kinship regression model using the simulated data.

We use tools from {BGmisc} and a toy dataset to illustrate the workflow.

Loading Packages and Data

We begin by loading the required packages and a built-in dataset from {BGmisc}.

We rename the family ID column to avoid naming conflicts and generate a pedigree-encoded data frame.

df_potter <- potter

names(df_potter)[names(df_potter) == "famID"] <- "oldfam"

df_potter <- ped2fam(df_potter,
  famID = "famID",
  personID = "personID"
)

We also verify and repair sex coding to ensure compatibility with downstream pedigree operations.

df_potter <- checkSex(df_potter,
  code_male = 1,
  code_female = 0,
  verbose = FALSE, repair = TRUE
)
BGmisc::plotPedigree(df_potter, verbose = FALSE)
Pedigree plot of the Potter dataset

Pedigree plot of the Potter dataset

#> named list()

The pedigree plot provides a visual representation of the kinship structure in the dataset. Each node represents an individual, and the edges indicate familial relationships.

To extract the necessary kinship information, we need to compute two matrices: the additive genetic relatedness matrix and the shared environment matrix. These matrices are derived from the pedigree data and are essential for understanding the genetic and environmental relationships among individuals. Using {BGmisc}, we compute:

  • The additive genetic relatedness matrix (add).

  • The shared environment matrix (cn), indicating whether kin were raised together (1) or apart (0).

add <- ped2add(df_potter)
cn <- ped2cn(df_potter)

The ped2add() function computes the additive genetic relatedness matrix, which quantifies the genetic similarity between individuals based on their pedigree information. The ped2cn() function computes the shared environment matrix, which indicates whether individuals were raised in the same environment (1) or different environments (0).

The resulting matrices are symmetric, with diagonal elements representing self-relatedness (1.0). The off-diagonal elements represent the relatedness between pairs of individuals, with values ranging from 0 (no relatedness) to 0.5 (full siblings) to 1 (themselves).

We convert the component matrices into a long-form data frame of kin pairs using com2links(). Self-pairs and duplicate entries are removed.

df_links <- com2links(
  writetodisk = FALSE,
  ad_ped_matrix = add, cn_ped_matrix = cn,
  drop_upper_triangular = TRUE
) %>%
  filter(ID1 != ID2)

df_links %>%
  slice(1:10) %>%
  knitr::kable()
ID1 ID2 addRel cnuRel
1 2 0.500 1
3 4 0.500 1
1 6 0.500 0
2 6 0.250 0
3 6 0.500 0
4 6 0.250 0
3 7 0.250 0
4 7 0.500 0
5 7 0.500 0
6 7 0.125 0

We then extract two subsets:

  • Full siblings: additive relatedness = 0.5 and shared environment = 1

  • Cousins: additive relatedness = 0.125 and shared environment = 0

df_sim <- df_links %>%
  filter(addRel == .5) %>% # only full siblings %>%
  filter(cnuRel == 1) # only kin raised in the same home

df_cousin <- df_links %>%
  filter(addRel == .125) %>% # only cousins %>%
  filter(cnuRel == 0) # only kin raised in separate homes

Simulate Phenotypic Data

To simulate phenotypic data, we need to create a data frame that includes the kinship information and the outcome variables. We will simulate two outcome variables (y1 and y2) for each cousin pair in the dataset. The kinsim() function from {discord} is used to generate the simulated data based on the specified variance structure.

set.seed(1234)
syn_df <- discord::kinsim(
  mu_all = c(1, 1),
  cov_a = .4,
  cov_e = .4,
  c_all = 0,
  r_vector = df_cousin$addRel
) %>%
  select(-c(
    A1_1, A1_2, A2_1, A2_2,
    C1_1, C1_2, C2_1, C2_2,
    E1_1, E1_2, E2_1, E2_2,
    r
  ))

The simulated data reflect a known variance structure: additive genetic covariance = .4, genetic correlation = 0.125, no shared environment, and residual (unique environment) variance = 0.4. Latent component scores are excluded from the final dataset, but they can be useful for debugging and understanding the underlying structure of the data.

We bind the simulated outcome data to the cousin link data to prepare it for modeling.

data_demo <- cbind(df_cousin, syn_df)


summary(data_demo)
#>       ID1             ID2             addRel          cnuRel       y1_1        
#>  Min.   : 3.00   Min.   :  7.00   Min.   :0.125   Min.   :0   Min.   :-3.6557  
#>  1st Qu.:21.00   1st Qu.: 26.00   1st Qu.:0.125   1st Qu.:0   1st Qu.:-0.4007  
#>  Median :23.00   Median : 28.00   Median :0.125   Median :0   Median : 0.3537  
#>  Mean   :21.62   Mean   : 36.55   Mean   :0.125   Mean   :0   Mean   : 0.6055  
#>  3rd Qu.:24.00   3rd Qu.: 30.00   3rd Qu.:0.125   3rd Qu.:0   3rd Qu.: 1.6073  
#>  Max.   :28.00   Max.   :104.00   Max.   :0.125   Max.   :0   Max.   : 4.5506  
#>       y1_2              y2_1              y2_2               id      
#>  Min.   :-1.1505   Min.   :-2.2658   Min.   :-2.3618   Min.   : 1.0  
#>  1st Qu.:-0.0983   1st Qu.:-0.4629   1st Qu.: 0.3215   1st Qu.:12.5  
#>  Median : 1.1069   Median : 0.7087   Median : 1.4460   Median :24.0  
#>  Mean   : 1.1957   Mean   : 0.9443   Mean   : 1.1523   Mean   :24.0  
#>  3rd Qu.: 2.1121   3rd Qu.: 2.2073   3rd Qu.: 2.1726   3rd Qu.:35.5  
#>  Max.   : 5.0089   Max.   : 5.3431   Max.   : 4.2828   Max.   :47.0

The data_demo data frame now contains the kinship information along with the simulated outcome variables y1 and y2. Each row represents a pair of cousins, and the columns include the IDs of the individuals, their relatedness, and the simulated phenotypic data.

Fitting a Discordant-Kinship Regression Model

We then use discord_regression() to fit a discordant-kinship model, predicting y1 from y2. Based on the structure of the data, we expect that there will be a significant association between the two outcome variables, as there is a known overlapping non-shared environment covariance.

The model is fit using the discord_regression() function, which takes the following arguments:

model_output <- discord_regression(
  data = data_demo,
  outcome = "y1",
  predictors = "y2",
  id = "id",
  sex = NULL,
  race = NULL,
  pair_identifiers = c("_1", "_2")
)
summary(model_output)
#> 
#> Call:
#> stats::lm(formula = y1_diff ~ y1_mean + y2_diff + y2_mean, data = preppedData)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.6044 -0.9315 -0.1824  0.7204  2.7929 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.21796    0.24153   5.043  8.8e-06 ***
#> y1_mean      0.08375    0.13099   0.639   0.5260    
#> y2_diff      0.27022    0.08830   3.060   0.0038 ** 
#> y2_mean      0.11734    0.11812   0.993   0.3261    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.117 on 43 degrees of freedom
#> Multiple R-squared:  0.2033, Adjusted R-squared:  0.1477 
#> F-statistic: 3.657 on 3 and 43 DF,  p-value: 0.01959

The output of the model includes estimates of the regression coefficients, standard errors, and p-values for the association between the two outcome variables.

Conclusion

This vignette demonstrates how {BGmisc} and discord enable researchers to perform discordant-kinship analyses starting from simple family data. There’s no need for pre-constructed kinship links or specialized datasets like the NLSY—just basic family identifiers are sufficient to generate kinship structures and apply powerful behavior genetic models.

References