
Full Data Workflow
S. Mason Garrison
Source:vignettes/full_data_workflow.Rmd
full_data_workflow.Rmd
Introduction
This vignette provides a complete workflow for conducting discordant-kinship regression using the {discord} package. We encourage you to read the accompanying paper (Garrison et al., 2025) for a full explanation of the methods and their applications. Here, we demonstrate how to transform your data from its original structure into a format suitable for analysis, then walk through both standard OLS regression (for comparison) and discordant-kinship regression.
The tutorial covers three common starting data structures: wide format, long format, and pedigree data. We show how to prepare each for analysis using simulated data for reproducibility. You can adapt our analysis pipeline to your own data by substituting your variables and kinship links.
Data Structures and Preparation
Your data likely exists in one of three common formats (wide, long,
or pedigree). Regardless of starting point, the goal is to create
wide-format data where each row represents a kinship pair, with
variables for each member of the pair distinguished by suffixes (e.g.,
_1
and _2
). Understanding your data structure
determines which transformation steps you’ll need.
Setup
First, load the necessary packages:
# For easy data manipulation
library(dplyr)
library(tidyr)
# For kinship linkages
library(NlsyLinks)
# For discordant-kinship regression
library(discord)
# To clean data frames
library(janitor)
library(broom)
# For pipe
library(magrittr)
# For pedigree data manipulation
library(BGmisc)
# For pedigree plotting
library(ggpedigree)
library(ggplot2)
Wide Format Data
Wide format is the most direct structure for our analysis. Each row
represents one kinship pair, with variables appearing twice—once for
each member of the pair—using suffixes to distinguish the two
individuals. You’ll see columns like age_s1
and
age_s2
, where _s1
and _s2
identify which sibling’s data appears in each column.
Here’s an example with simulated sibling data:
df_wide <- data.frame(
pid = 1:5,
id_s1 = c(101, 201, 301, 401, 501),
id_s2 = c(102, 202, 302, 402, 502),
age_s1 = c(30, 27, 40, 36, 30),
age_s2 = c(28, 25, 38, 35, 28),
height_s1 = c(175, 160, 180, 170, 165),
height_s2 = c(170, 162, 178, 172, 168),
weight_s1 = c(70, 60, 80, 75, 65),
weight_s2 = c(68, 62, 78, 74, 66)
)
df_wide %>%
slice(1:5) %>%
knitr::kable()
pid | id_s1 | id_s2 | age_s1 | age_s2 | height_s1 | height_s2 | weight_s1 | weight_s2 |
---|---|---|---|---|---|---|---|---|
1 | 101 | 102 | 30 | 28 | 175 | 170 | 70 | 68 |
2 | 201 | 202 | 27 | 25 | 160 | 162 | 60 | 62 |
3 | 301 | 302 | 40 | 38 | 180 | 178 | 80 | 78 |
4 | 401 | 402 | 36 | 35 | 170 | 172 | 75 | 74 |
5 | 501 | 502 | 30 | 28 | 165 | 168 | 65 | 66 |
In this example, pid
is a unique identifier for each
sibling pair. Specifically, each row is anchored by a pair identifier.
Variables ending in _s1
refer to one sibling, while those
ending in _s2
refer to the other. In this case, the dataset
is already arranged so that the older sibling is _s1
and
the younger is _s2
, but this ordering will be adjusted
later based on the outcome variable. If your data already has this
structure, you can skip ahead to the “Ordering and Derived Variables”
section.
Long Format Data Example
Long format structures data with one row per individual. In other words, each person appears in their own row, with a pair identifier column linking siblings together. This format is common in many datasets, particularly those downloaded from data repositories, but requires transformation for our analysis. For example, the NLSY datasets by default provide data in long format.
We can demonstrate long format by reshaping our wide data from above:
df_long <- df_wide %>%
tidyr::pivot_longer(
cols = -pid, # keep the dyad identifier intact
names_to = c(".value", "sibling"), # split base names and the sibling marker
names_sep = "_s" # original suffix delimiter in column names
)
df_long %>%
slice(1:10) %>%
knitr::kable()
pid | sibling | id | age | height | weight |
---|---|---|---|---|---|
1 | 1 | 101 | 30 | 175 | 70 |
1 | 2 | 102 | 28 | 170 | 68 |
2 | 1 | 201 | 27 | 160 | 60 |
2 | 2 | 202 | 25 | 162 | 62 |
3 | 1 | 301 | 40 | 180 | 80 |
3 | 2 | 302 | 38 | 178 | 78 |
4 | 1 | 401 | 36 | 170 | 75 |
4 | 2 | 402 | 35 | 172 | 74 |
5 | 1 | 501 | 30 | 165 | 65 |
5 | 2 | 502 | 28 | 168 | 66 |
Notice how each individual now occupies their own row, with the
sibling
column indicating whether they were originally
labeled 1
or 2
. The pid
column
still identifies which individuals form a pair.
If your data does not already have a unique identifier for each kinship pair, you may need to construct your own based on the available information or download one from a kinship linkage resources. Ideally, you will also have information about the type of kinship (e.g., full siblings, half-siblings, cousins) and whether they were raised together or apart. (More on how to do this below for pedigree data.)
Converting Long to Wide
To convert long format data for discordant-kinship analysis, use
pivot_wider()
from the {tidyr} package:
df_long2wide <- df_long %>%
tidyr::pivot_wider(
names_from = sibling, # the column that indicates the sibling number
values_from = c(id, age, height, weight), # variables to spread into paired columns
names_sep = "_s" # ensures id_s1, id_s2,etc
)
df_long2wide %>%
slice(1:5) %>%
knitr::kable()
pid | id_s1 | id_s2 | age_s1 | age_s2 | height_s1 | height_s2 | weight_s1 | weight_s2 |
---|---|---|---|---|---|---|---|---|
1 | 101 | 102 | 30 | 28 | 175 | 170 | 70 | 68 |
2 | 201 | 202 | 27 | 25 | 160 | 162 | 60 | 62 |
3 | 301 | 302 | 40 | 38 | 180 | 178 | 80 | 78 |
4 | 401 | 402 | 36 | 35 | 170 | 172 | 75 | 74 |
5 | 501 | 502 | 30 | 28 | 165 | 168 | 65 | 66 |
The reshaped data now reproduces our original wide format. Each
analytic variable appears twice with _s1
and
_s2
suffixes, and the pair identifier pid
anchors each row.
Using existing kinship links
As mentioned before, if your long-format data lacks pair identifiers, you’ll need to create them using family IDs, household identifiers, or external kinship resources. This process can be straightforward if you have access to family or household identifiers, and are confident in the kinship relationships. However, it can be more complex if the relationships are not clearly defined or if there are multiple types of kinship (e.g., siblings, cousins, half-siblings) in your data.
For NLSY data specifically, the {NlsyLinks} package provides validated kinship links for the vast majority of dyads in the NLSY79 and NLSY97 cohorts. These links are provided in wide format, and can be used to left-join with the long-format data that the NLSY info website provides. See the {NlsyLinks} documentation for more details on how to access and use these datasets, or the reproducible source code for the analyses in the primary paper (Garrison et al., 2025).
library(NlsyLinks)
data(Links79PairExpanded)
Links79PairExpanded %>%
arrange(ExtendedID) %>%
filter(RelationshipPath == "Gen1Housemates" & RFull == 0.5) %>%
slice_head(n = 5) %>%
select(
ExtendedID,
SubjectTag_S1, SubjectTag_S2,
RelationshipPath, RFull, IsMz,
EverSharedHouse
) %>%
knitr::kable()
ExtendedID | SubjectTag_S1 | SubjectTag_S2 | RelationshipPath | RFull | IsMz | EverSharedHouse |
---|---|---|---|---|---|---|
3 | 300 | 400 | Gen1Housemates | 0.5 | No | TRUE |
5 | 500 | 600 | Gen1Housemates | 0.5 | No | TRUE |
13 | 1300 | 1400 | Gen1Housemates | 0.5 | No | TRUE |
17 | 1700 | 1800 | Gen1Housemates | 0.5 | No | TRUE |
20 | 2000 | 2100 | Gen1Housemates | 0.5 | No | TRUE |
As you can see, this dataset includes an extended family identifier
(ExtendedID
), individual IDs for each sibling in the pair
(R0000100
and R0000200
), and their
RelationshipPath (RelationshipPath
= “Gen1Housemates” and
genetic relatedness (RFull
= 0.5). You can merge this
dataset with your long-format data using the individual IDs.
At this stage the long‑origin data are indistinguishable from the
original wide example. Each analytic variable now appears exactly
twice—once for _s1
and once for _s2
—and the
dyad identifier pid
continues to anchor the row. You can
proceed directly to ordering and construction of *_mean
and
*_diff
with discord_data()
.
Pedigree Data
Pedigree data is common in genetic and family studies, where detailed family trees are available. It is long format data that provides rich information about familial relationships, such as (mother ID, father ID, spouse ID). Our research team has developed specialized tools in the {BGmisc} package to extract kinship links and transform these pedigree structures into a wide format suitable for analysis. Below is an example of how such a dataset might look in both tabular and graphical forms:
data(potter)
ggpedigree(potter, config = list(
label_method = "geom_text",
label_nudge_y = .25,
focal_fill_personID = 7,
focal_fill_include = TRUE,
focal_fill_force_zero = TRUE,
focal_fill_na_value = "grey50",
focal_fill_low_color = "darkred",
focal_fill_high_color = "gold",
focal_fill_mid_color = "orange",
focal_fill_scale_midpoint = .65,
focal_fill_component = "additive",
focal_fill_method = "steps", #
# focal_fill_method = "viridis_c",
focal_fill_use_log = FALSE,
focal_fill_n_breaks = 10,
sex_color_include = TRUE,
focal_fill_legend_title = "Genetic Relatives \nof Harry Potter"
)) +
labs(title = "Potter Pedigree Plot") +
theme(legend.position = "right")

Pedigree plot of the Potter dataset
The pedigree tree above illustrates the family relationships among individuals in the Potter dataset. Each node represents an individual, and the lines connecting them indicate familial relationships such as parent-child and sibling connections.
data(potter)
df_ped <- potter %>%
as.data.frame() %>%
select(
personID, sex, famID, momID, dadID, spouseID,
twinID, zygosity
) %>%
mutate(x_var = round(rnorm(nrow(.), mean = 0, sd = 1), digits = 2))
df_ped %>%
slice(1:5) %>%
knitr::kable(digits = 2)
personID | sex | famID | momID | dadID | spouseID | twinID | zygosity | x_var |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 101 | 102 | 3 | NA | NA | -1.40 |
2 | 0 | 1 | 101 | 102 | NA | NA | NA | 0.26 |
3 | 0 | 1 | 103 | 104 | 1 | NA | NA | -2.44 |
4 | 0 | 1 | 103 | 104 | 5 | NA | NA | -0.01 |
5 | 1 | 1 | NA | NA | 4 | NA | NA | 0.62 |
As you can see, each individual is represented in a separate row, with columns for their unique identifier, mother’s identifier, father’s identifier, and other relevant variables.
{BGmisc} is available on CRAN and GitHub. It can create kinship links existing pedigree information.
To transform this data into a wide format suitable for discordant-kinship regression, we need to create kinship links based on the pedigree information. To extract the necessary kinship information, we need to compute two matrices: the additive genetic relatedness matrix and the shared environment matrix. Other matrices, such as mitochondrial, can also be computed if needed.
Using {BGmisc}, we compute:
The additive genetic relatedness matrix (add).
The shared environment matrix (cn), indicating whether kin were raised together (1) or apart (0).
The ped2add()
function computes the additive genetic
relatedness matrix, which quantifies the genetic similarity between
individuals based on their pedigree information. The
ped2cn()
function computes the shared environment matrix,
which indicates whether individuals were raised in the same environment
(1) or different environments (0).
The resulting matrices are symmetric, with diagonal elements representing self-relatedness (1.0). The off-diagonal elements represent the relatedness between pairs of individuals, with values ranging from 0 (no relatedness) to 0.5 (full siblings) to 1 (themselves).
We convert the component matrices into a wide-form dataframe of kin
pairs using com2links()
. Self-pairs and duplicate entries
are removed.
df_links <- com2links(
writetodisk = FALSE,
ad_ped_matrix = add,
cn_ped_matrix = cn,
drop_upper_triangular = TRUE
) %>%
filter(ID1 != ID2)
df_links %>%
slice(1:5) %>%
knitr::kable(digits = 3)
ID1 | ID2 | addRel | cnuRel |
---|---|---|---|
1 | 2 | 0.50 | 1 |
3 | 4 | 0.50 | 1 |
1 | 6 | 0.50 | 0 |
2 | 6 | 0.25 | 0 |
3 | 6 | 0.50 | 0 |
As you can see, the df_links
data frame contains pairs
of individuals (ID1 and ID2) along with their additive genetic
relatedness (addRel
) and shared environment status
(cnuRel
). These data are in wide format, with each row
representing a unique pair of individuals.
Further, we can tally the number of pairs by relatedness and shared environment to understand the composition of the dataset.
addRel | cnuRel | n |
---|---|---|
0.0625 | 0 | 3 |
0.1250 | 0 | 47 |
0.2500 | 0 | 104 |
0.5000 | 0 | 50 |
0.5000 | 1 | 32 |
This table shows the number of kinship pairs for each combination of genetic relatedness and shared environment status. Although the discord regression model can be used with any kin group, it is most interpretable when there is a single kinship group.
For demonstration, we’ll focus on first cousins raised in separate homes from our pedigree data. This subset allows us to illustrate the process clearly.
set.seed(2024)
df_cousin <- df_links %>%
filter(addRel == .125) %>% # only cousins %>%
filter(cnuRel == 0) # only kin raised in separate homes
df_synthetic <- discord::kinsim(
mu_all = c(1, 1),
cov_a = .5,
cov_c = .1,
cov_e = .3,
c_vector = rep(df_cousin$cnuRel, 3),
r_vector = rep(df_cousin$addRel, 3)
) %>%
select(
pid = id,
r,
weight_s1 = y1_1,
weight_s2 = y1_2,
height_s1 = y2_1,
height_s2 = y2_2
) %>%
mutate(
age_s1 = round(rnorm(nrow(.), mean = 30, sd = 5), digits = 0),
age_s2 = age_s1 - sample(1:5, nrow(.), replace = TRUE)
)
df_synthetic %>%
slice(1:5) %>%
knitr::kable(digits = 2)
pid | r | weight_s1 | weight_s2 | height_s1 | height_s2 | age_s1 | age_s2 |
---|---|---|---|---|---|---|---|
1 | 0.12 | 2.09 | -1.30 | 4.54 | -0.66 | 31 | 30 |
2 | 0.12 | 1.21 | 0.40 | 1.20 | 0.26 | 32 | 28 |
3 | 0.12 | 1.03 | 2.53 | 2.76 | 0.58 | 29 | 27 |
4 | 0.12 | 1.46 | 4.92 | 0.93 | -0.67 | 30 | 29 |
5 | 0.12 | 2.69 | 2.78 | 0.95 | -0.54 | 23 | 18 |
Ordering and Derived Variables
Now that we have our data in wide format, we can proceed to order the pairs and create the derived variables needed for discordant-kinship regression. The key steps are:
-
Ordering the pairs: We need to ensure that within
each pair, the individual with the higher outcome value is consistently
labeled as
_1
and the other as_2
. This ordering is to ensure that we can extract meaningful difference scores. -
Creating derived variables: We will create
*_mean
and*_diff*
variables for both the outcome and predictor variables. The*_mean
variable represents the average of the two individuals’ values, while the*_diff
variable represents the difference between the two individuals’ values (i.e.,_1 - _2
).
These steps can be accomplished using the discord_data()
function from the {discord} package. This function takes care of
ordering the pairs and creating the derived variables, ensuring that the
data is ready for analysis. When using discord_data()
, you
will need to specify the outcome variable, predictor variables, and the
identifiers for the two members of each pair.
Below we default to the long path for reproducibility. Replace with your chosen source as needed.
# CHOOSE ONE based on your path
# source_wide <- df_wide
# source_wide <- df_long2wide
source_wide <- df_synthetic # if you followed the pedigree path
Now call discord_data()
specifying the outcome and
predictor present for both siblings. Here we use Y
as the
outcome and X
as the predictor, with your _S1
/ _S2
suffix convention.
df_discord_weight <- discord::discord_data(
data = source_wide,
outcome = "weight",
predictors = c("height", "age"),
demographics = "none",
pair_identifiers = c("_s1", "_s2"),
id = "pid" # or "famID" if you followed the pedigree path
)
df_discord_weight %>%
slice(1:5) %>%
knitr::kable(digits = 2, caption = "Transformed data ready for discordant-kinship regression")
id | weight_1 | weight_2 | weight_diff | weight_mean | height_1 | height_2 | height_diff | height_mean | age_1 | age_2 | age_diff | age_mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2.09 | -1.30 | 3.39 | 0.39 | 4.54 | -0.66 | 5.20 | 1.94 | 31 | 30 | 1 | 30.5 |
2 | 1.21 | 0.40 | 0.81 | 0.80 | 1.20 | 0.26 | 0.94 | 0.73 | 32 | 28 | 4 | 30.0 |
3 | 2.53 | 1.03 | 1.51 | 1.78 | 0.58 | 2.76 | -2.18 | 1.67 | 27 | 29 | -2 | 28.0 |
4 | 4.92 | 1.46 | 3.46 | 3.19 | -0.67 | 0.93 | -1.60 | 0.13 | 29 | 30 | -1 | 29.5 |
5 | 2.78 | 2.69 | 0.09 | 2.74 | -0.54 | 0.95 | -1.49 | 0.21 | 18 | 23 | -5 | 20.5 |
Understanding the Transformation
Let’s examine what discord_data()
did to our
variables:
df_discord_weight %>%
select(
id,
weight_1, weight_2, weight_mean, weight_diff,
height_1, height_2, height_mean, height_diff
) %>%
slice(1:5) %>%
knitr::kable(digits = 2)
id | weight_1 | weight_2 | weight_mean | weight_diff | height_1 | height_2 | height_mean | height_diff |
---|---|---|---|---|---|---|---|---|
1 | 2.09 | -1.30 | 0.39 | 3.39 | 4.54 | -0.66 | 1.94 | 5.20 |
2 | 1.21 | 0.40 | 0.80 | 0.81 | 1.20 | 0.26 | 0.73 | 0.94 |
3 | 2.53 | 1.03 | 1.78 | 1.51 | 0.58 | 2.76 | 1.67 | -2.18 |
4 | 4.92 | 1.46 | 3.19 | 3.46 | -0.67 | 0.93 | 0.13 | -1.60 |
5 | 2.78 | 2.69 | 2.74 | 0.09 | -0.54 | 0.95 | 0.21 | -1.49 |
Notice several important patterns in this output. First,
weight_1
is always greater than or equal to
weight_2
because discord_data()
ordered
siblings by the outcome variable. The sibling with higher weight becomes
_1
and the sibling with lower weight becomes
_2
. This ordering ensures that weight_diff
(calculated as weight_1 - weight_2
) is always
non-negative.
Second, the mean scores represent each pair’s average. For example,
weight_mean
equals the average of weight_1
and
weight_2
. These means capture between-family variation -
how pairs differ from each other in average weight.
Third, the predictor differences can be positive or negative. Even though weight differences are always positive (by construction), height differences vary. The taller sibling might be the heavier one, or the shorter sibling might be heavier. This variation is what we’ll test in our regression.
This structure lets us ask a key question: within sibling pairs, does the sibling with more of the predictor also have more of the outcome? If the taller sibling is systematically heavier even after controlling for family background, that suggests height may causally influence weight. If not, the association likely reflects familial confounding.
Analyzing the Data
With our data prepared, we can now perform both standard OLS regression and discordant-kinship regression to compare results.
Standard OLS Regression
First, let’s run a standard OLS regression using the original wide-format data. This will serve as a baseline for comparison with the family-based analyses.
For our baseline analysis, we select one sibling from each pair. In
the original dataset, s1 was defined as the older sibling, so we can
simply choose the sibling labeled _s1
. But, another way to
select the oldest sibling is to use the discord_data()
function with the outcome
argument set to the age variable.
This will ensure that _1
is always the oldest sibling
within the pair. In the case of ties, one sibling is randomly assigned
as _1
.
df_for_ols <- df_synthetic %>%
dplyr::select(
id = pid,
weight = weight_s1,
height = height_s1,
age = age_s1
)
df_discord_age <- discord::discord_data(
data = source_wide,
outcome = "age",
predictors = c("height", "weight"),
demographics = "none",
pair_identifiers = c("_s1", "_s2"),
id = "pid" # or "famID" if you followed the pedigree path
)
df_discord_age %>%
select(
id, age_1, age_2,
age_mean, age_diff, weight_1, height_1
) %>%
slice(1:5) %>%
knitr::kable(caption = "One sibling per pair for OLS regression, selecting by age", digits = 2)
id | age_1 | age_2 | age_mean | age_diff | weight_1 | height_1 |
---|---|---|---|---|---|---|
1 | 31 | 30 | 30.5 | 1 | 2.09 | 4.54 |
2 | 32 | 28 | 30.0 | 4 | 1.21 | 1.20 |
3 | 29 | 27 | 28.0 | 2 | 1.03 | 2.76 |
4 | 30 | 29 | 29.5 | 1 | 1.46 | 0.93 |
5 | 23 | 18 | 20.5 | 5 | 2.69 | 0.95 |
df_for_ols %>%
select(id, age, weight, height) %>%
slice(1:5) %>%
knitr::kable(caption = "One sibling per pair for OLS regression using original wide data", digits = 2)
id | age | weight | height |
---|---|---|---|
1 | 31 | 2.09 | 4.54 |
2 | 32 | 1.21 | 1.20 |
3 | 29 | 1.03 | 2.76 |
4 | 30 | 1.46 | 0.93 |
5 | 23 | 2.69 | 0.95 |
As you can see, df_for_ols
and
df_discord_age
both resulted in the same individuals being
selected for analysis. The df_discord_age
dataset was
created by ordering siblings based on age, ensuring that _1
is always the older sibling. The df_for_ols
dataset
directly selects the older sibling from the original wide data.
By selecting one sibling per pair, we’re analyzing the same number of observations as we have kinship pairs. This represents the standard individual-level approach used in most research:
$$ Y = \\beta_0 + \\beta_1 X1 + \\beta_2 X2 + \\epsilon $$
ols_model <- lm(weight ~ height + age, data = df_for_ols)
stargazer::stargazer(ols_model,
type = "html",
digits = 3, single.row = TRUE, title = "Standard OLS Regression Results"
)
Dependent variable: | |
weight | |
height | 0.287*** (0.078) |
age | -0.073** (0.029) |
Constant | 2.788*** (0.871) |
Observations | 141 |
R2 | 0.125 |
Adjusted R2 | 0.112 |
Residual Std. Error | 1.573 (df = 138) |
F Statistic | 9.857*** (df = 2; 138) |
Note: | p<0.1; p<0.05; p<0.01 |
This standard regression shows associations between our variables while controlling only for measured covariates.
Discordant Between-Family Regression
Next, we run a between-family regression using the discordant data. This model regresses the outcome mean on predictor means, capturing between-family variation:
$$ Y_{mean} = \\beta_0 + \\beta_1 X1_{mean} + \\beta_2 X2_{mean} + \\epsilon $$
between_model <- lm(
weight_mean ~ height_mean + age_mean,
data = df_discord_weight
)
stargazer::stargazer(between_model,
type = "html",
digits = 3, single.row = TRUE, title = "Between-Family Regression Results"
)
Dependent variable: | |
weight_mean | |
height_mean | 0.199** (0.077) |
age_mean | -0.044* (0.023) |
Constant | 2.064*** (0.666) |
Observations | 141 |
R2 | 0.067 |
Adjusted R2 | 0.054 |
Residual Std. Error | 1.284 (df = 138) |
F Statistic | 4.970*** (df = 2; 138) |
Note: | p<0.1; p<0.05; p<0.01 |
This between-family regression captures how differences in average height and age across families relate to average weight. However, it does not account for within-family differences, which is where discordant-kinship regression comes in.
It is important to note that the between-family regression results are not directly comparable to the standard OLS regression results. The between-family model uses mean scores, which represent family-level averages, while the OLS model uses individual-level data. Therefore, the coefficients from these two models may differ in magnitude and interpretation. But, they both provide useful information about the relationships among the variables.
Discordant-Kinship Regression
Now we return to our prepared data to run discordant-kinship
regression. We can specify the models manually or use the
discord_regression()
convenience function. Both approaches
produce identical results.
The discordant model regresses the outcome difference on outcome mean, predictor means, and predictor differences:
$$ Y_{diff} = \\beta_0 + \\beta_1 Y_{mean} + \\beta_2 X1_{mean} + \\beta_3 X1_{diff} + \\beta_4 X2_{mean} + \\beta_5 X2_{diff} + \\epsilon $$
discord_model_manual <- lm(
weight_diff ~ weight_mean + height_mean + height_diff + age_mean + age_diff,
data = df_discord_weight
)
tidy(discord_model_manual, conf.int = TRUE) %>%
knitr::kable(digits = 3, caption = "Discordant Regression (Manual)")
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.500 | 0.528 | 0.948 | 0.345 | -0.544 | 1.544 |
weight_mean | -0.043 | 0.065 | -0.660 | 0.511 | -0.171 | 0.086 |
height_mean | 0.041 | 0.060 | 0.680 | 0.498 | -0.078 | 0.159 |
height_diff | 0.141 | 0.043 | 3.254 | 0.001 | 0.055 | 0.226 |
age_mean | 0.031 | 0.018 | 1.742 | 0.084 | -0.004 | 0.066 |
age_diff | -0.026 | 0.024 | -1.092 | 0.277 | -0.074 | 0.021 |
stargazer::stargazer(between_model, discord_model_manual,
type = "html",
digits = 3, single.row = TRUE, title = "Between-Family and Discordant Regression Results"
)
Dependent variable: | ||
weight_mean | weight_diff | |
(1) | (2) | |
weight_mean | -0.043 (0.065) | |
height_mean | 0.199** (0.077) | 0.041 (0.060) |
height_diff | 0.141*** (0.043) | |
age_mean | -0.044* (0.023) | 0.031* (0.018) |
age_diff | -0.026 (0.024) | |
Constant | 2.064*** (0.666) | 0.500 (0.528) |
Observations | 141 | 141 |
R2 | 0.067 | 0.102 |
Adjusted R2 | 0.054 | 0.069 |
Residual Std. Error | 1.284 (df = 138) | 0.978 (df = 135) |
F Statistic | 4.970*** (df = 2; 138) | 3.060** (df = 5; 135) |
Note: | p<0.1; p<0.05; p<0.01 |
Note that we are using the df_discord_weight dataset created earlier
with discord_data()
. This dataset is ordered such that
within a pair, the sibling with the higher weight is always
_1
. This ordering ensures that weight_diff
is
always non-negative, which is a key feature of discordant-kinship
regression.
Using the Helper Function
Alternatively, we can use the discord_regression()
function, which simplifies the process:
By default, discord_regression()
orders siblings by the
outcome variable, creates the necessary derived variables, and fits the
discordant regression model in one step. This function also allows for
easy inclusion of demographic covariates if needed. See the function
documentation for more details on these options, as well as a vignette
focused on demographic controls.
discord_model <- discord_regression(
data = source_wide,
outcome = "weight",
predictors = c("height", "age"),
demographics = "none",
sex = NULL,
race = NULL,
pair_identifiers = c("_s1", "_s2"),
id = "pid"
)
tidy(discord_model, conf.int = TRUE) %>%
knitr::kable(digits = 3, caption = "Discordant Regression Results")
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.500 | 0.528 | 0.948 | 0.345 | -0.544 | 1.544 |
weight_mean | -0.043 | 0.065 | -0.660 | 0.511 | -0.171 | 0.086 |
height_diff | 0.141 | 0.043 | 3.254 | 0.001 | 0.055 | 0.226 |
age_diff | -0.026 | 0.024 | -1.092 | 0.277 | -0.074 | 0.021 |
height_mean | 0.041 | 0.060 | 0.680 | 0.498 | -0.078 | 0.159 |
age_mean | 0.031 | 0.018 | 1.742 | 0.084 | -0.004 | 0.066 |
glance(discord_model) %>%
select(r.squared, adj.r.squared, sigma, p.value, nobs) %>%
knitr::kable(digits = 3)
r.squared | adj.r.squared | sigma | p.value | nobs |
---|---|---|---|---|
0.102 | 0.069 | 0.978 | 0.012 | 141 |
stargazer::stargazer(discord_model,
discord_model_manual,
type = "html",
digits = 3, single.row = TRUE, title = "Discordant Regression Results Comparison"
)
% Error: Unrecognized object type.
Session Info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.5.1 (2025-06-13)
#> os Ubuntu 24.04.3 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate C.UTF-8
#> ctype C.UTF-8
#> tz UTC
#> date 2025-10-06
#> pandoc 3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#> quarto NA
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> backports 1.5.0 2024-05-23 [1] RSPM
#> BGmisc * 1.5.0 2025-07-19 [1] RSPM
#> broom * 1.0.10 2025-09-13 [1] RSPM
#> bslib 0.9.0 2025-01-30 [1] RSPM
#> cachem 1.1.0 2024-05-16 [1] RSPM
#> cli 3.6.5 2025-04-23 [1] RSPM
#> data.table 1.17.8 2025-07-10 [1] RSPM
#> desc 1.4.3 2023-12-10 [1] RSPM
#> digest 0.6.37 2024-08-19 [1] RSPM
#> discord * 1.3 2025-10-06 [1] local
#> dplyr * 1.1.4 2023-11-17 [1] RSPM
#> evaluate 1.0.5 2025-08-27 [1] RSPM
#> farver 2.1.2 2024-05-13 [1] RSPM
#> fastmap 1.2.0 2024-05-15 [1] RSPM
#> fs 1.6.6 2025-04-12 [1] RSPM
#> generics 0.1.4 2025-05-09 [1] RSPM
#> ggpedigree * 0.9.0 2025-09-12 [1] RSPM
#> ggplot2 * 4.0.0 2025-09-11 [1] RSPM
#> glue 1.8.0 2024-09-30 [1] RSPM
#> gtable 0.3.6 2024-10-25 [1] RSPM
#> htmltools 0.5.8.1 2024-04-04 [1] RSPM
#> htmlwidgets 1.6.4 2023-12-06 [1] RSPM
#> janitor * 2.2.1 2024-12-22 [1] RSPM
#> jquerylib 0.1.4 2021-04-26 [1] RSPM
#> jsonlite 2.0.0 2025-03-27 [1] RSPM
#> kinship2 1.9.6.2 2025-09-04 [1] RSPM
#> knitr 1.50 2025-03-16 [1] RSPM
#> labeling 0.4.3 2023-08-29 [1] RSPM
#> lattice 0.22-7 2025-04-02 [3] CRAN (R 4.5.1)
#> lifecycle 1.0.4 2023-11-07 [1] RSPM
#> lubridate 1.9.4 2024-12-08 [1] RSPM
#> magrittr * 2.0.4 2025-09-12 [1] RSPM
#> Matrix 1.7-3 2025-03-11 [3] CRAN (R 4.5.1)
#> NlsyLinks * 2.2.3 2025-08-31 [1] RSPM
#> pillar 1.11.1 2025-09-17 [1] RSPM
#> pkgconfig 2.0.3 2019-09-22 [1] RSPM
#> pkgdown 2.1.3 2025-05-25 [1] any (@2.1.3)
#> purrr 1.1.0 2025-07-10 [1] RSPM
#> quadprog 1.5-8 2019-11-20 [1] RSPM
#> R6 2.6.1 2025-02-15 [1] RSPM
#> ragg 1.5.0 2025-09-02 [1] RSPM
#> RColorBrewer 1.1-3 2022-04-03 [1] RSPM
#> rlang 1.1.6 2025-04-11 [1] RSPM
#> rmarkdown 2.30 2025-09-28 [1] RSPM
#> S7 0.2.0 2024-11-07 [1] RSPM
#> sass 0.4.10 2025-04-11 [1] RSPM
#> scales 1.4.0 2025-04-24 [1] RSPM
#> sessioninfo 1.2.3 2025-02-05 [1] any (@1.2.3)
#> snakecase 0.11.1 2023-08-27 [1] RSPM
#> stargazer 5.2.3 2022-03-04 [1] RSPM
#> stringi 1.8.7 2025-03-27 [1] RSPM
#> stringr 1.5.2 2025-09-08 [1] RSPM
#> systemfonts 1.3.1 2025-10-01 [1] RSPM
#> textshaping 1.0.3 2025-09-02 [1] RSPM
#> tibble 3.3.0 2025-06-08 [1] RSPM
#> tidyr * 1.3.1 2024-01-24 [1] RSPM
#> tidyselect 1.2.1 2024-03-11 [1] RSPM
#> timechange 0.3.0 2024-01-18 [1] RSPM
#> vctrs 0.6.5 2023-12-01 [1] RSPM
#> withr 3.0.2 2024-10-28 [1] RSPM
#> xfun 0.53 2025-08-19 [1] RSPM
#> yaml 2.3.10 2024-07-26 [1] RSPM
#>
#> [1] /home/runner/work/_temp/Library
#> [2] /opt/R/4.5.1/lib/R/site-library
#> [3] /opt/R/4.5.1/lib/R/library
#> * ── Packages attached to the search path.
#>
#> ──────────────────────────────────────────────────────────────────────────────