Validation tools for identifying and repairing errors in pedigrees
Source:vignettes/validation.Rmd
validation.Rmd
Introduction
The BGmisc
R package offers a comprehensive suite of
functions tailored for extended behavior genetics analysis, including
model identification, calculating relatedness, pedigree conversion, and
pedigree simulation. This vignette provides an overview of the
validation tools available in the package, designed to identify and
repair errors in pedigrees.
In an ideal world, you would have perfect pedigrees with no errors.
However, in the real world, pedigrees are often incomplete, contain
errors, or are missing data. The BGmisc
package provides
tools to identify these errors, which is particularly useful for large
pedigrees where manual inspection is not feasible. While some errors in
the package can be automatically repaired, the vast majority require
manual inspection. It is often not possible to automatically repair
errors in pedigrees, as the correct solution may not be obvious, or may
depend on additional information that is not universally available.
Identifying and Repairing Errors in Pedigrees
ID Validation
One common issue in pedigree data is the presence of duplicate IDs. There are two main types of ID duplication: within-row duplication and across-row duplication. Within-row duplication occurs when an individual’s parents’ IDs are incorrectly listed as their own ID. Across-row duplication occurs when two or more individuals share the same ID.
The checkIDs
function in BGmisc helps identify by kinds
of duplicates. Here’s how to use it:
library(BGmisc)
# Create a sample dataset
df <- ped2fam(potter, famID = "newFamID", personID = "personID")
# Call the checkIDs function
result <- checkIDs(df, repair = FALSE)
print(result)
#> $all_unique_ids
#> [1] TRUE
#>
#> $total_non_unique_ids
#> [1] 0
#>
#> $total_own_father
#> [1] 0
#>
#> $total_own_mother
#> [1] 0
#>
#> $total_duplicated_parents
#> [1] 0
#>
#> $total_within_row_duplicates
#> [1] 0
#>
#> $within_row_duplicates
#> [1] FALSE
#> $all_unique_ids
#> [1] TRUE
#>
#> $total_non_unique_ids
#> [1] 0
#>
#> $total_own_father
#> [1] 0
#>
#> $total_own_mother
#> [1] 0
#>
#> $total_duplicated_parents
#> [1] 0
#>
#> $total_within_row_duplicates
#> [1] 0
#>
#> $within_row_duplicates
#> [1] FALSE
In this example, the checkIDs
function returns a list
with several elements. The all_unique_ids
element indicates
whether all IDs in the dataset are unique. The
total_non_unique_ids
element indicates the total number of
non-unique IDs. The total_own_father
and
total_own_mother
elements indicate the total number of
individuals whose father’s and mother’s IDs match their own ID,
respectively. The total_duplicated_parents
element
indicates the total number of individuals with duplicated parent IDs.
The total_within_row_duplicates
element indicates the total
number of within-row duplicates. The within_row_duplicates
element indicates whether there are any within-row duplicates in the
dataset. As the output shows, there are no duplicates in the sample
dataset.
Between-Person Duplicates
Let us now consider a scenario where there are between-person
duplicates in the dataset. The checkIDs
function can
identify these duplicates and, if the repair
argument is
set to TRUE
, attempt to repair them. In the example below,
we have created two between-person duplicates. First, we have
overwritten the personID
of one person with their sibling’s
ID. Second, we have added a copy of Dudley Dursley to the dataset.
# Create a sample dataset with duplicates
df <- ped2fam(potter, famID = "newFamID", personID = "personID")
# Sibling overwrite
df$personID[df$name == "Vernon Dursley"] <- df$personID[df$name == "Marjorie Dursley"]
# Add a copy of Dudley Dursley
df <- rbind(df, df[df$name == "Dudley Dursley",])
Now, let’s call the sumarizeFamilies
function to see
what the dataset looks like.
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
summarizeFamilies(df, famID = "newFamID", personID = "personID")$family_summary %>% glimpse()
#> Rows: 1
#> Columns: 17
#> $ newFamID <dbl> 1
#> $ count <int> 37
#> $ gen_mean <dbl> 1.756757
#> $ gen_median <dbl> 2
#> $ gen_min <dbl> 0
#> $ gen_max <dbl> 3
#> $ gen_sd <dbl> 1.038305
#> $ spouseID_mean <dbl> 38.2
#> $ spouseID_median <dbl> 15
#> $ spouseID_min <dbl> 1
#> $ spouseID_max <dbl> 106
#> $ spouseID_sd <dbl> 44.15118
#> $ sex_mean <dbl> 0.5135135
#> $ sex_median <dbl> 1
#> $ sex_min <dbl> 0
#> $ sex_max <dbl> 1
#> $ sex_sd <dbl> 0.5067117
If we didn’t know to look for duplicates, we might not notice the
issue. Indeed, only of the duplicates was selected as are founder
member. However, the checkIDs
function can help us identify
and repair these errors:
# Call the checkIDs
result <- checkIDs(df)
print(result)
#> $all_unique_ids
#> [1] FALSE
#>
#> $total_non_unique_ids
#> [1] 4
#>
#> $non_unique_ids
#> [1] 2 6
#>
#> $total_own_father
#> [1] 0
#>
#> $total_own_mother
#> [1] 0
#>
#> $total_duplicated_parents
#> [1] 0
#>
#> $total_within_row_duplicates
#> [1] 0
#>
#> $within_row_duplicates
#> [1] FALSE
As we can see from this output, there are 4 non-unique IDs in the dataset, specifically 2, 6. Let’s take a peek at the duplicates:
df %>% filter(personID %in% result$non_unique_ids) %>%
arrange(personID)
#> personID newFamID famID name gen momID dadID spouseID sex
#> 1 2 1 1 Vernon Dursley 1 101 102 3 1
#> 2 2 1 1 Marjorie Dursley 1 101 102 NA 0
#> 6 6 1 1 Dudley Dursley 2 3 1 NA 1
#> 61 6 1 1 Dudley Dursley 2 3 1 NA 1
Yep, these are definitely the duplicates.
df_repair <- checkIDs(df, repair = TRUE)
df_repair %>% filter(ID %in% result$non_unique_ids) %>%
arrange(ID)
#> ID newFamID fam name gen momID dadID spID sex
#> 1 2 1 1 Vernon Dursley 1 101 102 3 1
#> 2 2 1 1 Marjorie Dursley 1 101 102 NA 0
#> 6 6 1 1 Dudley Dursley 2 3 1 NA 1
result <- checkIDs(df_repair)
print(result)
#> $all_unique_ids
#> [1] FALSE
#>
#> $total_non_unique_ids
#> [1] 2
#>
#> $non_unique_ids
#> [1] 2
#>
#> $total_own_father
#> [1] 0
#>
#> $total_own_mother
#> [1] 0
#>
#> $total_duplicated_parents
#> [1] 0
#>
#> $total_within_row_duplicates
#> [1] 0
#>
#> $within_row_duplicates
#> [1] FALSE
Great! The function was able to repair the full duplicate, without any manual intervention. That still leaves us with the sibling overwrite, but that’s a more complex issue that would require manual intervention. We’ll leave that for now.
Handling Within-Row Duplicates
Sometimes, an individual’s parents’ IDs may be incorrectly listed as their own ID, leading to within-row duplicates. The checkIDs function can also identify these errors:
# Create a sample dataset with within-person duplicate parent IDs
df <- ped2fam(potter, famID = "newFamID", personID = "personID")
df$momID[df$name == "Vernon Dursley"] <- df$personID[df$name == "Vernon Dursley"]
# Check for within-row duplicates
result <- checkIDs(df, repair = FALSE)
print(result)
#> $all_unique_ids
#> [1] TRUE
#>
#> $total_non_unique_ids
#> [1] 0
#>
#> $total_own_father
#> [1] 0
#>
#> $total_own_mother
#> [1] 1
#>
#> $total_duplicated_parents
#> [1] 0
#>
#> $total_within_row_duplicates
#> [1] 1
#>
#> $within_row_duplicates
#> [1] TRUE
#>
#> $is_own_mother_ids
#> [1] 1
In this example, we have created a within-row duplicate by setting
the momID of Vernon Dursley to his own ID. The checkIDs
function correctly identifies this error.
Verifying Sex Coding
Another common issue in pedigree data is incorrect coding of
biological sex. In genetic studies, ensuring accurate recording of
biological sex in pedigree data is crucial for analyses that rely on
this information. The checkSex
function in
BGmisc
helps identify and repair errors related to
biological sex coding, such as inconsistencies where an individual’s sex
is incorrectly recorded. An example of this would be a parent who is
biologically male, but listed as a mother. The checkSex
function can help identify and correct such errors.
It is essential to distinguish between biological sex (genotype) and
gender identity (phenotype). Biological sex is based on chromosomes and
other biological characteristics, while gender identity is a broader,
richer, personal, deeply-held sense of being male, female, a blend of
both, neither, or another gender entirely. While checkSex
focuses on biological sex necessary for genetic analysis, we respect and
recognize the full spectrum of gender identities beyond the binary. The
developers of this package affirm their support for folx in the LGBTQ+
community.
The checkSex
function in BGmisc
performs
two main tasks: identifying possible errors and inconsistencies for
variables related to biological sex. The function is capable of
validating the sex coding in a pedigree and optionally repairing the sex
coding based on specified logic. Here’s how you can use the
checkSex
function to validate and optionally repair sex
coding in a pedigree dataset:
# Validate sex coding
results <- checkSex(potter, code_male = 1, code_female = 0, verbose = TRUE, repair = FALSE)
#> Step 1: Checking how many sexes/genders...
#> 2 unique values found.
#> 1 2 unique values found.
#> 0Checks Made:
#> $sex_unique
#> [1] 1 0
#>
#> $sex_length
#> [1] 2
#>
#> $all_sex_dad
#> [1] "1"
#>
#> $all_sex_mom
#> [1] "0"
#>
#> $most_frequent_sex_dad
#> [1] "1"
#>
#> $most_frequent_sex_mom
#> [1] "0"
print(results)
#> $sex_unique
#> [1] 1 0
#>
#> $sex_length
#> [1] 2
#>
#> $all_sex_dad
#> [1] "1"
#>
#> $all_sex_mom
#> [1] "0"
#>
#> $most_frequent_sex_dad
#> [1] "1"
#>
#> $most_frequent_sex_mom
#> [1] "0"
In this example, the checkSex
function checks the unique
values in the sex column and identifies any inconsistencies in the sex
coding of parents. The function returns a list containing validation
results, such as the unique values found in the sex column and any
inconsistencies in the sex coding of parents.
If incorrect sex codes are found, you can attempt to repair them automatically using the repair argument:
# Repair sex coding
df_fix <- checkSex(potter, code_male = 1, code_female = 0, verbose = TRUE, repair = TRUE)
#> Step 1: Checking how many sexes/genders...
#> 2 unique values found.
#> 1 2 unique values found.
#> 0Step 2: Attempting to repair sex coding...
#> Changes Made:
#> [[1]]
#> [1] "Recode sex based on most frequent sex in dads: 1. Total gender changes made: 36"
print(df_fix)
#> ID fam name gen momID dadID spID sex
#> 1 1 1 Vernon Dursley 1 101 102 3 M
#> 2 2 1 Marjorie Dursley 1 101 102 NA F
#> 3 3 1 Petunia Evans 1 103 104 1 F
#> 4 4 1 Lily Evans 1 103 104 5 F
#> 5 5 1 James Potter 1 NA NA 4 M
#> 6 6 1 Dudley Dursley 2 3 1 NA M
#> 7 7 1 Harry Potter 2 4 5 8 M
#> 8 8 1 Ginny Weasley 2 10 9 7 F
#> 9 9 1 Arthur Weasley 1 NA NA 10 M
#> 10 10 1 Molly Prewett 1 NA NA 9 F
#> 11 11 1 Ron Weasley 2 10 9 17 M
#> 12 12 1 Fred Weasley 2 10 9 NA M
#> 13 13 1 George Weasley 2 10 9 NA M
#> 14 14 1 Percy Weasley 2 10 9 20 M
#> 15 15 1 Charlie Weasley 2 10 9 NA M
#> 16 16 1 Bill Weasley 2 10 9 18 M
#> 17 17 1 Hermione Granger 2 NA NA 11 F
#> 18 18 1 Fleur Delacour 2 105 106 16 F
#> 19 19 1 Gabrielle Delacour 2 105 106 NA F
#> 20 20 1 Audrey UNKNOWN 2 NA NA 14 F
#> 21 21 1 James Potter II 3 8 7 NA M
#> 22 22 1 Albus Potter 3 8 7 NA M
#> 23 23 1 Lily Potter 3 8 7 NA F
#> 24 24 1 Rose Weasley 3 17 11 NA F
#> 25 25 1 Hugo Weasley 3 17 11 NA M
#> 26 26 1 Victoire Weasley 3 18 16 NA F
#> 27 27 1 Dominique Weasley 3 18 16 NA F
#> 28 28 1 Louis Weasley 3 18 16 NA M
#> 29 29 1 Molly Weasley 3 20 14 NA F
#> 30 30 1 Lucy Weasley 3 20 14 NA F
#> 31 101 1 Mother Dursley 0 NA NA 102 F
#> 32 102 1 Father Dursley 0 NA NA 101 M
#> 33 104 1 Father Evans 0 NA NA 103 M
#> 34 103 1 Mother Evans 0 NA NA 104 F
#> 35 106 1 Father Delacour 0 NA NA 105 M
#> 36 105 1 Mother Delacour 0 NA NA 106 F
When the repair argument is set to TRUE, the function attempts to repair the sex coding based on specified logic. It recodes the sex variable based on the most frequent sex values found among parents. This ensures that the sex coding is consistent and accurate, which is essential for constructing valid genetic pedigrees.