Ingests a GEDCOM genealogy file, identifies individual records, and parses person-level identifiers, names, life events, attributes, and family relationships into a structured data frame. Optional post-processing can infer parental IDs from family relationships, reconcile redundant name fields, and remove uninformative columns from the parsed output.
Usage
readGedcom(
file_path,
verbose = FALSE,
add_parents = TRUE,
remove_empty_cols = TRUE,
combine_cols = TRUE,
skinny = FALSE,
parse_dates = FALSE,
update_rate = 1000,
post_process = TRUE,
...
)
readGed(
file_path,
verbose = FALSE,
add_parents = TRUE,
remove_empty_cols = TRUE,
combine_cols = TRUE,
skinny = FALSE,
parse_dates = FALSE,
update_rate = 1000,
post_process = TRUE,
...
)
readgedcom(
file_path,
verbose = FALSE,
add_parents = TRUE,
remove_empty_cols = TRUE,
combine_cols = TRUE,
skinny = FALSE,
parse_dates = FALSE,
update_rate = 1000,
post_process = TRUE,
...
)Arguments
- file_path
Character string. Path to the GEDCOM file.
- verbose
Logical. If `TRUE`, print progress messages.
- add_parents
Logical. If `TRUE`, infer `momID` and `dadID` from `FAMC` and `FAMS` mappings during post-processing.
- remove_empty_cols
Logical. If `TRUE`, drop columns that are entirely `NA` during post-processing.
- combine_cols
Logical. If `TRUE`, combine redundant name columns, such as `name_given` with `name_given_pieces` and `name_surn` with `name_surn_pieces`, when their values do not conflict.
- skinny
Logical. If `TRUE`, return a slimmer data frame by dropping `FAMC`, `FAMS`, and columns that are entirely `NA` during post-processing.
- parse_dates
Logical. If `TRUE`, attempt to parse date columns (e.g., `birth_date`, `death_date`) into Date objects, after removing common GEDCOM date qualifiers like "ABT", "BEF", and "AFT".
- update_rate
Numeric. Intended rate at which progress messages should be printed. Currently unused.
- post_process
Logical. If `TRUE`, apply post-processing steps controlled by `add_parents`, `combine_cols`, `remove_empty_cols`, `skinny`, and `parse_dates`.
- ...
Additional arguments. Currently unused.
Value
A data frame containing information about individuals, with the following potential columns:
- personID
Individual ID parsed from the `@ INDI` line.
- momID
ID of the individual's mother, if inferred.
- dadID
ID of the individual's father, if inferred.
- sex
Sex of the individual.
- name
Cleaned full name of the individual.
- name_given
Given name parsed from the `NAME` tag.
- name_given_pieces
Given name parsed from a separate `GIVN` tag, if present.
- name_surn
Surname parsed from the `NAME` tag.
- name_surn_pieces
Surname parsed from a separate `SURN` tag, if present.
- name_marriedsurn
Married surname parsed from `_MARNM`, if present.
- name_nick
Nickname parsed from `NICK`, if present.
- name_npfx
Name prefix parsed from `NPFX`, if present.
- name_nsfx
Name suffix parsed from `NSFX`, if present.
- birth_date
Birth date of the individual.
- birth_lat
Latitude of the birthplace.
- birth_long
Longitude of the birthplace.
- birth_place
Birthplace of the individual.
- death_caus
Cause of death.
- death_date
Death date of the individual.
- death_lat
Latitude of the place of death.
- death_long
Longitude of the place of death.
- death_place
Place of death of the individual.
- attribute_caste
Caste of the individual.
- attribute_children
Number of children of the individual.
- attribute_description
Description of the individual.
- attribute_education
Education of the individual.
- attribute_idnumber
Identification number of the individual.
- attribute_marriages
Number of marriages of the individual.
- attribute_nationality
Nationality of the individual.
- attribute_occupation
Occupation of the individual.
- attribute_property
Property owned by the individual.
- attribute_religion
Religion of the individual.
- attribute_residence
Residence of the individual.
- attribute_ssn
Social Security number of the individual.
- attribute_title
Title of the individual.
- FAMC
ID or IDs of the family in which the individual is a child.
- FAMS
ID or IDs of families in which the individual is a spouse.
If no individual records are found, the function returns `NULL` with a warning.
Details
`readGedcom()` is a line-oriented parser tuned to common GEDCOM 5.5 and 5.5.1 structures. Individual records are identified from blocks that begin with an `@ INDI` line. Each individual block is passed to an internal parser that uses simple GEDCOM tag pattern matches to extract identifiers, names, life events, attributes, and family relationships.
Name information is parsed primarily from the GEDCOM `NAME` tag, which often encodes given names and surnames using slash-delimited surname notation, such as `NAME John /Smith/`. The parser extracts the given name, surname, and a cleaned full name. Additional name components are parsed when present, including name prefix, name suffix, nickname, and married surname.
Birth and death events are recognized from `BIRT` and `DEAT` tags. Event details are currently parsed using fixed offsets within the individual block. For birth events, the parser expects `DATE` at `i + 1`, `PLAC` at `i + 2`, `LATI` at `i + 4`, and `LONG` at `i + 5`. For death events, the parser expects `DATE` at `i + 1`, `PLAC` at `i + 2`, `CAUS` at `i + 3`, `LATI` at `i + 4`, and `LONG` at `i + 5`. Missing elements leave the corresponding output fields as `NA`.
Attribute tags such as `OCCU`, `EDUC`, `RELI`, `CAST`, `NCHI`, `NMR`, `NATI`, `RESI`, `PROP`, `SSN`, `TITL`, `DSCR`, and `IDNO` are parsed directly into dedicated columns prefixed with `attribute_`.
Family relationships are parsed from `FAMC` and `FAMS` tags. `FAMC` identifies the family in which an individual is a child, and `FAMS` identifies families in which an individual is a spouse. These raw family identifiers are retained in the parsed output unless removed during post-processing. When `add_parents = TRUE`, they are also used to infer `momID` and `dadID`.
If `post_process = TRUE`, `readGedcom()` applies optional cleanup steps controlled by `add_parents`, `combine_cols`, `remove_empty_cols`, and `skinny`. These steps can infer parent IDs, collapse redundant name fields, remove columns that are entirely missing, and drop raw family relationship columns for a slimmer output.