Introduction
I often find myself needing to parse the key=value pairs from the INFO string of a VCF or GFF/GTF file. There are some great dedicated tools for doing this (e.g. rtg-tools, bcftools), but sometimes it’s easier to parse this data in R, where I might already be doing other data wrangling. Luckily, this technique can be applied to a variety of key=value parsing tasks.
The INFO
column of VCF files, and the attribute
column of GFF/GTF files, contain semicolon delimited strings, with keys and values separated by an equals sign. Extracting these values in a clean (and safe) way can be tricky. I discovered that using a combination of lapply()
, stringr::str_split()
, and tidyr::unnest()
functions can help expand this data in a uniform way!
For reference, here are links to the VCF and GFF format specs:
First, let’s load some R libraries:
library(glue)
library(readr)
library(tidyr)
library(stringr)
Download an example VCF file
To demonstrate the method, let’s download some test data. There is a small VCF hosted on Ensembl that should work for our purposes.
vcf_url <- "http://ftp.ensembl.org/pub/release-106/variation/vcf/homo_sapiens/homo_sapiens_clinically_associated.vcf.gz"
vcf_filename <- stringr::str_remove(basename(vcf_url), "\\.gz$")
download.file(vcf_url, destfile = glue("{vcf_filename}.gz"), method = "wget")
# Uncompress the file
system(glue("gunzip -c {vcf_filename}.gz >{vcf_filename}"))
Read the VCF file into R as a tab-delimited table
Let’s make sure we get the column names and types correct by explicitly specifying them for a VCF. Many VCFs have a 9th, FORMAT
column, followed by one or more SAMPLE
columns. However, this example VCF is missing them. Make sure to check you are reading your VCF file correctly.
col_names <- c(
"CHROM",
"POS",
"ID",
"REF",
"ALT",
"QUAL",
"FILTER",
"INFO"
)
col_types <- cols(
CHROM = col_character(),
POS = col_double(),
ID = col_character(),
REF = col_character(),
ALT = col_character(),
QUAL = col_character(),
FILTER = col_character(),
INFO = col_character()
)
# Limit the number of rows being read to 500 (for demo purposes only)
vcf <- readr::read_tsv(vcf_filename,
quote = "",
comment = "#",
col_names = col_names,
col_types = col_types,
n_max = 500)
Create a simple function for parsing key=value strings
This function will be supplied to the lapply()
FUN
argument.
The purpose of this function is to take a list as input (x
) and split the strings based on a separator (e.g. =
by default). Make sure the data do not contain extra whitespace, and replace any empty strings ""
with NA
values. Finally, return the key, value data as a tibble.
# The key=value pairs in this column are linked with an "="
split_info <- function(x, sep = "=") {
clean <- str_trim(x)
clean_split <- str_split(clean, sep, simplify = TRUE)
clean_split[clean_split == ""] <- NA_character_
as_tibble(as.data.frame(clean_split)) %>%
dplyr::rename(keys = "V1", values = "V2")
}
# Create a list-column that contains a tibble data frame of all split INFO data
vcf_listcol <- vcf %>%
dplyr::mutate(INFO_list = lapply(X = str_split(INFO, ";"), FUN = split_info))
Expand the INFO_list
column into new columns, one for each key
Above, we created a new column in our VCF table: a “list column”. This is a special type of data structure that can store a list at every row location. Each row might have a different list length (each row only contains the keys and values found in the original INFO column/row). The tidyr::unnest()
function is very cool! It can take a list column as input, and expand all the data into regular columns (long format). Then these keys can be spread out as new columns with tidyr::pivot_wider()
. All keys are included, with NAs provided for missing values.
vcf_listcol_unnest <- vcf_listcol %>%
unnest(cols = INFO_list, keep_empty = TRUE) %>%
tidyr::pivot_wider(names_from = "keys", values_from = "values", names_prefix = "INFO.")
Check out the resulting table
vcf_listcol_unnest %>%
print(width = Inf, n = 6)
# A tibble: 500 Ă— 29
CHROM POS ID REF ALT QUAL FILTER
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1 943995 rs761448939 C G,T . .
2 1 1014143 rs786201005 C T . .
3 1 1014317 rs672601345 G GG . .
4 1 1014359 rs672601312 G A,T . .
5 1 1020239 rs201073369 G A,C . .
6 1 1022225 rs756623659 G A . .
INFO
<chr>
1 dbSNP_154;TSA=SNV;E_Freq;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_TOPMed;E_gno…
2 dbSNP_154;TSA=SNV;E_Cited;E_Phenotype_or_Disease;CLIN_pathogenic;AA=C
3 dbSNP_154;TSA=indel;E_Cited;E_Phenotype_or_Disease;CLIN_pathogenic;AA=G
4 dbSNP_154;TSA=SNV;E_Freq;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_gnomAD;CLIN_…
5 dbSNP_154;TSA=SNV;E_Freq;E_1000G;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_TOPM…
6 dbSNP_154;TSA=SNV;E_Freq;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_gnomAD;CLIN_…
INFO.dbSNP_154 INFO.TSA INFO.E_Freq INFO.E_Cited INFO.E_Phenotype_or_Disease
<chr> <chr> <chr> <chr> <chr>
1 NA SNV NA NA NA
2 NA SNV NA NA NA
3 NA indel NA NA NA
4 NA SNV NA NA NA
5 NA SNV NA NA NA
6 NA SNV NA NA NA
INFO.E_ExAC INFO.E_TOPMed INFO.E_gnomAD INFO.CLIN_pathogenic INFO.AA
<chr> <chr> <chr> <chr> <chr>
1 NA NA NA NA C
2 NA NA NA NA C
3 NA NA NA NA G
4 NA NA NA NA G
5 NA NA NA NA G
6 NA NA NA NA NA
INFO.CLIN_uncertain_significance INFO.E_1000G INFO.CLIN_benign
<chr> <chr> <chr>
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
INFO.CLIN_likely_benign INFO.MA INFO.MAF INFO.MAC INFO.CLIN_likely_pathogenic
<chr> <chr> <chr> <chr> <chr>
1 NA NA NA NA NA
2 NA NA NA NA NA
3 NA NA NA NA NA
4 NA NA NA NA NA
5 NA C 0.008786 44 NA
6 NA NA NA NA NA
INFO.E_ESP INFO.CLIN_not_provided INFO.CLIN_risk_factor
<chr> <chr> <chr>
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
# â„ą 494 more rows
# â„ą Use `print(n = ...)` to see more rows