Skip to content

Parsing the VCF INFO string for key=value pairs with R tidyverse tools

Posted on:December 16, 2022 at 09:57 PM

Introduction

I often find myself needing to parse the key=value pairs from the INFO string of a VCF or GFF/GTF file. There are some great dedicated tools for doing this (e.g. rtg-tools, bcftools), but sometimes it’s easier to parse this data in R, where I might already be doing other data wrangling. Luckily, this technique can be applied to a variety of key=value parsing tasks.

The INFO column of VCF files, and the attribute column of GFF/GTF files, contain semicolon delimited strings, with keys and values separated by an equals sign. Extracting these values in a clean (and safe) way can be tricky. I discovered that using a combination of lapply(), stringr::str_split(), and tidyr::unnest() functions can help expand this data in a uniform way!

For reference, here are links to the VCF and GFF format specs:

First, let’s load some R libraries:

library(glue)
library(readr)
library(tidyr)
library(stringr)

Download an example VCF file

To demonstrate the method, let’s download some test data. There is a small VCF hosted on Ensembl that should work for our purposes.

vcf_url <- "http://ftp.ensembl.org/pub/release-106/variation/vcf/homo_sapiens/homo_sapiens_clinically_associated.vcf.gz"
vcf_filename <- stringr::str_remove(basename(vcf_url), "\\.gz$")

download.file(vcf_url, destfile = glue("{vcf_filename}.gz"), method = "wget")

# Uncompress the file
system(glue("gunzip -c {vcf_filename}.gz >{vcf_filename}"))

Read the VCF file into R as a tab-delimited table

Let’s make sure we get the column names and types correct by explicitly specifying them for a VCF. Many VCFs have a 9th, FORMAT column, followed by one or more SAMPLE columns. However, this example VCF is missing them. Make sure to check you are reading your VCF file correctly.

col_names <- c(
    "CHROM",
    "POS",
    "ID",
    "REF",
    "ALT",
    "QUAL",
    "FILTER",
    "INFO"
)

col_types <- cols(
    CHROM = col_character(),
    POS = col_double(),
    ID = col_character(),
    REF = col_character(),
    ALT = col_character(),
    QUAL = col_character(),
    FILTER = col_character(),
    INFO = col_character()
)

# Limit the number of rows being read to 500 (for demo purposes only)
vcf <- readr::read_tsv(vcf_filename,
    quote = "",
    comment = "#",
    col_names = col_names,
    col_types = col_types,
    n_max = 500)

Create a simple function for parsing key=value strings

This function will be supplied to the lapply() FUN argument.

The purpose of this function is to take a list as input (x) and split the strings based on a separator (e.g. = by default). Make sure the data do not contain extra whitespace, and replace any empty strings "" with NA values. Finally, return the key, value data as a tibble.

# The key=value pairs in this column are linked with an "="
split_info <- function(x, sep = "=") {
    clean <- str_trim(x)
    clean_split <- str_split(clean, sep, simplify = TRUE)
    clean_split[clean_split == ""] <- NA_character_
    as_tibble(as.data.frame(clean_split)) %>%
        dplyr::rename(keys = "V1", values = "V2")
}


# Create a list-column that contains a tibble data frame of all split INFO data
vcf_listcol <- vcf %>%
    dplyr::mutate(INFO_list = lapply(X = str_split(INFO, ";"), FUN = split_info))

Expand the INFO_list column into new columns, one for each key

Above, we created a new column in our VCF table: a “list column”. This is a special type of data structure that can store a list at every row location. Each row might have a different list length (each row only contains the keys and values found in the original INFO column/row). The tidyr::unnest() function is very cool! It can take a list column as input, and expand all the data into regular columns (long format). Then these keys can be spread out as new columns with tidyr::pivot_wider(). All keys are included, with NAs provided for missing values.

vcf_listcol_unnest <- vcf_listcol %>%
    unnest(cols = INFO_list, keep_empty = TRUE) %>%
    tidyr::pivot_wider(names_from = "keys", values_from = "values", names_prefix = "INFO.")

Check out the resulting table

vcf_listcol_unnest %>%
    print(width = Inf, n = 6)
# A tibble: 500 Ă— 29
  CHROM     POS ID          REF   ALT   QUAL  FILTER
  <chr>   <dbl> <chr>       <chr> <chr> <chr> <chr>
1 1      943995 rs761448939 C     G,T   .     .
2 1     1014143 rs786201005 C     T     .     .
3 1     1014317 rs672601345 G     GG    .     .
4 1     1014359 rs672601312 G     A,T   .     .
5 1     1020239 rs201073369 G     A,C   .     .
6 1     1022225 rs756623659 G     A     .     .
  INFO
  <chr>
1 dbSNP_154;TSA=SNV;E_Freq;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_TOPMed;E_gno…
2 dbSNP_154;TSA=SNV;E_Cited;E_Phenotype_or_Disease;CLIN_pathogenic;AA=C
3 dbSNP_154;TSA=indel;E_Cited;E_Phenotype_or_Disease;CLIN_pathogenic;AA=G
4 dbSNP_154;TSA=SNV;E_Freq;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_gnomAD;CLIN_…
5 dbSNP_154;TSA=SNV;E_Freq;E_1000G;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_TOPM…
6 dbSNP_154;TSA=SNV;E_Freq;E_Cited;E_Phenotype_or_Disease;E_ExAC;E_gnomAD;CLIN_…
  INFO.dbSNP_154 INFO.TSA INFO.E_Freq INFO.E_Cited INFO.E_Phenotype_or_Disease
  <chr>          <chr>    <chr>       <chr>        <chr>
1 NA             SNV      NA          NA           NA
2 NA             SNV      NA          NA           NA
3 NA             indel    NA          NA           NA
4 NA             SNV      NA          NA           NA
5 NA             SNV      NA          NA           NA
6 NA             SNV      NA          NA           NA
  INFO.E_ExAC INFO.E_TOPMed INFO.E_gnomAD INFO.CLIN_pathogenic INFO.AA
  <chr>       <chr>         <chr>         <chr>                <chr>
1 NA          NA            NA            NA                   C
2 NA          NA            NA            NA                   C
3 NA          NA            NA            NA                   G
4 NA          NA            NA            NA                   G
5 NA          NA            NA            NA                   G
6 NA          NA            NA            NA                   NA
  INFO.CLIN_uncertain_significance INFO.E_1000G INFO.CLIN_benign
  <chr>                            <chr>        <chr>
1 NA                               NA           NA
2 NA                               NA           NA
3 NA                               NA           NA
4 NA                               NA           NA
5 NA                               NA           NA
6 NA                               NA           NA
  INFO.CLIN_likely_benign INFO.MA INFO.MAF INFO.MAC INFO.CLIN_likely_pathogenic
  <chr>                   <chr>   <chr>    <chr>    <chr>
1 NA                      NA      NA       NA       NA
2 NA                      NA      NA       NA       NA
3 NA                      NA      NA       NA       NA
4 NA                      NA      NA       NA       NA
5 NA                      C       0.008786 44       NA
6 NA                      NA      NA       NA       NA
  INFO.E_ESP INFO.CLIN_not_provided INFO.CLIN_risk_factor
  <chr>      <chr>                  <chr>
1 NA         NA                     NA
2 NA         NA                     NA
3 NA         NA                     NA
4 NA         NA                     NA
5 NA         NA                     NA
6 NA         NA                     NA
# â„ą 494 more rows
# â„ą Use `print(n = ...)` to see more rows