Blog

Miguel Liezun, Software Engineer - Apr 12, 2022

Fetching subsets of VCF files, a DIY approach

The vision of Gencove is that the universal application of genomics will result in a healthier and more sustainable civilization.

As we work towards that vision, we regularly process large amounts of genomic information and it is commonly accessed from remote locations like object storage (e.g., AWS S3), for that purpose we’ve written a tool called htsutil for subsetting a VCF file outside of the HTSlib ecosystem.

We are huge fans of HTSlib/bcftools and use it extensively, but in our hands it was somewhat unreliable when fetching data from remote network locations in high-volume production settings.

Htsutil takes a CSI index file and a region and returns the byte offsets that you need to read from a VCF file, which can be passed to downstream tools for efficient processing of VCF files.

This means we can now use downstream tools to:

  1. Robustly fetch information from our VCF files by streaming files from S3 using curl or AWS CLI.
  2. Efficiently fetch information from our VCF file by using the byte offsets that htsutil provides.

Also, we’ve released htsutil as an open source project that is available on Gitlab.

How does htsutil work?

Imagine you have a VCF file stored in a remote location and you need to use it inside an EC2 instance to run some analysis.

Here is where htsutil comes in handy. When given a region and an index file (.csi) the result is the byte offsets that can be consumed by other tools (like aws cli) in order to retrieve the desired portion of the VCF file.

Htsutil receives a region and an index file and the result is the portion of the VCF file that you need.

After downloading the VCF file portion to the EC2 instance some post-processing is needed. We’ve built an example implementation that results in a valid VCF file that contains only the region you need

Use and contribute

First, let’s give a shoutout to the biogo project which htsutil depends on.

Htsutil is released under an MIT License and can be found in the gitlab repository https://gitlab.com/gencove/htsutil. With binary releases for both macOS and Linux.

A guide of how to use htsutil can be found in the project readme. Also remember to check the example implementation under the examples folder.

We encourage everyone that wants to report an error or contribute with a fix to create an Issue or a Merge Request, as always, let us know what you think.