Blog

Forest Dussault, Senior Software Engineer - May 09, 2024

Flexible genomic data export with Gencove Explorer

Gencove Explorer provides robust support for several data export methods, making use of the Explorer SDK to facilitate seamless and efficient data transfer across various environments and storage solutions. In this blog, we’ll explore various methods of exporting data from Gencove's systems, covering how to generate pre-signed URLs for sample deliverables, and how to export data deliverables to AWS S3, Microsoft Azure, and Google Cloud Storage (GCP).

Exporting data first requires retrieving samples of interest along with their respective deliverables. Below is an example of how you might retrieve all of the Sample objects for a particular Gencove project, identified by a UUID.

from gencove_explorer import get_samples

# Gencove Project ID
project_uuid = "4b8fcd7b-5670-49c5-9adb-f1636e35f295"

# Retrieve Sample objects with imputation deliverables
samples = get_samples(
  input_data=project_uuid,
  file_types=["impute-vcf", "impute-csi"],
)

# Grab a single sample for demonstration
example_sample = samples[0]

# Inspect the sample
print(example_sample)

# Abbreviated output
Sample(
    id="745cf34b-e0b8-4819-b3b2-fa3fc7147265",
    client_id="EX1110000",
    files={
        "impute-csi": GencoveFile(...),
        "impute-vcf": GencoveFile(...),
    },
    last_status=SampleStatus(...),
)

More details on retrieving samples can be found in the Querying section of the Explorer docs.

Generating Presigned URLs from Sample Deliverables

Using our example_sample retrieved above, we can access various GencoveFile deliverables. Here, we focus on exporting the impute-vcf deliverable:

# Store VCF deliverable GencoveFile object
impute_vcf_file = example_sample.files["impute-vcf"]

# Generate a URL for the VCF file
print(impute_vcf_file.remote.url)

Calling the .remote.url property generates a pre-signed URL on the fly, allowing for temporary access (48 hours) to the deliverable via HTTPS.

Exporting to external cloud providers

The Gencove Explorer Library facilitates exporting sample deliverables to AWS, Azure, and GCP, the brief examples below demonstrate this capability. For more information, see the official docs on Explorer shortcuts.

AWS

To export data to AWS, use the ExportSampleDeliverablesToS3 shortcut:

from gencove_explorer_library.shortcuts.export_sample_deliverables import ExportSampleDeliverablesToS3

# Wrap data in a list
data = [example_sample]

# Define ExportSampleDeliverablesToS3 shortcut
to_aws = ExportSampleDeliverablesToS3(
    s3_path="s3://bucket/path/",
    aws_session_configuration={
        "aws_access_key_id": "AKIA...",
        "aws_secret_access_key": "123..."
    },
    samples=data
)

# Execute export job via batch cluster
to_aws.run()

To monitor the export job, the .status() method can be called periodically.

Note that the aws_session_configuration must be configured appropriately with your Access Key ID and Secret Access Key to ensure successful exporting of data to AWS S3. These credentials must have the appropriate IAM configured for accessing the destination S3 prefix.

Microsoft Azure

Similarly, data can be exported to Azure storage using the ExportSampleDeliverablesToAzureStorage shortcut:

from gencove_explorer_library.shortcuts.export_sample_deliverables import ExportSampleDeliverablesToAzureStorage

# Wrap data in a list
data = [example_sample]

# Define ExportSampleDeliverablesToAzureStorage shortcut
to_azure = ExportSampleDeliverablesToAzureStorage(
    azure_container_name="example",
    azure_blob_path="foo/bar/baz",
    azure_connection_string="DefaultEndpointsProtocol=https;AccountName=accountname;AccountKey=accountkey",
    samples=data
)

# Execute export job via batch cluster
to_azure.run()

Note that the azure_connection_string must be correctly defined and have the necessary permissions to access the destination Azure path.

Google Cloud (GCP)

Exporting to GCP follows a similar process using the ExportSampleDeliverablesToGCPStorage shortcut:

from gencove_explorer_library.shortcuts.export_sample_deliverables import ExportSampleDeliverablesToGCPStorage

# Wrap data in a list
data = [example_sample]

# Define ExportSampleDeliverablesToGCPStorage shortcut
to_gcp = ExportSampleDeliverablesToGCPStorage(
    storage_bucket="example-bucket",
    storage_path="example",
    gcp_service_account_json_path="example.json",
    samples=data
)

# Execute export job via batch cluster
to_gcp.run()

Final thoughts

The examples above are all available to be run interactively via Gencove Explorer through a demo notebook. These Explorer Library Shortcuts make it convenient to export data to various service providers and allow users to leverage Gencove Explorer's built-in efficiency. If your deliverable destination of choice is not AWS, Azure, or GCP, please get in touch, and we’ll add a shortcut that meets your group’s needs.

Gencove Explorer is a cloud-based genomics computing platform that allows users to organize, analyze, and visualize their data all in one place. Significantly streamlining genomic data analysis and storage, Gencove Explorer enables scientists to design and execute complex data pipelines on scalable clusters within a familiar batch computing environment.