alleleTools.format.vcf module

VCF File Handling Module.

This module provides a VCF (Variant Call Format) class for reading, parsing, and manipulating VCF files containing genetic variant data, particularly optimized for HLA and KIR allele data.

class alleleTools.format.vcf.VCF(path)[source]

Bases: object

A class for handling VCF (Variant Call Format) files.

This class provides methods to read, parse, and manipulate VCF files, with specific functionality for handling allele data from polymorphic genes like HLA and KIR.

metadata

VCF header metadata lines

Type:

str

dataframe

Main VCF data with ID as index

Type:

pd.DataFrame

Parameters:

path (str) – Path to the VCF file to read

get_format()[source]

Extract the format field specification from the VCF.

Parses the FORMAT column to determine the structure of genotype information fields (e.g., GT:DS:AA:AB:BB).

Returns:

List of format field names in order

Return type:

List[str]

Example

>>> vcf.get_format()
['GT', 'DS', 'AA', 'AB', 'BB']
remove_id_prefix(prefix: str)[source]

Remove a prefix from allele IDs in the dataframe.

This is commonly used to remove gene prefixes like ‘HLA_’ or ‘KIR_’ from allele identifiers to standardize naming.

Parameters:

prefix (str) – The prefix string to remove from allele IDs

Example

>>> vcf.remove_id_prefix("HLA_")
# "HLA_A*01:01" becomes "A*01:01"
samples()[source]

Get the list of sample column names from the VCF.

Returns all column names that are not part of the standard VCF format (i.e., sample-specific genotype columns).

Returns:

Set of sample column names

Return type:

set

samples_dataframe()[source]

Get a dataframe containing only the sample genotype columns.

Returns:

DataFrame with only sample columns, indexed by variant ID

Return type:

pd.DataFrame

save(path: str)[source]

Save the VCF data to a file.

Writes the metadata header followed by the dataframe in standard VCF format.

Parameters:

path (str) – Output file path

Note

This method modifies the internal dataframe structure during saving.