-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for a TR-based GenotypesPLINK class #73
Comments
trtools has lots of dependencies that we don't need as a workaround, let's just copy the code that we need from trtools try:
import trtools.utils.tr_harmonizer as trh
except ModuleNotFoundError:
from . import tr_harmonizer as trh |
PR #204 added support for TRs via a GenotypesTR class And Pgenlib v0.83.0 added support for multiallelic variants! So the remaining tasks are
Then, once we've finished that (so both of the tests pass), we can implement the code for TRs:
If PGEN files can store multiallelic dosages, we should consider creating a command called
Once we do all of this, we should be able to use PLINK2 to analyze TRs! For example, if we wanted to do an association analysis with TRs, we could do the following:
|
The
data.Genotypes
class only really supports bi-allelic, phased variants. (There is some limited support for unphased genotypes, but it isn't really something we care too much about at the moment.)I've designed the class so that it could easily be extended to support multi-allelic variants, so it would be great if we could add formal support for tandem repeats (TRs). Probably the best way to do this would be to use the existing
tr_harmonizer
library intrtools
.A potential implementation strategy
We could extend (sub-class) the
data.Genotypes
class to create a newdata.GenotypesTR
class.data.Genotypes._variant_arr()
method, which takes acyvcf2.Variant
instance as input and creates a np array containing ID, CHROM, POS, REF, and ALT fields. Instead, we could convert thecyvcf2.Variant
instance to atrtools.utils.tr_harmonizer.TRRecord
before calling that method. And within the method, we can store the alleles in a new column with dtype "object" in the np array, where the object is just a tuple of allele strings containing REF, ALT1, ALT2, etc.data.Genotypes.data
property is filled. Basically, we would want to store STR lengths instead of 0s and 1s here. Our plan is to do this conversion after we call the parent function. We might also be able to calltr_harmonizer
'sGetLengthGenotypes()
method to get the lengths.data.Genotypes
class still work indata.GenotypesTR
check_biallelic
andcheck_maf()
methodsAdditional methods
If we ever want to merge TRs and SNPs into one
data.Genotypes
instance, we might want to develop amerge_variants()
method which can merge one instance with another. Assuming both are sorted, we might want to develop the method so that it keeps the records sorted in the end.PLINK2 PGEN support
PLINK2 PGEN files support multi-allelic variants, but the
pgenlib
python library responsible for reading and writing PGEN files does not yet support them. Quoting from the documentation,Once support for multi-allelic variants is added to the
pgenlib
python library, support for reading and writing TRs to PGEN files should come with it, essentially automatically. We would just need to copy (and slightly adapt) our code from thedata.GenotypesTR._variant_arr()
method. We might also need to create an equivalent method to do the reverse, too.Compatability
In theory, once we've made these changes, our existing code should work out-of-the-box with the new
data.GenotypesTR
class because they utilize the same interface. To confirm, we can copy our tests and just change the input.The text was updated successfully, but these errors were encountered: