Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for a TR-based GenotypesPLINK class #73

Closed
aryarm opened this issue Jul 28, 2022 · 2 comments · Fixed by #204 or #222
Closed

support for a TR-based GenotypesPLINK class #73

aryarm opened this issue Jul 28, 2022 · 2 comments · Fixed by #204 or #222
Labels
enhancement New feature or request idea

Comments

@aryarm
Copy link
Member

aryarm commented Jul 28, 2022

The data.Genotypes class only really supports bi-allelic, phased variants. (There is some limited support for unphased genotypes, but it isn't really something we care too much about at the moment.)

I've designed the class so that it could easily be extended to support multi-allelic variants, so it would be great if we could add formal support for tandem repeats (TRs). Probably the best way to do this would be to use the existing tr_harmonizer library in trtools.

A potential implementation strategy

We could extend (sub-class) the data.Genotypes class to create a new data.GenotypesTR class.

  1. First, we would just need to override the data.Genotypes._variant_arr() method, which takes a cyvcf2.Variant instance as input and creates a np array containing ID, CHROM, POS, REF, and ALT fields. Instead, we could convert the cyvcf2.Variant instance to a trtools.utils.tr_harmonizer.TRRecord before calling that method. And within the method, we can store the alleles in a new column with dtype "object" in the np array, where the object is just a tuple of allele strings containing REF, ALT1, ALT2, etc.
  2. Next, we would have to somehow inherit the way that the data.Genotypes.data property is filled. Basically, we would want to store STR lengths instead of 0s and 1s here. Our plan is to do this conversion after we call the parent function. We might also be able to call tr_harmonizer's GetLengthGenotypes() method to get the lengths.
  3. Lastly, we should copy all of our test cases and make sure all of the other methods in the data.Genotypes class still work in data.GenotypesTR
  4. We might need to override or disable the check_biallelic and check_maf() methods

Additional methods

If we ever want to merge TRs and SNPs into one data.Genotypes instance, we might want to develop a merge_variants() method which can merge one instance with another. Assuming both are sorted, we might want to develop the method so that it keeps the records sorted in the end.

PLINK2 PGEN support

PLINK2 PGEN files support multi-allelic variants, but the pgenlib python library responsible for reading and writing PGEN files does not yet support them. Quoting from the documentation,

Multiallelic variants aren't fully supported yet. Instead, all ALT alleles are effectively collapsed into one.

Once support for multi-allelic variants is added to the pgenlib python library, support for reading and writing TRs to PGEN files should come with it, essentially automatically. We would just need to copy (and slightly adapt) our code from the data.GenotypesTR._variant_arr() method. We might also need to create an equivalent method to do the reverse, too.

Compatability

In theory, once we've made these changes, our existing code should work out-of-the-box with the new data.GenotypesTR class because they utilize the same interface. To confirm, we can copy our tests and just change the input.

@aryarm aryarm added enhancement New feature or request idea labels Sep 4, 2022
@aryarm aryarm changed the title support for TRs support for TRs in the Genotypes class Jan 14, 2023
@aryarm aryarm changed the title support for TRs in the Genotypes class support for a TR-based Genotypes class Jan 14, 2023
@aryarm
Copy link
Member Author

aryarm commented Feb 7, 2023

trtools has lots of dependencies that we don't need

as a workaround, let's just copy the code that we need from trtools
if we can import trtools, we'll use tr_harmonizer from the imported code
and otherwise, we'll just use the code that we copied

try:
    import trtools.utils.tr_harmonizer as trh
except ModuleNotFoundError:
    from . import tr_harmonizer as trh

@aryarm
Copy link
Member Author

aryarm commented Apr 25, 2023

PR #204 added support for TRs via a GenotypesTR class

And Pgenlib v0.83.0 added support for multiallelic variants! So the remaining tasks are

  • update our pyproject.toml to use the most recent version of Pgenlib (done! see 296e75d)
  • write tests in TestGenotypesPLINK in tests/test_data.py:
    • test_load_genotypes_multiallelic()
    • test_load_genotypes_multiallelic_tr() - should do the same as test_load_genotypes_multiallelic() but with simple-tr.pgen instead of simple-multiallelic.pgen. (Refer to the section about TRs below)
    • test_iter_multiallelic_tr()
    • test_write_multiallelic() - should try to generate a file containing multiallelic variants, then read it again and check that it matches what we wrote
    • test_write_multiallelic_tr() - should do the same as test_write_multiallelic() but with simple-tr.pgen instead of simple-multiallelic.pgen. (Refer to the section about TRs below)
      • we should also try to add a line to the simple-tr.vcf, simple-tr.pvar, and simple-tr.pgen files to replicate the situation described here, where there are 4 ALT alleles in the .pvar file, but the last one is not actually observed
  • Enable reading of multiallelic variants in our GenotypesPLINK class (See here)
    • we will need to provide a PvarReader object as a parameter when initializing the PgenReader class (See here)
    • we will also need to split the ALT allele by the comma delimiter in the _variant_arr() method
  • Enable writing of multiallelic variants in our GenotypesPLINK class
    • we will need to provide an allele_cts parameter to the append_alleles_batch() and append_partially_phased_batch() methods. (See here). It should be a np.uint32 array of each of the allele counts for each variant record.
      Note that it will be important that we provide the number of observed alleles and not just the number of alleles that we write to the PVAR file. (See here) We can probably obtain the number of observed alleles using np.unique(). I'm not yet sure whether empty genotypes should be included in that count, though
    • we will also need to provide an allele_ct_limit parameter to the PgenWriter class (See here). It should be the maximum number of alleles among all of the variants

Then, once we've finished that (so both of the tests pass), we can implement the code for TRs:

  • create test files (simple-tr.pgen, simple-tr.pvar, and simple-tr.psam) for TestGenotypesPLINKTR. We can convert from VCF using this command:

    plink2 --vcf-half-call m --make-pgen 'pvar-cols=vcfheader,qual,filter,info' \
    --vcf tests/data/simple-tr.vcf --out tests/data/simple-tr
    
  • write tests in a new class called TestGenotypesPLINKTR within tests/test_data.py:

    • _fake_genotypes_multiallelic() - should create a fake GenotypesPLINKTR class containing the contents of the simple-tr.pgen file
    • test_iter() - should work sorta like test_load_genotypes_iterate() in the TestGenotyesPLINK class
    • test_read()- should be super simple. Just check the properties match what we expect
  • Create a new GenotypesPLINKTR class (in genotypes.py) that extends the GenotypesPLINK classes

    • __init__() - this should be very similar to the GenotypesTR.__init__() method
      • it should call the parent's __iter__() and then add a vcftype property (from an extra parameter) just as in the GenotypesTR class
    • load() - this should be very similar to the GenotypesTR.load() method
    • _iter_TRRecords() - should open the pvar file (again), read each line into a cyvcf2.Variant object, convert it to a custom TRRecord object, and then yield it
    • read() - should call the parent's read() method, iterate over the TRRecord objects yielded by _iter_TRRecords(), and then use the ref and alt allele lengths stored in properties of the record to override each variant's array values in the data property
    • _iterate() - should call the parent's _iterate() method, iterate over the TRRecord objects yielded by _iter_TRRecords(), and then call GetLengthGenotypes() to override each variant's array values, and then yield them
    • write(), write_variants(), check_biallelic(), and check_maf() - we should simply override and disable these methods for now. We can raise a NotImplementedError like in the GenotypesTR class
      I can create test files for us to use for TestGenotypesPLINKTR
  • After we've finished GenotypesPLINKTR.read(), we can benchmark it against GenotypesVCF.

    • We'll need to create a script that can generate TR VCFs of different sizes for us. We can base it off of the script in tests/bench_genotypes.py. But since the GenotypesTR and GenotypesPLINKTR classes don't have write() methods, we'll need to create our own function within the benchmarking script in order to do that. This shouldn't be too difficult. We can just copy the header of tests/data/simple-tr.vcf and then use pysam to create TRs with random sets of alleles. The only major thing to note is that we will need to fill the START, END, and PERIOD INFO fields accordingly.

Benchmarking of TRs


If PGEN files can store multiallelic dosages, we should consider creating a command called dosage that can append dosage info computed using haptools' Genotypes classes to an existing PGEN file:

  • It can use GenotypesPLINKTR to compute the genotypes, sum them to compute the dosages, possibly divide by a number so that they remain in the range [0, 2], and then write them to the existing file using PgenWriter.append_dosages_batch(). It should only change the PGEN file. The PVAR and PSAM files can remain the same
  • If --out is specified, we should create a new PGEN file instead of overwriting the existing one. In the former case, we should also copy (or symlink?) the PVAR and PSAM files.

Once we do all of this, we should be able to use PLINK2 to analyze TRs! For example, if we wanted to do an association analysis with TRs, we could do the following:

  1. Convert a VCF into a PGEN file
    plink2 --vcf-half-call m --make-pgen 'pvar-cols=vcfheader,qual,filter,info' --vcf input.vcf --out output
    
  2. Run haptools dosage on the PGEN file
    haptools dosage output.pgen
    
  3. Use the PGEN file as input to plink2 --glm: https://www.cog-genomics.org/plink/2.0/assoc#glm

@aryarm aryarm changed the title support for a TR-based Genotypes class support for a TR-based GenotypesPLINK class Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request idea
Projects
None yet
1 participant