-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #64 from gymrek-lab/ref/simphenotype
ref: simphenotype
- Loading branch information
Showing
27 changed files
with
1,500 additions
and
822 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,81 @@ | ||
.. _commands-simphenotype: | ||
|
||
.. include:: simphenotype.md | ||
:parser: myst_parser.sphinx_ | ||
|
||
simphenotype | ||
============ | ||
|
||
Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal variables to use within the simulation by specifying them in a ``.hap`` file. | ||
|
||
The implementation is based on the `GCTA GWAS Simulation <https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation>`_ utility. | ||
|
||
Usage | ||
~~~~~ | ||
.. code-block:: bash | ||
haptools simphenotype \ | ||
--replications INT \ | ||
--heritability FLOAT \ | ||
--prevalence FLOAT \ | ||
--region TEXT \ | ||
--sample SAMPLE \ | ||
--samples-file FILENAME \ | ||
--output PATH \ | ||
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \ | ||
GENOTYPES HAPLOTYPES | ||
Model | ||
~~~~~ | ||
Each normalized haplotype :math:`\vec{Z_j}` is encoded as an independent causal variable in a linear model: | ||
|
||
.. math:: | ||
\vec{y} = \sum_j \beta_j \vec{Z_j} + \vec \epsilon | ||
where | ||
|
||
.. math:: | ||
\epsilon_i \sim N(0, \sigma^2) | ||
.. math:: | ||
\sigma^2 = Var[\sum_j \beta_j \vec{Z_j}] * (\frac 1 {h^2} - 1) | ||
The heritability :math:`h^2` is user-specified, but if it is not provided, then :math:`\sigma^2` will be computed purely from the effect sizes, instead: | ||
|
||
.. math:: | ||
\sigma^2 = \Biggl \lbrace {1 - \sum \beta_j^2 \quad \quad {\sum \beta_j^2 \le 1} \atop 0 \quad \quad \quad \quad \quad \text{ otherwise }} | ||
If a prevalence for the disease is specified, the final :math:`\vec{y}` value will be thresholded to produce a binary case/control trait with the desired fraction of diseased individuals. | ||
|
||
Output | ||
~~~~~~ | ||
Phenotypes are output in the PLINK2-style ``.pheno`` file format. If ``--replications`` was set to greater than 1, additional columns are output for each simulated trait. | ||
|
||
Note that case/control phenotypes are encoded as 0 (control) + 1 (case) **not** 1 (control) + 2 (case). The latter is used by PLINK2 unless the ``--1`` flag is used (see `the PLIN2 docs <https://www.cog-genomics.org/plink/2.0/input#input_missing_phenotype>`_). Therefore, you must use ``--1`` when providing our ``.pheno`` files to PLINK. | ||
|
||
Examples | ||
~~~~~~~~ | ||
.. code-block:: bash | ||
haptools simphenotype -o simulated.pheno tests/data/example.vcf.gz tests/data/simphenotype.hap | ||
Simulate two replicates of a case/control trait that occurs in 60% of your samples with a heritability of 0.8. Encode all of the haplotypes in ``tests/data/example.hap.gz`` as independent causal variables. | ||
|
||
.. code-block:: bash | ||
haptools simphenotype \ | ||
--replications 2 \ | ||
--heritability 0.8 \ | ||
--prevalence 0.6 \ | ||
--output simulated.pheno \ | ||
tests/data/example.vcf.gz tests/data/example.hap.gz | ||
Detailed Usage | ||
~~~~~~~~~~~~~~ | ||
|
||
.. click:: haptools.__main__:main | ||
:prog: haptools | ||
:show-nested: | ||
:commands: simphenotype |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
.. _formats-genotypes: | ||
|
||
|
||
Genotypes | ||
========= | ||
|
||
Genotype files must be specified as VCF or BCF files. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
.. _formats-phenotypes: | ||
|
||
|
||
Phenotypes and Covariates | ||
========================= | ||
|
||
Phenotype file format | ||
--------------------- | ||
Phenotypes are expected to follow `the PLINK2 .pheno file format <https://www.cog-genomics.org/plink/2.0/input#pheno>`_. This is a | ||
tab-separated format where the first column corresponds to the sample ID, and | ||
subsequent columns contain each of your phenotypes. | ||
|
||
The first line of the file corresponds with the header and must begin with ``#IID``. | ||
The names of each of your phenotypes belong in the subbsequent columns of the header. | ||
|
||
See `tests/data/simple.pheno </~https://github.com/gymrek-lab/haptools/blob/main/tests/data/simple.pheno>`_ for an example of a phenotype file: | ||
|
||
.. include:: ../../tests/data/simple.pheno | ||
:literal: | ||
|
||
Covariate file format | ||
--------------------- | ||
Covariates follow the same format as phenotypes. | ||
|
||
See `tests/data/simple.covar </~https://github.com/gymrek-lab/haptools/blob/main/tests/data/simple.covar>`_ for an example of a covariate file: | ||
|
||
.. include:: ../../tests/data/simple.covar | ||
:literal: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
.. _formats-sample_info: | ||
|
||
|
||
Sample Info | ||
=========== | ||
|
||
1000 Genomes sample_info file format | ||
------------------------------------ | ||
Within the subcommand ``haptools simgenotype`` we use a file to map samples in the | ||
reference to their population listed in the model file. This code is expected to work | ||
out of the box with 1000 genomes data and we have pre-computed this mapping file as | ||
well as given the command to how to compute it if you desire another as well. | ||
|
||
``cut -f 1,4 igsr-1000\ genomes\ on\ grch38.tsv | sed '1d' | sed -e 's/ /\t/g' > 1000genomes_sampleinfo.tsv`` | ||
|
||
See ``example-files/1000genomes_sampleinfo.tsv`` for an example of the 1000genomes | ||
GRCh38 samples mapped to their subpopulations. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.