Skip to content

Commit

Permalink
Merge pull request #64 from gymrek-lab/ref/simphenotype
Browse files Browse the repository at this point in the history
ref: simphenotype
  • Loading branch information
aryarm authored Jul 20, 2022
2 parents e6022c7 + 0359ced commit c5c1339
Show file tree
Hide file tree
Showing 27 changed files with 1,500 additions and 822 deletions.
12 changes: 8 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,10 @@ Homepage: https://haptools.readthedocs.io/

## Installation

UNDER CONSTRUCTION
We have not officially published `haptools` yet, but in the meantime, you can install it directly from our Github repository.
```bash
pip install git+/~https://github.com/gymrek-lab/haptools.git
```

## Haptools utilities

Expand All @@ -21,12 +24,13 @@ Haptools consists of multiple utilities listed below. Click on a utility to see

* [`haptools karyogram`](docs/commands/karyogram.md): Visualize a "chromosome painting" of local ancestry labels based on breakpoints output by `haptools simgenome`.

* [`haptools transform`](https://haptools.readthedocs.io/en/latest/commands/transform.html): Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

Outputs produced by these utilities are compatible with each other. For example
`haptools simgenome` outputs a VCF file with local ancestry information annotated for each variant. The output VCF file can be used as input to `haptools simphenotype` to simulate phenotype information. `haptools simgenome` also outputs a list of local ancestry breakpoints which can be visualized using `haptools karyogram`.

## Contributing

If you are interested in contributing to `haptools`, please get in touch by submitting a Github issue or contacting us at mlamkin@ucsd.edu.


We gladly welcome any contributions to `haptools`!

Please read [our contribution guidelines](https://haptools.readthedocs.io/en/latest/project_info/contributing.html) and then submit a [Github issue](/~https://github.com/gymrek-lab/haptools/issues).
10 changes: 9 additions & 1 deletion docs/api/haptools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,16 @@ haptools.data.haplotypes module
:undoc-members:
:show-inheritance:

haptools.sim_phenotype module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: haptools.sim_phenotype
:members:
:undoc-members:
:show-inheritance:

haptools.sim_genotype module
~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: haptools.sim_genotype
:members:
Expand Down
2 changes: 1 addition & 1 deletion docs/commands/karyogram.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Haptools karyogram
# karyogram

`haptools karyogram` takes as input a breakpoints file (e.g. as output by `haptools simgenotype`) and a sample name, and plots a karyogram depicting local ancestry tracks.

Expand Down
76 changes: 0 additions & 76 deletions docs/commands/simphenotype.md

This file was deleted.

81 changes: 79 additions & 2 deletions docs/commands/simphenotype.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,81 @@
.. _commands-simphenotype:

.. include:: simphenotype.md
:parser: myst_parser.sphinx_

simphenotype
============

Simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. The user denotes causal variables to use within the simulation by specifying them in a ``.hap`` file.

The implementation is based on the `GCTA GWAS Simulation <https://yanglab.westlake.edu.cn/software/gcta/#GWASSimulation>`_ utility.

Usage
~~~~~
.. code-block:: bash
haptools simphenotype \
--replications INT \
--heritability FLOAT \
--prevalence FLOAT \
--region TEXT \
--sample SAMPLE \
--samples-file FILENAME \
--output PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GENOTYPES HAPLOTYPES
Model
~~~~~
Each normalized haplotype :math:`\vec{Z_j}` is encoded as an independent causal variable in a linear model:

.. math::
\vec{y} = \sum_j \beta_j \vec{Z_j} + \vec \epsilon
where

.. math::
\epsilon_i \sim N(0, \sigma^2)
.. math::
\sigma^2 = Var[\sum_j \beta_j \vec{Z_j}] * (\frac 1 {h^2} - 1)
The heritability :math:`h^2` is user-specified, but if it is not provided, then :math:`\sigma^2` will be computed purely from the effect sizes, instead:

.. math::
\sigma^2 = \Biggl \lbrace {1 - \sum \beta_j^2 \quad \quad {\sum \beta_j^2 \le 1} \atop 0 \quad \quad \quad \quad \quad \text{ otherwise }}
If a prevalence for the disease is specified, the final :math:`\vec{y}` value will be thresholded to produce a binary case/control trait with the desired fraction of diseased individuals.

Output
~~~~~~
Phenotypes are output in the PLINK2-style ``.pheno`` file format. If ``--replications`` was set to greater than 1, additional columns are output for each simulated trait.

Note that case/control phenotypes are encoded as 0 (control) + 1 (case) **not** 1 (control) + 2 (case). The latter is used by PLINK2 unless the ``--1`` flag is used (see `the PLIN2 docs <https://www.cog-genomics.org/plink/2.0/input#input_missing_phenotype>`_). Therefore, you must use ``--1`` when providing our ``.pheno`` files to PLINK.

Examples
~~~~~~~~
.. code-block:: bash
haptools simphenotype -o simulated.pheno tests/data/example.vcf.gz tests/data/simphenotype.hap
Simulate two replicates of a case/control trait that occurs in 60% of your samples with a heritability of 0.8. Encode all of the haplotypes in ``tests/data/example.hap.gz`` as independent causal variables.

.. code-block:: bash
haptools simphenotype \
--replications 2 \
--heritability 0.8 \
--prevalence 0.6 \
--output simulated.pheno \
tests/data/example.vcf.gz tests/data/example.hap.gz
Detailed Usage
~~~~~~~~~~~~~~

.. click:: haptools.__main__:main
:prog: haptools
:show-nested:
:commands: simphenotype
2 changes: 1 addition & 1 deletion docs/commands/transform.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Examples
~~~~~~~~
.. code-block:: bash
haptools transform tests/data/example.vcf.gz tests/data/example.hap.gz | less
haptools transform tests/data/example.vcf.gz tests/data/basic.hap.gz | less
.. code-block:: bash
Expand Down
7 changes: 7 additions & 0 deletions docs/formats/genotypes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. _formats-genotypes:


Genotypes
=========

Genotype files must be specified as VCF or BCF files.
4 changes: 2 additions & 2 deletions docs/formats/haplotypes.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _formats-haplotypes:


.hap
====
Haplotypes
==========

This document describes our custom file format specification for haplotypes: the ``.hap`` file.

Expand Down
43 changes: 0 additions & 43 deletions docs/formats/inputs.rst

This file was deleted.

28 changes: 28 additions & 0 deletions docs/formats/phenotypes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
.. _formats-phenotypes:


Phenotypes and Covariates
=========================

Phenotype file format
---------------------
Phenotypes are expected to follow `the PLINK2 .pheno file format <https://www.cog-genomics.org/plink/2.0/input#pheno>`_. This is a
tab-separated format where the first column corresponds to the sample ID, and
subsequent columns contain each of your phenotypes.

The first line of the file corresponds with the header and must begin with ``#IID``.
The names of each of your phenotypes belong in the subbsequent columns of the header.

See `tests/data/simple.pheno </~https://github.com/gymrek-lab/haptools/blob/main/tests/data/simple.pheno>`_ for an example of a phenotype file:

.. include:: ../../tests/data/simple.pheno
:literal:

Covariate file format
---------------------
Covariates follow the same format as phenotypes.

See `tests/data/simple.covar </~https://github.com/gymrek-lab/haptools/blob/main/tests/data/simple.covar>`_ for an example of a covariate file:

.. include:: ../../tests/data/simple.covar
:literal:
17 changes: 17 additions & 0 deletions docs/formats/sample_info.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
.. _formats-sample_info:


Sample Info
===========

1000 Genomes sample_info file format
------------------------------------
Within the subcommand ``haptools simgenotype`` we use a file to map samples in the
reference to their population listed in the model file. This code is expected to work
out of the box with 1000 genomes data and we have pre-computed this mapping file as
well as given the command to how to compute it if you desire another as well.

``cut -f 1,4 igsr-1000\ genomes\ on\ grch38.tsv | sed '1d' | sed -e 's/ /\t/g' > 1000genomes_sampleinfo.tsv``

See ``example-files/1000genomes_sampleinfo.tsv`` for an example of the 1000genomes
GRCh38 samples mapped to their subpopulations.
5 changes: 4 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@
:hidden:
:maxdepth: 1

formats/inputs.rst
formats/genotypes.rst
formats/haplotypes.rst
formats/phenotypes.rst
formats/sample_info.rst

.. toctree::
:caption: Commands
Expand All @@ -20,6 +22,7 @@

commands/simgenotype.rst
commands/simphenotype.rst
commands/karyogram.rst
commands/transform.rst

.. toctree::
Expand Down
Loading

0 comments on commit c5c1339

Please sign in to comment.