Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: .hap file format IO #43

Merged
merged 71 commits into from
May 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
dc3c3ce
copy variant module from happler
aryarm Apr 14, 2022
9a0fa20
start on work for haplotype parser
aryarm Apr 14, 2022
1391b0c
continue implementing Haplotypes.read and Haplotypes.iterate methods
aryarm Apr 17, 2022
c51467b
create Haplotype and Variant classes for storing lines from .haps files
aryarm Apr 19, 2022
c8428fa
create specific section in docs for file formats
aryarm Apr 19, 2022
2015215
fix issues with commands not appearing in toc of docs
aryarm Apr 19, 2022
1879e04
add docs for .hap haplotypes file format
aryarm Apr 19, 2022
0a2af60
copy variant module from happler
aryarm Apr 14, 2022
7c8d182
start on work for haplotype parser
aryarm Apr 14, 2022
99071f8
continue implementing Haplotypes.read and Haplotypes.iterate methods
aryarm Apr 17, 2022
c5500fe
create Haplotype and Variant classes for storing lines from .haps files
aryarm Apr 19, 2022
32bd815
create specific section in docs for file formats
aryarm Apr 19, 2022
600e032
fix issues with commands not appearing in toc of docs
aryarm Apr 19, 2022
8cb274a
add docs for .hap haplotypes file format
aryarm Apr 19, 2022
dc63ed2
Merge branch 'feat/haplotypes' of github.com:gymrek-lab/haptools into…
aryarm Apr 19, 2022
8f856b0
rename hap data files
aryarm Apr 19, 2022
91856b4
create new example hap files with beta added
aryarm Apr 19, 2022
5aa0deb
change allele to str in hap format spec
aryarm Apr 19, 2022
a62a03b
correct type-hinting of return of Haplotypes.iterate
aryarm Apr 19, 2022
a784e6b
use fname property in Haplotypes.write
aryarm Apr 19, 2022
cf82d4f
start handling extras in Haplotypes class
aryarm Apr 21, 2022
555deba
store variants as tuple intead of list in Haplotype class
aryarm Apr 21, 2022
ec69ae7
rewrite from_hap_spec to automatically use properties from subclasses
aryarm Apr 21, 2022
0eff78d
define new haplotype class for haptools
aryarm Apr 21, 2022
5ba8f78
check header lines in Haplotypes.read
aryarm Apr 21, 2022
54a0617
add docs for usage of the .hap file
aryarm Apr 21, 2022
4f7c7fa
fmt with black
aryarm Apr 21, 2022
b196736
rebuild api docs with haplotypes.py
aryarm Apr 21, 2022
3e2a426
add examples for Haplotypes class
aryarm Apr 22, 2022
7e86aaf
validate that all extras are there in Haplotypes.check_ex_header
aryarm Apr 22, 2022
6232929
make _fmt a private field
aryarm Apr 23, 2022
62ab36b
convert iterate to __iter__ in data module
aryarm Apr 23, 2022
b369539
add more examples and docs to haplotypes class
aryarm Apr 23, 2022
f2fe5ac
add example hap files to docs
aryarm Apr 23, 2022
effb035
create smaller hap example files
aryarm Apr 23, 2022
b3d05ff
add HaplotypeTests class to testing module
aryarm Apr 23, 2022
fb27999
call __iter__ from read in Haplotypes class
aryarm Apr 23, 2022
3cca45f
use basic.hap in haplotypes examples
aryarm Apr 23, 2022
054a01b
add indexed basic hap and test example.hap.gz
aryarm Apr 23, 2022
5080728
test Haplotypes.write() method
aryarm Apr 23, 2022
e2bf695
add header lines to example.hap
aryarm Apr 23, 2022
4480b19
reformat with black -- oops
aryarm Apr 24, 2022
6d3b598
require sorting of line type symbols for indexed hap files
aryarm Apr 24, 2022
6a7c5ee
add Extra object encoding extra fields in Haplotypes module
aryarm May 9, 2022
2edcb52
revise hap test data files to pass tests
aryarm May 9, 2022
122f062
Merge branch 'feat/haplotypes' of github.com:gymrek-lab/haptools into…
aryarm May 9, 2022
68ba119
add docs for new extra field declarations in header
aryarm May 9, 2022
001a28e
Preallocate np array when loading genotypes
aryarm May 9, 2022
daaeadf
retest genotypes module after changes
aryarm May 10, 2022
eb641c0
create transform subcommand
aryarm May 10, 2022
95a1619
create TestGenotypes class in testing module
aryarm May 10, 2022
2e33f4a
test variant selection in Genotypes class
aryarm May 10, 2022
e54143a
refmt with black
aryarm May 10, 2022
60bda2b
create Data.unset() to check if data is unset
aryarm May 10, 2022
56ea690
add variants param to Genotypes.load()
aryarm May 10, 2022
e830119
output from a file path in transform subcommand
aryarm May 11, 2022
c1b55ff
create Genotypes class that also stores REF/ALT
aryarm May 11, 2022
db74659
create Haplotype.transform function
aryarm May 11, 2022
b72d1d3
create Haplotypes.transform function and add tests
aryarm May 11, 2022
e084ea8
write Haplotypes to a VCF
aryarm May 11, 2022
2c1dc3c
refmt with black and get rid of HaplotypesGT class
aryarm May 11, 2022
9e83254
clean up transform docs
aryarm May 11, 2022
6bad9d8
warn against importing at the top of __main__
aryarm May 11, 2022
4384cb8
clean up duplicated code in Genotypes class
aryarm May 13, 2022
259aaee
add Genotypes._prephased attr to ignore phasing while debugging
aryarm May 13, 2022
e72f2d3
allow for discarding samples that are missing genotypes
aryarm May 13, 2022
13c06e7
add more docs and messages to Genotypes and Haplotypes classes
aryarm May 13, 2022
1410315
require GenotypeRefAlt instance as input to Haplotypes.transform
aryarm May 13, 2022
8ccb7d2
refmt with black
aryarm May 14, 2022
75e75be
prelim code for other gts readers
aryarm May 14, 2022
34a839d
Merge pull request #45 from gymrek-lab/feat/transform
aryarm May 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions docs/api/haptools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,18 @@ haptools.data.phenotypes module
:undoc-members:
:show-inheritance:

haptools.data.covariates module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: haptools.data.covariates
:members:
:undoc-members:
:show-inheritance:

haptools.data.haplotypes module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: haptools.data.haplotypes
:members:
:undoc-members:
:show-inheritance:
2 changes: 1 addition & 1 deletion docs/commands/simgenotype.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Haptools simgenotype
# simgenotype

`haptools simgenotype` takes as input a reference set of haplotypes in VCF format and a user-specified admixture model. It outputs a VCF file with simulated genotype information for admixed genotypes, as well as a breakpoints file that can be used for visualization.

Expand Down
2 changes: 1 addition & 1 deletion docs/commands/simgenotype.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _subcommands-simgenotype:
.. _commands-simgenotype:

.. include:: simgenotype.md
:parser: myst_parser.sphinx_
6 changes: 3 additions & 3 deletions docs/commands/simphenotype.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Haptools simphenotype
# simphenotype

Haptools simphenotype simulates a complex trait, taking into account haplotype- or local-ancestry- specific effects as well as traditional variant-level effects. It takes causal effects and genotypes as input and outputs simulated phenotypes.

Expand All @@ -19,9 +19,9 @@ haptools simphenotype \

Required parameters:

* `--vcf <string>`: A bgzipped, tabix-indexed, phased VCF file. If you are simulating local-ancestry effects, the VCF file must contain the `FORMAT/LA` tag included in output of `haptools simgenotype`. See [haptools file formats](../../docs/project_info/haptools_file_formats.rst) for more details.
* `--vcf <string>`: A bgzipped, tabix-indexed, phased VCF file. If you are simulating local-ancestry effects, the VCF file must contain the `FORMAT/LA` tag included in output of `haptools simgenotype`. See [haptools file formats](../../docs/formats/inputs.rst) for more details.

* `--hap <string>`: A bgzipped, tabix-indexed HAP file, which specifies causal effects. This is a custom format described in more detail in [haptools file formats](../../docs/project_info/haptools_file_formats.rst). The HAP format enables flexible specification of a range of effect types including traditional variant-level effects, haplotype-level effects, associations with repeat lengths at short tandem repeats, and interaction of these effects with local ancestry labels. See [Examples](#examples) below for detailed examples of how to specify effects.
* `--hap <string>`: A bgzipped, tabix-indexed HAP file, which specifies causal effects. This is a custom format described in more detail in [haptools file formats](../../docs/formats/haplotypes.rst). The HAP format enables flexible specification of a range of effect types including traditional variant-level effects, haplotype-level effects, associations with repeat lengths at short tandem repeats, and interaction of these effects with local ancestry labels. See [Examples](#examples) below for detailed examples of how to specify effects.

* `--out <string>`: Prefix to name output files.

Expand Down
2 changes: 1 addition & 1 deletion docs/commands/simphenotype.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _subcommands-simphenotype:
.. _commands-simphenotype:

.. include:: simphenotype.md
:parser: myst_parser.sphinx_
35 changes: 35 additions & 0 deletions docs/commands/transform.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
.. _commands-transform:


transform
=========

Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

The ``transform`` command takes as input a set of genotypes in VCF and a list of haplotypes (specified as a :doc:`.hap file </formats/haplotypes>`) and outputs a set of haplotype "genotypes" in VCF.

Usage
~~~~~
.. code-block:: bash

haptools transform \
--region TEXT \
--sample SAMPLE \
--samples-file FILENAME \
--output PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
GENOTYPES HAPLOTYPES

Examples
~~~~~~~~
.. code-block:: bash

haptools transform tests/data/example.vcf.gz tests/data/example.hap.gz | less

Detailed Usage
~~~~~~~~~~~~~~

.. click:: haptools.__main__:main
:prog: haptools
:show-nested:
:commands: transform
168 changes: 168 additions & 0 deletions docs/formats/haplotypes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
.. _formats-haplotypes:


.hap
====

This document describes our custom file format specification for haplotypes: the ``.hap`` file.

This is a tab-separated file composed of different types of lines. The first field of each line is a single, uppercase character denoting the type of line. The following line types are supported.

.. list-table::
:widths: 25 25
:header-rows: 1

* - Type
- Description
* - #
- Comment
* - H
- Haplotype
* - V
- Variant

Each line type (besides #) has a set of mandatory fields described below. Additional "extra" fields can be appended to these to customize the file.

``#`` Comment line
~~~~~~~~~~~~~~~~~~
Comment lines begin with ``#`` and are ignored. Consecutive comment lines that appear at the beginning of the file are treated as part of the header.

Extra fields must be declared in the header. The declaration must be a tab-separated line containing the following fields:

1. Line type (ex: ``H`` or ``V``)
2. Name
3. Python format string (ex: 'd' for int, 's' for string, or '.3f' for a float with 3 decimals)
4. Description

Note that the first field must follow the ``#`` symbol immediately (ex: ``#H`` or ``#V``).


``H`` Haplotype
~~~~~~~~~~~~~~~
Haplotypes contain the following attributes:

.. list-table::
:widths: 25 25 25 50
:header-rows: 1

* - Column
- Field
- Type
- Description
* - 1
- Chromosome
- string
- The contig that this haplotype belongs on
* - 2
- Start Position
- int
- The start position of this haplotype on this contig
* - 3
- End Position
- int
- The end position of this haplotype on this contig
* - 4
- Haplotype ID
- string
- Uniquely identifies a haplotype

``V`` Variant
~~~~~~~~~~~~~
Each variant line belongs to a particular haplotype. These lines contain the following attributes:

.. list-table::
:widths: 25 25 25 50
:header-rows: 1

* - Column
- Field
- Type
- Description
* - 1
- Haplotype ID
- string
- Identifies the haplotype to which this variant belongs
* - 2
- Start Position
- int
- The start position of this variant on its contig
* - 3
- End Position
- int
- The end position of this variant on its contig

Usually the same as the Start Position
* - 4
- Variant ID
- string
- The unique ID for this variant, as defined in the genotypes file
* - 5
- Allele
- string
- The allele of this variant within the haplotype

Examples
~~~~~~~~
You can find an example of a ``.hap`` file without any extra fields in `tests/data/basic.hap </~https://github.com/gymrek-lab/haptools/blob/main/tests/data/basic.hap>`_:

.. include:: ../../tests/data/basic.hap
:literal:

You can find an example with extra fields added within `tests/data/simphenotype.hap </~https://github.com/gymrek-lab/haptools/blob/main/tests/data/simphenotype.hap>`_:

.. include:: ../../tests/data/simphenotype.hap
:literal:


Compressing and indexing
~~~~~~~~~~~~~~~~~~~~~~~~
We encourage you to bgzip compress and/or index your ``.hap`` file whenever possible. This will reduce both disk usage and the time required to parse the file.

.. code-block:: bash

sort -k1,4 -o file.hap file.hap
bgzip file.hap
tabix -s 2 -b 3 -e 4 file.hap.gz

In order to properly index the file, the IDs in the haplotype lines must be different from their chromosomes. In addition, you must sort on the first field (ie the line type symbol) in addition to the latter three.

Extra fields
~~~~~~~~~~~~
Additional fields can be appended to the ends of the haplotype and variant lines as long as they are declared in the header.

haptools extras
---------------
The following extra fields should be declared for your ``.hap`` file to be compatible with ``simphenotype``.

.. code-block::

#H ancestry s Local ancestry
#H beta .2f Effect size in linear model

..
_TODO: figure out how to tab this code block so that the tabs get copied when someone copies from it


``H`` Haplotype
+++++++++++++++

.. list-table::
:widths: 25 25 25 50
:header-rows: 1

* - Column
- Field
- Type
- Description
* - 5
- Local Ancestry
- string
- A population code denoting this haplotype's ancestral origins
* - 6
- Effect Size
- float
- The effect size of this haplotype; for use in ``simphenotype``

``V`` Variant
+++++++++++++
No extra fields are required here.
2 changes: 1 addition & 1 deletion docs/executing/inputs.rst → docs/formats/inputs.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _executing-inputs:
.. _formats-inputs:


Inputs
Expand Down
9 changes: 6 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@
:parser: myst_parser.sphinx_

.. toctree::
:caption: Execution
:name: executing
:caption: File Formats
:name: formats
:hidden:
:maxdepth: 1

executing/inputs.rst
formats/inputs.rst
formats/haplotypes.rst

.. toctree::
:caption: Commands
Expand All @@ -18,6 +19,8 @@
:maxdepth: 1

commands/simgenotype.rst
commands/simphenotype.rst
commands/transform.rst

.. toctree::
:caption: API
Expand Down
3 changes: 0 additions & 3 deletions docs/project_info/haptools_file_formats.rst

This file was deleted.

Loading