Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions to process, filter, annotate and aggregate variants by transcript expression (get the pext scores per variant) #651

Merged
merged 41 commits into from
Jan 20, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
77279e8
function to annotate variant with transcript expression
KoalaQin Dec 8, 2023
e9cfeb3
Aggregate mean_proportion
KoalaQin Dec 8, 2023
223e130
merge from base branch
KoalaQin Dec 18, 2023
373c83e
integrate pull_out_worst_from_tx_annotatate into the main function
KoalaQin Dec 18, 2023
21b5519
Edit docstring
KoalaQin Dec 19, 2023
0f898e5
Suggested changes to `tx_annotate_variants`
jkgoodrich Dec 19, 2023
ba3a1c4
Merge branch 'qh/tx_annotate_mt' of /~https://github.com/broadinstitute…
jkgoodrich Dec 19, 2023
9e9a283
A little more clean up
jkgoodrich Dec 19, 2023
9e9e60f
Merge pull request #654 from broadinstitute/jg/tx_annotate_mt_suggest…
KoalaQin Dec 20, 2023
1fb32ea
merge changes from base branch
KoalaQin Dec 20, 2023
42714b8
remove redundant content
KoalaQin Dec 20, 2023
5a09617
split into small functions: preprocess, annotate and aggregate
KoalaQin Dec 21, 2023
df86088
add TODOs
KoalaQin Dec 21, 2023
86852b1
add option when 'freq' is not in Table
KoalaQin Dec 21, 2023
81f9106
Merge branch 'qh/get_expression_proportion' of /~https://github.com/bro…
jkgoodrich Dec 22, 2023
0cf1829
Suggestions to tx_annotate_mt
jkgoodrich Dec 26, 2023
a98fc1f
Use `process_consequences` if ignore_splicing is True
jkgoodrich Jan 2, 2024
ff26879
Fix keep_csqs -> keep_genes
jkgoodrich Jan 2, 2024
0cac355
Changes from testing
jkgoodrich Jan 2, 2024
6957027
Merge branch 'qh/get_expression_proportion' of /~https://github.com/bro…
jkgoodrich Jan 2, 2024
1a1fadb
Merge branch 'qh/tx_annotate_mt' of /~https://github.com/broadinstitute…
jkgoodrich Jan 2, 2024
1fdd370
Remove extra tab in docstring
jkgoodrich Jan 2, 2024
0e2aca0
Merge pull request #655 from broadinstitute/jg/tx_annotate_mt_suggest…
KoalaQin Jan 4, 2024
3f0ca81
filter to CDS & redefine grouping
KoalaQin Jan 8, 2024
a7e791d
add CDS filtering in preprocess
KoalaQin Jan 8, 2024
c3e6cf3
merge from base branch
KoalaQin Jan 8, 2024
203f8ad
add filter to CDS function
KoalaQin Jan 8, 2024
259af2b
Merge branch 'main' into qh/tx_annotate_mt
KoalaQin Jan 8, 2024
559c5d8
apply review suggestions
KoalaQin Jan 13, 2024
6f7cdcb
add gencode v39 resource
KoalaQin Jan 13, 2024
e2366ad
add tissues to global
KoalaQin Jan 14, 2024
705be2f
delete annotate tx_ht back to variant ht because key changed
KoalaQin Jan 16, 2024
6f2a4fb
address review comments
KoalaQin Jan 16, 2024
7ce7eaa
add warnings to filter_to_cds and option to filter by amino_acids
KoalaQin Jan 19, 2024
2274f6a
option to filter by valid amino_acids in transcript_consequences
KoalaQin Jan 19, 2024
aa67ca1
address review comments
KoalaQin Jan 19, 2024
09c3a8e
combine filtering criteria with concatenation
KoalaQin Jan 20, 2024
f3764a8
rename pipeline function
KoalaQin Jan 20, 2024
baff663
changes requested by reviewer
KoalaQin Jan 20, 2024
7462733
small formatting
KoalaQin Jan 20, 2024
bcbc433
small formatting * 2
KoalaQin Jan 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion gnomad/utils/filtering.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,7 +410,8 @@ def filter_to_gencode_cds(
filtering to CDS intervals by transcript, an additional step must be taken.
For example, when filtering VEP transcript consequences, there may be cases
where a variant is retained with this filter, but is considered outside the
CDS intervals of some of the transcripts in the VEP annotation.
CDS intervals of the transcript per the VEP predicted consequence of the
variant.

:param t: Input Table/MatrixTable to filter.
:param gencode_ht: Gencode Table to use for filtering the input Table/MatrixTable
Expand Down
31 changes: 21 additions & 10 deletions gnomad/utils/transcript_annotation.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,12 +230,20 @@ def tx_filter_variants_by_csqs(
"""
Prepare a Table of variants with VEP transcript consequences for annotation.

.. note::

When `filter_to_cds` is set to True, the returned Table will be further
filtered by defined 'amino_acids' annotation, which is to filter out certain
consequences, such as 'stop_retained_variant', that are kept by all CDS
intervals but don't belong to CDS of the transcript they fall on.

:param ht: Table of variants with 'vep' annotations.
:param gencode_ht: Optional Gencode resource Table containing CDS interval
KoalaQin marked this conversation as resolved.
Show resolved Hide resolved
information. This is only used when `filter_to_cds` is set to True. Default is
None, which will use the default version of the Gencode Table resource for
the reference build of the input Table `ht`.
:param filter_to_cds: Whether to filter to CDS regions. Default is True.
:param filter_to_cds: Whether to filter to CDS regions. Default is True. And it
will be further filtered by defined 'amino_acids' annotation.
:param filter_to_genes: Optional list of genes to filter to. Default is None.
:param match_by_gene_symbol: Whether to match by gene symbol instead of gene ID.
Default is False.
Expand All @@ -247,9 +255,13 @@ def tx_filter_variants_by_csqs(
:return: Table of variants with preprocessed/filtered transcript consequences
prepared for annotation.
"""
additional_filtering_criteria = None
if filter_to_cds:
KoalaQin marked this conversation as resolved.
Show resolved Hide resolved
logger.info("Filtering to CDS regions...")
ht = filter_to_gencode_cds(ht, gencode_ht=gencode_ht)
additional_filtering_criteria = [
lambda csq: hl.is_defined(csq.amino_acids) & (csq.amino_acids != "*")
]

keep_csqs = True
if ignore_splicing:
Expand All @@ -273,9 +285,7 @@ def tx_filter_variants_by_csqs(
keep_csqs=keep_csqs,
genes=filter_to_genes,
match_by_gene_symbol=match_by_gene_symbol,
additional_filtering_criteria=(
lambda csq: (csq.amino_acids is not None) & (csq.amino_acids != "*")
),
additional_filtering_criteria=additional_filtering_criteria,
)


Expand All @@ -290,11 +300,12 @@ def tx_annotate_variants(
Annotate variants with transcript-based expression values or expression proportion from GTEx.

:param ht: Table of variants to annotate, it should contain the nested fields:
`vep.transcript_consequences`.
`{vep_root}.{vep_annotation}`.
:param tx_ht: Table of transcript expression information.
:param tissues_to_filter: Optional list of tissues to exclude from the output.
Default is None.
:param vep_root: Name used for root VEP annotation. Default is 'vep'.
:param vep_annotation: Name of annotation in `vep_root`, one of the processed
:param vep_annotation: Name of annotation under vep_root, one of the processed
consequences: ["transcript_consequences", "worst_csq_by_gene",
"worst_csq_for_variant", "worst_csq_by_gene_canonical",
"worst_csq_for_variant_canonical"]. For example, if you want to annotate
Expand Down Expand Up @@ -385,13 +396,13 @@ def perform_tx_annotation_pipeline(
"""
One-stop usage of `tx_filter_variants_by_csqs`, `tx_annotate_variants` and `tx_aggregate_variants`.

:param ht: Table of variants to annotate, it should contain at least the
following nested fields: `vep.transcript_consequences`, `freq`.
:param ht: Table of variants to annotate, it should contain the nested fields:
`{vep_root}.{vep_annotation}`.
:param tx_ht: Table of transcript expression information.
:param tissues_to_filter: Optional list of tissues to exclude from the output.
KoalaQin marked this conversation as resolved.
Show resolved Hide resolved
:param vep_root: Name used for root VEP annotation. Default is 'vep'.
:param vep_annotation: Name of annotation in 'vep' annotation, refer to the
function where it is used for more details.
:param vep_annotation: Name of annotation under vep_root. Default is
'transcript_consequences'.
:param filter_to_csqs: Optional list of consequences to filter to. Default is None.
:param additional_group_by: Optional list of additional fields to group by before
sum aggregation. If None, the returned Table will be grouped by only "locus"
Expand Down
12 changes: 6 additions & 6 deletions gnomad/utils/vep.py
Original file line number Diff line number Diff line change
Expand Up @@ -456,12 +456,12 @@ def filter_vep_to_gene_list(

.. note::

Filtering to a list of genes by their gene_id or gene_symbol will filter to
Filtering to a list of genes by their 'gene_id' or 'gene_symbol' will filter to
all variants that are annotated to the gene, including
['upstream_gene_variant', 'downstream_gene_variant'], which will not be the
same as if you filter to a gene interval. If you only want variants inside
certain gene boundaries and filter faster, you can filter the variant first to an
interval list and further filter to specific genes.
certain gene boundaries and a faster filter, you can first filter `t` to an
interval list and then apply this filter.

:param t: Input Table or MatrixTable.
:param genes: Genes of interest to filter VEP transcript consequences to.
Expand Down Expand Up @@ -720,7 +720,7 @@ def filter_vep_transcript_csqs(
genes: Optional[List[str]] = None,
keep_genes: bool = True,
match_by_gene_symbol: bool = False,
additional_filtering_criteria: Optional[Callable] = None,
additional_filtering_criteria: Optional[List[Callable]] = None,
) -> Union[hl.Table, hl.MatrixTable]:
"""
Filter VEP transcript consequences based on specified criteria, and optionally filter to variants where transcript consequences is not empty after filtering.
Expand Down Expand Up @@ -761,7 +761,7 @@ def filter_vep_transcript_csqs(
Default is True.
:param match_by_gene_symbol: Whether to match values in `genes` to VEP transcript
consequences by 'gene_symbol' instead of 'gene_id'. Default is False.
:param additional_filtering_criteria: Optional filtering criteria to apply to.
:param additional_filtering_criteria: Optional list of additional filtering criteria to apply to the VEP transcript consequences.
KoalaQin marked this conversation as resolved.
Show resolved Hide resolved
:return: Table or MatrixTable filtered to specified criteria.
"""
if not synonymous and not (canonical or mane_select) and not filter_empty_csq:
Expand Down Expand Up @@ -801,7 +801,7 @@ def filter_vep_transcript_csqs(
criteria.append(lambda csq: ~genes.contains(csq[gene_field]))
if additional_filtering_criteria is not None:
logger.info("Filtering to variants with additional criteria...")
criteria = criteria + [additional_filtering_criteria]
criteria = criteria + additional_filtering_criteria

transcript_csqs = transcript_csqs.filter(lambda x: combine_functions(criteria, x))
is_mt = isinstance(t, hl.MatrixTable)
Expand Down