-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor GKS formatting changes and addition of gnomAD flags to annotation #617
Merged
klaricch
merged 35 commits into
broadinstitute:main
from
theferrit32:kf/gks-cg-annotations
Feb 28, 2024
+177
−157
Merged
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
79e2b1c
Add qcFilters to gks annotation
theferrit32 aecfa27
Add lowComplexityRegion to gks annotations
theferrit32 aabb709
Add heterozygous allele balance flag count to gks annotations
theferrit32 f55b0e4
Remove prior method of summing ab_hist_alt.bin_freq in hail expression
theferrit32 222c22a
Make popFreqId casing match schema. And make single string, not tuple.
theferrit32 c3b1744
Make group ids match the dot syntax of the top level popmax pop id
theferrit32 d94f3a7
Better validation for allele balance flag in gks annotation
theferrit32 a07ab99
Add gnomad-id.NO_COHORTS as popFreqId when no popmaxFAF95
theferrit32 e14c28a
Comment changes from review
theferrit32 bc2cb83
Nullable popMaxFAF95 & pop->grp
matren395 d5d89c4
Rename heterozygousAlleleBalanceFlagged to heterozygousSkewedAlleleCount
theferrit32 0a48f11
Add hemizygote count to overall and by_sex subcohorts
theferrit32 f402b30
Include hemizygote count for unsexed subcohorts
theferrit32 bd173fe
Now that we need to sometimes retrieve other group freq inside _creat…
theferrit32 9c2f6ed
Remove freq_index_key, not worth it for only a couple values
theferrit32 06c172b
Rename tpl->x
theferrit32 2b9d2a9
Add handling for joint groupmax faf in v4 tables. Add handling for lc…
theferrit32 6247036
Add 'remaining' to POP_NAMES and POP_COLORS
theferrit32 a07b71f
Fix FAF key for v3. Fix when jointGrpMaxFAF95 is null
theferrit32 35f18ae
Add docstrings to add_grpMaxFAF95* functions
theferrit32 3c86062
Move qualityMeasures to top level of cohort freq statement. Omit grpM…
theferrit32 0261dbd
Add monoallelic flag using AC/AN
theferrit32 9dbd418
Use monoallelic flag from table, which is based on raw AF
theferrit32 13a0ed2
Use version field to determine fafmax struct. Handle fafmax subsets i…
theferrit32 50797cd
Replace None alleleFrequency due to 0/0, with 0.0
theferrit32 6c60dca
Use v4 exomes coverage if table passed is v4.
theferrit32 ca9cffb
Update coverage table determination for gks annotations
theferrit32 91b67f7
Add fractionCoverage20x annotation to qualityMeasures
theferrit32 4350015
Remove duplicate 'remaining' key from ancestry.py
theferrit32 cd66874
Apply suggestions from code review
theferrit32 ed2312c
Apply formatting
theferrit32 b6c5101
Delete add_grpMaxFAF95_v3 and gks_compute_seqloc_digest.
theferrit32 2c8c99e
lint
theferrit32 0ae91f2
Apply suggestions from code review
theferrit32 caabf24
Move high level version check above attempt to read table
theferrit32 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,7 +14,6 @@ | |
) | ||
from gnomad.sample_qc.ancestry import POP_NAMES | ||
from gnomad.utils.annotations import add_gks_va, add_gks_vrs | ||
from gnomad.utils.vcf import FAF_POPS | ||
|
||
logging.basicConfig( | ||
format="%(asctime)s (%(name)s %(lineno)s): %(message)s", | ||
|
@@ -479,6 +478,32 @@ def release_vcf_path(data_type: str, version: str, contig: str) -> str: | |
return f"gs://gcp-public-data--gnomad/release/{version}/vcf/{data_type}/gnomad.{data_type}.{version_prefix}{version}.sites{contig}.vcf.bgz" | ||
|
||
|
||
def add_grpMaxFAF95_v4(ht: hl.Table) -> hl.Table: | ||
""" | ||
Add a grpMaxFAF95 struct with 'popmax' and 'popmax_population'. | ||
|
||
Also includes a jointGrpMaxFAF95 annotation using the v4 fafmax and joint_fafmax structures. | ||
|
||
:param ht: Input hail table. | ||
:return: Annotated hail table. | ||
""" | ||
if "gnomad" in ht.fafmax: | ||
fafmax_field = ht.fafmax.gnomad | ||
else: | ||
fafmax_field = ht.fafmax | ||
ht = ht.annotate( | ||
grpMaxFAF95=hl.struct( | ||
popmax=fafmax_field.faf95_max, | ||
popmax_population=fafmax_field.faf95_max_gen_anc, | ||
), | ||
jointGrpMaxFAF95=hl.struct( | ||
popmax=ht.joint_fafmax.faf95_max, | ||
popmax_population=ht.joint_fafmax.faf95_max_gen_anc, | ||
), | ||
) | ||
return ht | ||
|
||
|
||
def gnomad_gks( | ||
locus_interval: hl.IntervalExpression, | ||
version: str, | ||
|
@@ -510,21 +535,29 @@ def gnomad_gks( | |
:return: List of dictionaries containing VRS information | ||
(and freq info split by ancestry groups and sex if desired) for specified variant. | ||
""" | ||
# Read public_release table if no custom table provided | ||
# Obtain the high level version number and verify that it is 4. | ||
high_level_version = f"v{version.split('.')[0]}" | ||
if high_level_version != "v4": | ||
raise NotImplementedError( | ||
"gnomad_gks() is currently only implemented for gnomAD v4." | ||
) | ||
|
||
# Read public_release table if no custom table provided. | ||
if custom_ht: | ||
ht = custom_ht | ||
else: | ||
ht = hl.read_table(public_release(data_type).versions[version].path) | ||
|
||
high_level_version = f"v{version.split('.')[0]}" | ||
# Read coverage statistics if requested. | ||
coverage_version_3_0_1 = "3.0.1" # v4 genomes coverage | ||
coverage_version_4_0 = "4.0" # v4 exomes coverage | ||
|
||
# Read coverage statistics if requested | ||
if high_level_version == "v3": | ||
coverage_version = "3.0.1" | ||
# In v4, exomes have coverage in v4 coverage table, | ||
# genomes have coverage in v3 coverage table. | ||
if data_type == "genomes": | ||
coverage_version = coverage_version_3_0_1 | ||
else: | ||
theferrit32 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
raise NotImplementedError( | ||
"gnomad_gks() is currently only implemented for gnomAD v3." | ||
) | ||
coverage_version = coverage_version_4_0 | ||
|
||
coverage_ht = None | ||
|
||
|
@@ -533,12 +566,17 @@ def gnomad_gks( | |
coverage_ht = custom_coverage_ht | ||
else: | ||
coverage_ht = hl.read_table( | ||
coverage("genomes").versions[coverage_version].path | ||
coverage(data_type).versions[coverage_version].path | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @klaricch this will use |
||
) | ||
ht = ht.annotate(mean_depth=coverage_ht[ht.locus].mean) | ||
ht = ht.annotate(fraction_cov_over_20=coverage_ht[ht.locus].over_20) | ||
|
||
# Retrieve ancestry groups from the imported POPS dictionary. | ||
pops_list = list(POPS[high_level_version]) if by_ancestry_group else None | ||
pops_list = ( | ||
list(POPS["v4" if data_type == "exomes" else "v3"]) | ||
if by_ancestry_group | ||
else None | ||
) | ||
|
||
# Throw warnings if contradictory arguments are passed. | ||
if by_ancestry_group and vrs_only: | ||
|
@@ -552,43 +590,36 @@ def gnomad_gks( | |
" please also specify 'by_ancestry_group' to stratify by." | ||
) | ||
|
||
# Call and return add_gks_vrs and add_gks_va for chosen arguments. | ||
|
||
# Select relevant fields, checkpoint, and filter to interval before adding | ||
# annotations | ||
ht = ht.annotate( | ||
faf95=hl.rbind( | ||
hl.sorted( | ||
hl.array( | ||
[ | ||
hl.struct( | ||
faf=ht.faf[ht.faf_index_dict[f"{pop}-adj"]].faf95, | ||
population=pop, | ||
) | ||
for pop in FAF_POPS | ||
] | ||
), | ||
key=lambda f: (-f.faf, f.population), | ||
), | ||
lambda fafs: hl.if_else( | ||
hl.len(fafs) > 0, | ||
hl.struct( | ||
popmax=fafs[0].faf, | ||
popmax_population=hl.if_else( | ||
fafs[0].faf == 0, hl.missing(hl.tstr), fafs[0].population | ||
), | ||
), | ||
hl.struct( | ||
popmax=hl.missing(hl.tfloat), popmax_population=hl.missing(hl.tstr) | ||
), | ||
), | ||
) | ||
) | ||
# annotations. | ||
|
||
# Pull up LCR flag and make referrable in the same field. | ||
ht = ht.annotate(lcr=ht.region_flags.lcr) | ||
|
||
# Pull up allele balance histogram arrays. | ||
ht = ht.annotate(ab_hist_alt=ht.histograms.qual_hists.ab_hist_alt) | ||
|
||
keep_fields = [ht.freq, ht.info.vrs, ht.faf95] | ||
ht = add_grpMaxFAF95_v4(ht) | ||
|
||
ht = ht.annotate(in_autosome_or_par=ht.locus.in_autosome_or_par()) | ||
|
||
keep_fields = [ | ||
ht.freq, | ||
ht.info.vrs, | ||
ht.info.monoallelic, | ||
ht.grpMaxFAF95, | ||
ht.filters, | ||
ht.lcr, | ||
ht.ab_hist_alt, | ||
ht.in_autosome_or_par, | ||
] | ||
|
||
if not skip_coverage: | ||
keep_fields.append(ht.mean_depth) | ||
keep_fields.append(ht.fraction_cov_over_20) | ||
|
||
if "jointGrpMaxFAF95" in ht.row: | ||
keep_fields.append(ht.jointGrpMaxFAF95) | ||
|
||
ht = ht.select(*keep_fields) | ||
|
||
|
@@ -598,9 +629,12 @@ def gnomad_gks( | |
ht = ht.checkpoint(hl.utils.new_temp_file("vrs_checkpoint", extension="ht")) | ||
|
||
# Collect all variants as structs, so all dictionary construction can be | ||
# done in native Python | ||
# done in native Python. | ||
variant_list = ht.collect() | ||
ht_freq_index_dict = ht.freq_index_dict.collect()[0] | ||
# gnomad v4 renamed freq_index_dict keys to use underscores instead of dashes. | ||
# Use underscores for v3 as well. | ||
ht_freq_index_dict = {k.replace("-", "_"): v for k, v in ht_freq_index_dict.items()} | ||
|
||
# Assemble output dicts with VRS and optionally frequency, append to list, | ||
# then return list | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -82,6 +82,7 @@ | |
"oeu": "#6AA5CD", | ||
"onf": "#6AA5CD", | ||
"unk": "#ABB9B9", | ||
"remaining": "#ABB9B9", | ||
"": "#ABB9B9", | ||
} | ||
|
||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the function is still written according to our original plan to have a v3 minor release with vrs annotations, but that's no longer the case - the highest version for v3 is 3.1.2 and that doesn't contain vrs annotations. Right now supplying either version=3.1.2/data_type=genomes or version=4.0/data_type=genomes is failing so we're only able to pull out v4 exomes. I think we should change the function to refer to v3 as little as possible (for examples, refer to data_type rather than high_level_version). I opened a PR (#671) so that this line (/~https://github.com/theferrit32/gnomad_methods/blob/803eb50ecd0df0350dca426dc5b2cf434425a401/gnomad/resources/grch38/gnomad.py#L532) will be compatible with v4 genomes. We also may want to add/change the error message to mention the version needs to be a v4 version.