Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor GKS formatting changes and addition of gnomAD flags to annotation #617

Merged
merged 35 commits into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
79e2b1c
Add qcFilters to gks annotation
theferrit32 Oct 4, 2023
aecfa27
Add lowComplexityRegion to gks annotations
theferrit32 Oct 4, 2023
aabb709
Add heterozygous allele balance flag count to gks annotations
theferrit32 Oct 4, 2023
f55b0e4
Remove prior method of summing ab_hist_alt.bin_freq in hail expression
theferrit32 Oct 4, 2023
222c22a
Make popFreqId casing match schema. And make single string, not tuple.
theferrit32 Oct 5, 2023
c3b1744
Make group ids match the dot syntax of the top level popmax pop id
theferrit32 Oct 5, 2023
d94f3a7
Better validation for allele balance flag in gks annotation
theferrit32 Oct 5, 2023
a07ab99
Add gnomad-id.NO_COHORTS as popFreqId when no popmaxFAF95
theferrit32 Oct 18, 2023
e14c28a
Comment changes from review
theferrit32 Oct 18, 2023
bc2cb83
Nullable popMaxFAF95 & pop->grp
matren395 Oct 20, 2023
d5d89c4
Rename heterozygousAlleleBalanceFlagged to heterozygousSkewedAlleleCount
theferrit32 Oct 19, 2023
0a48f11
Add hemizygote count to overall and by_sex subcohorts
theferrit32 Oct 19, 2023
f402b30
Include hemizygote count for unsexed subcohorts
theferrit32 Oct 19, 2023
bd173fe
Now that we need to sometimes retrieve other group freq inside _creat…
theferrit32 Oct 19, 2023
9c2f6ed
Remove freq_index_key, not worth it for only a couple values
theferrit32 Oct 20, 2023
06c172b
Rename tpl->x
theferrit32 Oct 20, 2023
2b9d2a9
Add handling for joint groupmax faf in v4 tables. Add handling for lc…
theferrit32 Oct 31, 2023
6247036
Add 'remaining' to POP_NAMES and POP_COLORS
theferrit32 Nov 1, 2023
a07b71f
Fix FAF key for v3. Fix when jointGrpMaxFAF95 is null
theferrit32 Nov 1, 2023
35f18ae
Add docstrings to add_grpMaxFAF95* functions
theferrit32 Nov 1, 2023
3c86062
Move qualityMeasures to top level of cohort freq statement. Omit grpM…
theferrit32 Nov 1, 2023
0261dbd
Add monoallelic flag using AC/AN
theferrit32 Nov 1, 2023
9dbd418
Use monoallelic flag from table, which is based on raw AF
theferrit32 Nov 2, 2023
13a0ed2
Use version field to determine fafmax struct. Handle fafmax subsets i…
theferrit32 Nov 14, 2023
50797cd
Replace None alleleFrequency due to 0/0, with 0.0
theferrit32 Nov 14, 2023
6c60dca
Use v4 exomes coverage if table passed is v4.
theferrit32 Nov 14, 2023
ca9cffb
Update coverage table determination for gks annotations
theferrit32 Jan 10, 2024
91b67f7
Add fractionCoverage20x annotation to qualityMeasures
theferrit32 Jan 11, 2024
4350015
Remove duplicate 'remaining' key from ancestry.py
theferrit32 Jan 11, 2024
cd66874
Apply suggestions from code review
theferrit32 Feb 7, 2024
ed2312c
Apply formatting
theferrit32 Feb 7, 2024
b6c5101
Delete add_grpMaxFAF95_v3 and gks_compute_seqloc_digest.
theferrit32 Feb 7, 2024
2c8c99e
lint
theferrit32 Feb 7, 2024
0ae91f2
Apply suggestions from code review
theferrit32 Feb 27, 2024
caabf24
Move high level version check above attempt to read table
theferrit32 Feb 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 78 additions & 44 deletions gnomad/resources/grch38/gnomad.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
)
from gnomad.sample_qc.ancestry import POP_NAMES
from gnomad.utils.annotations import add_gks_va, add_gks_vrs
from gnomad.utils.vcf import FAF_POPS

logging.basicConfig(
format="%(asctime)s (%(name)s %(lineno)s): %(message)s",
Expand Down Expand Up @@ -479,6 +478,32 @@ def release_vcf_path(data_type: str, version: str, contig: str) -> str:
return f"gs://gcp-public-data--gnomad/release/{version}/vcf/{data_type}/gnomad.{data_type}.{version_prefix}{version}.sites{contig}.vcf.bgz"


def add_grpMaxFAF95_v4(ht: hl.Table) -> hl.Table:
"""
Add a grpMaxFAF95 struct with 'popmax' and 'popmax_population'.

Also includes a jointGrpMaxFAF95 annotation using the v4 fafmax and joint_fafmax structures.

:param ht: Input hail table.
:return: Annotated hail table.
"""
if "gnomad" in ht.fafmax:
fafmax_field = ht.fafmax.gnomad
else:
fafmax_field = ht.fafmax
ht = ht.annotate(
grpMaxFAF95=hl.struct(
popmax=fafmax_field.faf95_max,
popmax_population=fafmax_field.faf95_max_gen_anc,
),
jointGrpMaxFAF95=hl.struct(
popmax=ht.joint_fafmax.faf95_max,
popmax_population=ht.joint_fafmax.faf95_max_gen_anc,
),
)
return ht


def gnomad_gks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the function is still written according to our original plan to have a v3 minor release with vrs annotations, but that's no longer the case - the highest version for v3 is 3.1.2 and that doesn't contain vrs annotations. Right now supplying either version=3.1.2/data_type=genomes or version=4.0/data_type=genomes is failing so we're only able to pull out v4 exomes. I think we should change the function to refer to v3 as little as possible (for examples, refer to data_type rather than high_level_version). I opened a PR (#671) so that this line (/~https://github.com/theferrit32/gnomad_methods/blob/803eb50ecd0df0350dca426dc5b2cf434425a401/gnomad/resources/grch38/gnomad.py#L532) will be compatible with v4 genomes. We also may want to add/change the error message to mention the version needs to be a v4 version.

locus_interval: hl.IntervalExpression,
version: str,
Expand Down Expand Up @@ -510,21 +535,29 @@ def gnomad_gks(
:return: List of dictionaries containing VRS information
(and freq info split by ancestry groups and sex if desired) for specified variant.
"""
# Read public_release table if no custom table provided
# Obtain the high level version number and verify that it is 4.
high_level_version = f"v{version.split('.')[0]}"
if high_level_version != "v4":
raise NotImplementedError(
"gnomad_gks() is currently only implemented for gnomAD v4."
)

# Read public_release table if no custom table provided.
if custom_ht:
ht = custom_ht
else:
ht = hl.read_table(public_release(data_type).versions[version].path)

high_level_version = f"v{version.split('.')[0]}"
# Read coverage statistics if requested.
coverage_version_3_0_1 = "3.0.1" # v4 genomes coverage
coverage_version_4_0 = "4.0" # v4 exomes coverage

# Read coverage statistics if requested
if high_level_version == "v3":
coverage_version = "3.0.1"
# In v4, exomes have coverage in v4 coverage table,
# genomes have coverage in v3 coverage table.
if data_type == "genomes":
coverage_version = coverage_version_3_0_1
else:
theferrit32 marked this conversation as resolved.
Show resolved Hide resolved
raise NotImplementedError(
"gnomad_gks() is currently only implemented for gnomAD v3."
)
coverage_version = coverage_version_4_0

coverage_ht = None

Expand All @@ -533,12 +566,17 @@ def gnomad_gks(
coverage_ht = custom_coverage_ht
else:
coverage_ht = hl.read_table(
coverage("genomes").versions[coverage_version].path
coverage(data_type).versions[coverage_version].path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klaricch this will use genomes version 3.0.1 coverage table for v3 and v4 genomes tables, and exomes version 4.0 coverage table for v4 exomes tables. If the input is a v3 table with data_type=exomes, line 562 will throw an exception KeyError: '3.0.1', which is expected.

)
ht = ht.annotate(mean_depth=coverage_ht[ht.locus].mean)
ht = ht.annotate(fraction_cov_over_20=coverage_ht[ht.locus].over_20)

# Retrieve ancestry groups from the imported POPS dictionary.
pops_list = list(POPS[high_level_version]) if by_ancestry_group else None
pops_list = (
list(POPS["v4" if data_type == "exomes" else "v3"])
if by_ancestry_group
else None
)

# Throw warnings if contradictory arguments are passed.
if by_ancestry_group and vrs_only:
Expand All @@ -552,43 +590,36 @@ def gnomad_gks(
" please also specify 'by_ancestry_group' to stratify by."
)

# Call and return add_gks_vrs and add_gks_va for chosen arguments.

# Select relevant fields, checkpoint, and filter to interval before adding
# annotations
ht = ht.annotate(
faf95=hl.rbind(
hl.sorted(
hl.array(
[
hl.struct(
faf=ht.faf[ht.faf_index_dict[f"{pop}-adj"]].faf95,
population=pop,
)
for pop in FAF_POPS
]
),
key=lambda f: (-f.faf, f.population),
),
lambda fafs: hl.if_else(
hl.len(fafs) > 0,
hl.struct(
popmax=fafs[0].faf,
popmax_population=hl.if_else(
fafs[0].faf == 0, hl.missing(hl.tstr), fafs[0].population
),
),
hl.struct(
popmax=hl.missing(hl.tfloat), popmax_population=hl.missing(hl.tstr)
),
),
)
)
# annotations.

# Pull up LCR flag and make referrable in the same field.
ht = ht.annotate(lcr=ht.region_flags.lcr)

# Pull up allele balance histogram arrays.
ht = ht.annotate(ab_hist_alt=ht.histograms.qual_hists.ab_hist_alt)

keep_fields = [ht.freq, ht.info.vrs, ht.faf95]
ht = add_grpMaxFAF95_v4(ht)

ht = ht.annotate(in_autosome_or_par=ht.locus.in_autosome_or_par())

keep_fields = [
ht.freq,
ht.info.vrs,
ht.info.monoallelic,
ht.grpMaxFAF95,
ht.filters,
ht.lcr,
ht.ab_hist_alt,
ht.in_autosome_or_par,
]

if not skip_coverage:
keep_fields.append(ht.mean_depth)
keep_fields.append(ht.fraction_cov_over_20)

if "jointGrpMaxFAF95" in ht.row:
keep_fields.append(ht.jointGrpMaxFAF95)

ht = ht.select(*keep_fields)

Expand All @@ -598,9 +629,12 @@ def gnomad_gks(
ht = ht.checkpoint(hl.utils.new_temp_file("vrs_checkpoint", extension="ht"))

# Collect all variants as structs, so all dictionary construction can be
# done in native Python
# done in native Python.
variant_list = ht.collect()
ht_freq_index_dict = ht.freq_index_dict.collect()[0]
# gnomad v4 renamed freq_index_dict keys to use underscores instead of dashes.
# Use underscores for v3 as well.
ht_freq_index_dict = {k.replace("-", "_"): v for k, v in ht_freq_index_dict.items()}

# Assemble output dicts with VRS and optionally frequency, append to list,
# then return list
Expand Down
1 change: 1 addition & 0 deletions gnomad/sample_qc/ancestry.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@
"oeu": "#6AA5CD",
"onf": "#6AA5CD",
"unk": "#ABB9B9",
"remaining": "#ABB9B9",
"": "#ABB9B9",
}

Expand Down
Loading
Loading