Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions on PR: Calculate per callset stats on gnomAD v4.1 per sample variant counts #614

Merged

Conversation

jkgoodrich
Copy link
Contributor

Relies on broadinstitute/gnomad_methods#701

This reorganizes the code to move building most of the filtering expressions in gnomad_methods. There might be more that we can move over to gnomad_methods, but I think we should get these suggestions merged into the main PR first.

I added some more combinations and a general filter for most of the groups to include pass filters and capture filters, but I would like to discuss this more to see what is best for us to do.

I added metadata similar to what we do in frequencies because I think it will be easier to filter to the specific grouping that is wanted.

I also switched to doing the aggregate over an array of the filter groups to prevent a class too large error, and I think also a nice speed-up

@jkgoodrich jkgoodrich requested review from KoalaQin and matren395 May 10, 2024 16:21
@jkgoodrich jkgoodrich changed the base branch from main to dm/per_sample_counts_4_1 May 10, 2024 16:21
@matren395
Copy link
Contributor

makes sense enough, I think, but let me take another look at it in a bit.

and this isn't supposed to add functionality for finding whole callset stats, yeah? just reorganizing and filter changes yeah ?

@jkgoodrich
Copy link
Contributor Author

and this isn't supposed to add functionality for finding whole callset stats, yeah? just reorganizing and filter changes yeah ?

Yeah, I haven't gotten to that PR yet, I will add my thoughts for the callset-wide stats to that PR

Copy link
Contributor

@matren395 matren395 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense enough to me - left some comments where some more documentation is needed, and I might wanna try and run this all in a notebook tomorrow

gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
Copy link
Contributor

@KoalaQin KoalaQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sending my first round back for now.

gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
@jkgoodrich jkgoodrich requested review from matren395 and KoalaQin May 20, 2024 23:01
Change to use the metadata combinations to determine grouping instead of doing it separately
Don't convert the summary stats array to rows, just keep as an array with associated metadata like we do for frequency
Copy link
Contributor

@KoalaQin KoalaQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent these back so far.

gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
@jkgoodrich jkgoodrich requested a review from KoalaQin May 23, 2024 18:17
Copy link
Contributor

@KoalaQin KoalaQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished the whole thing, now at the getting meta function.

gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
@jkgoodrich jkgoodrich requested a review from KoalaQin May 23, 2024 22:26
Copy link
Contributor

@KoalaQin KoalaQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes sense. I think we can move the first two functions to gnomad_methods, with a bit change on the docstring.

gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
Comment on lines 210 to 212
- The `filter_group_key_rename` parameter can be used to rename keys in the
`all_sum_stat_filters`, `common_combo_override`, or `lof_combo_override`
after creating all combinations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be:

Suggested change
- The `filter_group_key_rename` parameter can be used to rename keys in the
`all_sum_stat_filters`, `common_combo_override`, or `lof_combo_override`
after creating all combinations.
- The `filter_group_key_rename` parameter can be used to rename keys in the
generated filter combinations to a different set of keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it renames any key including in the override dictionaries

@jkgoodrich jkgoodrich requested a review from KoalaQin May 24, 2024 15:34
Copy link
Contributor

@KoalaQin KoalaQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, we're very close.

Comment on lines 193 to 194
common_filter_combos: List[List[str]] = None,
common_combo_override: Dict[str, List[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of renaming these parameters? I found common_combo cute but it's not very clear.
Maybe like this:

    summary_stat_filters: Dict[str, List[str]],
    common_filter_combinations: List[List[str]] = None,
    common_filter_overrides: Dict[str, List[str]] = None,
    lof_filter_combinations: Optional[List[List[str]]] = None,
    lof_filter_overrides: Dict[str, List[str]] = None,
    filter_key_renames: Dict[str, str] = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am keeping combinations as combos, it's a common way to shorten combinations similar to how we shorten frequency to freq, and we use "combo" or "combos" in other areas of our code to keep names shorter. I changed to use filter in the override names

gnomad_qc/v4/assessment/calculate_per_sample_stats.py Outdated Show resolved Hide resolved
Comment on lines +514 to +515
# filter expression from a struct.
f_expr = filter_exprs.get(f"{k}_{v}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an example for f_struct? I get it for loftee_no_flags as f_expr.
Would f_struct for something like maf_af?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if you look at the gnomad_methods code, it can return a BooleanExpression or a StuctExpression of BooleanExpressions, max_af, csq_set,...

jkgoodrich and others added 3 commits May 24, 2024 09:57
@jkgoodrich jkgoodrich requested a review from KoalaQin May 24, 2024 16:14
Copy link
Contributor

@KoalaQin KoalaQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jkgoodrich jkgoodrich merged commit afed864 into dm/per_sample_counts_4_1 May 24, 2024
2 checks passed
@jkgoodrich jkgoodrich deleted the jg/move_some_stats_functionality_to_methods branch May 24, 2024 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants