-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New ingest one snakefile #10
Closed
Closed
Changes from all commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
9be3cc9
Ingest: Copy ingest from monkeypox repo
d91180e
Ingest: Ignore ingest cache directories
j23414 0700db2
Ingest: Remove Nextclade
6ce79cb
Remove bin/scripts duplication
af25286
fix: Use curl for downloading files
j23414 1400d2f
Ingest: Replace monkeypox text and parameters with dengue
9ada261
Dengue-specific-ingest: Add dengue serotype wildcards
8c9763f
Dengue-specific-ingest: Add download filters
15f335d
Add post processing script
4f57aed
Replace post processing R script with python
938feaf
[ingest] Simplify finding strain name
j23414 f7bc33b
zstd compress output files
3fccb81
fix: makes the compress rule more generic
j23414 7ecc29a
Build: Index by genbank accession instead of duplicate strain names
d6eadd0
fix: remove entries where accession is not found
j23414 6c21a39
Ingest: Compromise by duplicating scripts
360b383
Ingest: Replace monkeypox text and parameters with dengue in scripts
j23414 b724b96
Ingest: Compromise by allowing redundant data pull by serotype
b8f3b8d
[wip] attempt at limiting concurrent deploys
j23414 755d287
Build: parameterize threads in align rule
j23414 a78ac90
docs: Add documentation on running ingest
j23414 4e28f33
fix: wildcards paired with optional.yaml
j23414 87c80a9
refactor: move post_process_metadata to rule transform
j23414 4823f72
cleanup some unused metadata columns
j23414 247b2fd
mark temp intermediate files
j23414 d008d5a
fix: annotations is a file, not a param
j23414 77d1a07
refactor: parameterize data s3 source
j23414 df024cb
Ingest: Using Snakemake modules for ingesting data into the Nextstrai…
j23414 8d3ef7d
ingest: add zstd support
joverlee521 351b3ec
Snakefile: allow configfile overrides
joverlee521 dafdd12
Update workaround for ingest `shell`
joverlee521 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
import pandas as pd | ||
import json, argparse | ||
|
||
def replace_name_recursive(node, lookup): | ||
if node["name"] in lookup: | ||
node["name"] = lookup[node["name"]] | ||
|
||
if "children" in node: | ||
for child in node["children"]: | ||
replace_name_recursive(child, lookup) | ||
|
||
if __name__=="__main__": | ||
parser = argparse.ArgumentParser( | ||
description="Swaps out the strain names in the Auspice JSON with the final strain name", | ||
formatter_class=argparse.ArgumentDefaultsHelpFormatter | ||
) | ||
|
||
parser.add_argument('--input-auspice-json', type=str, required=True, help="input auspice_json") | ||
parser.add_argument('--metadata', type=str, required=True, help="input data") | ||
parser.add_argument('--display-strain-name', type=str, required=True, help="field to use as strain name in auspice") | ||
parser.add_argument('--output', type=str, metavar="JSON", required=True, help="output Auspice JSON") | ||
args = parser.parse_args() | ||
|
||
metadata = pd.read_csv(args.metadata, sep='\t') | ||
name_lookup = {} | ||
for ri, row in metadata.iterrows(): | ||
strain_id = row['strain'] | ||
name_lookup[strain_id] = args.display_strain_name if pd.isna(row[args.display_strain_name]) else row[args.display_strain_name] | ||
|
||
with open(args.input_auspice_json, 'r') as fh: | ||
data = json.load(fh) | ||
|
||
replace_name_recursive(data['tree'], name_lookup) | ||
|
||
with open(args.output, 'w') as fh: | ||
json.dump(data, fh) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
strain_id_field: "accession" | ||
display_strain_field: "strain_original" | ||
# s3_src: 'https://data.nextstrain.org/files/dengue' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,69 +1,41 @@ | ||
DENV/SPAIN/EEB17/2009 | ||
DENV1/FRANCE/00475/2008 | ||
DENV1/MALAYSIA/P1244/1972 | ||
DENV1/VIETNAM/BIDV3990/2008 | ||
DENV1/VIETNAM/BIDV992/2006 | ||
DENV2/AUSTRALIA/QML22/2015 | ||
DENV2/BURKINA_FASO/DAKAR2039/1980 | ||
DENV2/BURKINA_FASO/DAKARA2022/1980 | ||
DENV2/COTE_D_IVOIRE/DAKAR510/1980 | ||
DENV2/COTE_D_IVOIRE/DAKAR578/1980 | ||
DENV2/COTE_D_IVOIRE/DAKARA1247/1980 | ||
DENV2/GUINEA/PM33974/1981 | ||
DENV2/HAITI/DENGUEVIRUS2HOMOSAPIENS1/2016 | ||
DENV2/MALAYSIA/DKD811/2008 | ||
DENV2/MALAYSIA/P81407/1970 | ||
DENV2/MALAYSIA/SAB/2015 | ||
DENV2/NIGERIA/IBH11208/1966 | ||
DENV2/NIGERIA/IBH11234/1966 | ||
DENV2/NIGERIA/IBH11664/1966 | ||
DENV2/SENEGAL/0674/1970 | ||
DENV2/SENEGAL/DAKAR0761/1974 | ||
DENV2/SENEGAL/DAKAR141069/1999 | ||
DENV2/SENEGAL/DAKAR141070/1999 | ||
DENV2/SENEGAL/DAKARD75505/1999 | ||
DENV2/TRINIDAD_AND_TOBAGO/NA/1953 | ||
DENV4/MALAYSIA/P215/1975 | ||
DENV4/MALAYSIA/P514/1975 | ||
DENV4/MALAYSIA/P731120/1973 | ||
D2Sab2015 # miscategorized | ||
QML22 # miscategorized | ||
DAK_Ar_A1247 # sylvatic | ||
Dak_Ar_2039 # sylvatic | ||
Dak_Ar_578 # sylvatic | ||
DAK_Ar_510 # sylvatic | ||
PM33974 # sylvatic | ||
Dak_Ar_A2022 # sylvatic | ||
Dak_Ar_141069 # sylvatic | ||
Dak_Ar_141070 # sylvatic | ||
Dak_Ar_D75505 # sylvatic | ||
Dak_HD_10674 # sylvatic | ||
Dak_Ar_D20761 # sylvatic | ||
IBH11664 # sylvatic | ||
IBH11208 # sylvatic | ||
IBH11234 # sylvatic | ||
P8_1407 # sylvatic | ||
P75_514 # sylvatic | ||
P73_1120 # sylvatic | ||
P75_215 # sylvatic | ||
DKD811 # sylvatic | ||
ZS01/01 # metadata issue | ||
Vero # cell line | ||
MS13002673 # too divergent | ||
MS11011405 # too divergent | ||
V43257 # too divergent | ||
KDC0574A2_06/02/2011 # too divergent | ||
00178/03 # too divergent | ||
00759/12 # too divergent | ||
00988/11 # too divergent | ||
01113/10 # too divergent | ||
01224/04 # too divergent | ||
01231/10 # too divergent | ||
01488/09 # too divergent | ||
01542/04 # too divergent | ||
dev1 # too divergent | ||
DKE_121 # too divergent | ||
SENDAK_HD_10674 # sylvatic | ||
DENV2_1_DAK_HD_76395 # sylvatic | ||
DENV3/PUERTORICO/1963/PRS_228762_AC27 # too divergent | ||
PR_6 # too divergent | ||
KY923048 # D2Sab2015 # miscategorized | ||
KX274130 # QML22 # miscategorized | ||
EF105383 # DAK_Ar_A1247 # sylvatic | ||
EF105382 # Dak_Ar_2039 # sylvatic | ||
EF105380 # Dak_Ar_578 # sylvatic | ||
EF105381 # DAK_Ar_510 # sylvatic | ||
EF105378 # PM33974 # sylvatic | ||
EF105386 # Dak_Ar_A2022 # sylvatic | ||
EF105389 # Dak_Ar_141069 # sylvatic | ||
EF105390 # Dak_Ar_141070 # sylvatic | ||
EF457904 # Dak_Ar_D75505 # sylvatic | ||
EF105384 # Dak_HD_10674 # sylvatic | ||
EF105385 # Dak_Ar_D20761 # sylvatic | ||
EF105388 # IBH11664 # sylvatic | ||
EF105387 # IBH11208 # sylvatic | ||
EU003591 # IBH11234 # sylvatic | ||
EF105379 # P8_1407 # sylvatic | ||
JF262779 # P75_514 # sylvatic | ||
JF262780 # P73_1120 # sylvatic | ||
EF457906 # P75_215 # sylvatic | ||
FJ467493 # DKD811 # sylvatic | ||
EF051521 # ZS01/01 # metadata issue | ||
MT929160 # Vero # cell line | ||
MH048676 # MS13002673 # too divergent | ||
MH048674 # MS11011405 # too divergent | ||
MT597439 # V43257 # too divergent | ||
MN448607 # KDC0574A2_06/02/2011 # too divergent | ||
ON046268 # 00178/03 # too divergent | ||
ON046278 # 00759/12 # too divergent | ||
ON046276 # 00988/11 # too divergent | ||
ON046273 # 01113/10 # too divergent | ||
ON046270 # 01224/04 # too divergent | ||
ON046274 # 01231/10 # too divergent | ||
ON046272 # 01488/09 # too divergent | ||
ON046271 # 01542/04 # too divergent | ||
MZ284953 # dev1 # too divergent | ||
MZ215848 # DKE_121 # too divergent | ||
MW946564 # SENDAK_HD_10674 # sylvatic | ||
OK605757 # DENV2_1_DAK_HD_76395 # sylvatic | ||
MW945427 # DENV3/PUERTORICO/1963/PRS_228762_AC27 # too divergent | ||
OM258630 # PR_6 # too divergent |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used modules myself, so I'm interested to know how you and @joverlee521 found them to work with - especially related to debugging / developing?
As a counterexample, I've been playing with an HBV build which uses our convention of separating out ingest data and code into
./ingest/
, but still runs everything from a top-level./Snakemake
(which callsinclude: "ingest/ingest.smk"
).The advantage of modules (as used in this context) is that they continue to allow
./ingest/
to function in a completely stand-alone way, which has obvious advantages. Perhaps using modules also enforces the separation of concerns between ingest + phylo in a desirable way? The cost is increased complexity. After seeing how complex thencov
snakemake workflow became over the years I'm trying to gauge whether the trade-offs here are worth it.