Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about generating "depth.txt" for metabat2 #20

Closed
hongzhonglu opened this issue Mar 21, 2021 · 5 comments
Closed

Question about generating "depth.txt" for metabat2 #20

hongzhonglu opened this issue Mar 21, 2021 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@hongzhonglu
Copy link

Hi Francisco,
Could I ask a question about usage of metabat2 in bin analysis? I found it need this file-"depth.txt"? However I can not get file from the last step - "megahit". Do you know how to prepare "depth.txt" as the input for metabat2?

Thanks a lot!

Best regards,
Hongzhong

@franciscozorrilla
Copy link
Owner

Hi Hongzhong,

There are a few ways to do this, and the most optimal way depends on what you are trying to do.
Two important questions to ask yourself are:

  1. Do you just want to use metabat2 for binning or are you using all 3 binners + metaWRAP? Yes = best performance
  2. Are you cross mapping each set of short reads to each assembly? Yes = best performance

You can look at the tutorial/demo to get an idea of how to use the metaGEM.sh parser to interface with the Snakefile for job submissions, and specifically in this section you can see the cross-mapping. I just uncommented out 3 lines of the Snakefile in this last commit, so now you should be able to submit crossMap jobs to generate depth files just as described in the tutorial. For example, to submit 2 jobs with 24 cores + 120 GB RAM and 24 hour max runtime:

bash metaGEM.sh -t crossMap -j 2 -c 24 -m 120 -h 24

Note that by default this will run the Snakefile rule crossMap, which will submit one job per each of your samples. Within each of these jobs, there will be a for loop mapping each set of paired end reads in your dataset to the focal sample's assembly. These mapping files will are used to generate your coverage inputs for CONCOCT, MetaBAT2, and MaxBin2.

I should mention as a note of caution that this approach works well for small-to-medium-sized datasets (~ <= 150 medium sized samples), but may become impractical for large datasets, both in terms of runtime and computational load. This is because the job needs to generate N sorted bam files to create the concoct coverage table, where N = number of samples. You can imagine if you had a dataset of 300 samples, and each bam file is ~10GB, you would need around 3TB of temporary storage per job, and up to ~900TB if you run all jobs in parallel.

In the metaGEM manuscript we processed the TARA oceans dataset which was quite large (~246 samples). For these larger datasets we recommend to run a slightly modified workflow where each individual mapping operation is submitted as an individual job and mapped using kallisto. I am now working on adding support for this alternative branch of the workflow to the metaGEM.sh parser (issue #22).

Please let me know if you have further questions.
Best wishes,
Francisco

@hongzhonglu
Copy link
Author

hongzhonglu commented Mar 23, 2021

Hi Francisco,
Thanks a lot for your kind help! Now I just want to use metabat2 for binning as it is the first time for me to run the meta-genome analysis. So I start from simple things. I could find the steps to generate depth.txt file from your nice pipeline. I will study how to run it.

Best regards,
Hongzhong

@franciscozorrilla
Copy link
Owner

Hi Hongzhong,

In that case I recommend looking at the metabat rule in line 512 of the Snakefile.
Note that the output is commented out currently, since this is a "backup"/alternative version of running metabat2.
You will need to uncomment out the output to the metabat rule in line 518 to look like this:

directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/{{IDs}}.metabat-bins')

and then comment out the output to the main metabat rule metabatCross on line 581 to look like this:

#directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/{{IDs}}.metabat-bins')

You need to do this so that Snakemake knows exactly which rule to execute to generate your desired files.
After making sure that only your desired metabat2 rule has an uncommented output then you can submit metabat2 jobs to the cluster using:

bash metaGEM.sh -t metabat -j N_JOBS -c N_CORES -m MEMORY -h RUN_TIME

I have also recently expanded the metaGEM wiki, so please check it out if you want to learn more about usage and implementation of metaGEM.

Also, just so you know, from personal experience I found that CONCOCT tends to outperform maxbin2 and metabat2 in most cases. As a reference you can look at Supplementary Figure 2 of the metaGEM paper:

Screenshot 2021-03-23 at 17 04 18

Hope this helps and let me know if you have any other questions.
Best wishes,
Francisco

@hongzhonglu
Copy link
Author

Hi Francisco,
Thanks so much! Very good reference for me to study.

Best regards,
Hongzhong

@franciscozorrilla
Copy link
Owner

Closing this due to inactivity but please reopen/comment if you have further questions.

Repository owner locked and limited conversation to collaborators May 10, 2021
@franciscozorrilla franciscozorrilla added the question Further information is requested label May 22, 2021
@franciscozorrilla franciscozorrilla self-assigned this May 22, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants