Skip to content

SAM_Processing

Skylar Wyant edited this page Mar 9, 2018 · 33 revisions

Basic Usage

The SAM_Processing handler sorts, de-duplicates, and adds read groups to the SAM files produced from Read_Mapping to create finished BAM files. This script utilizes Picard or SAMTools (user choice) to carry out the processing of the SAM files. In addition, it creates mapping statistics using the flagstat function of SAMTools.

To run SAM_Processing, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, SAM_Processing can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):

./sequence_handling SAM_Processing Config

Where Config is the full file path to the configuration file.

Handler-Specific Variables

The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.

Variable Function Method
METHOD Which program should be used to process the SAM files. Choose from 'picard' (recommended) or 'samtools'. Picard and SAMtools
SP_QSUB QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00". Picard and SAMtools
MAPPED_LIST A list of full file paths to the read-mapped samples. This is not created by Read_Mapping, but can be generated using sample_list_generator.sh. Picard and SAMtools
PICARD_JAR The full file path for the Picard jar file. Picard
MAX_FILES The maximum number of file handles that can be used. For UNIX systems, the per-process maximum number of files that can be open may be found with ulimit -n. Set slightly under this value. Default is 1000. Picard
TMP An optional variable that tells Picard where to store temporary files. Use if you've had issues running out of temp space. Otherwise, leave blank. Picard

Note: If using SAMtools to process the SAM files (METHOD=samtools), then the last three variables may be left blank since they are only used for processing with Picard.

Output

SAM_Processing creates sorted, deduplicated BAM files that have read groups marked. The finished .bam files and index .bai files will be generated at

${OUT_DIR}/SAM_Processing/${METHOD}/${SAMPLE}.bam

where ${OUT_DIR} and ${PROJECT} are specified in the configuration file and ${METHOD} is either SAMtools or Picard.

Metrics about the finished BAM files, including percent mapped and percent paired, can be found under the Statistics directory. Intermediate files should have been deleted to save space, but the empty directory that they were stored in may still exist.

${OUT_DIR}/SAM_Processing/${METHOD}/Statistics/${PROJECT}_mapping_statistics.txt
${OUT_DIR}/SAM_Processing/${METHOD}/Intermediates

Note that a list of finished BAM files is not generated from SAM_Processing with Picard. However, sample_list_generator.sh can be used to make one.

For processing with SAMtools (not Picard), a reference genome is necessary. If your reference genome is not indexed, SAM_Processing generates an index file for the reference genome in the same directory as the reference genome. Please make sure you have write permissions for said directory. After indexing, SAM_Processing will exit, so you will need to run SAM_Processing again to process SAM files.

After running SAM_Processing, there are two options for further processing.

  1. Quality_Assessment can be used for more complete quality assurance.
  2. Coverage_Mapping can be used to generate coverage statistics for each BAM file.

Dependencies

SAM_Processing depends on Picard (which depends on Java) or SAMTools for all processing needs. SAMTools is used for generating the alignment statistics for both methods. In addition, PBS and GNU Parallel are required for basic operation. Please check the dependencies page to ensure that you are using the required version of each dependency.

Alternatively: Use Realigner_Target_Creator to prepare your BAM files for ANGSD-wrapper. To learn more about using your finished BAM files to compute population genetics descriptive statistics without performing SNP calls, visit the ANGSD-wrapper Github repository.

Clone this wiki locally