Collection of Python scripts for parsing/analysis of reduced representation sequencing data (e.g. RAD-seq, nextRAD). While many of the scripts are functional, some still need considerable cleaning up and more thorough testing - and this repository therefore very much represents a work in progress.
These scripts all require Python 3, with some requiring additional packages (BioPython and NumPy - both of which can be easily installed using the Miniconda or Anaconda installers, or PyVCF - which can be installed using e.g. pip install PyVCF
). Usage information for each script can be obtained using the -h
or --help
flag (e.g. python3 name_of_script.py -h
, or is also listed in this README.
This documentation is dynamically generated using the listed README_compile.py script, extracting purpose, usage and links to example files from the argparse information of each script.
vcf_remap2genome.py - script to remap VCF from de novo RAD assembly back to a reference genome
pyrad_find_caps_markers.py - search PyRAD output file for diagnostic CAPS loci that can distinguish two groups (or one group and all other samples)
vcf_clone_detect.py - script to facilitate identification of clones in dataset
vcf_remap.py - Remaps variants in VCF format to new CHROM and POS as obtained through the
mapping_get_bwa_matches.py
scripts. Positions are rough estimates because:
(1) new position is simply an offset of the mapping position + 0-based
position in locus (and e.g. do not take into account reference insertions),
(2) one standard contig length is used to determine pos in reverse mapping
reads (flag 16). [File did not pass PEP8 check]
usage: vcf_remap.py [-h] vcf_file mapping_file locus_length
positional arguments:
vcf_file vcf input file
mapping_file file with mapping results
locus_length length of query loci
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf.
vcf_missing_data.py - Outputs list of missing data (# and % of SNPs) for each sample in VCF, to identify poor-performing samples to eliminate prior to SNP filtering. Takes vcf_filename as argument. Outputs to STDOUT (no output file). [File did not pass PEP8 check]
usage: vcf_missing_data.py [-h] vcf_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf.
vcf_rename_loci.py - Renames CHROMS in .vcf
file according to list with old/new names, and only
outputs those loci that are listed. [File did not pass PEP8 check]
usage: vcf_rename_loci.py [-h] vcf_file locusnames_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
locusnames_file text file (tsv or csv) with old and new name for each locus
(/CHROM)
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, locusnames_file.txt.
vcf_find_clones.py - Script compares the allelic similarity of individuals in a VCF, and outputs
all pairwise comparisons. This can be used to detect potential clones based on
percentage match. Note: highest matches can be assessed in the output file by
using $ sort -rn --key=5 output_file.txt | head -n 50
in the terminal. [File did not pass PEP8 check]
usage: vcf_find_clones.py [-h] vcf_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf.
vcf_get_chrom_pos_from_number.py - Translates sequential marker numbers back to CHROM/POS from original .vcf
file. Several programs only allow for integers to identify markers, this
script is to restore the original CHROM/POS for markers that were identified. [File did not pass PEP8 check]
usage: vcf_get_chrom_pos_from_number.py [-h] vcf_file markernumbers_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
markernumbers_file text file with SNP numbers that were identified
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, markernumbers_file.txt.
vcf_spider.py - Wrapper for PGDspider on Mac OS to convert .vcf
files to various formats.
Note : set PGDSPIDER_PATH constant before use, and make script executable in
terminal with $ chmod +x vcf_spider.py
.
usage: vcf_spider.py [-h] vcf_filename pop_filename output_filename
positional arguments:
vcf_filename original vcf file
pop_filename pop filename (.txt)
output_filename output filename (extension used to determine file format
(.genepop, .bayescan, .structure or .arlequin)
optional arguments:
-h, --help show this help message and exit
vcf_clone_detect.py - Attempts to identify groups of clones in a dataset. The script (1) conducts
pairwise comparisons (allelic similarity) for all individuals in a .vcf
file, (2) produces a histogram of genetic similarities, (3) lists the highest
matches to assess for a potential clonal threshold, (4) clusters the groups of
clones based on a particular threshold (supplied or roughly inferred), and (5)
lists the clonal individuals that can be removed from the dataset (so that one
individual with the least amount of missing data remains). If optional popfile
is given, then clonal groups are sorted by population. Note: Firstly, the
script is run with a .vcf
file and an optional popfile to produce an output
file (e.g. python3 vcf_clone_detect.py.py --vcf vcf_file.vcf --pop pop_file.txt --output compare_file.csv
). Secondly, it can be rerun using the
precalculated similarities under different thresholds (e.g. python3 vcf_clone_detect.py.py --input compare_file.csv --threshold 94.5
) [File did not pass PEP8 check]
usage: vcf_clone_detect.py [-h] [-v vcf_file] [-p pop_file] [-i compare_file]
[-o compare_file] [-t threshold]
optional arguments:
-h, --help show this help message and exit
-v vcf_file, --vcf vcf_file
input file with SNP data (`.vcf`)
-p pop_file, --pop pop_file
text file (tsv or csv) with individuals and
populations (to accompany `.vcf` file)
-i compare_file, --input compare_file
input file (csv) with previously calculated pairwise
comparisons (using the `--outputfile` option)
-o compare_file, --output compare_file
output file (csv) for all pairwise comparisons (can
later be used as input with `--inputfile`)
-t threshold, --threshold threshold
manual similarity threshold (e.g. `94.5` means at
least 94.5 percent allelic similarity for individuals
to be considered clones)
vcf_minrep_filter_abs.py - Filters .vcf
file for SNPs that are genotyped for a minimum number of
individuals in each of the populations (rather than overall proportion of
individuals). This can help to guarantee a minimum number of individuals to
calculate population-based statistics, and eliminate loci that might be
suffering from locus drop-out in particular populations. Note: only
individuals that are listed in popfile are taken into account to determine
number of individuals genotyped (but all indivs are outputted). [File did not pass PEP8 check]
usage: vcf_minrep_filter_abs.py [-h]
vcf_file pop_file min_proportion
output_filename
positional arguments:
vcf_file input file with SNP data (`.vcf`)
pop_file text file (tsv or csv) with individuals and populations
min_proportion proportion of individuals required to be genotyped in each
population for a SNP to be included (e.g `0.8` for 80
percent of individuals)
output_filename name of output file (`.vcf`)
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, pop_file.txt.
vcf_minrep_filter.py - Filters .vcf
file for SNPs that are genotyped for a minimum proportion of
individuals in each of the populations (rather than overall proportion of
individuals). This can help to guarantee a minimum number of individuals to
calculate population-based statistics, and eliminate loci that might be
suffering from locus drop-out in particular populations. Note: only
individuals that are listed in popfile are taken into account to determine
proportion of individuals genotyped (but all indivs are outputted). [File did not pass PEP8 check]
usage: vcf_minrep_filter.py [-h]
vcf_file pop_file min_proportion output_filename
positional arguments:
vcf_file input file with SNP data (`.vcf`)
pop_file text file (tsv or csv) with individuals and populations
min_proportion proportion of individuals required to be genotyped in each
population for a SNP to be included (e.g `0.8` for 80
percent of individuals)
output_filename name of output file (`.vcf`)
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, pop_file.txt.
vcf_remove_chrom.py - Excludes those loci (/CHROMs) in .vcf
that are listed in exclusion list.
Also outputs a logfile with loci that were listed but not present in .vcf
. [File did not pass PEP8 check]
usage: vcf_remove_chrom.py [-h] vcf_file exclusion_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
exclusion_file text file loci to be excluded
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, exclusion_file.txt.
vcf_remap2genome.py - Remap VCF to genome. * Currently only works with single-line FASTA files (but easy to change) * Works best when using the optional alignment output in sam2tsv [File did not pass PEP8 check]
usage: vcf_remap2genome.py [-h] [-v vcf_file] [-f fasta_file] [-t samtsv_file]
[-o output_vcf_file] [-pid pid_threshold]
optional arguments:
-h, --help show this help message and exit
-v vcf_file, --vcf vcf_file
original vcf file, where the CHROM correspond to the
sequence name in the supplied fasta (perfect match),
and the POS the position within that sequence (also
counting gaps if present).
-f fasta_file, --fasta fasta_file
fasta file representing a single sequence for each
locus/CHROM in the original vcf (including gaps if
present in the original aligment that the fasta is
based on). Note that these gaps need to be removed
before mapping these sequences back to the genome with
bwa mem, but need to be present in this file.
-t samtsv_file, --t samtsv_file
The sam file converted to tsv. The sam file represents
the mapping outcome of the supplied fasta (but then
without gaps) to the genome with bwa mem, e.g.
through: bwa mem ref_genome fasta_file_no_gaps.fa >
samfile.sam. This samfile then needs to be converted
to a tsv, using sam2tsv from the jVarKit toolkit
(http://lindenb.github.io/jvarkit/Sam2Tsv.html).
-o output_vcf_file, --output_vcf output_vcf_file
remapped vcf file with genome scaffold/chroms and
positions within those scaffold/chroms.
-pid pid_threshold, --pid pid_threshold
optional pid alignment threshold, to exclude loci
aligning to the genome with a percent id (PID) score
below the indicated value.
vcf_append_simulated_crosses.py - Generates artificial crosses between individuals from two indicated (in a
popfile) parentalgroups, and appends crossed individuals to .vcf
file. Note:
individual SNPs on a single CHROM are independently crossed as if they are not
physically linked - therefore only use when subsampling a single SNP / CHROM. [File did not pass PEP8 check]
usage: vcf_append_simulated_crosses.py [-h] [--n_crosses n_crosses]
[--prefix prefix] [--parentalnames]
vcf_file pop_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
pop_file text file (tsv or csv) with the names of the
individuals used for the simulated crosses, and in the
second column which parental population they belong to
(any name can be chosen - as long as there are exactly
two distinct values)
optional arguments:
-h, --help show this help message and exit
--n_crosses n_crosses, -n n_crosses
number of crosses to simulate (should be no higher
than the number of individuals in each of the two
parental populations)
--prefix prefix prefix for crosses (used only if --parentalnames is
not set)
--parentalnames set flag to use names of both parents for cross
Example input file(s): vcf_file.vcf, pop_file.txt.
vcf2hapmatrix.py - Converts .vcf
file to Tag Haplotype Matrix (with Chrom), with order of
individuals as indicated in optional file. Note: not yet properly tested. SNPs
of same CHROM (first column) in .vcf
should be grouped
together/sequentially, and all individuals need to be listed in order_file. [File did not pass PEP8 check]
usage: vcf2hapmatrix.py [-h] [-o order_file] vcf_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
optional arguments:
-h, --help show this help message and exit
-o order_file, --order_file order_file
text file with preferred output order of individuals
Example input file(s): vcf_file.vcf.
vcf_genotype_freqs.py - Outputs genotype frequencies for specific SNPs in each population, organised by group. [File did not pass PEP8 check]
usage: vcf_genotype_freqs.py [-h] vcf_file factor_file SNP_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
factor_file text file (tsv or csv) with individuals, their population
assignment and group assignment
SNP_file text file (tsv or csv) with CHROM/POS of each SNP to be
outputted
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, factor_file.txt.
popfile_match_vcf.py - Cleans up popfile by eliminating any individuals that are not in .vcf
file. [File did not pass PEP8 check]
usage: popfile_match_vcf.py [-h] vcf_file pop_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
pop_file text file (tsv or csv) with individuals and populations
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, pop_file.txt.
vcf2introgress.py - Converts .vcf
file to INTROGRESS input files (4 files). Splits data into
three categories: parental1, parental2 and admixed based on cluster assignment
(provided in separate file; e.g. STRUCTURE output) and given threshold, and
outputs data for loc that exceed a certain frequency difference between the
two 'parental' categories. Note: not yet properly tested. also see similar
vcf_ancestry_matrix.py
script. I use the formatted CLUMPP output
(clumpp_K2.out.csv
) from the structure_mp
wrapper as assignment file (max.
of 2 clusters). [File did not pass PEP8 check]
usage: vcf2introgress.py [-h] [--include]
vcf_file assignment_file assign_cut_off freq_cut_off
output_prefix
positional arguments:
vcf_file input file with SNP data (`.vcf`)
assignment_file text file (tsv or csv) with assignment values for each
individual (max. 2 clusters); e.g. a reformatted STRUCTURE
output file
assign_cut_off min. assignment value for an individual to be included in
the allele frequency calculation (i.e. putative purebred)
freq_cut_off min. allele frequency difference between the 2 clusters for
a locus to be included in the output
output_prefix prefix for output files
optional arguments:
-h, --help show this help message and exit
--include, -i set this flag if parental pops need to be included in
output
Example input file(s): vcf_file.vcf, assignment_file.csv.
vcf_pos_count.py - Counts SNP occurrence frequency for each POS in .vcf
file. [File did not pass PEP8 check]
usage: vcf_pos_count.py [-h] vcf_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf.
vcf_reference_loci.py - Lists all loci (using CHROM column) in .vcf
that are genotyped for at least
one of the indicated samples/individuals. This can be used to reduce the
dataset to loci matching an included reference (e.g. aposymbiotic) samples.
Note: vcf
can subsequently be filtered by using the output as inclusion_file
for vcf_include_chrom.py
. [File did not pass PEP8 check]
usage: vcf_reference_loci.py [-h]
vcf_file
[reference_samples [reference_samples ...]]
positional arguments:
vcf_file input file with SNP data (`.vcf`)
reference_samples sample(s) against which the remainder of the dataset will
be compared
optional arguments:
-h, --help show this help message and exit
vcf_contrast_samples.py - Contrast all samples in .vcf
file against certain reference sample(s) (e.g.
outgroup samples), to assess for fixed / private alleles. [File did not pass PEP8 check]
usage: vcf_contrast_samples.py [-h]
vcf_file
[reference_samples [reference_samples ...]]
positional arguments:
vcf_file input file with SNP data (`.vcf`)
reference_samples sample(s) against which the remainder of the dataset will
be compared
optional arguments:
-h, --help show this help message and exit
vcf_gdmatrix.py - Calculates Genetic Distance (Hamming / p-distance) for each pair of
individuals in a .vcf
file and outputs as matrix. Popfile is supplied to
indicate order in matrix. [File did not pass PEP8 check]
usage: vcf_gdmatrix.py [-h] vcf_file pop_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
pop_file text file (tsv or csv) with individuals and populations
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, pop_file.txt.
vcf_single_snp.py - Reduces .vcf
file to a single 'random' SNP per locus/chrom. Use for analyses
that require SNPs that are not physically linked. (although note that they of
course still may be - particularly so when dealing with short loci) [File did not pass PEP8 check]
usage: vcf_single_snp.py [-h] [-d distance_threshold] vcf_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
optional arguments:
-h, --help show this help message and exit
-d distance_threshold, --distance distance_threshold
optional custom distance threshold between SNPs
(default is 2500; not relevant for short de-novo loci
not mapped to reference scaffolds)
Example input file(s): vcf_file.vcf.
vcf_splitfst.py - Filter original SNP dataset (.vcf
) for a particular Fst percentile bin.
Note: order in Fst file needs to correspond with (.vcf
) file, currently set
(see script CONSTANTS) to work with LOSITAN output file, and output filename
automatically generated from percentile bins. [File did not pass PEP8 check]
usage: vcf_splitfst.py [-h] vcf_file fst_file min_percentile max_percentile
positional arguments:
vcf_file input file with SNP data (`.vcf`)
fst_file text file (tsv or csv) with Fst values for each SNP (same
order as in vcf) - currently set to work with LOSITAN output
file
min_percentile min. Fst value for a SNP to be included
max_percentile max. Fst value for a SNP to be included
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, fst_file.txt.
vcf_ragoo_order.py - Adds ordering indexes to a file with CHROM and POS columns, based on RaGOO mapping to a chromosome-level assembly. These ordering indexes (chromosome, scaffold and SNP order) can be used for Manhattan-style plots, without having to remap coordinates of the vcf. CHROM and POS columns should be the first and second column in the input file. In the output, three extra columns are inserted (after the CHROM and POS columns) that correspond to the chromosome ID, scaffold order on that chromosome, and position order within the scaffold (based on the RaGOO mapping orientation). [File did not pass PEP8 check]
usage: vcf_ragoo_order.py [-h] filename ragoo_orderings_path
positional arguments:
filename input file
ragoo_orderings_path path with ragoo orderings files
optional arguments:
-h, --help show this help message and exit
popfile_from_vcf.py - Creates tab-separated popfile from .vcf
, using a subset of the sample name
as population. For example, to use the substring MGD
from AFMGD6804H
as
population designation, run script as python3 popfile_from_vcf vcf_file 3 5
. [File did not pass PEP8 check]
usage: popfile_from_vcf.py [-h] vcf_file start_pos end_pos
positional arguments:
vcf_file input file with SNP data (`.vcf`)
start_pos character start position in sample name to be used for
population name (one-based)
end_pos character end position in sample name to be used for population
name (one-based)
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf.
vcf_read_trim.py - Removes SNPs from .vcf
that are above a certain POS value. [File did not pass PEP8 check]
usage: vcf_read_trim.py [-h] vcf_file highest_pos
positional arguments:
vcf_file input file with SNP data (`.vcf`)
highest_pos max. POS value allowed in `.vcf`
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf.
vcf_ancestry_matrix.py - Creates a genotype matrix for loci that have a large allele frequency
difference between two genetic clusters (as identified with e.g. STRUCTURE).
The script takes both a .vcf
file and a text file with the assignment
probabilities as input. An assignment threshold (e.g. 0.98) needs to be
supplied to identify the reference individuals in the two clusters, and an
allele frequency cut-off needs to be supplied to identify divergent loci. An
optional file can be supplied with a list of loci that need to be included
regardless (e.g. previously identified outliers). Note: I use the formatted
CLUMPP output (clumpp_K2.out.csv
) from the structure_mp
wrapper as
assignment file (max. of 2 clusters). [File did not pass PEP8 check]
usage: vcf_ancestry_matrix.py [-h] [--include inclusion_file]
vcf_file assignment_file assign_cut_off
freq_cut_off
positional arguments:
vcf_file input file with SNP data (`.vcf`)
assignment_file text file (tsv or csv) with assignment values for each
individual (max. 2 clusters); e.g. a reformatted
STRUCTURE output file
assign_cut_off min. assignment value for an individual to be included
in the allele frequency calculation (i.e. putative
purebred
freq_cut_off min. allele frequency difference between the 2
clusters for a locus to be included in the output
optional arguments:
-h, --help show this help message and exit
--include inclusion_file, -i inclusion_file
text file with loci to be included in output
regardless of allele frequency differences
Example input file(s): vcf_file.vcf, assignment_file.csv.
vcf_include_chrom.py - Retains only those loci (/CHROMs) in .vcf
that are given in file. [File did not pass PEP8 check]
usage: vcf_include_chrom.py [-h] vcf_file inclusion_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
inclusion_file text file with loci to be retained
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, inclusion_file.txt.
vcf_afd_filter.py - Calculate allele frequency differentials between groups, and flag those loci that have AFDs exceeding threshold between all subgroups of those groups. Note: so far only used with 3 groups - yet to be tested for more. [File did not pass PEP8 check]
usage: vcf_afd_filter.py [-h] vcf_file group_file afd_threshold
positional arguments:
vcf_file input file with SNP data (`.vcf`)
group_file text file (tsv or csv) separating individuals (first column)
into groups (second column)
afd_threshold allele frequency differential threshold
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf.
vcf2tess.py - Converts .vcf
file to TESS input files (genotypes and coordinates). Requires
a popfile and a file with coordinates for each population (decimal ]degrees),
a then simulates individual coordinates by adding a certain amount of noise.
Note: outputs individuals in the same order as popfile. [File did not pass PEP8 check]
usage: vcf2tess.py [-h] [--noise noise]
vcf_file pop_file coord_file output_prefix
positional arguments:
vcf_file input file with SNP data (`.vcf`)
pop_file text file (tsv or csv) with individuals and
populations
coord_file text file (tsv or csv) with populations and their lats
and longs (in decimal degrees)
output_prefix name prefix for output files
optional arguments:
-h, --help show this help message and exit
--noise noise, -n noise
max. amount of noise to be added (default = 1e-10)
Example input file(s): vcf_file.vcf, pop_file.txt.
vcf_rename_samples.py - Renames sample in .vcf
file according to list with old/new names; also
outputs samples that are not listed in name_change file. [File did not pass PEP8 check]
usage: vcf_rename_samples.py [-h] vcf_file samplenames_file
positional arguments:
vcf_file input file with SNP data (`.vcf`)
samplenames_file text file (tsv or csv) with old and new name for each
sample (not all samples have to be listed)
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, samplenames_file.txt.
pyradclust2fasta.py - Creates one large FASTA from all PyRAD clustS files in directory. Only outputs clusters that exceed size threshold (min. number of sequences in cluster). First sequence of each cluster is outputted (together with size of overall cluster - note: not of that specific sequence). Prints the outputted and total number of clusters to STDOUT. [File did not pass PEP8 check]
usage: pyradclust2fasta.py [-h] path cluster_threshold output_file
positional arguments:
path path that contains PyRAD `.clustS` files
cluster_threshold minimum size of cluster to be included
output_file name of output FASTA file
optional arguments:
-h, --help show this help message and exit
pyrad_find_caps_markers.py - Search PyRAD output file for diagnostic CAPS loci that can distinguish two groups (or one group and all other samples). [File did not pass PEP8 check]
usage: pyrad_find_caps_markers.py [-h] [-i pyrad_file] [-g group_file]
[-r re_site_file] [-g1 group1] [-g2 group2]
[-m min_samples] [-o output_folder]
optional arguments:
-h, --help show this help message and exit
-i pyrad_file, --input pyrad_file
input pyrad .loci or .alleles file
-g group_file, --groups group_file
text file (tsv or csv) separating individuals (first
column) into groups (second column)
-r re_site_file, --re re_site_file
text file (tsv or csv) listing restriction site names
(first column) and their recognition sequences (second
column)
-g1 group1, --group1 group1
first group that is targeted in marker search
-g2 group2, --group2 group2
second group (optional) that is targeted in marker
search (if none given, group1 is contrasted against
all other samples in group_file
-m min_samples, --min_samples min_samples
minimum number of genotyped samples in each group for
a marker to be considered
-o output_folder, --output output_folder
name of output folder for individual seqs of each
diagnostic locus
pyrad_shared_loci.py - Calculates the mean/min/max of shared loci for each sample [File did not pass PEP8 check]
usage: pyrad_shared_loci.py [-h] loci_file
positional arguments:
loci_file (i)pyrad .loci file
optional arguments:
-h, --help show this help message and exit
Example input file(s): loci_file.txt.
pyrad2fasta.py - Create FASTA file with a representative sequence (using first sample) or all
sequences (when --all_seqs flag is set) for each locus in pyRAD/ipyrad .loci
or .allele
file. [File did not pass PEP8 check]
usage: pyrad2fasta.py [-h] [-a] pyrad_file
positional arguments:
pyrad_file PyRAD allele file (`.loci` or `.allele`)
optional arguments:
-h, --help show this help message and exit
-a, --all_seqs set flag to output all sequences
Example input file(s): pyrad_file.loci.
pyrad2concat_fasta.py - Concatenates PyRAD/ipyrad sequences from .loci
file for each individual and
outputs as FASTA (order by popfile). Note: missing data are filled with gaps
(N
) [File did not pass PEP8 check]
usage: pyrad2concat_fasta.py [-h] pyrad_file pop_file
positional arguments:
pyrad_file PyRAD file (`.loci`)
pop_file text file (tsv or csv) with individuals and populations
optional arguments:
-h, --help show this help message and exit
Example input file(s): pyrad_file.loci, pop_file.txt.
pyrad_filter.py - Filters PyRAD output file (.loci
) for those loci (1) present or absent
(using --exclude flag) in supplied list, (2) genotyped for at least X number
of samples, and (3) with at least Y number of informative sites. Note: can
also be used for .alleles
file but then 2x the number of samples should be
given (assuming diploid individual). [File did not pass PEP8 check]
usage: pyrad_filter.py [-h] [-e]
pyrad_file loci_file sample_threshold snp_threshold
positional arguments:
pyrad_file PyRAD file (`.loci`)
loci_file text file with PyRAD loci to be included
sample_threshold min. number of samples genotyped for a locus to be
included
snp_threshold min. number of SNPs for a locus to be included
optional arguments:
-h, --help show this help message and exit
-e, --exclude use the loci in loci_file as exclusion list
Example input file(s): pyrad_file.loci, loci_file.txt.
pyrad_include.py - Reduces (i)pyrad file to only those samples listed in supplied text file. [File did not pass PEP8 check]
usage: pyrad_include.py [-h] loci_file inclusion_file min_samples
positional arguments:
loci_file samples input file
inclusion_file text file with names of samples to be included
min_samples minimum number of samples for locus to be included
optional arguments:
-h, --help show this help message and exit
Example input file(s): loci_file.txt, inclusion_file.txt.
pyrad_trim.py - Trims sequence length in PyRAD/ipyrad .alleles
or .loci
file. [File did not pass PEP8 check]
usage: pyrad_trim.py [-h] pyrad_file seq_length
positional arguments:
pyrad_file PyRAD allele file (`.loci` or `.allele`)
seq_length length to which sequences are trimmed
optional arguments:
-h, --help show this help message and exit
Example input file(s): pyrad_file.loci.
pyrad2migrate.py - Converts PyRAD .allele
file to migrate-n input file (population designated
indicated in supplied popfile). Note: only appropriate for PyRAD .allele
file (not .loci
). [File did not pass PEP8 check]
usage: pyrad2migrate.py [-h] allele_file pop_file
positional arguments:
allele_file PyRAD allele file (.allele)
pop_file text file (tsv or csv) with individuals and populations
optional arguments:
-h, --help show this help message and exit
Example input file(s): pop_file.txt.
fastq_barcodes2samplenames.py - Renames barcoded .fastq.gz files in a folder to sample/individual names. Takes relative or absolute path as first argument (e.g. 'samples'; without trailing slash) and a text file as second argument. The latter should be a tab- separated text file, with the barcode in the first column, and the new sample/individual name in the second column. The script conducts a trial run first, listing the files to be renamed, and then asks for confirmation before doing the actual renaming. [File did not pass PEP8 check]
usage: fastq_barcodes2samplenames.py [-h] path barcode_file
positional arguments:
path path (for current directory use `.`)
barcode_file text file (tsv or csv) with barcodes and sample names
optional arguments:
-h, --help show this help message and exit
fastq_seqcount.py - Outputs number of reads for each fastq.gz sample to text file, and prints
mean/min/max to STDOUT. Note: Does not work with all FASTQ formats, and
correct depending on OS zcat
or gzcat
needs to be set in COMPRESS_UTIL
constant. [File did not pass PEP8 check]
usage: fastq_seqcount.py [-h] path output_filename
positional arguments:
path path (for current directory use `.`)
output_filename name of text output file
optional arguments:
-h, --help show this help message and exit
fastq_bin_paired_reads.py - Clusters reads of paired-end RAD-seq data for downstream contig assembly. It
maps R1 reads to a reference, and then outputs those reads and the
corresponding R2 reads to a separate 'shuffled' FASTQ file per locus. Note:
when using an existing output folder, reads are being appended to existing
files (use this to append data from multiple samples). BWA needs to be
installed and accessible through PATH
environmental variable. [File did not pass PEP8 check]
usage: fastq_bin_paired_reads.py [-h]
r1_fastq_file r2_fastq_file ref_fasta_file
threads output_folder
positional arguments:
r1_fastq_file file in FASTQ format with R1 reads
r2_fastq_file file in FASTQ format with R2 reads
ref_fasta_file file in FASTA format with reference contigs
threads number of threads to be used by BWA
output_folder name of output folder
optional arguments:
-h, --help show this help message and exit
mapping_get_bwa_matches.py - Extracts a list of succesfully mapped loci from .sam
file (produced with
bwa mem
). Successfully mapped loci are identified by default identified as
those with flags 0 and 16 (can be adjusted in MATCH_FLAGS constant), and a
mapping quality of >=20. Configured for use with single-end reads. [File did not pass PEP8 check]
usage: mapping_get_bwa_matches.py [-h] sam_file
positional arguments:
sam_file `bwa mem` output file (`.sam`)
optional arguments:
-h, --help show this help message and exit
Example input file(s): sam_file.sam.
mapping_identify_blast_matches.py - Extracts a list of loci that have a blastn e-value below a certain threshold,
and outputs the (first) matching reference locus, as well as the alignment
length, nident, e-value and bitscore. It also compiles a set of all tax_ids,
which it uses to connect with the NCBI taxonomy database to get phylum ids for
each match using Entrez. Results are outputted to file with the chosen e-value
as post-fix, and STDOUT gives minimum alignment stats for filtered loci. Note:
fields in input file should be (in this order): query id, subject id,
alignment length, identity, perc. identity, evalue, bitscore, staxids, stitle.
This can be achieved by using blastn
with the following argument: -outfmt 7 qseqid sseqid length nident pident evalue bitscore staxids stitle
. [File did not pass PEP8 check]
usage: mapping_identify_blast_matches.py [-h] blastn_file evalue_cut_off email
positional arguments:
blastn_file blastn output file with the following fields (in that
order): query id, subject id, alignment length, identity,
perc. identity, evalue, bitscore
evalue_cut_off maximum e-value for match to be included
email email address to be used for NCBI connection
optional arguments:
-h, --help show this help message and exit
Example input file(s): blastn_file.txt.
mapping_get_blastn_matches.py - Extracts a list of loci that have a blastn e-value below a certain threshold,
and outputs the (first) matching reference locus, as well as the alignment
length, nident, and pident. Results are outputted to file with the chosen
e-value as post-fix, and STDOUT gives minimum alignment stats for filtered
loci. Note: fields in input file should be (in this order): query id, subject
id, alignment length, identity, perc. identity, evalue, bitscore (additional
fields beyond that are fine). This can be achieved by using blastn
with the
following argument: -outfmt 7 qseqid sseqid length nident pident evalue bitscore
. [File did not pass PEP8 check]
usage: mapping_get_blastn_matches.py [-h] blastn_file evalue_cut_off
positional arguments:
blastn_file blastn output file with the following fields (in that
order): query id, subject id, alignment length, identity,
perc. identity, evalue, bitscore
evalue_cut_off maximum e-value for match to be included
optional arguments:
-h, --help show this help message and exit
Example input file(s): blastn_file.txt.
popfile_toggleassign.py - Shuffle the assignment of individuals to populations by assigning the indivs sequentially to the different pops. The assignment is not completely random, but does generate equal population sizes (which otherwise differ substantially when using random assignment under originally small population sizes). [File did not pass PEP8 check]
usage: popfile_toggleassign.py [-h] pop_file
positional arguments:
pop_file text file (tsv or csv) with individuals and populations
optional arguments:
-h, --help show this help message and exit
Example input file(s): pop_file.txt.
popfile_randomize.py - Pseudo-randomize the assignment of individuals to populations. Note: with
small population sizes, this can lead to very uneven simulated population
sizes. See also the alternative: popfile_toggleassign.py
. Individuals in
original popfile should be ordered by population. [File did not pass PEP8 check]
usage: popfile_randomize.py [-h] pop_file
positional arguments:
pop_file text file (tsv or csv) with individuals and populations
optional arguments:
-h, --help show this help message and exit
Example input file(s): pop_file.txt.
popfile_from_clusters.py - Outputs a popfile based on cluster assignment file (from e.g. STRUCTURE) and
outputs a popfile based on those assigments (using supplied assignment
threshold). Note: I use the formatted CLUMPP output (e.g. clumpp_K4.out.csv
)
from the structure_mp
wrapper as assignment file. [File did not pass PEP8 check]
usage: popfile_from_clusters.py [-h] [-p pop_filename]
assignment_file assign_cut_off
positional arguments:
assignment_file text file (tsv or csv) with assignment values for each
individual (max. 2 clusters); e.g. a reformatted
STRUCTURE output file
assign_cut_off min. assignment value for an individual to be assigned
to a cluster
optional arguments:
-h, --help show this help message and exit
-p pop_filename, --popfile pop_filename
optional popfile: use original popnames as assignment
prefix
Example input file(s): assignment_file.csv.
structure_mp_plot.py - Plots all the results from a structure_mp run to a multi-page PDF. [File did not pass PEP8 check]
usage: structure_mp_plot.py [-h] [-o order_filename] [-p] [-c] path
positional arguments:
path path to structure_mp results
optional arguments:
-h, --help show this help message and exit
-o order_filename, --orderfile order_filename
optional file specifying the output order of samples
-p, --popnames set flag to output population names
-c, --clumpp_only set flag to only plot CLUMPP summary
nexus_set_label_colors.py - Set the color of each label in a NEXUS tree file.
usage: nexus_set_label_colors.py [-h] nexus_filename color_filename
positional arguments:
nexus_filename nexus input file)
color_filename file with samples and corresponding colors
optional arguments:
-h, --help show this help message and exit
nexus_append_label_groups.py - Appends group to each label in a NEXUS tree file. [File did not pass PEP8 check]
usage: nexus_append_label_groups.py [-h] nexus_filename group_filename
positional arguments:
nexus_filename nexus input file)
group_filename file with samples and corresponding groups
optional arguments:
-h, --help show this help message and exit
gdmatrix2tree.py - Creates NJ tree from a genetic distance matrix. Outputs ASCII format to STDOUT
and a nexus-formatted tree to output file. Note: distance matrix can be
created from vcf
using vcf_gdmatrix.py
. [File did not pass PEP8 check]
usage: gdmatrix2tree.py [-h] matrix_file tree_output_file
positional arguments:
matrix_file text file (tsv or csv) with genetic distance matrix
tree_output_file nexus file with output tree
optional arguments:
-h, --help show this help message and exit
Example input file(s): matrix_file.txt.
README_compile.py - Compiles README markdown file for this repository (/~https://github.com/pimbongaerts/radseq). Categories are assigned based on prefix, usage information is extracted from argparse, and example input files are assigned based on argument names. [File did not pass PEP8 check]
usage: README_compile.py [-h]
optional arguments:
-h, --help show this help message and exit
fasta_exclude.py - Reduces FASTA file to those loci not listed in supplied text file. [File did not pass PEP8 check]
usage: fasta_exclude.py [-h] fasta_file exclusion_file
positional arguments:
fasta_file FASTA input file (`.fasta`/ `.fa`)
exclusion_file text file with names of loci to be excluded
optional arguments:
-h, --help show this help message and exit
Example input file(s): fasta_file.fa, exclusion_file.txt.
fasta_include.py - Reduces FASTA file to only those loci listed in supplied text file. [File did not pass PEP8 check]
usage: fasta_include.py [-h] fasta_file inclusion_file
positional arguments:
fasta_file FASTA input file (`.fasta`/ `.fa`)
inclusion_file text file with names of loci to be included
optional arguments:
-h, --help show this help message and exit
Example input file(s): fasta_file.fa, inclusion_file.txt.
bwa_distance_filter.py - Filters list of loci mapped to genome scaffolds, so that they are spaced at least a certain distance. Input file should be tab-separated with columns in the following order (no header): rad_locus, ref_scaffold, ref_start_pos, flag. Required spacing between POS will be spacing + max_locus_length. [File did not pass PEP8 check]
usage: bwa_distance_filter.py [-h] [-l loci_file]
filename spacing max_locus_length
positional arguments:
filename input file
spacing desired spacing between loci
max_locus_length max. length of loci
optional arguments:
-h, --help show this help message and exit
-l loci_file, --loci loci_file
file with loci to be considered
goterms_from_uniprot_blast.py - Creates a list of GO terms for each annotated gene. [File did not pass PEP8 check]
usage: goterms_from_uniprot_blast.py [-h]
gene2uniprot_filename uniprot2go_filename
positional arguments:
gene2uniprot_filename
input file (tsv) with the custom gene ids (first
column) and the corresponding UniProt IDs (second
column); when multiple UniProt IDs are given for each
gene, they should be sorted by highest match)
uniprot2go_filename input file (tsv) with UniProt gene ids (first column)
and the corresponding GO terms (second column)
separated by semi-colons (;)
optional arguments:
-h, --help show this help message and exit
itertools_combinations.py - Generate list with all unique pairwise combinations of values in file. Short script meant to allow use of itertools.combinations in bash. [File did not pass PEP8 check]
usage: itertools_combinations.py [-h] filename
positional arguments:
filename input file with values
optional arguments:
-h, --help show this help message and exit
structure_mp.py - Multi-processing STRUCTURE (Pritchard et al 2000) wrapper for RAD-seq data.
Takes a .vcf
as input file and then creates a number of replicate datasets,
each with a different pseudo-random subsampling of one SNP per RAD contig.
Then, it runs the replicate datasets through STRUCTURE across multiple
threads, and summarises the outcome with CLUMPP (Jakobsson and Rosenberg
2007). Finally, it assesses the number of potential clusters using the
Puechmaille 2016 method (only suitable for certain datasets). Note: STILL
NEEDS TO BE MODIFIED FOR GENERAL USE. Input file (.vcf
) should be sorted by
CHROM. The mainparams' and
extraparamsfile for STRUCTURE need to be present in the current directory (with the desired settings - although
NUMINDSand
NUMLOCI` can be set to 0 as these will be supplied to STRUCTURE
by the script). The paramfile for CLUMPP will be generated by the script and
does not need to be supplied. [File did not pass PEP8 check]
usage: structure_mp.py [-h] vcf_file pop_file maxK replicates threads
positional arguments:
vcf_file input file with SNP data (`.vcf`)
pop_file population file (.txt)
maxK maximum number of K (expected clusters)
replicates number of replicate runs for each K
threads number of parallel threads
optional arguments:
-h, --help show this help message and exit
Example input file(s): vcf_file.vcf, pop_file.txt.