AllelePipe AllelePipe

Identifying alleles in population genomic datasets without a reference genome

Allelic variation within species provides fundamental insights into the evolution and ecology of organisms, and information about this variation is becoming increasingly available in sequence datasets of multiple and/or outbred individuals. Unfortunately, identifying true allelic variants poses a number of challenges, given the presence of both sequencing errors and alleles from other closely-related loci. Particularly tricky is the case where alleles must be identified without mapping them to a fully resolved reference genome, and where sequence depth information cannot be used to infer the putative number of loci sharing the same sequence.  This situation is commonly found in transcriptome and publicly available post-assembly datasets. The AllelePipe takes in assembled sequence contigs from one or more individuals and passes them through the following steps to extract allelic variation at putative individual loci:
  • 1) Similarity is assessed among all sequences from all individuals using SSAHA2 according to user-defined minimum similarity and alignment length thresholds.

  • 2) Alignment throughout the region of overlap is verified.

  • 3) Sequences are clustered by either single-linkage clustering or MCL as desired, with the option of re-starting the clustering with alternative methods/granularities.

  • 4)  Multiple alignments are created for sequences within each cluster and their consensus sequence generated, using CAP3. A single consensus genomic reference fasta file is generated for the whole dataset which can be used again in other analyses.

  • 5) Optionally, putatively chimeric clusters are removed, assuming that these are clusters where only one sequence bridges an internal region of the multiple alignment. This step is only appropriate for datasets with many individuals and good coverage of loci, where many sequences should be aligning across the length of each locus.

  • 6) SNPs (currently excluding indels) are identified using ssahaSNP against the reference sequence for the same or different sets of individuals, as desired. The program can be restarted from this step for additional analyses with different parameter and/or new individuals.

  • 7) Clusters are sorted as being single or multi-locus, based upon user settings for the maximum number of alleles allowed per individual


Citing AllelePipe

  • For the AllelePipe itself:
    • Dlugosch KM, et al. In prep. -- stay tuned

    The pipeline uses the following programs, which should also be cited:
    • Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9:868-877.

    • Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a fast search method for large DNA databases. Genome Res 11:1725-1729.

  • If you use the MCL option, you should ALSO cite:
    • Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178-2189.

ChromEvol ChromEvol

Model the evolution of chromosome numbers across a phylogeny

Chromosome number is a remarkably dynamic feature of eukaryotic evolution. Chromosome numbers can change by a duplication of the whole genome (a process termed polyploidy), or by gaining or losing single chromosomes. Of the various mechanisms of chromosome number change, polyploidy has received significant attention because of the impact such an event may have on the organism. Polyploids often differ markedly from their progenitors in morphological, physiological, or life history characteristics, and these differences may contribute to the establishment and success of a polyploid species in novel ecological settings.

ChromEvol implements a series of likelihood models for the evolution of chromosome numbers. By comparing the fit of the different models to biological data, it may be possible to gain insight regarding the pathways by which the evolution of chromosome number proceeds. For each model, the program infers the set of ancestral chromosome numbers and estimates the location along the tree for which polyploidy events (and other chromosome number changes) occurred.


Citing ChromEvol

DupPipe DupPipe

A Pipeline to Infer Gene Family Phylogenies and Summarize the Age of Duplication Events

Phylogenies reveal the evolutionary history of organisms and genes, and phylogenetic analyses of genome scale data can uncover genome-wide events that occurred in the past. For example, examining the overall shape of the distribution of all gene family duplications for an organism reveals the patterns of birth and death of gene copies through time. Large-scale birth events, such as ancient polyploidy, are apparent in these distributions as peaks and by comparison with other species may be placed in phylogenetic context. Such analyses are also useful to evaluate the resolution of close paralogs to assess assembly quality - overassembled or short read assemblies without sufficient power will not resolve recent paralogs. The DupPipe provides an online server to estimate gene family phylogenies and plot the distriubtion of duplications for subsequent evolutionary analyses and assembly quality assessment.

Gene family members are identified as sequences that demonstrate at least 40% sequence similarity over at least 300 bp from a discontiguous MegaBlast (Zhang et al. 2000; Ma et al. 2002). Reading frames for each sequence pair are identified by comparison to available protein sequences by searching against a set of proteins provided by the user or available on GenBank (Wheeler et al. 2007) using BlastX (Altschul et al.1997). Best hit proteins are paired with each gene at a minimum cutoff of 30% sequence similarity over at least 150 sites. Genes that do not have a best hit protein at this level are removed. To determine reading frame and generate estimated amino acid sequences, each gene is aligned against its best hit protein by Genewise 2.2.2 (Birney et al. 2004). Using the highest scoring Genewise DNA-protein alignments, custom Perl scripts are used to remove stop and 'N' containing codons and produce estimated amino acid sequences for each gene. Amino acid sequences for each duplicate pair are then aligned using MUSCLE 3.6 (Edgar 2004). The aligned amino acids are subsequently used to align their corresponding DNA sequences using RevTrans 1.4 (Wernersson and Pedersen 2003). Ks values (synonymous substitution rates) for each duplicate pair are calculated using the maximum likelihood method implemented in codeml of the PAML package (Yang 1997) under the F3-4 model (Goldman and Yang 1994).

Further cleaning of the dataset is conducted to remove duplication events that could bias the results. To reduce the possibility that identical genes are represented in the dataset, but missed by the TGICL clustering due to alternative splicing, all Ks values from one member of a duplicate pair with Ks = 0 were removed. Further, to reduce the multiplicative effects of multicopy gene families on Ks values, we use simple hierarchical clustering to construct phylogenies for each gene family, identified as single-linked clusters, and calculate the node Ks values.

Run the DupPipe on the EvoPipes Server Now!

Output Files

  • final_ks_values: Contains the node Ks values for each gene family cluster to plot the distribution of gene ages

  • pamloutput: Pairwise Ka, Ks, and Ka/Ks for each duplicate gene pair used to generate the node Ks values for each gene family cluster

  • genewise_dnas.fasta: Cleaned up DNAs for PAML (e.g., removed stop codons) and placed in reading frame by best blast hit to known proteins with genewise.

  • genewise_prots.fasta: Estimated protein sequences that correspond to the reading frame of the same sequence number in genewise_dnas.fasta.

  • sequence: Original DNA fasta file.

  • indices: The index of each fasta header used by the DupPipe with the original corresponding header.

Citing the DupPipe

findSSR findSSR

Identify SSRs in a collection of DNA sequences

findSSR is a pipeline that identifies all genes cointaining SSRs or microsatellites, including di-, tri-, tetra- and penta-nucleotide repeats of at least five repeats. Similar in scope to the approach of Temnykh et al. (2001), findSSRs was designed to pull out SSRs that would be useful for genotyping. For this purpose, unlike other programs, findSSRs only reports repeats found greater than 20 nucleotides from the ends of the sequence, leaving room for primers on either side. Output includes a list of each sequence name for genes containing an SSR, the location of the SSR, repeated motif, number of repeats, and the total length of the sequence examined. The program has been used to identify microsatellites that have been developed into markers used in several publications (Kane et al. 2009, Kane and Rieseberg 2007, Kane and Rieseberg 2008, Kawakami et al. 2010, Yatabe et al. 2007).

Run findSSR on the EvoPipes Server Now!

Output Files

  • out.ssr*: Row by row listing of SSRs for each fasta sequence with the repeat motif, number of repeats, position of repeat, and total length of the fasta sequence.

Citing findSSRs


Simulate gene and genome evolution with your data using NU-IN and EvolSimulator

NU-IN is is an adaptation and expansion of the EvolSimulator 2.1.0 genome evolution simulation program by Beiko and Charlebois (2007, http://bioinformatics.org.au/evolsim/). NU-IN was designed to expand EvolSimulator in two fundamental ways: 1) Allow synonymous and non-synonymous nucleotide evolution and 2) Permit input of genomes, gene family membership, and gene 'usefulness' (the selective retention of particular loci in particular environments). With these changes, the user has the ability to use real genomic (coding) sequence data to initiate a simulation of one or more lineages, generate mutations through SNPs and copy number variation (as well as horizontal gene transfer), evolve genomes by drift and selection, and use output of previous simulations as starting points for further evolution.


Citing NU-IN

RBHOrthologs RBH Orthologs

A pipeline to identify reciprocal best BLAST hits among a set of fasta files.

The RBH Orthologs pipeline is designed to provide a list of reciprocal best blast hits for a given set of files. This is one of many approaches to identifying putatively orthologous sequences in a collection of genomic data, an important step in searching for candidate genes under selection or constructing phylogenies with only orthologous sequences. Note that lineage specific duplications and missing data pose problems for the identification of orthologs using all approaches, but the stringent requirements of the RBH algorithm should minimize these errors.

For each fasta formatted data set uploaded, reciprocal BLAST searches are conducted for all pairwise combinations using a discontiguous megablast (Zhang et al. 2000; Ma et al. 2002) with an intial filter of at least 50% sequence identity and e-value of 0.1. Each blast search is further parsed to keep hits with at 70% sequence identity over at least 100 base pairs. The top hits among each of these pairwise BLAST searches are then examined for reciprocal best BLAST hits among all uploaded fasta files. A file containing the names of these RBH orthologs, listed by row, is provided as output. Names of each sequence are coded with the three letter file id provided by the user and the index number for each sequence, based on its position in the original fasta file.

Run RBH Orthologs on the EvoPipes Server Now!

Output Files

  • orthologs.id1.id2...: Row by row listing of rbh orthologs with each fasta sequence identified in the ortholog groups based on user provided IDs and position of the sequence in the original fasta files.

Citing the RBH Ortholog Pipeline


Scaffolded and Corrected Assembly of Roche 454

A next-gen sequence assembly tool for evolutionary genomics. Designed especially for assembling 454 EST sequences against high quality reference sequences from related species.

SCARF was created in order to knit together low-coverage 454 contigs that do not assemble during traditional de novo assembly, using a reference sequence library to orient the 454 sequences. SCARF is especially well suited for non-contiguous or low depth data sets such as EST (expressed sequence tag) libraries. SCARF can also be used to sort and assemble a pool of 454 sequence data according to a set of reference sequences (e.g. for metagenomics). See the documentation for a full description of the methodology behind SCARF.

Run SCARF on the EvoPipes Server Now!

Download SCARF

Citing SCARF

SnoWhite SnoWhite

An Aggressive Cleaning Pipeline for DNA Sequence Reads

Snowhite is a pipeline designed to flexibly and aggressively clean sequence reads (gDNA or cDNA) prior to assembly. It takes in and returns fasta formatted sequence and (optionally) quality files. It employs several steps:

  • 1) Adapter Clipping: SnoWhite can clip a user-specified number of bases from the beginning of each sequence.

  • 2&4) Seqclean: SnoWhite passes files to TGI's Seqclean, a relatively old but still excellent tool for trimming polyA/T tails, primer contaminants, and uninformative sequences (Ns).

  • 3) PolyA/T Trimming: SnoWhite provides additional trimming governed by many tunable parameters. In short, users can set tolerances for what constitutes a polyA/T, where to look for it in the sequence, and how much error to allow.

  • 5) TagDust: SnoWhite optionally implements TagDust, which is designed to find sequences that are composed almost entirely of primer/adapter fragments. These primer 'multimers' or 'concatmers' are a persistent low-abundance feature of many datasets, and are extremely difficult to remove using traditional contaminant searches.

Data Types

  • 454: SnoWhite was written for Roche 454 data, and is ideal for this.

  • Illumina & SOLiD: May require large amounts of RAM.

  • Sanger: Note that TagDust evaluates only the first 999bp of sequence,
    and TagDust does not tolerate vector sequences >2000nt.


Citing SnoWhite

  • For SnoWhite:
    Dlugosch KM, Rieseberg LH. SnoWhite: A pipeline for aggressive cleaning of next-generation sequence reads. In prep.

  • If you use the TagDust option, you should ALSO cite:
    Lassmann T, Hayashizaki Y, Daub CO. 2009. TagDust - A program to eliminate artifacts from next generation sequencing data. Bioinformatics 25: 2839-2840.

Translate TransPipe

A pipeline to translate a collection of genomic or cDNA sequences to protein and place them in the corresponding reading frame

TransPipe provides bulk translation and reading frame identification for a set of fasta formatted sequences. The identification of reading frames and translated protein sequences are crucial steps for codon based analyses, such as testing for evidence of accelerated amino acid evolution, Ka/Ks ratios, or protein guided DNA alignments. Using GeneWise's HMM algorithm, the TransPipe leaves only the regions of genes or EST reads that align to the protein sequence and starts them in-frame. An added benefit of this approach is that any missed vectors and adapters are also removed, as well as all non-coding DNA from the data set, such as introns and UTRs, simplifying the comparison of transcriptome, whole genome shotgun, and genespace data.

The reading frame and protein translation for each sequence is identified by comparison to available protein sequences provided by the user or available on GenBank (Wheeler et al. 2007). Using BlastX (Altschul et al.1997), best hit proteins are paired with each gene at a minimum cutoff of 30% sequence similarity over at least 150 sites. Genes that do not have a best hit protein at this level are removed. To determine reading frame and generate estimated amino acid sequences, each gene is aligned against its best hit protein by Genewise 2.2.2 (Birney et al. 2004). Using the highest scoring Genewise DNA-protein alignments, custom Perl scripts are used to remove stop and 'N' containing codons and produce estimated amino acid sequences for each gene. Output includes paired DNA and protein sequences with the DNA sequence's reading frame corresponding to the protein sequence.

Run the TransPipe on the EvoPipes Server Now!

Output Files

  • genewise_dnas.fasta: Translated DNAs placed in reading frame by alignment to best blast hit to known proteins with genewise. Sequences are also cleaned of any non-aligned regions and stop codons, and are ready for use in other software tools such as PAML.

  • genewise_prots.fasta: Estimated protein sequences that correspond to the reading frame of the same sequence number in genewise_dnas.fasta.

  • sequence: Original DNA fasta file.

  • fasta_index: The index of each fasta sequence used by the TransPipe with the original corresponding header in column 2.

  • dna_protein_index: The index of each fasta sequence (column 1) with the corresponding best hit protein index and header (column 2 and 3).

Citing the TransPipe

Last updated June 2, 2011
Developed with