TaxMan Information

TaxMan is an interactive web server for the production of rRNA gene subsequences based on your primers. You can download all data, interactively analyse the data by browsing the tree or plotting the coverage in pie charts, also off-line. Thus, you can check the taxonomic coverage your primers give you.

Even when the selected reference rRNA gene database is non-redundant, PCR can result in identical sub-sequences; therefore a highly redundant set of amplicons can be formed. The taxonomic lineages belonging to these sequences can be the same, but can also be different. Depending on your primers, different taxonomic phyla may form identical sub-sequences.

TaxMan calculates the amplicon sequences, makes them non-redundant and summarizes the lineages, which are taken from the selected rRNA gene database. Each sequence receives a unique FASTA header line. If there are multiple headers with the same lineage (but different sequences), a count will be appended. The example primers on this site form 1045 sequences from CORE, but only 796 are unique (so 141 (17.7%) amplicon sequences are identical). Especially, with larger sequence databases the redundancy after "PCR" can be significant.

The header line of the FASTA sequences contains the highest level at which the taxonomy starts to differ. This level is present so that it is immediately clear which taxonomic categories now have the same (amplicon) sequence. The example output using the CORE database contains several Streptococci for which the lineages are summarized. For example:
Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus;(cristatus/oligofermentans sinensis).
As TaxMan is sequence oriented, (different) sequences with the same lineage in the non-redundant amplicon set get a count. For example:
Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus;(cristatus/oligofermentans sinensis)_2

This web server

TaxMan input

Primers

Paste in your primer sequences: supply the forward and reverse primers. They may contain IUPAC ambiguity codes (see base codes and their reverse complement). The reverse primer needs to be in reverse complement orientation, which is common for PCR, but this may not be the case in your analysis pipeline. If you click the "Example" button a primer set, targeting the V5-V6 hypervariable region of 16S rRNA gene, is shown.

The primers may contain any of the IUPAC ambiguity codes. Sequences may also contain these codes, which means, for example, "R" should match G or A or R. Therefore, we expand these codes in the following way:

You cannot use ambiguity codes in so-called character classes, like [GAR]. However, this is not needed as we expand the codes. We currently use an adapted version, using the table above, of EMBOSS primersearch to calculate the amplicons. For each database sequence, we only keep the longest amplicon sequence.

Note: The table above does not contain the bases (A,C,G,T). Thus, when your primer contains, for example, an "A" but the sequence contains an "R" at the corresponding position, a mismatch results. This can occur a very few times. Under options, you can either allow mismatches and include the entire primer in the 3' mismatch window or tick the check box for "Allow bases (A,C,G,T) to include all their ambiguities". The latter option only works when the mismatch percent remains set to zero.

Targeted region

Not all rRNA gene sequences in the databases are full-length sequences. In all cases, the reported taxonomic coverage relates to the produced amplicon sequences. However, when you use this option, the number of sequences that did not produce amplicons, because they are too short to include your primers, is reported. Note that if you allow a high mismatch percentage (e.g. 30%), your primers may produce an amplicon in a different region than you target. Since in this case an amplicon is produced, the related sequences is not included in this number of incomplete sequences.

You can supply the begin and end position of the region targeted in the E. coli reference sequence (see below). If you enter these coordinates, the number of sequences that do not have sequence information for the entire targeted region will be reported. These are sequence that lack nucleotide data at the begin and/or end of the rRNA gene. This option relies on the multiple sequence alignment provided by the different databases.

If you use this option, both begin and end coordinates have to be supplied. These coordinates correspond to the targeted region. For example, if your forward primer starts at position 250 and your reverse primer ends at 1200 (on the forward strand), the begin/end values are 250/1200.

The reference sequences, currently implemented, are the E. coli 16S rRNA and 23 rRNA gene sequences. The identifiers of these reference sequences are from conservation diagrams of the Gutell lab.

16S rRNA gene reference:
>gb|J01695.2|ECORGNB:1268-2809 E.coli rRNA operon (rrnB) coding for Glu-tRNA-2, 5S, 16S and 23S rRNA

23S rRNA gene reference (for SILVA LSU):
>gb|J01695.2|ECORGNB:3250-6153 E.coli rRNA operon (rrnB) coding for Glu-tRNA-2, 5S, 16S and 23S rRNA

Implementation details

We have indexed the coordinates of the sequences in the multiple sequence alignments (MSAs) that were available. These MSAs are produced by the respective databases. The CORE and vaginal reference MSAs did not include the exact E. coli reference we use. We have used pyNAST v1.1 to align this reference sequence to the existing MSA.

For each sequence its begin and end position in the MSA are stored. The positions that you provide are coordinates in the reference sequence (e.g. 895 as begin when you use the 895F primer). We map these postions to the coordinates in the MSA using the reference and the aligned reference sequence. Next, we retrieve the sequence identifiers of the sequences that have a begin coordinate larger than or end coordinate smaller than the positions in the reference. Thus, the number of sequences that does not contain sequence information spanning the targeted region can be calculated. The identifiers of the missing sequences are checked against the identifiers of the amplicons. This means that if your primers target the region from 967 to 1046, but you supply the begin/end positions as 10/1500, the reported number of sequences with missing information is large, but the corrected coverage is 100%.

Less...

Databases

Building details

Data files have been downloaded and where applicable the taxonomy has been added to the FASTA header. The building process depends on the database:

CORE: The CORE database was downloaded on 23-08-2011 from downloads. We downloaded the EXCEL version and, using the data in this file, we created FASTA headers that include the taxonomic lineage. The headers contain the CORE accession id followed by the complete lineage information from the same EXCEL file.
HOMD: The Human Oral Microbiome Database (HOMD) was downloaded from the HOMD site. We downloaded version 10.1 of the 16S rRNA gene database and the taxon table in text format. The data is linked through the HOT id. The headers from the HOMD data starts with the HOT id merged with the strain synonym (like in the HOMD file), followed with the lineage information from the taxon table file. A single HOT id had no lineage information (HOT id: 735), while there is a sequence in the 16S rRNA file. This sequence is added to TaxMan with the lineage "Bacteria;unclassified".
Greengenes: The Greengenes data was downloaded after the update on October 2, 2011 (file: current_GREENGENES_gg16S_unaligned.fasta). The (taxa) information contains both Greengenes accession codes (merged with an underscore) followed by the complete lineage as is given by Greengenes. In the Greengenes files, information on the lineage is present in the form of a letter followed by two underscores. For kingdom information this is "k__"; this information is stripped off.
Example:
Original header
>14 AF068820.2 hydrothermal vent clone VC2.1 Arc13 k__Archaea; p__Euryarchaeota; c__Thermoplasmata; o__Thermoplasmatales; f__Aciduliprofundaceae; otu_204
TaxMan header (same format as NCBI Taxonomy or SILVA)
>14_AF068820.2 Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Aciduliprofundaceae;otu_204
SILVA: Data was downloaded and not changed/extended.
Vaginal 16S reference: The vaginal dataset is last updated on 20-05-2011 and was downloaded on 22-12-2011. The files contain information on the NCBI accession number and taxon id. We recursively looped through the NCBI taxonomy files (downloaded on 22-12-2011) to create the complete lineages. The sequence headers start with the accession number of the vaginal set followed by the lineage information. If the lineage was not known up to species level we added the label "unclassified" after the last known level.

Less...

TaxMan options

E-mail address

TaxMan output

Overview section

Download section

Taxonomic tree section

An expandable tree is shown here. You can click on the plus

graphic or the name to expand a part of the tree. You can change the height of the tree by dragging the bottom or right border or the bottom-left corner of the Tree area. You can also change the height by typing a (natural) number in the field for "Height of Tree viewer" and pressing enter. Internet Explorer users can only use the latter option.

You can use "Find" to search for a name (case-insentitive). After entering at least three characters, press enter or click the "Find" button. The part of the tree will expand where there is a name that starts with your search query. The entire match is set in bold and will be scrolled into view. You can press "Find Next" to jump to a (possible) next match.

The numbers in the tree refer to the number of sequences found by your primers and the number in the original reference database used, respectively. For convenience, also the percentage is shown. If this percentage is "--", the sequence was assigned an ambiguous lineage not present in the reference. One unique amplicon sequence can have several different lineages associated with it (see above). For the tree viewer only, an "extra" node is created, named "ambiguous". Ambiguous taxonomic assignments are collected under this node (at the appropriate level) as to seperate these clearly from the other assigments.
Example: Streptococcus;(cristatus/oligofermentans) is present in the tree viewer as Streptococcus;ambiguous;(cristatus/oligofermentans).
Note that this is not revevant if you used the option "Use only identical part of lineage".

Pie chart section

The pie charts show the distribution of the sequences over the taxonomy. If you mouse over a slice or over the legend, the numbers of sequences will be shown in the pie. The numbers are the same as in the tree.

By clicking on a slice of the pie, you can plot the taxomonic categories (children) of this slice. If you click Bacteria in the Root pie, a new pie will show all groups in Bacteria. Note that this new plot contains the sub groups of Bacteria (phyla) and not a lineage like Bacteria;unclassified. The latter is present (and counted) in the Bacteria slice of the Root pie. Now, the counts of the numbers of sequences are also present in the legend (e.g. Actinobacteria |61/72).

You can plot the taxonomic coverage data in different ways and supply a percentage threshold. This threshold is applied to amplicon and reference charts. For example, with ≥ 2% the pie charts will only contain taxa (e.g. Firmicutes) that occur at least 2% in your set, relative to all taxa at their parent level (e.g. Bacteria). Please note:

Example for threshold: The Proteobacteria count is 89/106. The count of Bacteria, its parent, is 504/618. By selecting the threshold of >15%, the Proteobacteria will be filtered out, since 89/618 = 14.4%. The first number refers to the number of sequences of this taxon targeted by your primers, the second to number of this taxon in the used reference database.

In Pie charts for amplicon sequences charts, you can plot all taxa or make a "difference" plot that only includes taxa missing for some percentage in your amplicon set. If you check "Plot differences", the new plot will have a pink header to indicate that this is a different type of plot. These "difference" plots may include all missing taxa or only the taxa missing at least a certain percentage. This allows zooming into taxa of your interest. This percentage is relative to the number of sequences of a taxon in the reference. For example, 30% means that 30% of sequences are missing from your set relative to the reference. Indeed, 100% means only taxa are shown that are entirely absent from your set.

The Pie charts for reference database allows a different view of data for the reference database. Instead of plotting the counts for your amplicon set the count of reference data are plotted here. Mousing over the pie slices will now show the same count twice (the count in the pie is now the same as the reference). For a fasta visualization of what is present in your set, you can mouse over the names in the legend to show the counts in your set and the reference.

Example In CORE, there are 1045 bacterial sequences. So, mousing over the pie will show "Bacteria; cnt:1045/1045", while mousing over Bacteria in the legend will show "Bacteria; cnt:655/1045". Here, 655 is the number of sequences in Bacteria in your set and 1045 is the total number of Bacteria sequences in CORE.