banner-button_0 banner-button_Layer-7 banner_button_03 banner-button_Layer-4 banner-button_05 banner-button_Layer-5 banner-button_07
banner-button_08 banner-button_09 banner-button_10
Bioinformatics Unit banner
   tabfoot tabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bgtabfoot-bg

TaxMan Information

TaxMan description

What is TaxMan?

TaxMan is an interactive web server for the production of rRNA gene subsequences based on your primers. You can download all data, interactively analyse the data by browsing the tree or plotting the coverage in pie charts, also off-line. Thus, you can check the taxonomic coverage your primers give you.

Even when the selected reference rRNA gene database is non-redundant, PCR can result in identical sub-sequences; therefore a highly redundant set of amplicons can be formed. The taxonomic lineages belonging to these sequences can be the same, but can also be different. Depending on your primers, different taxonomic phyla may form identical sub-sequences.

TaxMan calculates the amplicon sequences, makes them non-redundant and summarizes the lineages, which are taken from the selected rRNA gene database. Each sequence receives a unique FASTA header line. If there are multiple headers with the same lineage (but different sequences), a count will be appended. The example primers on this site form 1045 sequences from CORE, but only 796 are unique (so 141 (17.7%) amplicon sequences are identical). Especially, with larger sequence databases the redundancy after "PCR" can be significant.

The header line of the FASTA sequences contains the highest level at which the taxonomy starts to differ. This level is present so that it is immediately clear which taxonomic categories now have the same (amplicon) sequence. The example output using the CORE database contains several Streptococci for which the lineages are summarized. For example:
Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus;(cristatus/oligofermentans sinensis).
As TaxMan is sequence oriented, (different) sequences with the same lineage in the non-redundant amplicon set get a count. For example:
Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus;(cristatus/oligofermentans sinensis)_2

This web server

With this server, you can:
  • find sequences matching your primers
  • select different rRNA gene target databases (contact us if you would like us to add your database of choice)
  • interactively browse and search the taxonomic tree of your sequences
  • interactively plot the distribution of the related taxa
  • download your FASTA amplicon sequences with a choice between two FASTA headers.
  • download a lineage file that includes the counts for all taxa based on your inputs.
For more information, see sections below.

TaxMan input


Paste in your primer sequences: supply the forward and reverse primers. They may contain IUPAC ambiguity codes (see base codes and their reverse complement). The reverse primer needs to be in reverse complement orientation, which is common for PCR, but this may not be the case in your analysis pipeline. If you click the "Example" button a primer set, targeting the V5-V6 hypervariable region of 16S rRNA gene, is shown.

The primers may contain any of the IUPAC ambiguity codes. Sequences may also contain these codes, which means, for example, "R" should match G or A or R. Therefore, we expand these codes in the following way:

Ambiguity codeExpansion

You cannot use ambiguity codes in so-called character classes, like [GAR]. However, this is not needed as we expand the codes. We currently use an adapted version, using the table above, of EMBOSS primersearch to calculate the amplicons. For each database sequence, we only keep the longest amplicon sequence.

Note: The table above does not contain the bases (A,C,G,T). Thus, when your primer contains, for example, an "A" but the sequence contains an "R" at the corresponding position, a mismatch results. This can occur a very few times. Under options, you can either allow mismatches and include the entire primer in the 3' mismatch window or tick the check box for "Allow bases (A,C,G,T) to include all their ambiguities". The latter option only works when the mismatch percent remains set to zero.

Targeted region

Not all rRNA gene sequences in the databases are full-length sequences. In all cases, the reported taxonomic coverage relates to the produced amplicon sequences. However, when you use this option, the number of sequences that did not produce amplicons, because they are too short to include your primers, is reported. Note that if you allow a high mismatch percentage (e.g. 30%), your primers may produce an amplicon in a different region than you target. Since in this case an amplicon is produced, the related sequences is not included in this number of incomplete sequences.

You can supply the begin and end position of the region targeted in the E. coli reference sequence (see below). If you enter these coordinates, the number of sequences that do not have sequence information for the entire targeted region will be reported. These are sequence that lack nucleotide data at the begin and/or end of the rRNA gene. This option relies on the multiple sequence alignment provided by the different databases.

If you use this option, both begin and end coordinates have to be supplied. These coordinates correspond to the targeted region. For example, if your forward primer starts at position 250 and your reverse primer ends at 1200 (on the forward strand), the begin/end values are 250/1200.

The reference sequences, currently implemented, are the E. coli 16S rRNA and 23 rRNA gene sequences. The identifiers of these reference sequences are from conservation diagrams of the Gutell lab.

16S rRNA gene reference:
>gb|J01695.2|ECORGNB:1268-2809 E.coli rRNA operon (rrnB) coding for Glu-tRNA-2, 5S, 16S and 23S rRNA

23S rRNA gene reference (for SILVA LSU):
>gb|J01695.2|ECORGNB:3250-6153 E.coli rRNA operon (rrnB) coding for Glu-tRNA-2, 5S, 16S and 23S rRNA

Implementation details


You can choose to search a variety of rRNA gene databases:
  • CORE: OSU CORE database for the core oral microbiome.
  • Greengenes
  • HOMD: Human Oral Microbiome Database (16S rRNA RefSeq)
  • SILVA, comprehensive ribosomal RNA databases, with the following sections:
    • SSU: small subunit
    • SSU NR: small subunit with human skin (HSM) and mouse wound microbiome (MWM) added.
    • LSU: large subunit
  • Vaginal: Vaginal 16S reference database

Building details

TaxMan options

The following options are available:
  • Primers:
    Mismatch percent
    You may select a primer mismatch percentage. We limit the mismatch percentage to 30% to prevent non-usable output. If your primer is 20 bp long, a 20% mismatch will allow 4 mismatching bases.
    3' mismatch window:
    If the primer has a mismatch in the last few 3' nucleotides, where the primer should be extended, extention is hampered. You can select the size of the window where no mismatches may occur. If a mismatch occurs within this window no amplicon is reported. The window can be set to an integer value larger than zero.
    Allow bases (A,C,G,T) to include all their ambiguities
    You can choose to expand all A,C,G and T's in your primer to include all their corresponding IUPAC ambiguity codes (as literals; see table below). This option works when the mismatch percent remains set to zero. When you check this option an "A" in your primer will match with an "R" in a possible target sequence. This way, such literal mismatches will not be regarded as mismatches when mismatch percent is set to zero. See above for more information on ambiguities in primers.
    Remove Primer
    You can choose to remove the forward primer, reverse primer or both. The primers will be removed during amplicon calculation. In that case, the taxonomy data and downloadable FASTA files will not include the removed primer(s).
  • Taxonomy / Headers
    These options relate to the newly formed amplicons only. If the original source database was redundant (in either sequences or lineages), you may see the Reference lineages as if these options were switched off.
    Use only identical part of lineage
    The first taxonomic level where the lineages for a sequence start to differ are not included (see "header line" above). Only that part of the lineages that are identical for the all (redundant) sequences is used in the sequence header. Thus, "Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus;(cristatus/oligofermentans sinensis)". now becomes "Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus".
    No unique FASTA headers
    If a lineage is identical for multiple (different) sequences, a count if appended (starting with 2, as _2, _3, _4) (see above)

E-mail address

If you supply your e-mail address, the URL of the results page is sent to you when your job is finished. Providing your e-mail address is optional. You can also bookmark the results page to access the results at a later time. Results are kept two weeks (in general).

TaxMan output

The example output can be regenerated by running the example input. Please note that for several rRNA gene databases taxonomic orders or families can be missing. This results in FASTA headers where multiple ";" are present. In the Tree and Pie chart sections, we show "noname" in case a taxonomic caterogy was missing (or was "empty string").
Below, we describe the different sections of output page.

Overview section

This section mainly provides links to quickly jump to the section of your interest.

Download section

The download section provides links to the three files generated based on your primers. These files are gzipped. If you are a windows user and cannot open the files, you could download 7-zip or gzip. The following files are available:
  • The Lineage file (tab-delimited) provides counts for all taxa in your set (first column) as well as the original database (second column). This file can be used locally (e.g. in a spreadsheet) to analyse the taxa. The first lines in this file can be, for example:
    #Amplicon count		Reference  count	Lineage
    655			1045			Bacteria
    1			1			Bacteria;Acidobacteria
    1			1			Bacteria;Acidobacteria;Acidobacteria
    1			1			Bacteria;Acidobacteria;Acidobacteria;Acidobacteriales
  • The Amplicon FASTA file contains the longest amplicon for a sequence, when more than one amplicon can be formed from a single sequence. As amplicons are sub-sequences of the reference sequences, several amplicons can have identical sequences and can originate from sequences of different species. The lineage in the FASTA header line is summarized to the first different taxonomic category. For example, if two species cannot be distinguished based on their amplicon, the lineage could become "Bacteria;(Firmicutes/Bacteroidetes)". We do not summarize this to "Bacteria" as this does not illustrate what groups or species were merged. Remember headers are unique (see above). The FASTA header line contains the sequence id, followed by all ids with this exact sequence (as id1|id2|id3) and the lineage.
  • The Amplicon NRDB FASTA file is similar to the previous, but here all headers of identical sequences are concatenated and not summarized. The concatenation character is Start-of-Header (SOH, ^A) as used by the NCBI in their non-redundant FASTA files.

Taxonomic tree section

An expandable tree is shown here. You can click on the plus graphic or the name to expand a part of the tree. You can change the height of the tree by dragging the bottom or right border or the bottom-left corner of the Tree area. You can also change the height by typing a (natural) number in the field for "Height of Tree viewer" and pressing enter. Internet Explorer users can only use the latter option.

You can use "Find" to search for a name (case-insentitive). After entering at least three characters, press enter or click the "Find" button. The part of the tree will expand where there is a name that starts with your search query. The entire match is set in bold and will be scrolled into view. You can press "Find Next" to jump to a (possible) next match.

The numbers in the tree refer to the number of sequences found by your primers and the number in the original reference database used, respectively. For convenience, also the percentage is shown. If this percentage is "--", the sequence was assigned an ambiguous lineage not present in the reference. One unique amplicon sequence can have several different lineages associated with it (see above). For the tree viewer only, an "extra" node is created, named "ambiguous". Ambiguous taxonomic assignments are collected under this node (at the appropriate level) as to seperate these clearly from the other assigments.
Example: Streptococcus;(cristatus/oligofermentans) is present in the tree viewer as Streptococcus;ambiguous;(cristatus/oligofermentans).
Note that this is not revevant if you used the option "Use only identical part of lineage".

Pie chart section

The pie charts show the distribution of the sequences over the taxonomy. If you mouse over a slice or over the legend, the numbers of sequences will be shown in the pie. The numbers are the same as in the tree.

By clicking on a slice of the pie, you can plot the taxomonic categories (children) of this slice. If you click Bacteria in the Root pie, a new pie will show all groups in Bacteria. Note that this new plot contains the sub groups of Bacteria (phyla) and not a lineage like Bacteria;unclassified. The latter is present (and counted) in the Bacteria slice of the Root pie. Now, the counts of the numbers of sequences are also present in the legend (e.g. Actinobacteria |61/72).

You can plot the taxonomic coverage data in different ways and supply a percentage threshold. This threshold is applied to amplicon and reference charts. For example, with ≥ 2% the pie charts will only contain taxa (e.g. Firmicutes) that occur at least 2% in your set, relative to all taxa at their parent level (e.g. Bacteria). Please note:

  • changing this threshold does not change the pie charts: the threshold is applied to new charts only. This way plotting is most flexible as each pie can have a different threshold, which is shown in the pie area.
  • TaxMan is sequence oriented. The counts (therefore threshold) refer to the number of sequences in a certain taxon.
  • this percentage is calculated using the sequence counts in the reference database as denominator.

Example for threshold: The Proteobacteria count is 89/106. The count of Bacteria, its parent, is 504/618. By selecting the threshold of >15%, the Proteobacteria will be filtered out, since 89/618 = 14.4%. The first number refers to the number of sequences of this taxon targeted by your primers, the second to number of this taxon in the used reference database.

In Pie charts for amplicon sequences charts, you can plot all taxa or make a "difference" plot that only includes taxa missing for some percentage in your amplicon set. If you check "Plot differences", the new plot will have a pink header to indicate that this is a different type of plot. These "difference" plots may include all missing taxa or only the taxa missing at least a certain percentage. This allows zooming into taxa of your interest. This percentage is relative to the number of sequences of a taxon in the reference. For example, 30% means that 30% of sequences are missing from your set relative to the reference. Indeed, 100% means only taxa are shown that are entirely absent from your set.

The Pie charts for reference database allows a different view of data for the reference database. Instead of plotting the counts for your amplicon set the count of reference data are plotted here. Mousing over the pie slices will now show the same count twice (the count in the pie is now the same as the reference). For a fasta visualization of what is present in your set, you can mouse over the names in the legend to show the counts in your set and the reference.

Example In CORE, there are 1045 bacterial sequences. So, mousing over the pie will show "Bacteria; cnt:1045/1045", while mousing over Bacteria in the legend will show "Bacteria; cnt:655/1045". Here, 655 is the number of sequences in Bacteria in your set and 1045 is the total number of Bacteria sequences in CORE.

(c) IBIVU 2017. If you are experiencing problems with the site, please contact the webmaster.