TaxMan description
What is TaxMan?
TaxMan is an interactive web server for the production of rRNA gene subsequences based on your primers.
You can download all data, interactively analyse the data by browsing the tree or plotting the coverage in pie charts, also off-line.
Thus, you can check the taxonomic coverage your primers give you.
Even when the selected reference rRNA gene database is non-redundant, PCR can result in identical sub-sequences;
therefore a highly redundant set of amplicons can be formed.
The taxonomic lineages belonging to these sequences can be the same, but can also be different.
Depending on your primers, different taxonomic phyla may form identical sub-sequences.
TaxMan calculates the amplicon sequences, makes them non-redundant and summarizes the lineages, which are taken from the selected rRNA gene database.
Each sequence receives a unique FASTA header line.
If there are multiple headers with the same lineage (but different sequences), a count will be appended.
The example primers on this site form 1045 sequences from CORE, but only 796 are unique
(so 141 (17.7%) amplicon sequences are identical).
Especially, with larger sequence databases the redundancy after "PCR" can be significant.
The header line of the FASTA sequences contains the highest level at which the taxonomy starts to differ.
This level is present so that it is immediately clear which taxonomic categories now have the same (amplicon) sequence.
The example output using the CORE database contains several Streptococci for which the lineages are summarized.
For example:
Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus;(cristatus/oligofermentans sinensis).
As TaxMan is sequence oriented, (different) sequences with the same lineage in the non-redundant amplicon set get a count.
For example:
Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus;(cristatus/oligofermentans sinensis)_2
This web server
With this server, you can:
- find sequences matching your primers
- select different rRNA gene target databases (contact us if you would like us to add your database of choice)
- interactively browse and search the taxonomic tree of your sequences
- interactively plot the distribution of the related taxa
- download your FASTA amplicon sequences with a choice between two FASTA headers.
- download a lineage file that includes the counts for all taxa based on your inputs.
For more information, see sections below.
TaxMan input
Primers
Paste in your primer sequences: supply the forward and reverse primers. They may contain IUPAC ambiguity codes
(see base codes and their reverse complement).
The reverse primer needs to be in reverse complement orientation, which is common for PCR, but this may not be the case in your
analysis pipeline. If you click the "Example" button a primer set, targeting the V5-V6 hypervariable region of 16S rRNA gene, is shown.
The primers may contain any of the IUPAC ambiguity codes. Sequences may also contain these codes,
which means, for example, "R" should match G or A or R.
Therefore, we expand these codes in the following way:
Ambiguity code | Expansion |
B | TGC BKSY |
D | TGA DKRW |
H | TCA HMWY |
K | TG K |
M | CA M |
R | GA R |
S | GC S |
V | GCA VMRS |
W | TA W |
Y | TC Y |
You cannot use ambiguity codes in so-called character classes, like [GAR]. However, this is not needed as we expand the codes.
We currently use an adapted version, using the table above, of EMBOSS primersearch to calculate the amplicons.
For each database sequence, we only keep the longest amplicon sequence.
Note: The table above does not contain the bases (A,C,G,T). Thus, when your primer contains, for example, an "A" but the sequence
contains an "R" at the corresponding position, a mismatch results. This can occur a very few times.
Under options, you can either allow mismatches and include the entire primer in the 3' mismatch window or tick the check box for
"Allow bases (A,C,G,T) to include all their ambiguities".
The latter option only works when the mismatch percent remains set to zero.
Targeted region
Not all rRNA gene sequences in the databases are full-length sequences.
In all cases, the reported taxonomic coverage relates to the produced amplicon sequences.
However, when you use this option, the number of sequences that did not produce amplicons, because they are too short to include your primers,
is reported. Note that if you allow a high mismatch percentage (e.g. 30%), your primers may produce an amplicon in a different region than you target.
Since in this case an amplicon is produced, the related sequences is not included in this number of incomplete sequences.
You can supply the begin and end position of the region targeted in the E. coli reference sequence (see below).
If you enter these coordinates, the number of sequences that do not have sequence information for the entire targeted region
will be reported. These are sequence that lack nucleotide data at the begin and/or end of the rRNA gene.
This option relies on the multiple sequence alignment provided by the different databases.
If you use this option, both begin and end coordinates have to be supplied.
These coordinates correspond to the targeted region.
For example, if your forward primer starts at position 250 and your reverse primer ends at 1200 (on the forward strand),
the begin/end values are 250/1200.
The reference sequences, currently implemented, are the E. coli 16S rRNA and 23 rRNA gene sequences.
The identifiers of these reference sequences are from conservation diagrams of the Gutell lab.
16S rRNA gene reference:
>gb|J01695.2|ECORGNB:1268-2809 E.coli rRNA operon (rrnB) coding for Glu-tRNA-2, 5S, 16S and 23S rRNA
23S rRNA gene reference (for SILVA LSU):
>gb|J01695.2|ECORGNB:3250-6153 E.coli rRNA operon (rrnB) coding for Glu-tRNA-2, 5S, 16S and 23S rRNA
We have indexed the coordinates of the sequences in the multiple sequence alignments (MSAs) that were available.
These MSAs are produced by the respective databases.
The CORE and vaginal reference MSAs did not include the exact E. coli reference we use.
We have used pyNAST v1.1 to align this reference sequence to the existing MSA.
For each sequence its begin and end position in the MSA are stored.
The positions that you provide are coordinates in the reference sequence (e.g. 895 as begin when you use the 895F primer).
We map these postions to the coordinates in the MSA using the reference and the aligned reference sequence.
Next, we retrieve the sequence identifiers of the sequences that have a
begin coordinate larger than or end coordinate smaller than the positions in the reference.
Thus, the number of sequences that does not contain sequence information spanning the targeted region can be calculated.
The identifiers of the missing sequences are checked against the identifiers of the amplicons.
This means that if your primers target the region from 967 to 1046, but you supply the begin/end positions as 10/1500,
the reported number of sequences with missing information is large, but the corrected coverage is 100%.
Less...
Databases
You can choose to search a variety of rRNA gene databases:
- CORE: OSU CORE database for the core oral microbiome.
- Greengenes
- HOMD: Human Oral Microbiome Database (16S rRNA RefSeq)
- SILVA, comprehensive ribosomal RNA databases,
with the following sections:
- SSU: small subunit
- SSU NR: small subunit with human skin (HSM) and mouse wound microbiome (MWM) added.
- LSU: large subunit
- Vaginal: Vaginal 16S reference database
Data files have been downloaded and where applicable the taxonomy has been added to the FASTA header.
The building process depends on the database:
- CORE: The CORE database was downloaded on 23-08-2011 from downloads.
We downloaded the EXCEL version and, using the data in this file, we created FASTA headers that include the taxonomic lineage.
The headers contain the CORE accession id followed by the complete lineage information from the same EXCEL file.
- HOMD:
The Human Oral Microbiome Database (HOMD) was downloaded from the HOMD site.
We downloaded version 10.1 of the 16S rRNA gene database
and the taxon table in text format.
The data is linked through the HOT id.
The headers from the HOMD data starts with the HOT id merged with the strain synonym (like in the HOMD file),
followed with the lineage information from the taxon table file.
A single HOT id had no lineage information (HOT id: 735), while there is a sequence in the 16S rRNA file.
This sequence is added to TaxMan with the lineage "Bacteria;unclassified".
- Greengenes:
The Greengenes data was downloaded after the update on October 2, 2011
(file: current_GREENGENES_gg16S_unaligned.fasta).
The (taxa) information contains both Greengenes accession codes (merged with an underscore) followed by the complete lineage as is given by Greengenes.
In the Greengenes files, information on the lineage is present in the form of a letter followed by two underscores.
For kingdom information this is "k__"; this information is stripped off.
Example:
Original header
>14 AF068820.2 hydrothermal vent clone VC2.1 Arc13 k__Archaea; p__Euryarchaeota; c__Thermoplasmata; o__Thermoplasmatales; f__Aciduliprofundaceae; otu_204
TaxMan header (same format as NCBI Taxonomy or SILVA)
>14_AF068820.2 Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Aciduliprofundaceae;otu_204
- SILVA: Data was downloaded and not changed/extended.
- Vaginal 16S reference:
The vaginal dataset is last updated on 20-05-2011 and was downloaded on 22-12-2011.
The files contain information on the NCBI accession number and taxon id.
We recursively looped through the NCBI taxonomy files (downloaded on 22-12-2011) to create the complete lineages.
The sequence headers start with the accession number of the vaginal set followed by the lineage information.
If the lineage was not known up to species level we added the label "unclassified" after the last known level.
Less...
TaxMan options
The following options are available:
- Primers:
- Mismatch percent
-
- You may select a primer mismatch percentage. We limit the mismatch percentage to 30% to prevent non-usable output.
If your primer is 20 bp long, a 20% mismatch will allow 4 mismatching bases.
- 3' mismatch window:
- If the primer has a mismatch in the last few 3' nucleotides, where the primer should be extended, extention is hampered.
You can select the size of the window where no mismatches may occur. If a mismatch occurs within this window no amplicon is reported.
The window can be set to an integer value larger than zero.
- Allow bases (A,C,G,T) to include all their ambiguities
- You can choose to expand all A,C,G and T's in your primer to include all
their corresponding IUPAC ambiguity codes (as literals; see table below).
This option works when the mismatch percent remains set to zero.
When you check this option an "A" in your primer will match with an "R" in a possible target sequence.
This way, such literal mismatches will not be regarded as mismatches when mismatch percent is set to zero.
See above for more information on ambiguities in primers.
Bases | Expansion |
A | A DHMRVW |
C | C BHMSVY |
G | G BDKRSV |
T | T BDHKWY |
- Remove Primer
- You can choose to remove the forward primer, reverse primer or both.
The primers will be removed during amplicon calculation.
In that case, the taxonomy data and downloadable FASTA files will not include the removed primer(s).
- Taxonomy / Headers
These options relate to the newly formed amplicons only.
If the original source database was redundant (in either sequences or lineages), you may see the Reference lineages as if these options were switched off.
- Use only identical part of lineage
- The first taxonomic level where
the lineages for a sequence start to differ are not included (see "header line" above).
Only that part of the lineages that are identical for the all (redundant) sequences is used in the sequence header.
Thus, "Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus;(cristatus/oligofermentans sinensis)".
now becomes "Bacteria;Firmicutes;Bacillales;Lactobacillales;Streptococcaceae;Streptococcus".
- No unique FASTA headers
- If a lineage is identical for multiple (different) sequences,
a count if appended (starting with 2, as _2, _3, _4) (see above)
E-mail address
If you supply your e-mail address, the URL of the results page is sent to you when your job is finished.
Providing your e-mail address is optional.
You can also bookmark the results page to access the results at a later time.
Results are kept two weeks (in general).
TaxMan output
The example output can be regenerated by running the example input.
Please note that for several rRNA gene databases taxonomic orders or families can be missing.
This results in FASTA headers where multiple ";" are present. In the Tree and Pie chart sections,
we show "noname" in case a taxonomic caterogy was missing (or was "empty string").
Below, we describe the different sections of output page.
Overview section
This section mainly provides links to quickly jump to the section of your interest.
Download section
The download section provides links to the three files generated based on your primers.
These files are gzipped. If you are a windows user and cannot open the files, you could download
7-zip or gzip.
The following files are available:
Taxonomic tree section
An expandable tree is shown here. You can click on the
graphic or the name to expand a part of the tree.
You can change the height of the tree by
dragging the bottom or right border or the bottom-left corner of the Tree area.
You can also change the height by typing a (natural) number in the field for "Height of Tree viewer" and pressing enter.
Internet Explorer users can only use the latter option.
You can use "Find" to search for a name (case-insentitive). After entering at least three characters, press enter or click the "Find" button.
The part of the tree will expand where there is a name that starts with your search query.
The entire match is set in bold and will be scrolled into view.
You can press "Find Next" to jump to a (possible) next match.
The numbers in the tree refer to the number of sequences found by your primers and the number in the original reference database used, respectively.
For convenience, also the percentage is shown. If this percentage is "--", the sequence was assigned an ambiguous lineage not present in the reference.
One unique amplicon sequence can have several different lineages associated with it (see above).
For the tree viewer only, an "extra" node is created, named "ambiguous".
Ambiguous taxonomic assignments are collected under this node (at the appropriate level) as to seperate these clearly from the other assigments.
Example: Streptococcus;(cristatus/oligofermentans) is present in the tree viewer as Streptococcus;ambiguous;(cristatus/oligofermentans).
Note that this is not revevant if you used the option "Use only identical part of lineage".
Pie chart section
The pie charts show the distribution of the sequences over the taxonomy.
If you mouse over a slice or over the legend, the numbers of sequences will be shown in the pie.
The numbers are the same as in the tree.
By clicking on a slice of the pie, you can plot the taxomonic categories (children) of this slice.
If you click Bacteria in the Root pie, a new pie will show all groups in Bacteria.
Note that this new plot contains the sub groups of Bacteria (phyla) and not a lineage like Bacteria;unclassified.
The latter is present (and counted) in the Bacteria slice of the Root pie.
Now, the counts of the numbers of sequences are also present in the legend (e.g. Actinobacteria |61/72).
You can plot the taxonomic coverage data in different ways and supply a percentage threshold.
This threshold is applied to amplicon and reference charts.
For example, with ≥ 2% the pie charts will only contain taxa (e.g. Firmicutes) that occur at least 2% in your set,
relative to all taxa at their parent level (e.g. Bacteria).
Please note:
- changing this threshold does not change the pie charts: the threshold is applied to new charts only.
This way plotting is most flexible as each pie can have a different threshold, which is shown in the pie area.
- TaxMan is sequence oriented. The counts (therefore threshold) refer to the number of sequences in a certain taxon.
- this percentage is calculated using the sequence counts in the reference database as denominator.
Example for threshold: The Proteobacteria count is 89/106. The count of Bacteria, its parent, is 504/618.
By selecting the threshold of >15%, the Proteobacteria will be filtered out, since 89/618 = 14.4%.
The first number refers to the number of sequences of this taxon targeted by your primers, the second to number of this taxon in the used reference database.
In Pie charts for amplicon sequences charts, you can plot all taxa or make a "difference"
plot that only includes taxa missing for some percentage in your amplicon set.
If you check "Plot differences", the new plot will have a pink header to indicate that this is a different type of plot.
These "difference" plots may include all missing taxa or only the taxa missing at least a certain percentage.
This allows zooming into taxa of your interest.
This percentage is relative to the number of sequences of a taxon in the reference.
For example, 30% means that 30% of sequences are missing from your set relative to the reference.
Indeed, 100% means only taxa are shown that are entirely absent from your set.
The Pie charts for reference database allows a different view of data for the reference database.
Instead of plotting the counts for your amplicon set the count of reference data are plotted here.
Mousing over the pie slices will now show the same count twice (the count in the pie is now the same as the reference).
For a fasta visualization of what is present in your set, you can mouse over the names in the legend to show the counts in your set and the reference.
Example
In CORE, there are 1045 bacterial sequences. So, mousing over the pie will show "Bacteria; cnt:1045/1045", while mousing over Bacteria in the legend will show
"Bacteria; cnt:655/1045". Here, 655 is the number of sequences in Bacteria in your set and 1045 is the total number of Bacteria sequences in CORE.