See the INSTALL
file included with GenomeHistory for installation instructions.
Conant, G. C. and Wagner, A. (2002) GenomeHistory: A software tool and
its application to fully sequenced genomes, Nucleic Acids Research,
30(15): 3378-3386 [Reprint]
GenomeHistory can be invoked either with command-line options or
through the use of a command file. An
example command file is included. The command-line options
be obtained by typing % GenomeHistory.pl with no
Possible options are:
||=<protein sequences file>
||=<nucleotide sequences file>
||=<Gene name file>
|Gene Name Field
||=<Gene name field #>
|BLAST matrix location
|Only Top BLAST match
|Minimum ORF Translation length
|Minimum Number Aligned Residues
|Percent identity threshold
|Ordered Gene List
Any combinations of options may be used. When used on the
line, only the short option format is allowed. The command file
the same set of options, but allows either format. Note that in
cases the equal (=) sign is required. Options are not
(but filenames may be).
Example command-line invocation:
% GenomeHistory.pl pa=my_genome.fasta na=my_genome_nuc.fasta
Example command file invocation:
% GenomeHistory.pl genome_analysis.txt
Generating the example dataset:
GenomeHistory includes a small example dataset (a portion of the yeast
genome) that can be used to test your installation of
From the GenomeHistory directory, type (% is your prompt)
% GenomeHistory test_analysis.txt
This will create two new files, test_ks.txt and test_ks.err.
These files should be identical to the example files provided, example_ks.txt
You can compare them using the UNIX "diff"command:
% diff test_ks.txt example_ks.txt
% diff test_ks.err example_ks.err
In theory, "diff" should not print
anything. However, if you are using a different version of clustalw (not 1.82), there may be (mostly minor) differences.
Notes and Tips:
Locating nucleotide sequences corresponding to protein sequences:
One of the chief difficulties with the sort of genome analysis
by GenomeHistory is matching nucleotide sequences to protein
By default, GenomeHistory does this using keywords in the FASTA header
files (see the Gene Name Field option). Unfortuntately, some genomes do
not provide identifiers that allow one to one matching between protein
sequences and their corresponding nucleotide sequences.
problem can be overcome with a second matching method if the genome's
and protein sequences are listed in a one to one order (meaning protein
sequence number 1 is the translation of nucleotide sequence number 1
This second method is invoked with the Ordered gene list =
option and simply uses the order that genes are listed in the two files
to identify nucleotide sequences. Microbial genomes seem
suited for use with this option, as alternative splicing and gene
are less of a problem for these genomes.
Using the Genename field to locate nucleotide sequences:
If you are experiencing numerous errors where GenomeHistory is not able
to find nucleotide sequences for genes, you may want to experiment with
the "Gene Name Field =" option. Sequences in the nucleotide files
may have slightly different names than those in the protein sequence
but the unquie identifiers should match. For instance, the
yeast genome stores unique names in field #2, while Arabidopsis uses
field 3. Note that this option will only be effective if there is
a unquie identifier common to both protein and nucleotide sequences
instance the YXX### format yeast gene identifiers). If you cannot
find such an identifier, you may want to experiment with the Ordered
gene list = yes option described above.
The running time of GenomeHistory is in large part dictated by the
of BLAST hits retained for each gene. In general, the lower
E-value) the BLAST threshold is to, the fewer gene pairs will be
and the faster the analysis will be. The trade-off, of course, is
the potential to miss significant hits. We have found values
E < 0.00001 to be quite conservative for the yeast genome.
BLAST in non-default locations:
Washington University BLAST expects to be installed at
If this is not where you have installed BLAST, you will need to use the
"BLAST matrix location" option to locate your installation. For
The Genetic code option takes on the following possible values:
- U: Universal
- VM: Vertebrate mitochondrial
- YM: Yeast mitochondrial
- MM: Mold mitochondrial
- IM: Invertebrate mitochondrial
- CN: ciliate nuclear
- EM: Echinoderm mitochondrial.
Using the "Only Top BLAST match=YES" option:
When this option is used, Ks and Ka is calculated
for each gene and its top BLAST hit, no matter what the E-value of that
hit. Thus, the "BLAST threshold" option is meaningless in this
Repeated analysis of gene pairs:
Gene pairs could potentially represented (and analyzed) in two ways
1, Gene 2) and (Gene 2, Gene 1). Under normal circumstances,
will only analyze one of these two combinations (the first one that
However if the option "Only Top BLAST match = YES" is set, or if the
"Gene file = <filename>" is used, both comparisions may be made,
these two options may be used in analyses where it is not desirable to
exclude reciprocal comparisions.
Each sequence pair for which Ks/Ka are calculated
is first checked to be sure that the two sequences are not
If they are, the word IDENTICAL is entered in the output rather than
Saturation in Ks values:
If the divergence time between two gene pairs is sufficiently long,
may no longer be sufficient information in the two sequences to
Ks. Such sequences are said to be saturated.
numerical algorithms for estimating Ks under likelihood
directly determine that sequences are saturated. Instead we have
implemented a simple test: After the likelihood maximization is
the Ks value for the pair is multiplied by 10 and the
is recalculated. If this likelihood is as high or higher than the
original likelihood, that indicates that the sequence pair has
Ks, and that Ks value in the output is marked
an *. Of course, this test does not address the problem of the
variances associated with high, but unsaturated, Ks
Researchers will want to remember this issue when selecting duplicate
for analyses. A forthcoming manscript will discuss this issue in
more detail (Hahn, Conant and Wagner, Journal of Molecular
When the "Checkpoint" option is ON, GenomeHistory will look for a
file called .GenomeHistory.chkpnt.(argfile) (where argfile is the
file) or .GenomeHistory.chkpnt.cmdline (if invoking from the command
If this file is found, GenomeHistory will restart from the point listed
in this file. This option is useful for stopping jobs and
them at a later time. For checkpointing to work
you must specify the same output and error filenames on the restart as
were originally used.
Note that the checkpoint file is created whether or not Checkpointing
is turned on.
You can use any combination of absolute and relative paths when
filenames to GenomeHistory.
By default, any gene pair that do not meet the three analysis criteria
after pairwise alignment are discarded without further warnings.
By specifying "Warnings=on", you can output WARNING entries in the
file that indicate why a particular pair was discarded. This
however, has the potential to make the error file very large.
By default, GenomeHistory uses the observed base frequencies at each
position in the calculation of Ks and Ka (in other words, the
composition at each codon position is averaged between the two
In some cases, it may be preferable to use the over-all average base
(OBSERVED) which does not break down the composition by codon
It is also possible to use equal frequencies for all bases (EQUAL).
As written, GenomeHistory removes all gap positions in alignments
calculating Ks and Ka. This is done
gap characters can tend to downwardly bias estimates of sequence
(Yang, Z., 2000, Mol. Bio. Evol. 51:423-432). It is
to disable this feature of GenomeHistory by editting the
script and removing the "-nogap" option from the calc_percent_dist
and like_pair_dist program options.
By default, both BLAST and Clustalw use the BLOSUM62 amino acid
matrix for sequence alignments. The BLAST matrix can be changed
the "BLAST matrix location= " option, provided you have another
matrix in the appropriate format. To use a different matrix in
you will need to edit GenomeHistory.pl to find the line that reads:
$output=`nice clustalw $tempseqfile /output=pir`;
Change this to read
$output=`nice clustalw $tempseqfile /output=pir
where <NAME> is a matrix file or one of the default clustal
For further details on clustal options, see http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/help9.html.
Comments on basic installation process:
To build and use the package, you will need a c and a c++ compiler
(preferably the GNu gcc and g++),
the Washington University implementation of BLAST, and clustalw.
The make command will build liblapack
(a partial lapack library), libf2c (because lapack was originally built
in Fortran), liblikelihood and
libdnafuncs (two of my libraries that have many generic functions),
like_pair_dist (which calculates Ks/Ka),
aligndna_new (which creates a nucleotide alignment from a protein
and calc_percent_diff (which
calculates the percentage difference for all pairs of sequences in
You will need to add the GenomeHistory directory to your path.
You can then run the program
by typing GenomeHistory.pl <filename> where <filename> is a
file with the parameters for the run.
An example input file with descriptions of each command is included
This package was developed and tested on a RedHat 7.1 Linux machines
the gcc compilers. It should work on
other platforms, but has not been tested on them. Use
or edit the file GenomeHistory/make.inc to
change the name of your c and c++ compiler.
7-29-02: Fixed a bug that caused incorrect nucleotide sequences
be returned in genomes (such as Drosophila) where gene names
be substrings of each other (e.g. CT1187 and CT11871).
with this problem were still analyzed correctly, but also generated an
10-27-03: Fixed a bug that resulted in some gene pairs where Ks
= 0 (reported as Ks <10-4) being reported as
saturated in synonymous substitutions.
01-07-04: Fixed a bug preventing the "Ordered Gene list=YES" option
from working properly when a "Gene File" was provided.
10-08-04: Fixed a bug which caused GenomeHistory to fail to create
nucleotide alignments for calculating Ks and Ka
when compiled with gcc 3.0.
10-17-04: Fixed a bug in the GenomeHistory perl script which prevented the
analysis of gene pairs with E values of 0.0 when using some versions of WU-BLAST. Special thanks to Luke Hakes and Ignazio Carbone for help finding these last two bugs.
Included c++ packages:
The actual computational work of GenomeHistory is performed by
c++ executables. In addition to BLAST and Clustalw, this includes
3 programs we have written:
- calc_percent_dist: This program takes a sequence alignment
or nucleotide) and returns the pairwise percent similarity for all
of sequences in the alignment. The "-nogap" option causes it to
similarity based only on non-gap positions
- algndna_new: This is a complex program for deducing
alignments from protein sequence alignments. We use only a
of it here: it also has the ablity to parse GenBank files to locate
and exons from protein-coding genes
- like_pair_dist: The "meat" of GenomeHistory: this program
optimization to find maximum likelihood estimates of Ks and
Structure of the c++ codes (Advanced users):
The three c++ programs included are based on a
set of classes. Of these, some of the more important are:
- Exchange: By analogy to a telephone exchange, this
the central repository of global parameters that all of the other
need to know about. It mostly consists of public variables and
functions to those variables. Improper settings in this class can
give strange results in programs. For instance, it is important
set to model of evolution in the exchange, as certain Boolean flags
be set correctly for codon-based models to work.
- Sequence_dataset: A simple data structure class
protein or nucleotide sequences. It is primarily intended for
sequences (so that each sequence is of the same length).
- Read_Sequence: This is actually the base class for a
designed to read various sequence file formats. Currently
Support for these formats may not be complete.
- PAUP NEXUS
- PHYLIP (Interleaved only)
This is the base class for the
Models of evolution are implemented as derived classes. All
models are multiply inherited, with one side of the inheritance
whether the model is nucleotide based or codon-based and the other side
describing the model of nucleotide substitution (for instance, Kimura
- Tree: This class implements a bifurcating tree
used for phylogenetic trees. In GenomeHistory, the only tree used
is a degenerate two-tip tree.
- Genetic_Code: This is the base class that
genetic codes. To be compatible with some of my earlier codes,
base class Genetic_code functions equivalently to the derived