Understanding Bioinformatics

Description

Efnisyfirlit

Cover
Half Title
Dedication
Title Page
Copyright Page
Preface
A Note to the Reader
Organization of this Book
Applications and Theory Chapters
Part 1: Background Basics
Part 2: Sequence Alignments
Part 3: Evolutionary Processes
Part 4: Genome Characteristics
Part 5: Secondary Structures
Part 6: Tertiary Structures
Part 7: Cells and Organisms
Appendices
Organization of the Chapters
Learning Outcomes
Flow Diagrams
Mind Maps
Illustrations
Further Reading
List of Symbols
Glossary
Garland Science Website
Artwork
Additional Material
List of Reviewers
Contents In Brief
Table of Contents
Part 1 Background Basics
1 The Nucleic Acid World
1.1 The Structure of DNA and RNA
DNA is a linear polymer of only four different bases
Two complementary DNA strands interact by base-pairing to form a double helix
RNA molecules are mostly single stranded but can also have base-pair structures
1.2 DNA, RNA, and Protein: The Central Dogma
DNA is the information store, but RNA is the messenger
Messenger RNA is translated into protein according to the genetic code
Translation involves transfer RNAs and RNA-containing ribosomes
1.3 Gene Structure and Control
RNA polymerase binds to specific sequences that position it and identify where to begin transcription
The signals initiating transcription in eukaryotes are generally more complex than those in bacteria
Eukaryotic mRNA transcripts undergo several modifications prior to their use in translation
The control of translation
1.4 The Tree of Life and Evolution
A brief survey of the basic characteristics of the major forms of life
Nucleic acid sequences can change as a result of mutation
Summary
Further Reading
General References
1.1 The Structure of DNA and RNA
1.2 DNA, RNA and Protein: The Central Dogma
1.3 Gene Structure and Control
1.4 The Tree of Life and Evolution
Box 1.2
2 Protein Structure
2.1 Primary and Secondary Structure
Protein structure can be considered on several different levels
Amino acids are the building blocks of proteins
The differing chemical and physical properties of amino acids are due to their side chains
Amino acids are covalently linked together in the protein chain by peptide bonds
Secondary structure of proteins is made up of α-helices and β-strands
Several different types of β-sheet are found in protein structures
Turns, hairpins, and loops connect helices and strands
2.2 Implication for Bioinformatics
Certain amino acids prefer a particular structural unit
Evolution has aided sequence analysis
Visualization and computer manipulation of protein structures
2.3 Proteins Fold to Form Compact Structures
The tertiary structure of a protein is defined by the path of the polypeptide chain
The stable folded state of a protein represents a state of low energy
Many proteins are formed of multiple subunits
Summary
Further Reading
3 Dealing with Databases
3.1 The Structure of Databases
Flat-file databases store data as text files
Relational databases are widely used for storing biological information
XML has the flexibility to define bespoke data classifications
Many other database structures are used for biological data
Databases can be accessed locally or online and often link to each other
3.2 Types of Database
There’s more to databases than just data
Primary and derived data
How we define and connect things is important: Ontologies
3.3 Looking for Databases
Sequence databases
Microarray databases
Protein interaction databases
Structural databases
3.4 Data Quality
Nonredundancy is especially important for some applications of sequence databases
Automated methods can be used to check for data consistency
Initial analysis and annotation is usually automated
Human intervention is often required to produce the highest quality annotation
The importance of updating databases and entry identifier and version numbers
Summary
Further Reading
3.1 The Structure of Databases
3.2 Types of Database
How we define and connect things is important: Ontologies
3.3 Looking for Databases
Sequence databases
Microarray databases
Protein interaction databases
Structural databases
3.4 Data Quality
MIAME
Part 2 Sequence Alignments
4 Producing and Analyzing Sequence Alignments
4.1 Principles of Sequence Alignment
Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity
Alignment can reveal homology between sequences
It is easier to detect homology when comparing protein sequences than when comparing nucleic acid sequences
4.2 Scoring Alignments
The quality of an alignment is measured by giving it a quantitative score
The simplest way of quantifying similarity between two sequences is percentage identity
The dot-plot gives a visual assessment of similarity based on identity
Genuine matches do not have to be identical
There is a minimum percentage identity that can be accepted as significant
There are many different ways of scoring an alignment
4.3 Substitution Matrices
Substitution matrices are used to assign individual scores to aligned sequence positions
The PAM substitution matrices use substitution frequencies derived from sets of closely related protein sequences
The BLOSUM substitution matrices use mutation data from highly conserved local regions of sequence
The choice of substitution matrix depends on the problem to be solved
4.4 Inserting Gaps
Gaps inserted in a sequence to maximize similarity with another require a scoring penalty
Dynamic programming algorithms can determine the optimal introduction of gaps
4.5 Types of Alignment
Different kinds of alignments are useful in different circumstances
Multiple sequence alignments enable the simultaneous comparison of a set of similar sequences
Multiple alignments can be constructed by several different techniques
Multiple alignments can improve the accuracy of alignment for sequences of low similarity
ClustalW can make global multiple alignments of both DNA and protein sequences
Multiple alignments can be made by combining a series of local alignments
Alignment can be improved by incorporating additional information
4.6 Searching Databases
Fast yet accurate search algorithms have been developed
FASTA is a fast database-search method based on matching short identical segments
BLAST is based on finding very similar short segments
Different versions of BLAST and FASTA are used for different problems
PSI-BLAST enables profile-based database searches
Ssearch is a rigorous alignment method
4.7 Searching with Nucleic Acid or Protein Sequences
DNA or RNA sequences can be used either directly or after translation
The quality of a database match has to be tested to ensure that it could not have arisen by chance
Choosing an appropriate E-value threshold helps to limit a database search
Low-complexity regions can complicate homology searches
Different databases can be used to solve particular problems
4.8 Protein Sequence Motifs or Patterns
Creation of pattern databases requires expert knowledge
The BLOCKS database contains automatically compiled short blocks of conserved multiply aligned protein sequences
4.9 Searching Using Motifs and Patterns
The PROSITE database can be searched for protein motifs and patterns
The pattern-based program PHI-BLAST searches for both homology and matching motifs
Patterns can be generated from multiple sequences using PRATT
The PRINTS database consists of fingerprints representing sets of conserved motifs that describe a protein family
The Pfam database defines profiles of protein families
4.10 Patterns and Protein Function
Searches can be made for particular functional sites in proteins
Sequence comparison is not the only way of analyzing protein sequences
Summary
Further Reading
4.1 Principles of Sequence Alignment
4.2 Scoring Alignments
Twilight zone and midnight zone
4.3 Substitution Matrices
4.4 Inserting Gaps
4.5 Types of Alignment
ClustalW
DIALIGN
4.6 Searching Databases
BLAST
FASTA
PSI-BLAST
4.8 Protein Sequence Motifs or Patterns
MEME
4.9 Searching Using Motifs and Patterns
MOTIF
PRATT
4.10 Patterns and Protein Function
HCA
5 Pairwise Sequence Alignment and Database Searching
5.1 Substitution Matrices and Scoring
Alignment scores attempt to measure the likelihood of a common evolutionary ancestor
The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins
The BLOSUM matrices were designed to find conserved regions of proteins
Scoring matrices for nucleotide sequence alignment can be derived in similar ways
The substitution scoring matrix used must be appropriate to the specific alignment problem
Gaps are scored in a much more heuristic way than substitutions
5.2 Dynamic Programming Algorithms
Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm
Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm
Time can be saved with a loss of rigor by not calculating the whole matrix
5.3 Indexing Techniques and Algorithmic Approximations
Suffix trees locate the positions of repeats and unique sequences
Hashing is an indexing technique that lists the starting positions of all k-tuples
The FASTA algorithm uses hashing and chaining for fast database searching
The BLAST algorithm makes use of finite-state automata
Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms
5.4 Alignment Score Significance
The statistics of gapped local alignments can be approximated by the same theory
5.5 Aligning Complete Genome Sequences
Indexing and scanning whole genome sequences efficiently is crucial for the sequence alignment of higher organisms
The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms
Summary
Further Reading
5.1 Substitution Matrices and Scoring
The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins
The BLOSUM matrices were designed to find conserved regions of proteins
Scoring matrices for nucleotide sequence alignment can be derived in similar ways
The substitution scoring matrix used must be appropriate to the specific alignment problem
Gaps are scored in a much more heuristic way than substitutions
5.2 Dynamic Programming Algorithms
Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm
Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm
Time can be saved with a loss of rigor by not calculating the whole matrix
5.3 Indexing Techniques and Algorithmic Approximations
Suffix trees locate the positions of repeats and unique sequences; Hashing is an indexing technique that lists the starting positions of all k-tuples
The FASTA algorithm uses hashing and chaining for fast database searching; The BLAST algorithm makes use of finite-state automata
Box 5.2: Sometimes things just aren’t complex enough
5.4 Alignment Score Significance
5.5 Alignments Involving Complete Genome Sequences
6 Patterns, Profiles, and Multiple Alignments
6.1 Profiles and Sequence Logos
Position-specific scoring matrices are an extension of substitution scoring matrices
Methods for overcoming a lack of data in deriving the values for a PSSM
PSI-BLAST is a sequence database searching program
Representing a profile as a logo
6.2 Profile Hidden Markov Models
The basic structure of HMMs used in sequence alignment to profiles
Estimating HMM parameters using aligned sequences
Scoring a sequence against a profile HMM: The most probable path and the sum over all paths
Estimating HMM parameters using unaligned sequences
6.3 Aligning Profiles
Comparing two PSSMs by alignment
Aligning profile HMMs
6.4 Multiple Sequence Alignments by Gradual Sequence Addition
The order in which sequences are added is chosen based on the estimated likelihood of incorporating errors in the alignment
Many different scoring schemes have been used in constructing multiple alignments
The multiple alignment is built using the guide tree and profile methods and may be further refined
6.5 Other Ways of Obtaining Multiple Alignments
The multiple sequence alignment program DIALIGN aligns ungapped blocks
The SAGA method of multiple alignment uses a genetic algorithm
6.6 Sequence Pattern Discovery
Discovering patterns in a multiple alignment: eMOTIF and AACC
Probabilistic searching for common patterns in sequences: Gibbs and MEME
Searching for more general sequence patterns
Summary
Further Reading
6.1 Profiles and Sequence Logos
6.2 Profile Hidden Markov Models
6.3 Aligning Profiles
6.4 Multiple Sequence Alignments by Gradual Sequence Additions
6.5 Other Ways of Obtaining Multiple Alignments
6.6 Sequence Pattern Discovery
Part 3 Evolutionary Processes
7 Recovering Evolutionary History
7.1 The Structure and Interpretation of Phylogenetic Trees
Phylogenetic trees reconstruct evolutionary relationships
Tree topology can be described in several ways
Consensus and condensed trees report the results of comparing tree topologies
7.2 Molecular Evolution and its Consequences
Most related sequences have many positions that have mutated several times
The rate of accepted mutation is usually not the same for all types of base substitution
Different codon positions have different mutation rates
Only orthologous genes should be used to construct species phylogenetic trees
Major changes affecting large regions of the genome are surprisingly common
7.3 Phylogenetic Tree Reconstruction
Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species
The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset
A model of evolution must be chosen to use with the method
All phylogenetic analyses must start with an accurate multiple alignment
Phylogenetic analyses of a small dataset of 16S RNA sequence data
Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved
Summary
Further Reading
General
7.1 The Structure and Interpretation of Phylogenetic Trees
Phylogenetic trees reconstruct evolutionary relationships
Tree topology can be described in several ways
Consensus and condensed trees report the results of comparing tree topologies
7.2 Molecular Evolution and its Consequences
The rate of accepted mutation is usually not the same for all types of base substitution
Different codon positions have different mutation rates
Only orthologous genes should be used to construct species phylogenetic trees
Major changes affecting large regions of the genome are surprisingly common
7.3 Phylogenetic Tree Reconstruction
Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species
The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset
A model of evolution must be chosen to use with the method
Phylogenetic analyses of a small dataset of 16S RNA sequence data
Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved
8 Building Phylogenetic Trees
8.1 Evolutionary Models and the Calculation of Evolutionary Distance
A simple but inaccurate measure of evolutionary distance is the p-distance
The Poisson distance correction takes account of multiple mutations at the same site
The Gamma distance correction takes account of mutation rate variation at different sequence positions
The Jukes-Cantor model reproduces some basic features of the evolution of nucleotide sequences
More complex models distinguish between the relative frequencies of different types of mutation
There is a nucleotide bias in DNA sequences
Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment
8.2 Generating Single Phylogenetic Trees
Clustering methods produce a phylogenetic tree based on evolutionary distances
The UPGMA method assumes a constant molecular clock and produces an ultrametric tree
The Fitch-Margoliash method produces an unrooted additive tree
The neighbor-joining method is related to the concept of minimum evolution
Stepwise addition and star-decomposition methods are usually used to generate starting trees for further exploration, not the final tree
8.3 Generating Multiple Tree Topologies
The branch-and-bound method greatly improves the efficiency of exploring tree topology
Optimization of tree topology can be achieved by making a series of small changes to an existing tree
Finding the root gives a phylogenetic tree a direction in time
8.4 Evaluating Tree Topologies
Functions based on evolutionary distances can be used to evaluate trees
Unweighted parsimony methods look for the trees with the smallest number of mutations
Mutations can be weighted in different ways in the parsimony method
Trees can be evaluated using the maximum likelihood method
The quartet-puzzling method also involves maximum likelihood in the standard implementation
Bayesian methods can also be used to reconstruct phylogenetic trees
8.5 Assessing the Reliability of Tree Features and Comparing Trees
The long-branch attraction problem can arise even with perfect data and methodology
Tree topology can be tested by examining the interior branches
Tests have been proposed for comparing two or more alternative trees
Summary
Further Reading
8.1 Evolutionary Models and the Calculation of Evolutionary Distance
The Gamma distance correction takes account of mutation rate variation at different sequence positions
More complex models distinguish between the relative frequencies of different types of mutation
There is a nucleotide bias in DNA sequences
Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment
8.2 Generating Single Phylogenetic Trees
The UPGMA method assumes a constant molecular clock and produces an ultrametric tree
The Fitch-Margoliash method produces an unrooted additive tree
The neighbor-joining method is related to the concept of minimum evolution
8.3 Generating Multiple Tree Topologies
Optimization of tree topology can be achieved by making a series of small changes to an existing tree
Finding the root gives a phylogenetic tree a direction in time
8.4 Evaluating Tree Topologies
Functions based on evolutionary distances can be used to evaluate trees
Trees can be evaluated using the maximum likelihood method
The quartet-puzzling method also involves maximum likelihood in the standard implementation
Bayesian methods can also be used to reconstruct phylogenetic trees
8.5 Assessing the Reliability of Tree Features and Comparing Trees
The long-branch attraction problem can arise even with perfect data and methodology
Tree topology can be tested by examining the interior branches
Tests have been proposed for comparing two or more alternative trees
Part 4 Genome Characteristics
9 Revealing Genome Features
9.1 Preliminary Examination of Genome Sequence
Whole genome sequences can be split up to simplify gene searches
Structural RNA genes and repeat sequences can be excluded from further analysis
Homology can be used to identify genes in both prokaryotic and eukaryotic genomes
9.2 Gene Prediction in Prokaryotic Genomes
9.3 Gene Prediction in Eukaryotic Genomes
Programs for predicting exons and introns use a variety of approaches
Gene predictions must preserve the correct reading frame
Some programs search for exons using only the query sequence and a model for exons
Some programs search for genes using only the query sequence and a gene model
Genes can be predicted using a gene model and sequence similarity
Genomes of related organisms can be used to improve gene prediction
9.4 Splice Site Detection
Splice sites can be detected independently by specialized programs
9.5 Prediction of Promoter Regions
Prokaryotic promoter regions contain relatively well-defined motifs
Eukaryotic promoter regions are typically more complex than prokaryotic promoters
A variety of promoter-prediction methods are available online
Promoter prediction results are not very clear-cut
9.6 Confirming Predictions
There are various methods for calculating the accuracy of gene-prediction programs
Translating predicted exons can confirm the correctness of the prediction
Constructing the protein and identifying homologs
9.7 Genome Annotation
Genome annotation is the final step in genome analysis
Gene ontology provides a standard vocabulary for gene annotation
9.8 Large Genome Comparisons
Summary
Further Reading
9.1 Preliminary Examination of Genome Sequence
9.2 Gene Prediction in Prokaryotic Genomes
9.3 Gene Prediction in Eukaryotic Genomes
9.4 Splice Site Detection
9.5 Prediction of Promoter Regions
9.6 Confirming Predictions
9.7 Genome Annotation
9.8 Large Genome Comparisons
Box 9.5
10 Gene Detection and Genome Annotation
10.1 Detection of Functional RNA Molecules Using Decision Trees
Detection of tRNA genes using the tRNAscan algorithm
Detection of tRNA genes in eukaryotic genomes
10.2 Features Useful for Gene Detection in Prokaryotes
10.3 Algorithms for Gene Detection in Prokaryotes
GeneMark uses inhomogeneous Markov chains and dicodon statistics
GLIMMER uses interpolated Markov models of coding potential
Orpheus uses homology, codon statistics, and ribosome-binding sites
GeneMark.hmm uses explicit state duration hidden Markov models
EcoParse is an HMM gene model
10.4 Features Used in Eukaryotic Gene Detection
Differences between prokaryotic and eukaryotic genes
Introns, exons, and splice sites
Promoter sequences and binding sites for transcription factors
10.5 Predicting Eukaryotic Gene Signals
Detection of core promoter binding signals is a key element of some eukaryotic gene-prediction methods
A set of models has been designed to locate the site of core promoter sequence signals
Predicting promoter regions from general sequence properties can reduce the numbers of false-positive results
Predicting eukaryotic transcription and translation start sites
Translation and transcription stop signals complete the gene definition
10.6 Predicting Exons and Introns
Exons can be identified using general sequence properties
Splice-site prediction
Splice sites can be predicted by sequence patterns combined with base statistics
GenScan uses a combination of weight matrices and decision trees to locate splice sites
GeneSplicer predicts splice sites using first-order Markov chains
NetPlantGene combines neural networks with intron and exon predictions to predict splice sites
Other splicing features may yet be exploited for splice-site prediction
Specific methods exist to identify initial and terminal exons
Exons can be defined by searching databases for homologous regions
10.7 Complete Eukaryotic Gene Models
10.8 Beyond the Prediction of Individual Genes
Functional annotation
Comparison of related genomes can help resolve uncertain predictions
Evaluation and reevaluation of gene-detection methods
Summary
Further Reading
10.1 Detection of Functional RNA Molecules Using Decision Trees
tRNA detection
Detection of other RNA genes
10.2 Features Useful for Gene Detection in Prokaryotes
Identifying protein-coding regions using base statistics
10.3 Algorithms for Gene Detection in Prokaryotes
GeneMark, GeneMark.hmm, and further developments
Glimmer
Orpheus
EcoParse
Prokaryotic genomes
Markov models
10.4 Features Used in Eukaryotic Gene Detection
10.4 Preliminary analysis for human genes
10.5 Predicting Eukaryotic Gene Signals
Initial analysis of core promoter sequences
Algorithms of core promoter detection
Promoter recognition
10.6 Predicting Exons and Introns
10.7 Complete Eukaryotic Gene Models
10.8 Beyond the Prediction of Individual Genes
Detailed reexamination of the annotation of a complete genome
Large-scale changes in chromosomes
Box 10.1 Measures of gene prediction accuracy at the nucleotide level
Box 10.2 Sequencing many genomes at once
Box 10.3 Measures of gene prediction accuracy at the exon level
Part 5 Secondary Structures
11 Obtaining Secondary Structure from Sequence
11.1 Types of Prediction Methods
Statistical methods are based on rules that give the probability that a residue will form part of a particular secondary structure
Nearest-neighbor methods are statistical methods that incorporate additional information about protein structure
Machine-learning approaches to secondary structure prediction mainly make use of neural networks and HMM methods
11.2 Training and Test Databases
There are several ways to define protein secondary structures
11.3 Assessing the Accuracy of Prediction Programs
Q3 measures the accuracy of individual residue assignments
Secondary structure predictions should not be expected to reach 100% residue accuracy
The Sov value measures the prediction accuracy for whole elements
CAFASP/CASP: Unbiased and readily available protein prediction assessments
11.4 Statistical and Knowledge-Based Methods
The GOR method uses an information theory approach
The program Zpred includes multiple alignment of homologous sequences and residue conservation information
There is an overall increase in prediction accuracy using multiple sequence information
The nearest-neighbor method: The use of multiple nonhomologous sequences
Predator is a combined statistical and knowledge-based program that includes the nearest-neighbor approach
11.5 Neural Network Methods of Secondary Structure Prediction
Assessing the reliability of neural net predictions
Several examples of Web-based neural network secondary structure prediction programs
PROF: Protein forecasting
Psipred
Jnet: Using several alternative representations of the sequence alignment
11.6 Some Secondary Structures Require Specialized Prediction Methods
Transmembrane proteins
Quantifying the preference for a membrane environment
11.7 Prediction of Transmembrane Protein Structure
Multi-helix membrane proteins
A selection of prediction programs to predict transmembrane helices
Statistical methods
Knowledge-based prediction
Evolutionary information from protein families improves the prediction
Neural nets in transmembrane prediction
Predicting transmembrane helices with hidden Markov models
Comparing the results: What to choose
What happens if a non-transmembrane protein is submitted to transmembrane prediction programs
Prediction of transmembrane structure containing β-strands
11.8 Coiled-Coil Structures
The COILS prediction program
PAIRCOIL and MULTICOIL are an extension of the COILS algorithm
Zipping the leucine zipper: A specialized coiled coil
11.9 RNA Secondary Structure Prediction
Summary
Further Reading
11.1 Types of Prediction Methods
11.2 Training and Test Databases
PDB
STRIDE
DSSP
DEFINE
11.3 Assessing the Accuracy of Prediction Programs
11.4 Statistical and Knowledge-Based Methods
11.5 Neural Network Methods of Secondary Structure Prediction
11.7 Prediction of Transmembrane Protein Structure
11.8 Coiled-Coil Structures
11.9 RNA Secondary Structure Prediction
Box 11.1
Box 11.3
12 Predicting Secondary Structures
12.1 Defining Secondary Structure and Prediction Accuracy
The definitions used for automatic protein secondary structure assignment do not give identical results
There are several different measures of the accuracy of secondary structure prediction
12.2 Secondary Structure Prediction Based on Residue Propensities
Each structural state has an amino acid preference which can be assigned as a residue propensity
The simplest prediction methods are based on the average residue propensity over a sequence window
Residue propensities are modulated by nearby sequence
Predictions can be significantly improved by including information from homologous sequences
12.3 The Nearest-Neighbor Methods are Based on Sequence Segment Similarity
Short segments of similar sequence are found to have similar structure
Several sequence similarity measures have been used to identify nearest-neighbor segments
A weighted average of the nearest-neighbor segment structures is used to make the prediction
A nearest-neighbor method has been developed to predict regions with a high potential to misfold
12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction
Layered feed-forward neural networks can transform a sequence into a structural prediction
Inclusion of information on homologous sequences improves neural network accuracy
More complex neural nets have been applied to predict secondary and other structural features
12.5 Hidden Markov Models Have Been Applied to Structure Prediction
HMM methods have been found especially effective for transmembrane proteins
Nonmembrane protein secondary structures can also be successfully predicted with HMMs
12.6 General Data Classification Techniques can Predict Structural Features
Support vector machines have been successfully used for protein structure prediction
Discriminants, SOMs, and other methods have also been used
Summary
Further Reading
12.1 Defining Secondary Structure and Prediction Accuracy
DSSP
PALSSE
β-Spider
Limits of prediction accuracy
TM properties and accuracy measures
PSIPRED (Q3 and Sov variation)
Matthews correlation coefficient
Sov
SS assignment comparison
Length distributions (non-TM)
12.2 Secondary Structure Prediction Based on Residue Propensities
PDB_SELECT
In-depth analysis of unbiased structural datasets
Hydrophobicity scales
Chou-Fasman
COILS
MEMSAT
Local and nonlocal effects
β-turn propensities
β-turn propensities and use of PSSMs
AAindex
GOR theory
GOR I
GOR II
GOR III
GOR IV
GOR V (includes other sequences)
Using consecutive pair of structural states
Zpred
Treatment of gaps in multiple alignments during prediction
12.3 The Nearest-neighbor Methods are Based on Sequence Segment Similarity
NNSSP
SSPAL
I-sites
SIMPA96
Correlation between amino acid composition and protein structural class
HβP
12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction
PHDsec
PHDpsi
PSIPRED
Back-propagation learning
BTPRED
Betaturns
DESTRUCT
SSpro
12.5 Hidden Markov Models Have Been Applied to Structure Prediction
HMMTOP
TMHMM
Phobius
PROftmb
PRED-TMBB
YASPIN
MARCOIL
12.6 General Data Classification Techniques can Predict Structural Features
DSC
FoldIndex
GPI-SOM
PSIMLR
Part 6 Tertiary Structures
13 Modeling Protein Structure
13.1 Potential Energy Functions and Force Fields
The conformation of a protein can be visualized in terms of a potential energy surface
Conformational energies can be described by simple mathematical functions
Similar force fields can be used to represent conformational energies in the presence of averaged environments
Potential energy functions can be used to assess a modeled structure
Energy minimization can be used to refine a modeled structure and identify local energy minima
Molecular dynamics and simulated annealing are used to find global energy minima
13.2 Obtaining a Structure by Threading
The prediction of protein folds in the absence of known structural homologs
Libraries or databases of nonredundant protein folds are used in threading
Two distinct types of scoring schemes have been used in threading methods
Dynamic programming methods can identify optimal alignments of target sequences and structural folds
Several methods are available to assess the confidence to be put on the fold prediction
The C2-like domain from the Dictyostelia: A practical example of threading
13.3 Principles of Homology Modeling
Closely related target and template sequences give better models
Significant sequence identity depends on the length of the sequence
Homology modeling has been automated to deal with the numbers of sequences that can now be modeled
Model building is based on a number of assumptions
13.4 Steps in Homology Modeling
Structural homologs to the target protein are found in the PDB
Accurate alignment of target and template sequences is essential for successful modeling
The structurally conserved regions of a protein are modeled first
The modeled core is checked for misfits before proceeding to the next stage
Sequence realignment and remodeling may improve the structure
Insertions and deletions are usually modeled as loops
Nonidentical amino acid side chains are modeled mainly by using rotamer libraries
Energy minimization is used to relieve structural errors
Molecular dynamics can be used to explore possible conformations for mobile loops
Models need to be checked for accuracy
How far can homology models be trusted?
13.5 Automated Homology Modeling
The program MODELLER models by satisfying protein structure constraints
COMPOSER uses fragment-based modeling to automatically generate a model
Automated methods available on the Web for comparative modeling
Assessment of structure prediction
13.6 Homology Modeling of PI3 Kinase p110a
Swiss-Pdb Viewer can be used for manual or semi-manual modeling
Alignment, core modeling, and side-chain modeling are carried out all in one
The loops are modeled from a database of possible structures
Energy minimization and quality inspection can be carried out within Swiss-Pdb Viewer
MolIDE is a downloadable semi-automatic modeling package
Automated modeling on the Web illustrated with p110α kinase
Modeling a functionally related but sequentially dissimilar protein: mTOR
Generating a multidomain three-dimensional structure from sequence
Summary
Further Reading
Modeling: General
13.1 Potential Energy Functions and Force Fields
Ab initio modeling
13.2 Obtaining a Structure by Threading
123D+
GenTHREADER
3D-PSSM
FUGUE
LIBRA
LOOPP
LIBELLULA
SCOP
13.3 Automated Homology Modeling
MolIDE
SCWRL
PSIPRED
LOOPY
MODELLER
Assessing models
MolProbity
PROCHECK
CASP
Antibody and HIV examples
14 Analyzing Structure-Function Relationships
14.1 Functional Conservation
Functional regions are usually structurally conserved
Similar biochemical function can be found in proteins with different folds
Fold libraries identify structurally similar proteins regardless of function
14.2 Structure Comparison Methods
Finding domains in proteins aids structure comparison
Structural comparisons can reveal conserved functional elements not discernible from a sequence comparison
The CE method builds up a structural alignment from pairs of aligned protein segments
The Vector Alignment Search Tool (VAST) aligns secondary structural elements
DALI identifies structure superposition without maintaining segment order
FATCAT introduces rotations between rigid segments
14.3 Finding Binding Sites
Highly conserved, strongly charged, or hydrophobic surface areas may indicate interaction sites
Searching for protein-protein interactions using surface properties
Surface calculations highlight clefts or holes in a protein that may serve as binding sites
Looking at residue conservation can identify binding sites
14.4 Docking Methods and Programs
Simple docking procedures can be used when the structure of a homologous protein bound to a ligand analog is known
Specialized docking programs will automatically dock a ligand to a structure
Scoring functions are used to identify the most likely docked ligand
The DOCK program is a semirigid-body method that analyzes shape and chemical complementarity of ligand and binding site
Fragment docking identifies potential substrates by predicting types of atoms and functional groups in the binding area
GOLD is a flexible docking program, which utilizes a genetic algorithm
The water molecules in binding sites should also be considered
Summary
Further Reading
General
14.1 Functional Conservation
Structure-function relationships
Fold recognition programs
Domain identification
Viewing programs
14.3 Finding Binding Sites
Evolutionary trace methods
14.4 Docking Methods and Programs
Part 7 Cells and Organisms
15 Proteome and Gene Expression Analysis
15.1 Analysis of Large-scale Gene Expression
The expression of large numbers of different genes can be measured simultaneously by DNA microarrays
Gene expression microarrays are mainly used to detect differences in gene expression in different conditions
Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression
Digital differential display uses bioinformatics and statistics to detect differential gene expression in different tissues
Facilitating the integration of data from different places and experiments
The simplest method of analyzing gene expression microarray data is hierarchical cluster analysis
Techniques based on self-organizing maps can be used for analyzing microarray data
Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision of clusters
Clustered gene expression data can be used as a tool for further research
15.2 Analysis of Large-scale Protein Expression
Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell
Measuring the expression levels shown in 2D gels
Differences in protein expression levels between different samples can be detected by 2D gels
Clustering methods are used to identify protein spots with similar expression patterns
Principal component analysis (PCA) is an alternative to clustering for analyzing microarray and 2D gel data
The changes in a set of protein spots can be tracked over a number of different samples
Databases and online tools are available to aid the interpretation of 2D gel data
Protein microarrays allow the simultaneous detection of the presence or activity of large numbers of different proteins
Mass spectrometry can be used to identify the proteins separated and purified by 2D gel electrophoresis or other means
Protein-identification programs for mass spectrometry are freely available on the Web
Mass spectrometry can be used to measure protein concentration
Summary
Further Reading
15.1 Analysis of Large-scale Gene Expression
Functional genomics
Microarray quality control focus
MIAME
15.2 Analysis of Large-scale Protein Expression
Proteomics
Protein microarrays
MASCOT
ProteinProspector
Databases and software
16 Clustering Methods and Statistics
16.1 Expression Data Require Preparation Prior to Analysis
Data normalization is designed to remove systematic experimental errors
Expression levels are often analyzed as ratios and are usually transformed by taking logarithms
Sometimes further normalization is useful after the data transformation
Principal component analysis is a method for combining the properties of an object
16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points
Euclidean distance is the measure used in everyday life
The Pearson correlation coefficient measures distance in terms of the shape of the expression response
The Mahalanobis distance takes account of the variation and correlation of expression responses
16.3 Clustering Methods Identify Similar and Distinct Expression Patterns
Hierarchical clustering produces a related set of alternative partitions of the data
k-means clustering groups data into several clusters but does not determine a relationship between clusters
Self-organizing maps (SOMs) use neural network methods to cluster data into a predetermined number of clusters
Evolutionary clustering algorithms use selection, recombination, and mutation to find the best possible solution to a problem
The self-organizing tree algorithm (SOTA) determines the number of clusters required
Biclustering identifies a subset of similar expression level patterns occurring in a subset of the samples
The validity of clusters is determined by independent methods
16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression
T-tests can be used to estimate the significance of the difference between two expression levels
Nonparametric tests are used to avoid making assumptions about the data sampling
Multiple testing of differential expression requires special techniques to control error rates
16.5 Gene and Protein Expression Data Can be Used to Classify Samples
Many alternative methods have been proposed that can classify samples
Support vector machines are another form of supervised learning algorithm that can produce classifiers
Summary
Further Reading
Monograph
16.1 Expression Data Require Preparation Prior to Analysis
Variance-stabilizing transformation
Lowess normalization
General review of normalization and transformation of data
Data used to generate Figures 16.2 to 16.4
Principal component analysis
16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points
Distance definitions
16.3 Clustering Methods Identify Similar and Distinct Expression Patterns
Self-organizing maps (SOMs)
Clustering using genetic algorithms
SOTA
Biclustering techniques
Validation of partitions produced by clustering methods
16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression
Bayesian estimation of variance
Nonparametric tests
Multiple hypothesis testing
16.5 Gene and Protein Expression Data Can be Used to Classify Samples
Support vector machines
17 Systems Biology
17.1 What is a System?
A system is more than the sum of its parts
A biological system is a living network
Databases are useful starting points in constructing a network
To construct a model more information is needed than a network
There are three possible approaches to constructing a model
Kinetic models are not the only way in systems biology
17.2 Structure of the Model
Control circuits are an essential part of any biological system
The interactions in networks can be represented as simple differential equations
17.3 Robustness of Biological Systems
Robustness is a distinct feature of complexity in biology
Modularity plays an important part in robustness
Redundancy in the system can provide robustness
Living systems can switch from one state to another by means of bistable switches
17.4 Storing and Running System Models
Specialized programs make simulating systems easier
Standardized systems descriptions aid their storage and reuse
Summary
Further Reading
17.1 What is a System?
Databases
17.2 Structure of the Model
Topological modeling
17.3 Robustness of Biological Systems
Molecular systems modeling
Appendix A: probability, Information, and Bayesian Analysis
Probability Theory, Entropy, and Information
Mutually exclusive events
Occurrence of two events
Occurrence of two random variables
Bayesian Analysis
Bayes’ theorem
Inference of parameter values
Further Reading
Appendix B: Molecular Energy Functions
Force Fields for Calculating Intra- and Intermolecular Interaction Energies
Bonding terms
Nonbonding terms
Potentials used in Threading
Potentials of mean force
Potential terms relating to solvent effects
Further Reading
Appendix C: Function Optimization
Full Search Methods
Dynamic programming and branch-and-bound
Local Optimization
The downhill simplex method
The steepest descent method
The conjugate gradient method
Methods using second derivatives
Thermodynamic Simulation and Global Optimization
Monte Carlo and genetic algorithms
Molecular dynamics
Simulated annealing
Summary
Further Reading
List of Symbols
Notation concepts
Glossary
Index

Understanding Bioinformatics

Description

Efnisyfirlit

Additional information

Reviews

Aðrar vörur

Bókakaup

Um okkur

Skráðu þig á póstlistann okkar

Understanding Bioinformatics

Description

Efnisyfirlit

Additional information

Reviews

Aðrar vörur

Related products

Armstrong’s Handbook of Learning and Development

An Introduction to Sociolinguistics

Algorithms For Dummies

Afghanistan

Bókakaup

Um okkur

Skráðu þig á póstlistann okkar