Description
Efnisyfirlit
- Cover
- Half Title
- Dedication
- Title Page
- Copyright Page
- Preface
- A Note to the Reader
- Organization of this Book
- Applications and Theory Chapters
- Part 1: Background Basics
- Part 2: Sequence Alignments
- Part 3: Evolutionary Processes
- Part 4: Genome Characteristics
- Part 5: Secondary Structures
- Part 6: Tertiary Structures
- Part 7: Cells and Organisms
- Appendices
- Organization of the Chapters
- Learning Outcomes
- Flow Diagrams
- Mind Maps
- Illustrations
- Further Reading
- List of Symbols
- Glossary
- Garland Science Website
- Artwork
- Additional Material
- List of Reviewers
- Contents In Brief
- Table of Contents
- Part 1 Background Basics
- 1 The Nucleic Acid World
- 1.1 The Structure of DNA and RNA
- DNA is a linear polymer of only four different bases
- Two complementary DNA strands interact by base-pairing to form a double helix
- RNA molecules are mostly single stranded but can also have base-pair structures
- 1.2 DNA, RNA, and Protein: The Central Dogma
- DNA is the information store, but RNA is the messenger
- Messenger RNA is translated into protein according to the genetic code
- Translation involves transfer RNAs and RNA-containing ribosomes
- 1.3 Gene Structure and Control
- RNA polymerase binds to specific sequences that position it and identify where to begin transcription
- The signals initiating transcription in eukaryotes are generally more complex than those in bacteria
- Eukaryotic mRNA transcripts undergo several modifications prior to their use in translation
- The control of translation
- 1.4 The Tree of Life and Evolution
- A brief survey of the basic characteristics of the major forms of life
- Nucleic acid sequences can change as a result of mutation
- Summary
- Further Reading
- General References
- 1.1 The Structure of DNA and RNA
- 1.2 DNA, RNA and Protein: The Central Dogma
- 1.3 Gene Structure and Control
- 1.4 The Tree of Life and Evolution
- Box 1.2
- 2 Protein Structure
- 2.1 Primary and Secondary Structure
- Protein structure can be considered on several different levels
- Amino acids are the building blocks of proteins
- The differing chemical and physical properties of amino acids are due to their side chains
- Amino acids are covalently linked together in the protein chain by peptide bonds
- Secondary structure of proteins is made up of α-helices and β-strands
- Several different types of β-sheet are found in protein structures
- Turns, hairpins, and loops connect helices and strands
- 2.2 Implication for Bioinformatics
- Certain amino acids prefer a particular structural unit
- Evolution has aided sequence analysis
- Visualization and computer manipulation of protein structures
- 2.3 Proteins Fold to Form Compact Structures
- The tertiary structure of a protein is defined by the path of the polypeptide chain
- The stable folded state of a protein represents a state of low energy
- Many proteins are formed of multiple subunits
- Summary
- Further Reading
- 3 Dealing with Databases
- 3.1 The Structure of Databases
- Flat-file databases store data as text files
- Relational databases are widely used for storing biological information
- XML has the flexibility to define bespoke data classifications
- Many other database structures are used for biological data
- Databases can be accessed locally or online and often link to each other
- 3.2 Types of Database
- There’s more to databases than just data
- Primary and derived data
- How we define and connect things is important: Ontologies
- 3.3 Looking for Databases
- Sequence databases
- Microarray databases
- Protein interaction databases
- Structural databases
- 3.4 Data Quality
- Nonredundancy is especially important for some applications of sequence databases
- Automated methods can be used to check for data consistency
- Initial analysis and annotation is usually automated
- Human intervention is often required to produce the highest quality annotation
- The importance of updating databases and entry identifier and version numbers
- Summary
- Further Reading
- 3.1 The Structure of Databases
- 3.2 Types of Database
- How we define and connect things is important: Ontologies
- 3.3 Looking for Databases
- Sequence databases
- Microarray databases
- Protein interaction databases
- Structural databases
- 3.4 Data Quality
- MIAME
- Part 2 Sequence Alignments
- 4 Producing and Analyzing Sequence Alignments
- 4.1 Principles of Sequence Alignment
- Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity
- Alignment can reveal homology between sequences
- It is easier to detect homology when comparing protein sequences than when comparing nucleic acid sequences
- 4.2 Scoring Alignments
- The quality of an alignment is measured by giving it a quantitative score
- The simplest way of quantifying similarity between two sequences is percentage identity
- The dot-plot gives a visual assessment of similarity based on identity
- Genuine matches do not have to be identical
- There is a minimum percentage identity that can be accepted as significant
- There are many different ways of scoring an alignment
- 4.3 Substitution Matrices
- Substitution matrices are used to assign individual scores to aligned sequence positions
- The PAM substitution matrices use substitution frequencies derived from sets of closely related protein sequences
- The BLOSUM substitution matrices use mutation data from highly conserved local regions of sequence
- The choice of substitution matrix depends on the problem to be solved
- 4.4 Inserting Gaps
- Gaps inserted in a sequence to maximize similarity with another require a scoring penalty
- Dynamic programming algorithms can determine the optimal introduction of gaps
- 4.5 Types of Alignment
- Different kinds of alignments are useful in different circumstances
- Multiple sequence alignments enable the simultaneous comparison of a set of similar sequences
- Multiple alignments can be constructed by several different techniques
- Multiple alignments can improve the accuracy of alignment for sequences of low similarity
- ClustalW can make global multiple alignments of both DNA and protein sequences
- Multiple alignments can be made by combining a series of local alignments
- Alignment can be improved by incorporating additional information
- 4.6 Searching Databases
- Fast yet accurate search algorithms have been developed
- FASTA is a fast database-search method based on matching short identical segments
- BLAST is based on finding very similar short segments
- Different versions of BLAST and FASTA are used for different problems
- PSI-BLAST enables profile-based database searches
- Ssearch is a rigorous alignment method
- 4.7 Searching with Nucleic Acid or Protein Sequences
- DNA or RNA sequences can be used either directly or after translation
- The quality of a database match has to be tested to ensure that it could not have arisen by chance
- Choosing an appropriate E-value threshold helps to limit a database search
- Low-complexity regions can complicate homology searches
- Different databases can be used to solve particular problems
- 4.8 Protein Sequence Motifs or Patterns
- Creation of pattern databases requires expert knowledge
- The BLOCKS database contains automatically compiled short blocks of conserved multiply aligned protein sequences
- 4.9 Searching Using Motifs and Patterns
- The PROSITE database can be searched for protein motifs and patterns
- The pattern-based program PHI-BLAST searches for both homology and matching motifs
- Patterns can be generated from multiple sequences using PRATT
- The PRINTS database consists of fingerprints representing sets of conserved motifs that describe a protein family
- The Pfam database defines profiles of protein families
- 4.10 Patterns and Protein Function
- Searches can be made for particular functional sites in proteins
- Sequence comparison is not the only way of analyzing protein sequences
- Summary
- Further Reading
- 4.1 Principles of Sequence Alignment
- 4.2 Scoring Alignments
- Twilight zone and midnight zone
- 4.3 Substitution Matrices
- 4.4 Inserting Gaps
- 4.5 Types of Alignment
- ClustalW
- DIALIGN
- 4.6 Searching Databases
- BLAST
- FASTA
- PSI-BLAST
- 4.8 Protein Sequence Motifs or Patterns
- MEME
- 4.9 Searching Using Motifs and Patterns
- MOTIF
- PRATT
- 4.10 Patterns and Protein Function
- HCA
- 5 Pairwise Sequence Alignment and Database Searching
- 5.1 Substitution Matrices and Scoring
- Alignment scores attempt to measure the likelihood of a common evolutionary ancestor
- The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins
- The BLOSUM matrices were designed to find conserved regions of proteins
- Scoring matrices for nucleotide sequence alignment can be derived in similar ways
- The substitution scoring matrix used must be appropriate to the specific alignment problem
- Gaps are scored in a much more heuristic way than substitutions
- 5.2 Dynamic Programming Algorithms
- Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm
- Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm
- Time can be saved with a loss of rigor by not calculating the whole matrix
- 5.3 Indexing Techniques and Algorithmic Approximations
- Suffix trees locate the positions of repeats and unique sequences
- Hashing is an indexing technique that lists the starting positions of all k-tuples
- The FASTA algorithm uses hashing and chaining for fast database searching
- The BLAST algorithm makes use of finite-state automata
- Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms
- 5.4 Alignment Score Significance
- The statistics of gapped local alignments can be approximated by the same theory
- 5.5 Aligning Complete Genome Sequences
- Indexing and scanning whole genome sequences efficiently is crucial for the sequence alignment of higher organisms
- The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms
- Summary
- Further Reading
- 5.1 Substitution Matrices and Scoring
- The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins
- The BLOSUM matrices were designed to find conserved regions of proteins
- Scoring matrices for nucleotide sequence alignment can be derived in similar ways
- The substitution scoring matrix used must be appropriate to the specific alignment problem
- Gaps are scored in a much more heuristic way than substitutions
- 5.2 Dynamic Programming Algorithms
- Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm
- Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm
- Time can be saved with a loss of rigor by not calculating the whole matrix
- 5.3 Indexing Techniques and Algorithmic Approximations
- Suffix trees locate the positions of repeats and unique sequences; Hashing is an indexing technique that lists the starting positions of all k-tuples
- The FASTA algorithm uses hashing and chaining for fast database searching; The BLAST algorithm makes use of finite-state automata
- Box 5.2: Sometimes things just aren’t complex enough
- 5.4 Alignment Score Significance
- 5.5 Alignments Involving Complete Genome Sequences
- 6 Patterns, Profiles, and Multiple Alignments
- 6.1 Profiles and Sequence Logos
- Position-specific scoring matrices are an extension of substitution scoring matrices
- Methods for overcoming a lack of data in deriving the values for a PSSM
- PSI-BLAST is a sequence database searching program
- Representing a profile as a logo
- 6.2 Profile Hidden Markov Models
- The basic structure of HMMs used in sequence alignment to profiles
- Estimating HMM parameters using aligned sequences
- Scoring a sequence against a profile HMM: The most probable path and the sum over all paths
- Estimating HMM parameters using unaligned sequences
- 6.3 Aligning Profiles
- Comparing two PSSMs by alignment
- Aligning profile HMMs
- 6.4 Multiple Sequence Alignments by Gradual Sequence Addition
- The order in which sequences are added is chosen based on the estimated likelihood of incorporating errors in the alignment
- Many different scoring schemes have been used in constructing multiple alignments
- The multiple alignment is built using the guide tree and profile methods and may be further refined
- 6.5 Other Ways of Obtaining Multiple Alignments
- The multiple sequence alignment program DIALIGN aligns ungapped blocks
- The SAGA method of multiple alignment uses a genetic algorithm
- 6.6 Sequence Pattern Discovery
- Discovering patterns in a multiple alignment: eMOTIF and AACC
- Probabilistic searching for common patterns in sequences: Gibbs and MEME
- Searching for more general sequence patterns
- Summary
- Further Reading
- 6.1 Profiles and Sequence Logos
- 6.2 Profile Hidden Markov Models
- 6.3 Aligning Profiles
- 6.4 Multiple Sequence Alignments by Gradual Sequence Additions
- 6.5 Other Ways of Obtaining Multiple Alignments
- 6.6 Sequence Pattern Discovery
- Part 3 Evolutionary Processes
- 7 Recovering Evolutionary History
- 7.1 The Structure and Interpretation of Phylogenetic Trees
- Phylogenetic trees reconstruct evolutionary relationships
- Tree topology can be described in several ways
- Consensus and condensed trees report the results of comparing tree topologies
- 7.2 Molecular Evolution and its Consequences
- Most related sequences have many positions that have mutated several times
- The rate of accepted mutation is usually not the same for all types of base substitution
- Different codon positions have different mutation rates
- Only orthologous genes should be used to construct species phylogenetic trees
- Major changes affecting large regions of the genome are surprisingly common
- 7.3 Phylogenetic Tree Reconstruction
- Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species
- The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset
- A model of evolution must be chosen to use with the method
- All phylogenetic analyses must start with an accurate multiple alignment
- Phylogenetic analyses of a small dataset of 16S RNA sequence data
- Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved
- Summary
- Further Reading
- General
- 7.1 The Structure and Interpretation of Phylogenetic Trees
- Phylogenetic trees reconstruct evolutionary relationships
- Tree topology can be described in several ways
- Consensus and condensed trees report the results of comparing tree topologies
- 7.2 Molecular Evolution and its Consequences
- The rate of accepted mutation is usually not the same for all types of base substitution
- Different codon positions have different mutation rates
- Only orthologous genes should be used to construct species phylogenetic trees
- Major changes affecting large regions of the genome are surprisingly common
- 7.3 Phylogenetic Tree Reconstruction
- Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species
- The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset
- A model of evolution must be chosen to use with the method
- Phylogenetic analyses of a small dataset of 16S RNA sequence data
- Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved
- 8 Building Phylogenetic Trees
- 8.1 Evolutionary Models and the Calculation of Evolutionary Distance
- A simple but inaccurate measure of evolutionary distance is the p-distance
- The Poisson distance correction takes account of multiple mutations at the same site
- The Gamma distance correction takes account of mutation rate variation at different sequence positions
- The Jukes-Cantor model reproduces some basic features of the evolution of nucleotide sequences
- More complex models distinguish between the relative frequencies of different types of mutation
- There is a nucleotide bias in DNA sequences
- Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment
- 8.2 Generating Single Phylogenetic Trees
- Clustering methods produce a phylogenetic tree based on evolutionary distances
- The UPGMA method assumes a constant molecular clock and produces an ultrametric tree
- The Fitch-Margoliash method produces an unrooted additive tree
- The neighbor-joining method is related to the concept of minimum evolution
- Stepwise addition and star-decomposition methods are usually used to generate starting trees for further exploration, not the final tree
- 8.3 Generating Multiple Tree Topologies
- The branch-and-bound method greatly improves the efficiency of exploring tree topology
- Optimization of tree topology can be achieved by making a series of small changes to an existing tree
- Finding the root gives a phylogenetic tree a direction in time
- 8.4 Evaluating Tree Topologies
- Functions based on evolutionary distances can be used to evaluate trees
- Unweighted parsimony methods look for the trees with the smallest number of mutations
- Mutations can be weighted in different ways in the parsimony method
- Trees can be evaluated using the maximum likelihood method
- The quartet-puzzling method also involves maximum likelihood in the standard implementation
- Bayesian methods can also be used to reconstruct phylogenetic trees
- 8.5 Assessing the Reliability of Tree Features and Comparing Trees
- The long-branch attraction problem can arise even with perfect data and methodology
- Tree topology can be tested by examining the interior branches
- Tests have been proposed for comparing two or more alternative trees
- Summary
- Further Reading
- 8.1 Evolutionary Models and the Calculation of Evolutionary Distance
- The Gamma distance correction takes account of mutation rate variation at different sequence positions
- More complex models distinguish between the relative frequencies of different types of mutation
- There is a nucleotide bias in DNA sequences
- Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment
- 8.2 Generating Single Phylogenetic Trees
- The UPGMA method assumes a constant molecular clock and produces an ultrametric tree
- The Fitch-Margoliash method produces an unrooted additive tree
- The neighbor-joining method is related to the concept of minimum evolution
- 8.3 Generating Multiple Tree Topologies
- Optimization of tree topology can be achieved by making a series of small changes to an existing tree
- Finding the root gives a phylogenetic tree a direction in time
- 8.4 Evaluating Tree Topologies
- Functions based on evolutionary distances can be used to evaluate trees
- Trees can be evaluated using the maximum likelihood method
- The quartet-puzzling method also involves maximum likelihood in the standard implementation
- Bayesian methods can also be used to reconstruct phylogenetic trees
- 8.5 Assessing the Reliability of Tree Features and Comparing Trees
- The long-branch attraction problem can arise even with perfect data and methodology
- Tree topology can be tested by examining the interior branches
- Tests have been proposed for comparing two or more alternative trees
- Part 4 Genome Characteristics
- 9 Revealing Genome Features
- 9.1 Preliminary Examination of Genome Sequence
- Whole genome sequences can be split up to simplify gene searches
- Structural RNA genes and repeat sequences can be excluded from further analysis
- Homology can be used to identify genes in both prokaryotic and eukaryotic genomes
- 9.2 Gene Prediction in Prokaryotic Genomes
- 9.3 Gene Prediction in Eukaryotic Genomes
- Programs for predicting exons and introns use a variety of approaches
- Gene predictions must preserve the correct reading frame
- Some programs search for exons using only the query sequence and a model for exons
- Some programs search for genes using only the query sequence and a gene model
- Genes can be predicted using a gene model and sequence similarity
- Genomes of related organisms can be used to improve gene prediction
- 9.4 Splice Site Detection
- Splice sites can be detected independently by specialized programs
- 9.5 Prediction of Promoter Regions
- Prokaryotic promoter regions contain relatively well-defined motifs
- Eukaryotic promoter regions are typically more complex than prokaryotic promoters
- A variety of promoter-prediction methods are available online
- Promoter prediction results are not very clear-cut
- 9.6 Confirming Predictions
- There are various methods for calculating the accuracy of gene-prediction programs
- Translating predicted exons can confirm the correctness of the prediction
- Constructing the protein and identifying homologs
- 9.7 Genome Annotation
- Genome annotation is the final step in genome analysis
- Gene ontology provides a standard vocabulary for gene annotation
- 9.8 Large Genome Comparisons
- Summary
- Further Reading
- 9.1 Preliminary Examination of Genome Sequence
- 9.2 Gene Prediction in Prokaryotic Genomes
- 9.3 Gene Prediction in Eukaryotic Genomes
- 9.4 Splice Site Detection
- 9.5 Prediction of Promoter Regions
- 9.6 Confirming Predictions
- 9.7 Genome Annotation
- 9.8 Large Genome Comparisons
- Box 9.5
- 10 Gene Detection and Genome Annotation
- 10.1 Detection of Functional RNA Molecules Using Decision Trees
- Detection of tRNA genes using the tRNAscan algorithm
- Detection of tRNA genes in eukaryotic genomes
- 10.2 Features Useful for Gene Detection in Prokaryotes
- 10.3 Algorithms for Gene Detection in Prokaryotes
- GeneMark uses inhomogeneous Markov chains and dicodon statistics
- GLIMMER uses interpolated Markov models of coding potential
- Orpheus uses homology, codon statistics, and ribosome-binding sites
- GeneMark.hmm uses explicit state duration hidden Markov models
- EcoParse is an HMM gene model
- 10.4 Features Used in Eukaryotic Gene Detection
- Differences between prokaryotic and eukaryotic genes
- Introns, exons, and splice sites
- Promoter sequences and binding sites for transcription factors
- 10.5 Predicting Eukaryotic Gene Signals
- Detection of core promoter binding signals is a key element of some eukaryotic gene-prediction methods
- A set of models has been designed to locate the site of core promoter sequence signals
- Predicting promoter regions from general sequence properties can reduce the numbers of false-positive results
- Predicting eukaryotic transcription and translation start sites
- Translation and transcription stop signals complete the gene definition
- 10.6 Predicting Exons and Introns
- Exons can be identified using general sequence properties
- Splice-site prediction
- Splice sites can be predicted by sequence patterns combined with base statistics
- GenScan uses a combination of weight matrices and decision trees to locate splice sites
- GeneSplicer predicts splice sites using first-order Markov chains
- NetPlantGene combines neural networks with intron and exon predictions to predict splice sites
- Other splicing features may yet be exploited for splice-site prediction
- Specific methods exist to identify initial and terminal exons
- Exons can be defined by searching databases for homologous regions
- 10.7 Complete Eukaryotic Gene Models
- 10.8 Beyond the Prediction of Individual Genes
- Functional annotation
- Comparison of related genomes can help resolve uncertain predictions
- Evaluation and reevaluation of gene-detection methods
- Summary
- Further Reading
- 10.1 Detection of Functional RNA Molecules Using Decision Trees
- tRNA detection
- Detection of other RNA genes
- 10.2 Features Useful for Gene Detection in Prokaryotes
- Identifying protein-coding regions using base statistics
- 10.3 Algorithms for Gene Detection in Prokaryotes
- GeneMark, GeneMark.hmm, and further developments
- Glimmer
- Orpheus
- EcoParse
- Prokaryotic genomes
- Markov models
- 10.4 Features Used in Eukaryotic Gene Detection
- 10.4 Preliminary analysis for human genes
- 10.5 Predicting Eukaryotic Gene Signals
- Initial analysis of core promoter sequences
- Algorithms of core promoter detection
- Promoter recognition
- 10.6 Predicting Exons and Introns
- 10.7 Complete Eukaryotic Gene Models
- 10.8 Beyond the Prediction of Individual Genes
- Detailed reexamination of the annotation of a complete genome
- Large-scale changes in chromosomes
- Box 10.1 Measures of gene prediction accuracy at the nucleotide level
- Box 10.2 Sequencing many genomes at once
- Box 10.3 Measures of gene prediction accuracy at the exon level
- Part 5 Secondary Structures
- 11 Obtaining Secondary Structure from Sequence
- 11.1 Types of Prediction Methods
- Statistical methods are based on rules that give the probability that a residue will form part of a particular secondary structure
- Nearest-neighbor methods are statistical methods that incorporate additional information about protein structure
- Machine-learning approaches to secondary structure prediction mainly make use of neural networks and HMM methods
- 11.2 Training and Test Databases
- There are several ways to define protein secondary structures
- 11.3 Assessing the Accuracy of Prediction Programs
- Q3 measures the accuracy of individual residue assignments
- Secondary structure predictions should not be expected to reach 100% residue accuracy
- The Sov value measures the prediction accuracy for whole elements
- CAFASP/CASP: Unbiased and readily available protein prediction assessments
- 11.4 Statistical and Knowledge-Based Methods
- The GOR method uses an information theory approach
- The program Zpred includes multiple alignment of homologous sequences and residue conservation information
- There is an overall increase in prediction accuracy using multiple sequence information
- The nearest-neighbor method: The use of multiple nonhomologous sequences
- Predator is a combined statistical and knowledge-based program that includes the nearest-neighbor approach
- 11.5 Neural Network Methods of Secondary Structure Prediction
- Assessing the reliability of neural net predictions
- Several examples of Web-based neural network secondary structure prediction programs
- PROF: Protein forecasting
- Psipred
- Jnet: Using several alternative representations of the sequence alignment
- 11.6 Some Secondary Structures Require Specialized Prediction Methods
- Transmembrane proteins
- Quantifying the preference for a membrane environment
- 11.7 Prediction of Transmembrane Protein Structure
- Multi-helix membrane proteins
- A selection of prediction programs to predict transmembrane helices
- Statistical methods
- Knowledge-based prediction
- Evolutionary information from protein families improves the prediction
- Neural nets in transmembrane prediction
- Predicting transmembrane helices with hidden Markov models
- Comparing the results: What to choose
- What happens if a non-transmembrane protein is submitted to transmembrane prediction programs
- Prediction of transmembrane structure containing β-strands
- 11.8 Coiled-Coil Structures
- The COILS prediction program
- PAIRCOIL and MULTICOIL are an extension of the COILS algorithm
- Zipping the leucine zipper: A specialized coiled coil
- 11.9 RNA Secondary Structure Prediction
- Summary
- Further Reading
- 11.1 Types of Prediction Methods
- 11.2 Training and Test Databases
- PDB
- STRIDE
- DSSP
- DEFINE
- 11.3 Assessing the Accuracy of Prediction Programs
- 11.4 Statistical and Knowledge-Based Methods
- 11.5 Neural Network Methods of Secondary Structure Prediction
- 11.7 Prediction of Transmembrane Protein Structure
- 11.8 Coiled-Coil Structures
- 11.9 RNA Secondary Structure Prediction
- Box 11.1
- Box 11.3
- 12 Predicting Secondary Structures
- 12.1 Defining Secondary Structure and Prediction Accuracy
- The definitions used for automatic protein secondary structure assignment do not give identical results
- There are several different measures of the accuracy of secondary structure prediction
- 12.2 Secondary Structure Prediction Based on Residue Propensities
- Each structural state has an amino acid preference which can be assigned as a residue propensity
- The simplest prediction methods are based on the average residue propensity over a sequence window
- Residue propensities are modulated by nearby sequence
- Predictions can be significantly improved by including information from homologous sequences
- 12.3 The Nearest-Neighbor Methods are Based on Sequence Segment Similarity
- Short segments of similar sequence are found to have similar structure
- Several sequence similarity measures have been used to identify nearest-neighbor segments
- A weighted average of the nearest-neighbor segment structures is used to make the prediction
- A nearest-neighbor method has been developed to predict regions with a high potential to misfold
- 12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction
- Layered feed-forward neural networks can transform a sequence into a structural prediction
- Inclusion of information on homologous sequences improves neural network accuracy
- More complex neural nets have been applied to predict secondary and other structural features
- 12.5 Hidden Markov Models Have Been Applied to Structure Prediction
- HMM methods have been found especially effective for transmembrane proteins
- Nonmembrane protein secondary structures can also be successfully predicted with HMMs
- 12.6 General Data Classification Techniques can Predict Structural Features
- Support vector machines have been successfully used for protein structure prediction
- Discriminants, SOMs, and other methods have also been used
- Summary
- Further Reading
- 12.1 Defining Secondary Structure and Prediction Accuracy
- DSSP
- PALSSE
- β-Spider
- Limits of prediction accuracy
- TM properties and accuracy measures
- PSIPRED (Q3 and Sov variation)
- Matthews correlation coefficient
- Sov
- SS assignment comparison
- Length distributions (non-TM)
- 12.2 Secondary Structure Prediction Based on Residue Propensities
- PDB_SELECT
- In-depth analysis of unbiased structural datasets
- Hydrophobicity scales
- Chou-Fasman
- COILS
- MEMSAT
- Local and nonlocal effects
- β-turn propensities
- β-turn propensities and use of PSSMs
- AAindex
- GOR theory
- GOR I
- GOR II
- GOR III
- GOR IV
- GOR V (includes other sequences)
- Using consecutive pair of structural states
- Zpred
- Treatment of gaps in multiple alignments during prediction
- 12.3 The Nearest-neighbor Methods are Based on Sequence Segment Similarity
- NNSSP
- SSPAL
- I-sites
- SIMPA96
- Correlation between amino acid composition and protein structural class
- HβP
- 12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction
- PHDsec
- PHDpsi
- PSIPRED
- Back-propagation learning
- BTPRED
- Betaturns
- DESTRUCT
- SSpro
- 12.5 Hidden Markov Models Have Been Applied to Structure Prediction
- HMMTOP
- TMHMM
- Phobius
- PROftmb
- PRED-TMBB
- YASPIN
- MARCOIL
- 12.6 General Data Classification Techniques can Predict Structural Features
- DSC
- FoldIndex
- GPI-SOM
- PSIMLR
- Part 6 Tertiary Structures
- 13 Modeling Protein Structure
- 13.1 Potential Energy Functions and Force Fields
- The conformation of a protein can be visualized in terms of a potential energy surface
- Conformational energies can be described by simple mathematical functions
- Similar force fields can be used to represent conformational energies in the presence of averaged environments
- Potential energy functions can be used to assess a modeled structure
- Energy minimization can be used to refine a modeled structure and identify local energy minima
- Molecular dynamics and simulated annealing are used to find global energy minima
- 13.2 Obtaining a Structure by Threading
- The prediction of protein folds in the absence of known structural homologs
- Libraries or databases of nonredundant protein folds are used in threading
- Two distinct types of scoring schemes have been used in threading methods
- Dynamic programming methods can identify optimal alignments of target sequences and structural folds
- Several methods are available to assess the confidence to be put on the fold prediction
- The C2-like domain from the Dictyostelia: A practical example of threading
- 13.3 Principles of Homology Modeling
- Closely related target and template sequences give better models
- Significant sequence identity depends on the length of the sequence
- Homology modeling has been automated to deal with the numbers of sequences that can now be modeled
- Model building is based on a number of assumptions
- 13.4 Steps in Homology Modeling
- Structural homologs to the target protein are found in the PDB
- Accurate alignment of target and template sequences is essential for successful modeling
- The structurally conserved regions of a protein are modeled first
- The modeled core is checked for misfits before proceeding to the next stage
- Sequence realignment and remodeling may improve the structure
- Insertions and deletions are usually modeled as loops
- Nonidentical amino acid side chains are modeled mainly by using rotamer libraries
- Energy minimization is used to relieve structural errors
- Molecular dynamics can be used to explore possible conformations for mobile loops
- Models need to be checked for accuracy
- How far can homology models be trusted?
- 13.5 Automated Homology Modeling
- The program MODELLER models by satisfying protein structure constraints
- COMPOSER uses fragment-based modeling to automatically generate a model
- Automated methods available on the Web for comparative modeling
- Assessment of structure prediction
- 13.6 Homology Modeling of PI3 Kinase p110a
- Swiss-Pdb Viewer can be used for manual or semi-manual modeling
- Alignment, core modeling, and side-chain modeling are carried out all in one
- The loops are modeled from a database of possible structures
- Energy minimization and quality inspection can be carried out within Swiss-Pdb Viewer
- MolIDE is a downloadable semi-automatic modeling package
- Automated modeling on the Web illustrated with p110α kinase
- Modeling a functionally related but sequentially dissimilar protein: mTOR
- Generating a multidomain three-dimensional structure from sequence
- Summary
- Further Reading
- Modeling: General
- 13.1 Potential Energy Functions and Force Fields
- Ab initio modeling
- 13.2 Obtaining a Structure by Threading
- 123D+
- GenTHREADER
- 3D-PSSM
- FUGUE
- LIBRA
- LOOPP
- LIBELLULA
- SCOP
- 13.3 Automated Homology Modeling
- MolIDE
- SCWRL
- PSIPRED
- LOOPY
- MODELLER
- Assessing models
- MolProbity
- PROCHECK
- CASP
- Antibody and HIV examples
- 14 Analyzing Structure-Function Relationships
- 14.1 Functional Conservation
- Functional regions are usually structurally conserved
- Similar biochemical function can be found in proteins with different folds
- Fold libraries identify structurally similar proteins regardless of function
- 14.2 Structure Comparison Methods
- Finding domains in proteins aids structure comparison
- Structural comparisons can reveal conserved functional elements not discernible from a sequence comparison
- The CE method builds up a structural alignment from pairs of aligned protein segments
- The Vector Alignment Search Tool (VAST) aligns secondary structural elements
- DALI identifies structure superposition without maintaining segment order
- FATCAT introduces rotations between rigid segments
- 14.3 Finding Binding Sites
- Highly conserved, strongly charged, or hydrophobic surface areas may indicate interaction sites
- Searching for protein-protein interactions using surface properties
- Surface calculations highlight clefts or holes in a protein that may serve as binding sites
- Looking at residue conservation can identify binding sites
- 14.4 Docking Methods and Programs
- Simple docking procedures can be used when the structure of a homologous protein bound to a ligand analog is known
- Specialized docking programs will automatically dock a ligand to a structure
- Scoring functions are used to identify the most likely docked ligand
- The DOCK program is a semirigid-body method that analyzes shape and chemical complementarity of ligand and binding site
- Fragment docking identifies potential substrates by predicting types of atoms and functional groups in the binding area
- GOLD is a flexible docking program, which utilizes a genetic algorithm
- The water molecules in binding sites should also be considered
- Summary
- Further Reading
- General
- 14.1 Functional Conservation
- Structure-function relationships
- Fold recognition programs
- Domain identification
- Viewing programs
- 14.3 Finding Binding Sites
- Evolutionary trace methods
- 14.4 Docking Methods and Programs
- Part 7 Cells and Organisms
- 15 Proteome and Gene Expression Analysis
- 15.1 Analysis of Large-scale Gene Expression
- The expression of large numbers of different genes can be measured simultaneously by DNA microarrays
- Gene expression microarrays are mainly used to detect differences in gene expression in different conditions
- Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression
- Digital differential display uses bioinformatics and statistics to detect differential gene expression in different tissues
- Facilitating the integration of data from different places and experiments
- The simplest method of analyzing gene expression microarray data is hierarchical cluster analysis
- Techniques based on self-organizing maps can be used for analyzing microarray data
- Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision of clusters
- Clustered gene expression data can be used as a tool for further research
- 15.2 Analysis of Large-scale Protein Expression
- Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell
- Measuring the expression levels shown in 2D gels
- Differences in protein expression levels between different samples can be detected by 2D gels
- Clustering methods are used to identify protein spots with similar expression patterns
- Principal component analysis (PCA) is an alternative to clustering for analyzing microarray and 2D gel data
- The changes in a set of protein spots can be tracked over a number of different samples
- Databases and online tools are available to aid the interpretation of 2D gel data
- Protein microarrays allow the simultaneous detection of the presence or activity of large numbers of different proteins
- Mass spectrometry can be used to identify the proteins separated and purified by 2D gel electrophoresis or other means
- Protein-identification programs for mass spectrometry are freely available on the Web
- Mass spectrometry can be used to measure protein concentration
- Summary
- Further Reading
- 15.1 Analysis of Large-scale Gene Expression
- Functional genomics
- Microarray quality control focus
- MIAME
- 15.2 Analysis of Large-scale Protein Expression
- Proteomics
- Protein microarrays
- MASCOT
- ProteinProspector
- Databases and software
- 16 Clustering Methods and Statistics
- 16.1 Expression Data Require Preparation Prior to Analysis
- Data normalization is designed to remove systematic experimental errors
- Expression levels are often analyzed as ratios and are usually transformed by taking logarithms
- Sometimes further normalization is useful after the data transformation
- Principal component analysis is a method for combining the properties of an object
- 16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points
- Euclidean distance is the measure used in everyday life
- The Pearson correlation coefficient measures distance in terms of the shape of the expression response
- The Mahalanobis distance takes account of the variation and correlation of expression responses
- 16.3 Clustering Methods Identify Similar and Distinct Expression Patterns
- Hierarchical clustering produces a related set of alternative partitions of the data
- k-means clustering groups data into several clusters but does not determine a relationship between clusters
- Self-organizing maps (SOMs) use neural network methods to cluster data into a predetermined number of clusters
- Evolutionary clustering algorithms use selection, recombination, and mutation to find the best possible solution to a problem
- The self-organizing tree algorithm (SOTA) determines the number of clusters required
- Biclustering identifies a subset of similar expression level patterns occurring in a subset of the samples
- The validity of clusters is determined by independent methods
- 16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression
- T-tests can be used to estimate the significance of the difference between two expression levels
- Nonparametric tests are used to avoid making assumptions about the data sampling
- Multiple testing of differential expression requires special techniques to control error rates
- 16.5 Gene and Protein Expression Data Can be Used to Classify Samples
- Many alternative methods have been proposed that can classify samples
- Support vector machines are another form of supervised learning algorithm that can produce classifiers
- Summary
- Further Reading
- Monograph
- 16.1 Expression Data Require Preparation Prior to Analysis
- Variance-stabilizing transformation
- Lowess normalization
- General review of normalization and transformation of data
- Data used to generate Figures 16.2 to 16.4
- Principal component analysis
- 16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points
- Distance definitions
- 16.3 Clustering Methods Identify Similar and Distinct Expression Patterns
- Self-organizing maps (SOMs)
- Clustering using genetic algorithms
- SOTA
- Biclustering techniques
- Validation of partitions produced by clustering methods
- 16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression
- Bayesian estimation of variance
- Nonparametric tests
- Multiple hypothesis testing
- 16.5 Gene and Protein Expression Data Can be Used to Classify Samples
- Support vector machines
- 17 Systems Biology
- 17.1 What is a System?
- A system is more than the sum of its parts
- A biological system is a living network
- Databases are useful starting points in constructing a network
- To construct a model more information is needed than a network
- There are three possible approaches to constructing a model
- Kinetic models are not the only way in systems biology
- 17.2 Structure of the Model
- Control circuits are an essential part of any biological system
- The interactions in networks can be represented as simple differential equations
- 17.3 Robustness of Biological Systems
- Robustness is a distinct feature of complexity in biology
- Modularity plays an important part in robustness
- Redundancy in the system can provide robustness
- Living systems can switch from one state to another by means of bistable switches
- 17.4 Storing and Running System Models
- Specialized programs make simulating systems easier
- Standardized systems descriptions aid their storage and reuse
- Summary
- Further Reading
- 17.1 What is a System?
- Databases
- 17.2 Structure of the Model
- Topological modeling
- 17.3 Robustness of Biological Systems
- Molecular systems modeling
- Appendix A: probability, Information, and Bayesian Analysis
- Probability Theory, Entropy, and Information
- Mutually exclusive events
- Occurrence of two events
- Occurrence of two random variables
- Bayesian Analysis
- Bayes’ theorem
- Inference of parameter values
- Further Reading
- Appendix B: Molecular Energy Functions
- Force Fields for Calculating Intra- and Intermolecular Interaction Energies
- Bonding terms
- Nonbonding terms
- Potentials used in Threading
- Potentials of mean force
- Potential terms relating to solvent effects
- Further Reading
- Appendix C: Function Optimization
- Full Search Methods
- Dynamic programming and branch-and-bound
- Local Optimization
- The downhill simplex method
- The steepest descent method
- The conjugate gradient method
- Methods using second derivatives
- Thermodynamic Simulation and Global Optimization
- Monte Carlo and genetic algorithms
- Molecular dynamics
- Simulated annealing
- Summary
- Further Reading
- List of Symbols
- Notation concepts
- Glossary
- Index
Reviews
There are no reviews yet.