Understanding Bioinformatics

Höfundur Marketa Zvelebil; Jeremy O. Baum

Útgefandi Garland Science

Snið ePub

Print ISBN 9780815340249

Útgáfa 1

Útgáfuár 2008

15.790 kr.

Description

Efnisyfirlit

  • Cover
  • Half Title
  • Dedication
  • Title Page
  • Copyright Page
  • Preface
  • A Note to the Reader
  • Organization of this Book
  • Applications and Theory Chapters
  • Part 1: Background Basics
  • Part 2: Sequence Alignments
  • Part 3: Evolutionary Processes
  • Part 4: Genome Characteristics
  • Part 5: Secondary Structures
  • Part 6: Tertiary Structures
  • Part 7: Cells and Organisms
  • Appendices
  • Organization of the Chapters
  • Learning Outcomes
  • Flow Diagrams
  • Mind Maps
  • Illustrations
  • Further Reading
  • List of Symbols
  • Glossary
  • Garland Science Website
  • Artwork
  • Additional Material
  • List of Reviewers
  • Contents In Brief
  • Table of Contents
  • Part 1 Background Basics
  • 1 The Nucleic Acid World
  • 1.1 The Structure of DNA and RNA
  • DNA is a linear polymer of only four different bases
  • Two complementary DNA strands interact by base-pairing to form a double helix
  • RNA molecules are mostly single stranded but can also have base-pair structures
  • 1.2 DNA, RNA, and Protein: The Central Dogma
  • DNA is the information store, but RNA is the messenger
  • Messenger RNA is translated into protein according to the genetic code
  • Translation involves transfer RNAs and RNA-containing ribosomes
  • 1.3 Gene Structure and Control
  • RNA polymerase binds to specific sequences that position it and identify where to begin transcription
  • The signals initiating transcription in eukaryotes are generally more complex than those in bacteria
  • Eukaryotic mRNA transcripts undergo several modifications prior to their use in translation
  • The control of translation
  • 1.4 The Tree of Life and Evolution
  • A brief survey of the basic characteristics of the major forms of life
  • Nucleic acid sequences can change as a result of mutation
  • Summary
  • Further Reading
  • General References
  • 1.1 The Structure of DNA and RNA
  • 1.2 DNA, RNA and Protein: The Central Dogma
  • 1.3 Gene Structure and Control
  • 1.4 The Tree of Life and Evolution
  • Box 1.2
  • 2 Protein Structure
  • 2.1 Primary and Secondary Structure
  • Protein structure can be considered on several different levels
  • Amino acids are the building blocks of proteins
  • The differing chemical and physical properties of amino acids are due to their side chains
  • Amino acids are covalently linked together in the protein chain by peptide bonds
  • Secondary structure of proteins is made up of α-helices and β-strands
  • Several different types of β-sheet are found in protein structures
  • Turns, hairpins, and loops connect helices and strands
  • 2.2 Implication for Bioinformatics
  • Certain amino acids prefer a particular structural unit
  • Evolution has aided sequence analysis
  • Visualization and computer manipulation of protein structures
  • 2.3 Proteins Fold to Form Compact Structures
  • The tertiary structure of a protein is defined by the path of the polypeptide chain
  • The stable folded state of a protein represents a state of low energy
  • Many proteins are formed of multiple subunits
  • Summary
  • Further Reading
  • 3 Dealing with Databases
  • 3.1 The Structure of Databases
  • Flat-file databases store data as text files
  • Relational databases are widely used for storing biological information
  • XML has the flexibility to define bespoke data classifications
  • Many other database structures are used for biological data
  • Databases can be accessed locally or online and often link to each other
  • 3.2 Types of Database
  • There’s more to databases than just data
  • Primary and derived data
  • How we define and connect things is important: Ontologies
  • 3.3 Looking for Databases
  • Sequence databases
  • Microarray databases
  • Protein interaction databases
  • Structural databases
  • 3.4 Data Quality
  • Nonredundancy is especially important for some applications of sequence databases
  • Automated methods can be used to check for data consistency
  • Initial analysis and annotation is usually automated
  • Human intervention is often required to produce the highest quality annotation
  • The importance of updating databases and entry identifier and version numbers
  • Summary
  • Further Reading
  • 3.1 The Structure of Databases
  • 3.2 Types of Database
  • How we define and connect things is important: Ontologies
  • 3.3 Looking for Databases
  • Sequence databases
  • Microarray databases
  • Protein interaction databases
  • Structural databases
  • 3.4 Data Quality
  • MIAME
  • Part 2 Sequence Alignments
  • 4 Producing and Analyzing Sequence Alignments
  • 4.1 Principles of Sequence Alignment
  • Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity
  • Alignment can reveal homology between sequences
  • It is easier to detect homology when comparing protein sequences than when comparing nucleic acid sequences
  • 4.2 Scoring Alignments
  • The quality of an alignment is measured by giving it a quantitative score
  • The simplest way of quantifying similarity between two sequences is percentage identity
  • The dot-plot gives a visual assessment of similarity based on identity
  • Genuine matches do not have to be identical
  • There is a minimum percentage identity that can be accepted as significant
  • There are many different ways of scoring an alignment
  • 4.3 Substitution Matrices
  • Substitution matrices are used to assign individual scores to aligned sequence positions
  • The PAM substitution matrices use substitution frequencies derived from sets of closely related protein sequences
  • The BLOSUM substitution matrices use mutation data from highly conserved local regions of sequence
  • The choice of substitution matrix depends on the problem to be solved
  • 4.4 Inserting Gaps
  • Gaps inserted in a sequence to maximize similarity with another require a scoring penalty
  • Dynamic programming algorithms can determine the optimal introduction of gaps
  • 4.5 Types of Alignment
  • Different kinds of alignments are useful in different circumstances
  • Multiple sequence alignments enable the simultaneous comparison of a set of similar sequences
  • Multiple alignments can be constructed by several different techniques
  • Multiple alignments can improve the accuracy of alignment for sequences of low similarity
  • ClustalW can make global multiple alignments of both DNA and protein sequences
  • Multiple alignments can be made by combining a series of local alignments
  • Alignment can be improved by incorporating additional information
  • 4.6 Searching Databases
  • Fast yet accurate search algorithms have been developed
  • FASTA is a fast database-search method based on matching short identical segments
  • BLAST is based on finding very similar short segments
  • Different versions of BLAST and FASTA are used for different problems
  • PSI-BLAST enables profile-based database searches
  • Ssearch is a rigorous alignment method
  • 4.7 Searching with Nucleic Acid or Protein Sequences
  • DNA or RNA sequences can be used either directly or after translation
  • The quality of a database match has to be tested to ensure that it could not have arisen by chance
  • Choosing an appropriate E-value threshold helps to limit a database search
  • Low-complexity regions can complicate homology searches
  • Different databases can be used to solve particular problems
  • 4.8 Protein Sequence Motifs or Patterns
  • Creation of pattern databases requires expert knowledge
  • The BLOCKS database contains automatically compiled short blocks of conserved multiply aligned protein sequences
  • 4.9 Searching Using Motifs and Patterns
  • The PROSITE database can be searched for protein motifs and patterns
  • The pattern-based program PHI-BLAST searches for both homology and matching motifs
  • Patterns can be generated from multiple sequences using PRATT
  • The PRINTS database consists of fingerprints representing sets of conserved motifs that describe a protein family
  • The Pfam database defines profiles of protein families
  • 4.10 Patterns and Protein Function
  • Searches can be made for particular functional sites in proteins
  • Sequence comparison is not the only way of analyzing protein sequences
  • Summary
  • Further Reading
  • 4.1 Principles of Sequence Alignment
  • 4.2 Scoring Alignments
  • Twilight zone and midnight zone
  • 4.3 Substitution Matrices
  • 4.4 Inserting Gaps
  • 4.5 Types of Alignment
  • ClustalW
  • DIALIGN
  • 4.6 Searching Databases
  • BLAST
  • FASTA
  • PSI-BLAST
  • 4.8 Protein Sequence Motifs or Patterns
  • MEME
  • 4.9 Searching Using Motifs and Patterns
  • MOTIF
  • PRATT
  • 4.10 Patterns and Protein Function
  • HCA
  • 5 Pairwise Sequence Alignment and Database Searching
  • 5.1 Substitution Matrices and Scoring
  • Alignment scores attempt to measure the likelihood of a common evolutionary ancestor
  • The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins
  • The BLOSUM matrices were designed to find conserved regions of proteins
  • Scoring matrices for nucleotide sequence alignment can be derived in similar ways
  • The substitution scoring matrix used must be appropriate to the specific alignment problem
  • Gaps are scored in a much more heuristic way than substitutions
  • 5.2 Dynamic Programming Algorithms
  • Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm
  • Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm
  • Time can be saved with a loss of rigor by not calculating the whole matrix
  • 5.3 Indexing Techniques and Algorithmic Approximations
  • Suffix trees locate the positions of repeats and unique sequences
  • Hashing is an indexing technique that lists the starting positions of all k-tuples
  • The FASTA algorithm uses hashing and chaining for fast database searching
  • The BLAST algorithm makes use of finite-state automata
  • Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms
  • 5.4 Alignment Score Significance
  • The statistics of gapped local alignments can be approximated by the same theory
  • 5.5 Aligning Complete Genome Sequences
  • Indexing and scanning whole genome sequences efficiently is crucial for the sequence alignment of higher organisms
  • The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms
  • Summary
  • Further Reading
  • 5.1 Substitution Matrices and Scoring
  • The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins
  • The BLOSUM matrices were designed to find conserved regions of proteins
  • Scoring matrices for nucleotide sequence alignment can be derived in similar ways
  • The substitution scoring matrix used must be appropriate to the specific alignment problem
  • Gaps are scored in a much more heuristic way than substitutions
  • 5.2 Dynamic Programming Algorithms
  • Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm
  • Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm
  • Time can be saved with a loss of rigor by not calculating the whole matrix
  • 5.3 Indexing Techniques and Algorithmic Approximations
  • Suffix trees locate the positions of repeats and unique sequences; Hashing is an indexing technique that lists the starting positions of all k-tuples
  • The FASTA algorithm uses hashing and chaining for fast database searching; The BLAST algorithm makes use of finite-state automata
  • Box 5.2: Sometimes things just aren’t complex enough
  • 5.4 Alignment Score Significance
  • 5.5 Alignments Involving Complete Genome Sequences
  • 6 Patterns, Profiles, and Multiple Alignments
  • 6.1 Profiles and Sequence Logos
  • Position-specific scoring matrices are an extension of substitution scoring matrices
  • Methods for overcoming a lack of data in deriving the values for a PSSM
  • PSI-BLAST is a sequence database searching program
  • Representing a profile as a logo
  • 6.2 Profile Hidden Markov Models
  • The basic structure of HMMs used in sequence alignment to profiles
  • Estimating HMM parameters using aligned sequences
  • Scoring a sequence against a profile HMM: The most probable path and the sum over all paths
  • Estimating HMM parameters using unaligned sequences
  • 6.3 Aligning Profiles
  • Comparing two PSSMs by alignment
  • Aligning profile HMMs
  • 6.4 Multiple Sequence Alignments by Gradual Sequence Addition
  • The order in which sequences are added is chosen based on the estimated likelihood of incorporating errors in the alignment
  • Many different scoring schemes have been used in constructing multiple alignments
  • The multiple alignment is built using the guide tree and profile methods and may be further refined
  • 6.5 Other Ways of Obtaining Multiple Alignments
  • The multiple sequence alignment program DIALIGN aligns ungapped blocks
  • The SAGA method of multiple alignment uses a genetic algorithm
  • 6.6 Sequence Pattern Discovery
  • Discovering patterns in a multiple alignment: eMOTIF and AACC
  • Probabilistic searching for common patterns in sequences: Gibbs and MEME
  • Searching for more general sequence patterns
  • Summary
  • Further Reading
  • 6.1 Profiles and Sequence Logos
  • 6.2 Profile Hidden Markov Models
  • 6.3 Aligning Profiles
  • 6.4 Multiple Sequence Alignments by Gradual Sequence Additions
  • 6.5 Other Ways of Obtaining Multiple Alignments
  • 6.6 Sequence Pattern Discovery
  • Part 3 Evolutionary Processes
  • 7 Recovering Evolutionary History
  • 7.1 The Structure and Interpretation of Phylogenetic Trees
  • Phylogenetic trees reconstruct evolutionary relationships
  • Tree topology can be described in several ways
  • Consensus and condensed trees report the results of comparing tree topologies
  • 7.2 Molecular Evolution and its Consequences
  • Most related sequences have many positions that have mutated several times
  • The rate of accepted mutation is usually not the same for all types of base substitution
  • Different codon positions have different mutation rates
  • Only orthologous genes should be used to construct species phylogenetic trees
  • Major changes affecting large regions of the genome are surprisingly common
  • 7.3 Phylogenetic Tree Reconstruction
  • Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species
  • The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset
  • A model of evolution must be chosen to use with the method
  • All phylogenetic analyses must start with an accurate multiple alignment
  • Phylogenetic analyses of a small dataset of 16S RNA sequence data
  • Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved
  • Summary
  • Further Reading
  • General
  • 7.1 The Structure and Interpretation of Phylogenetic Trees
  • Phylogenetic trees reconstruct evolutionary relationships
  • Tree topology can be described in several ways
  • Consensus and condensed trees report the results of comparing tree topologies
  • 7.2 Molecular Evolution and its Consequences
  • The rate of accepted mutation is usually not the same for all types of base substitution
  • Different codon positions have different mutation rates
  • Only orthologous genes should be used to construct species phylogenetic trees
  • Major changes affecting large regions of the genome are surprisingly common
  • 7.3 Phylogenetic Tree Reconstruction
  • Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species
  • The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset
  • A model of evolution must be chosen to use with the method
  • Phylogenetic analyses of a small dataset of 16S RNA sequence data
  • Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved
  • 8 Building Phylogenetic Trees
  • 8.1 Evolutionary Models and the Calculation of Evolutionary Distance
  • A simple but inaccurate measure of evolutionary distance is the p-distance
  • The Poisson distance correction takes account of multiple mutations at the same site
  • The Gamma distance correction takes account of mutation rate variation at different sequence positions
  • The Jukes-Cantor model reproduces some basic features of the evolution of nucleotide sequences
  • More complex models distinguish between the relative frequencies of different types of mutation
  • There is a nucleotide bias in DNA sequences
  • Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment
  • 8.2 Generating Single Phylogenetic Trees
  • Clustering methods produce a phylogenetic tree based on evolutionary distances
  • The UPGMA method assumes a constant molecular clock and produces an ultrametric tree
  • The Fitch-Margoliash method produces an unrooted additive tree
  • The neighbor-joining method is related to the concept of minimum evolution
  • Stepwise addition and star-decomposition methods are usually used to generate starting trees for further exploration, not the final tree
  • 8.3 Generating Multiple Tree Topologies
  • The branch-and-bound method greatly improves the efficiency of exploring tree topology
  • Optimization of tree topology can be achieved by making a series of small changes to an existing tree
  • Finding the root gives a phylogenetic tree a direction in time
  • 8.4 Evaluating Tree Topologies
  • Functions based on evolutionary distances can be used to evaluate trees
  • Unweighted parsimony methods look for the trees with the smallest number of mutations
  • Mutations can be weighted in different ways in the parsimony method
  • Trees can be evaluated using the maximum likelihood method
  • The quartet-puzzling method also involves maximum likelihood in the standard implementation
  • Bayesian methods can also be used to reconstruct phylogenetic trees
  • 8.5 Assessing the Reliability of Tree Features and Comparing Trees
  • The long-branch attraction problem can arise even with perfect data and methodology
  • Tree topology can be tested by examining the interior branches
  • Tests have been proposed for comparing two or more alternative trees
  • Summary
  • Further Reading
  • 8.1 Evolutionary Models and the Calculation of Evolutionary Distance
  • The Gamma distance correction takes account of mutation rate variation at different sequence positions
  • More complex models distinguish between the relative frequencies of different types of mutation
  • There is a nucleotide bias in DNA sequences
  • Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment
  • 8.2 Generating Single Phylogenetic Trees
  • The UPGMA method assumes a constant molecular clock and produces an ultrametric tree
  • The Fitch-Margoliash method produces an unrooted additive tree
  • The neighbor-joining method is related to the concept of minimum evolution
  • 8.3 Generating Multiple Tree Topologies
  • Optimization of tree topology can be achieved by making a series of small changes to an existing tree
  • Finding the root gives a phylogenetic tree a direction in time
  • 8.4 Evaluating Tree Topologies
  • Functions based on evolutionary distances can be used to evaluate trees
  • Trees can be evaluated using the maximum likelihood method
  • The quartet-puzzling method also involves maximum likelihood in the standard implementation
  • Bayesian methods can also be used to reconstruct phylogenetic trees
  • 8.5 Assessing the Reliability of Tree Features and Comparing Trees
  • The long-branch attraction problem can arise even with perfect data and methodology
  • Tree topology can be tested by examining the interior branches
  • Tests have been proposed for comparing two or more alternative trees
  • Part 4 Genome Characteristics
  • 9 Revealing Genome Features
  • 9.1 Preliminary Examination of Genome Sequence
  • Whole genome sequences can be split up to simplify gene searches
  • Structural RNA genes and repeat sequences can be excluded from further analysis
  • Homology can be used to identify genes in both prokaryotic and eukaryotic genomes
  • 9.2 Gene Prediction in Prokaryotic Genomes
  • 9.3 Gene Prediction in Eukaryotic Genomes
  • Programs for predicting exons and introns use a variety of approaches
  • Gene predictions must preserve the correct reading frame
  • Some programs search for exons using only the query sequence and a model for exons
  • Some programs search for genes using only the query sequence and a gene model
  • Genes can be predicted using a gene model and sequence similarity
  • Genomes of related organisms can be used to improve gene prediction
  • 9.4 Splice Site Detection
  • Splice sites can be detected independently by specialized programs
  • 9.5 Prediction of Promoter Regions
  • Prokaryotic promoter regions contain relatively well-defined motifs
  • Eukaryotic promoter regions are typically more complex than prokaryotic promoters
  • A variety of promoter-prediction methods are available online
  • Promoter prediction results are not very clear-cut
  • 9.6 Confirming Predictions
  • There are various methods for calculating the accuracy of gene-prediction programs
  • Translating predicted exons can confirm the correctness of the prediction
  • Constructing the protein and identifying homologs
  • 9.7 Genome Annotation
  • Genome annotation is the final step in genome analysis
  • Gene ontology provides a standard vocabulary for gene annotation
  • 9.8 Large Genome Comparisons
  • Summary
  • Further Reading
  • 9.1 Preliminary Examination of Genome Sequence
  • 9.2 Gene Prediction in Prokaryotic Genomes
  • 9.3 Gene Prediction in Eukaryotic Genomes
  • 9.4 Splice Site Detection
  • 9.5 Prediction of Promoter Regions
  • 9.6 Confirming Predictions
  • 9.7 Genome Annotation
  • 9.8 Large Genome Comparisons
  • Box 9.5
  • 10 Gene Detection and Genome Annotation
  • 10.1 Detection of Functional RNA Molecules Using Decision Trees
  • Detection of tRNA genes using the tRNAscan algorithm
  • Detection of tRNA genes in eukaryotic genomes
  • 10.2 Features Useful for Gene Detection in Prokaryotes
  • 10.3 Algorithms for Gene Detection in Prokaryotes
  • GeneMark uses inhomogeneous Markov chains and dicodon statistics
  • GLIMMER uses interpolated Markov models of coding potential
  • Orpheus uses homology, codon statistics, and ribosome-binding sites
  • GeneMark.hmm uses explicit state duration hidden Markov models
  • EcoParse is an HMM gene model
  • 10.4 Features Used in Eukaryotic Gene Detection
  • Differences between prokaryotic and eukaryotic genes
  • Introns, exons, and splice sites
  • Promoter sequences and binding sites for transcription factors
  • 10.5 Predicting Eukaryotic Gene Signals
  • Detection of core promoter binding signals is a key element of some eukaryotic gene-prediction methods
  • A set of models has been designed to locate the site of core promoter sequence signals
  • Predicting promoter regions from general sequence properties can reduce the numbers of false-positive results
  • Predicting eukaryotic transcription and translation start sites
  • Translation and transcription stop signals complete the gene definition
  • 10.6 Predicting Exons and Introns
  • Exons can be identified using general sequence properties
  • Splice-site prediction
  • Splice sites can be predicted by sequence patterns combined with base statistics
  • GenScan uses a combination of weight matrices and decision trees to locate splice sites
  • GeneSplicer predicts splice sites using first-order Markov chains
  • NetPlantGene combines neural networks with intron and exon predictions to predict splice sites
  • Other splicing features may yet be exploited for splice-site prediction
  • Specific methods exist to identify initial and terminal exons
  • Exons can be defined by searching databases for homologous regions
  • 10.7 Complete Eukaryotic Gene Models
  • 10.8 Beyond the Prediction of Individual Genes
  • Functional annotation
  • Comparison of related genomes can help resolve uncertain predictions
  • Evaluation and reevaluation of gene-detection methods
  • Summary
  • Further Reading
  • 10.1 Detection of Functional RNA Molecules Using Decision Trees
  • tRNA detection
  • Detection of other RNA genes
  • 10.2 Features Useful for Gene Detection in Prokaryotes
  • Identifying protein-coding regions using base statistics
  • 10.3 Algorithms for Gene Detection in Prokaryotes
  • GeneMark, GeneMark.hmm, and further developments
  • Glimmer
  • Orpheus
  • EcoParse
  • Prokaryotic genomes
  • Markov models
  • 10.4 Features Used in Eukaryotic Gene Detection
  • 10.4 Preliminary analysis for human genes
  • 10.5 Predicting Eukaryotic Gene Signals
  • Initial analysis of core promoter sequences
  • Algorithms of core promoter detection
  • Promoter recognition
  • 10.6 Predicting Exons and Introns
  • 10.7 Complete Eukaryotic Gene Models
  • 10.8 Beyond the Prediction of Individual Genes
  • Detailed reexamination of the annotation of a complete genome
  • Large-scale changes in chromosomes
  • Box 10.1 Measures of gene prediction accuracy at the nucleotide level
  • Box 10.2 Sequencing many genomes at once
  • Box 10.3 Measures of gene prediction accuracy at the exon level
  • Part 5 Secondary Structures
  • 11 Obtaining Secondary Structure from Sequence
  • 11.1 Types of Prediction Methods
  • Statistical methods are based on rules that give the probability that a residue will form part of a particular secondary structure
  • Nearest-neighbor methods are statistical methods that incorporate additional information about protein structure
  • Machine-learning approaches to secondary structure prediction mainly make use of neural networks and HMM methods
  • 11.2 Training and Test Databases
  • There are several ways to define protein secondary structures
  • 11.3 Assessing the Accuracy of Prediction Programs
  • Q3 measures the accuracy of individual residue assignments
  • Secondary structure predictions should not be expected to reach 100% residue accuracy
  • The Sov value measures the prediction accuracy for whole elements
  • CAFASP/CASP: Unbiased and readily available protein prediction assessments
  • 11.4 Statistical and Knowledge-Based Methods
  • The GOR method uses an information theory approach
  • The program Zpred includes multiple alignment of homologous sequences and residue conservation information
  • There is an overall increase in prediction accuracy using multiple sequence information
  • The nearest-neighbor method: The use of multiple nonhomologous sequences
  • Predator is a combined statistical and knowledge-based program that includes the nearest-neighbor approach
  • 11.5 Neural Network Methods of Secondary Structure Prediction
  • Assessing the reliability of neural net predictions
  • Several examples of Web-based neural network secondary structure prediction programs
  • PROF: Protein forecasting
  • Psipred
  • Jnet: Using several alternative representations of the sequence alignment
  • 11.6 Some Secondary Structures Require Specialized Prediction Methods
  • Transmembrane proteins
  • Quantifying the preference for a membrane environment
  • 11.7 Prediction of Transmembrane Protein Structure
  • Multi-helix membrane proteins
  • A selection of prediction programs to predict transmembrane helices
  • Statistical methods
  • Knowledge-based prediction
  • Evolutionary information from protein families improves the prediction
  • Neural nets in transmembrane prediction
  • Predicting transmembrane helices with hidden Markov models
  • Comparing the results: What to choose
  • What happens if a non-transmembrane protein is submitted to transmembrane prediction programs
  • Prediction of transmembrane structure containing β-strands
  • 11.8 Coiled-Coil Structures
  • The COILS prediction program
  • PAIRCOIL and MULTICOIL are an extension of the COILS algorithm
  • Zipping the leucine zipper: A specialized coiled coil
  • 11.9 RNA Secondary Structure Prediction
  • Summary
  • Further Reading
  • 11.1 Types of Prediction Methods
  • 11.2 Training and Test Databases
  • PDB
  • STRIDE
  • DSSP
  • DEFINE
  • 11.3 Assessing the Accuracy of Prediction Programs
  • 11.4 Statistical and Knowledge-Based Methods
  • 11.5 Neural Network Methods of Secondary Structure Prediction
  • 11.7 Prediction of Transmembrane Protein Structure
  • 11.8 Coiled-Coil Structures
  • 11.9 RNA Secondary Structure Prediction
  • Box 11.1
  • Box 11.3
  • 12 Predicting Secondary Structures
  • 12.1 Defining Secondary Structure and Prediction Accuracy
  • The definitions used for automatic protein secondary structure assignment do not give identical results
  • There are several different measures of the accuracy of secondary structure prediction
  • 12.2 Secondary Structure Prediction Based on Residue Propensities
  • Each structural state has an amino acid preference which can be assigned as a residue propensity
  • The simplest prediction methods are based on the average residue propensity over a sequence window
  • Residue propensities are modulated by nearby sequence
  • Predictions can be significantly improved by including information from homologous sequences
  • 12.3 The Nearest-Neighbor Methods are Based on Sequence Segment Similarity
  • Short segments of similar sequence are found to have similar structure
  • Several sequence similarity measures have been used to identify nearest-neighbor segments
  • A weighted average of the nearest-neighbor segment structures is used to make the prediction
  • A nearest-neighbor method has been developed to predict regions with a high potential to misfold
  • 12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction
  • Layered feed-forward neural networks can transform a sequence into a structural prediction
  • Inclusion of information on homologous sequences improves neural network accuracy
  • More complex neural nets have been applied to predict secondary and other structural features
  • 12.5 Hidden Markov Models Have Been Applied to Structure Prediction
  • HMM methods have been found especially effective for transmembrane proteins
  • Nonmembrane protein secondary structures can also be successfully predicted with HMMs
  • 12.6 General Data Classification Techniques can Predict Structural Features
  • Support vector machines have been successfully used for protein structure prediction
  • Discriminants, SOMs, and other methods have also been used
  • Summary
  • Further Reading
  • 12.1 Defining Secondary Structure and Prediction Accuracy
  • DSSP
  • PALSSE
  • β-Spider
  • Limits of prediction accuracy
  • TM properties and accuracy measures
  • PSIPRED (Q3 and Sov variation)
  • Matthews correlation coefficient
  • Sov
  • SS assignment comparison
  • Length distributions (non-TM)
  • 12.2 Secondary Structure Prediction Based on Residue Propensities
  • PDB_SELECT
  • In-depth analysis of unbiased structural datasets
  • Hydrophobicity scales
  • Chou-Fasman
  • COILS
  • MEMSAT
  • Local and nonlocal effects
  • β-turn propensities
  • β-turn propensities and use of PSSMs
  • AAindex
  • GOR theory
  • GOR I
  • GOR II
  • GOR III
  • GOR IV
  • GOR V (includes other sequences)
  • Using consecutive pair of structural states
  • Zpred
  • Treatment of gaps in multiple alignments during prediction
  • 12.3 The Nearest-neighbor Methods are Based on Sequence Segment Similarity
  • NNSSP
  • SSPAL
  • I-sites
  • SIMPA96
  • Correlation between amino acid composition and protein structural class
  • HβP
  • 12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction
  • PHDsec
  • PHDpsi
  • PSIPRED
  • Back-propagation learning
  • BTPRED
  • Betaturns
  • DESTRUCT
  • SSpro
  • 12.5 Hidden Markov Models Have Been Applied to Structure Prediction
  • HMMTOP
  • TMHMM
  • Phobius
  • PROftmb
  • PRED-TMBB
  • YASPIN
  • MARCOIL
  • 12.6 General Data Classification Techniques can Predict Structural Features
  • DSC
  • FoldIndex
  • GPI-SOM
  • PSIMLR
  • Part 6 Tertiary Structures
  • 13 Modeling Protein Structure
  • 13.1 Potential Energy Functions and Force Fields
  • The conformation of a protein can be visualized in terms of a potential energy surface
  • Conformational energies can be described by simple mathematical functions
  • Similar force fields can be used to represent conformational energies in the presence of averaged environments
  • Potential energy functions can be used to assess a modeled structure
  • Energy minimization can be used to refine a modeled structure and identify local energy minima
  • Molecular dynamics and simulated annealing are used to find global energy minima
  • 13.2 Obtaining a Structure by Threading
  • The prediction of protein folds in the absence of known structural homologs
  • Libraries or databases of nonredundant protein folds are used in threading
  • Two distinct types of scoring schemes have been used in threading methods
  • Dynamic programming methods can identify optimal alignments of target sequences and structural folds
  • Several methods are available to assess the confidence to be put on the fold prediction
  • The C2-like domain from the Dictyostelia: A practical example of threading
  • 13.3 Principles of Homology Modeling
  • Closely related target and template sequences give better models
  • Significant sequence identity depends on the length of the sequence
  • Homology modeling has been automated to deal with the numbers of sequences that can now be modeled
  • Model building is based on a number of assumptions
  • 13.4 Steps in Homology Modeling
  • Structural homologs to the target protein are found in the PDB
  • Accurate alignment of target and template sequences is essential for successful modeling
  • The structurally conserved regions of a protein are modeled first
  • The modeled core is checked for misfits before proceeding to the next stage
  • Sequence realignment and remodeling may improve the structure
  • Insertions and deletions are usually modeled as loops
  • Nonidentical amino acid side chains are modeled mainly by using rotamer libraries
  • Energy minimization is used to relieve structural errors
  • Molecular dynamics can be used to explore possible conformations for mobile loops
  • Models need to be checked for accuracy
  • How far can homology models be trusted?
  • 13.5 Automated Homology Modeling
  • The program MODELLER models by satisfying protein structure constraints
  • COMPOSER uses fragment-based modeling to automatically generate a model
  • Automated methods available on the Web for comparative modeling
  • Assessment of structure prediction
  • 13.6 Homology Modeling of PI3 Kinase p110a
  • Swiss-Pdb Viewer can be used for manual or semi-manual modeling
  • Alignment, core modeling, and side-chain modeling are carried out all in one
  • The loops are modeled from a database of possible structures
  • Energy minimization and quality inspection can be carried out within Swiss-Pdb Viewer
  • MolIDE is a downloadable semi-automatic modeling package
  • Automated modeling on the Web illustrated with p110α kinase
  • Modeling a functionally related but sequentially dissimilar protein: mTOR
  • Generating a multidomain three-dimensional structure from sequence
  • Summary
  • Further Reading
  • Modeling: General
  • 13.1 Potential Energy Functions and Force Fields
  • Ab initio modeling
  • 13.2 Obtaining a Structure by Threading
  • 123D+
  • GenTHREADER
  • 3D-PSSM
  • FUGUE
  • LIBRA
  • LOOPP
  • LIBELLULA
  • SCOP
  • 13.3 Automated Homology Modeling
  • MolIDE
  • SCWRL
  • PSIPRED
  • LOOPY
  • MODELLER
  • Assessing models
  • MolProbity
  • PROCHECK
  • CASP
  • Antibody and HIV examples
  • 14 Analyzing Structure-Function Relationships
  • 14.1 Functional Conservation
  • Functional regions are usually structurally conserved
  • Similar biochemical function can be found in proteins with different folds
  • Fold libraries identify structurally similar proteins regardless of function
  • 14.2 Structure Comparison Methods
  • Finding domains in proteins aids structure comparison
  • Structural comparisons can reveal conserved functional elements not discernible from a sequence comparison
  • The CE method builds up a structural alignment from pairs of aligned protein segments
  • The Vector Alignment Search Tool (VAST) aligns secondary structural elements
  • DALI identifies structure superposition without maintaining segment order
  • FATCAT introduces rotations between rigid segments
  • 14.3 Finding Binding Sites
  • Highly conserved, strongly charged, or hydrophobic surface areas may indicate interaction sites
  • Searching for protein-protein interactions using surface properties
  • Surface calculations highlight clefts or holes in a protein that may serve as binding sites
  • Looking at residue conservation can identify binding sites
  • 14.4 Docking Methods and Programs
  • Simple docking procedures can be used when the structure of a homologous protein bound to a ligand analog is known
  • Specialized docking programs will automatically dock a ligand to a structure
  • Scoring functions are used to identify the most likely docked ligand
  • The DOCK program is a semirigid-body method that analyzes shape and chemical complementarity of ligand and binding site
  • Fragment docking identifies potential substrates by predicting types of atoms and functional groups in the binding area
  • GOLD is a flexible docking program, which utilizes a genetic algorithm
  • The water molecules in binding sites should also be considered
  • Summary
  • Further Reading
  • General
  • 14.1 Functional Conservation
  • Structure-function relationships
  • Fold recognition programs
  • Domain identification
  • Viewing programs
  • 14.3 Finding Binding Sites
  • Evolutionary trace methods
  • 14.4 Docking Methods and Programs
  • Part 7 Cells and Organisms
  • 15 Proteome and Gene Expression Analysis
  • 15.1 Analysis of Large-scale Gene Expression
  • The expression of large numbers of different genes can be measured simultaneously by DNA microarrays
  • Gene expression microarrays are mainly used to detect differences in gene expression in different conditions
  • Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression
  • Digital differential display uses bioinformatics and statistics to detect differential gene expression in different tissues
  • Facilitating the integration of data from different places and experiments
  • The simplest method of analyzing gene expression microarray data is hierarchical cluster analysis
  • Techniques based on self-organizing maps can be used for analyzing microarray data
  • Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision of clusters
  • Clustered gene expression data can be used as a tool for further research
  • 15.2 Analysis of Large-scale Protein Expression
  • Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell
  • Measuring the expression levels shown in 2D gels
  • Differences in protein expression levels between different samples can be detected by 2D gels
  • Clustering methods are used to identify protein spots with similar expression patterns
  • Principal component analysis (PCA) is an alternative to clustering for analyzing microarray and 2D gel data
  • The changes in a set of protein spots can be tracked over a number of different samples
  • Databases and online tools are available to aid the interpretation of 2D gel data
  • Protein microarrays allow the simultaneous detection of the presence or activity of large numbers of different proteins
  • Mass spectrometry can be used to identify the proteins separated and purified by 2D gel electrophoresis or other means
  • Protein-identification programs for mass spectrometry are freely available on the Web
  • Mass spectrometry can be used to measure protein concentration
  • Summary
  • Further Reading
  • 15.1 Analysis of Large-scale Gene Expression
  • Functional genomics
  • Microarray quality control focus
  • MIAME
  • 15.2 Analysis of Large-scale Protein Expression
  • Proteomics
  • Protein microarrays
  • MASCOT
  • ProteinProspector
  • Databases and software
  • 16 Clustering Methods and Statistics
  • 16.1 Expression Data Require Preparation Prior to Analysis
  • Data normalization is designed to remove systematic experimental errors
  • Expression levels are often analyzed as ratios and are usually transformed by taking logarithms
  • Sometimes further normalization is useful after the data transformation
  • Principal component analysis is a method for combining the properties of an object
  • 16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points
  • Euclidean distance is the measure used in everyday life
  • The Pearson correlation coefficient measures distance in terms of the shape of the expression response
  • The Mahalanobis distance takes account of the variation and correlation of expression responses
  • 16.3 Clustering Methods Identify Similar and Distinct Expression Patterns
  • Hierarchical clustering produces a related set of alternative partitions of the data
  • k-means clustering groups data into several clusters but does not determine a relationship between clusters
  • Self-organizing maps (SOMs) use neural network methods to cluster data into a predetermined number of clusters
  • Evolutionary clustering algorithms use selection, recombination, and mutation to find the best possible solution to a problem
  • The self-organizing tree algorithm (SOTA) determines the number of clusters required
  • Biclustering identifies a subset of similar expression level patterns occurring in a subset of the samples
  • The validity of clusters is determined by independent methods
  • 16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression
  • T-tests can be used to estimate the significance of the difference between two expression levels
  • Nonparametric tests are used to avoid making assumptions about the data sampling
  • Multiple testing of differential expression requires special techniques to control error rates
  • 16.5 Gene and Protein Expression Data Can be Used to Classify Samples
  • Many alternative methods have been proposed that can classify samples
  • Support vector machines are another form of supervised learning algorithm that can produce classifiers
  • Summary
  • Further Reading
  • Monograph
  • 16.1 Expression Data Require Preparation Prior to Analysis
  • Variance-stabilizing transformation
  • Lowess normalization
  • General review of normalization and transformation of data
  • Data used to generate Figures 16.2 to 16.4
  • Principal component analysis
  • 16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points
  • Distance definitions
  • 16.3 Clustering Methods Identify Similar and Distinct Expression Patterns
  • Self-organizing maps (SOMs)
  • Clustering using genetic algorithms
  • SOTA
  • Biclustering techniques
  • Validation of partitions produced by clustering methods
  • 16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression
  • Bayesian estimation of variance
  • Nonparametric tests
  • Multiple hypothesis testing
  • 16.5 Gene and Protein Expression Data Can be Used to Classify Samples
  • Support vector machines
  • 17 Systems Biology
  • 17.1 What is a System?
  • A system is more than the sum of its parts
  • A biological system is a living network
  • Databases are useful starting points in constructing a network
  • To construct a model more information is needed than a network
  • There are three possible approaches to constructing a model
  • Kinetic models are not the only way in systems biology
  • 17.2 Structure of the Model
  • Control circuits are an essential part of any biological system
  • The interactions in networks can be represented as simple differential equations
  • 17.3 Robustness of Biological Systems
  • Robustness is a distinct feature of complexity in biology
  • Modularity plays an important part in robustness
  • Redundancy in the system can provide robustness
  • Living systems can switch from one state to another by means of bistable switches
  • 17.4 Storing and Running System Models
  • Specialized programs make simulating systems easier
  • Standardized systems descriptions aid their storage and reuse
  • Summary
  • Further Reading
  • 17.1 What is a System?
  • Databases
  • 17.2 Structure of the Model
  • Topological modeling
  • 17.3 Robustness of Biological Systems
  • Molecular systems modeling
  • Appendix A: probability, Information, and Bayesian Analysis
  • Probability Theory, Entropy, and Information
  • Mutually exclusive events
  • Occurrence of two events
  • Occurrence of two random variables
  • Bayesian Analysis
  • Bayes’ theorem
  • Inference of parameter values
  • Further Reading
  • Appendix B: Molecular Energy Functions
  • Force Fields for Calculating Intra- and Intermolecular Interaction Energies
  • Bonding terms
  • Nonbonding terms
  • Potentials used in Threading
  • Potentials of mean force
  • Potential terms relating to solvent effects
  • Further Reading
  • Appendix C: Function Optimization
  • Full Search Methods
  • Dynamic programming and branch-and-bound
  • Local Optimization
  • The downhill simplex method
  • The steepest descent method
  • The conjugate gradient method
  • Methods using second derivatives
  • Thermodynamic Simulation and Global Optimization
  • Monte Carlo and genetic algorithms
  • Molecular dynamics
  • Simulated annealing
  • Summary
  • Further Reading
  • List of Symbols
  • Notation concepts
  • Glossary
  • Index

Additional information

Veldu vöru

Rafbók til eignar

Aðrar vörur

0
    0
    Karfan þín
    Karfan þín er tómAftur í búð