Understanding Bioinformatics

Höfundur Marketa Zvelebil; Jeremy O. Baum

Útgefandi Garland Science

Snið ePub

Print ISBN 9780815340249

Útgáfa 1

Útgáfuár 2008

15.290 kr.

Description

Efnisyfirlit

  • Cover
  • Half Title
  • Dedication
  • Title Page
  • Copyright Page
  • Preface
  • A Note to the Reader
  • Organization of this Book
  • Applications and Theory Chapters
  • Part 1: Background Basics
  • Part 2: Sequence Alignments
  • Part 3: Evolutionary Processes
  • Part 4: Genome Characteristics
  • Part 5: Secondary Structures
  • Part 6: Tertiary Structures
  • Part 7: Cells and Organisms
  • Appendices
  • Organization of the Chapters
  • Learning Outcomes
  • Flow Diagrams
  • Mind Maps
  • Illustrations
  • Further Reading
  • List of Symbols
  • Glossary
  • Garland Science Website
  • Artwork
  • Additional Material
  • List of Reviewers
  • Contents In Brief
  • Table of Contents
  • Part 1 Background Basics
  • 1 The Nucleic Acid World
  • 1.1 The Structure of DNA and RNA
  • DNA is a linear polymer of only four different bases
  • Two complementary DNA strands interact by base-pairing to form a double helix
  • RNA molecules are mostly single stranded but can also have base-pair structures
  • 1.2 DNA, RNA, and Protein: The Central Dogma
  • DNA is the information store, but RNA is the messenger
  • Messenger RNA is translated into protein according to the genetic code
  • Translation involves transfer RNAs and RNA-containing ribosomes
  • 1.3 Gene Structure and Control
  • RNA polymerase binds to specific sequences that position it and identify where to begin transcription
  • The signals initiating transcription in eukaryotes are generally more complex than those in bacteria
  • Eukaryotic mRNA transcripts undergo several modifications prior to their use in translation
  • The control of translation
  • 1.4 The Tree of Life and Evolution
  • A brief survey of the basic characteristics of the major forms of life
  • Nucleic acid sequences can change as a result of mutation
  • Summary
  • Further Reading
  • General References
  • 1.1 The Structure of DNA and RNA
  • 1.2 DNA, RNA and Protein: The Central Dogma
  • 1.3 Gene Structure and Control
  • 1.4 The Tree of Life and Evolution
  • Box 1.2
  • 2 Protein Structure
  • 2.1 Primary and Secondary Structure
  • Protein structure can be considered on several different levels
  • Amino acids are the building blocks of proteins
  • The differing chemical and physical properties of amino acids are due to their side chains
  • Amino acids are covalently linked together in the protein chain by peptide bonds
  • Secondary structure of proteins is made up of α-helices and β-strands
  • Several different types of β-sheet are found in protein structures
  • Turns, hairpins, and loops connect helices and strands
  • 2.2 Implication for Bioinformatics
  • Certain amino acids prefer a particular structural unit
  • Evolution has aided sequence analysis
  • Visualization and computer manipulation of protein structures
  • 2.3 Proteins Fold to Form Compact Structures
  • The tertiary structure of a protein is defined by the path of the polypeptide chain
  • The stable folded state of a protein represents a state of low energy
  • Many proteins are formed of multiple subunits
  • Summary
  • Further Reading
  • 3 Dealing with Databases
  • 3.1 The Structure of Databases
  • Flat-file databases store data as text files
  • Relational databases are widely used for storing biological information
  • XML has the flexibility to define bespoke data classifications
  • Many other database structures are used for biological data
  • Databases can be accessed locally or online and often link to each other
  • 3.2 Types of Database
  • There’s more to databases than just data
  • Primary and derived data
  • How we define and connect things is important: Ontologies
  • 3.3 Looking for Databases
  • Sequence databases
  • Microarray databases
  • Protein interaction databases
  • Structural databases
  • 3.4 Data Quality
  • Nonredundancy is especially important for some applications of sequence databases
  • Automated methods can be used to check for data consistency
  • Initial analysis and annotation is usually automated
  • Human intervention is often required to produce the highest quality annotation
  • The importance of updating databases and entry identifier and version numbers
  • Summary
  • Further Reading
  • 3.1 The Structure of Databases
  • 3.2 Types of Database
  • How we define and connect things is important: Ontologies
  • 3.3 Looking for Databases
  • Sequence databases
  • Microarray databases
  • Protein interaction databases
  • Structural databases
  • 3.4 Data Quality
  • MIAME
  • Part 2 Sequence Alignments
  • 4 Producing and Analyzing Sequence Alignments
  • 4.1 Principles of Sequence Alignment
  • Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity
  • Alignment can reveal homology between sequences
  • It is easier to detect homology when comparing protein sequences than when comparing nucleic acid sequences
  • 4.2 Scoring Alignments
  • The quality of an alignment is measured by giving it a quantitative score
  • The simplest way of quantifying similarity between two sequences is percentage identity
  • The dot-plot gives a visual assessment of similarity based on identity
  • Genuine matches do not have to be identical
  • There is a minimum percentage identity that can be accepted as significant
  • There are many different ways of scoring an alignment
  • 4.3 Substitution Matrices
  • Substitution matrices are used to assign individual scores to aligned sequence positions
  • The PAM substitution matrices use substitution frequencies derived from sets of closely related protein sequences
  • The BLOSUM substitution matrices use mutation data from highly conserved local regions of sequence
  • The choice of substitution matrix depends on the problem to be solved
  • 4.4 Inserting Gaps
  • Gaps inserted in a sequence to maximize similarity with another require a scoring penalty
  • Dynamic programming algorithms can determine the optimal introduction of gaps
  • 4.5 Types of Alignment
  • Different kinds of alignments are useful in different circumstances
  • Multiple sequence alignments enable the simultaneous comparison of a set of similar sequences
  • Multiple alignments can be constructed by several different techniques
  • Multiple alignments can improve the accuracy of alignment for sequences of low similarity
  • ClustalW can make global multiple alignments of both DNA and protein sequences
  • Multiple alignments can be made by combining a series of local alignments
  • Alignment can be improved by incorporating additional information
  • 4.6 Searching Databases
  • Fast yet accurate search algorithms have been developed
  • FASTA is a fast database-search method based on matching short identical segments
  • BLAST is based on finding very similar short segments
  • Different versions of BLAST and FASTA are used for different problems
  • PSI-BLAST enables profile-based database searches
  • Ssearch is a rigorous alignment method
  • 4.7 Searching with Nucleic Acid or Protein Sequences
  • DNA or RNA sequences can be used either directly or after translation
  • The quality of a database match has to be tested to ensure that it could not have arisen by chance
  • Choosing an appropriate E-value threshold helps to limit a database search
  • Low-complexity regions can complicate homology searches
  • Different databases can be used to solve particular problems
  • 4.8 Protein Sequence Motifs or Patterns
  • Creation of pattern databases requires expert knowledge
  • The BLOCKS database contains automatically compiled short blocks of conserved multiply aligned protein sequences
  • 4.9 Searching Using Motifs and Patterns
  • The PROSITE database can be searched for protein motifs and patterns
  • The pattern-based program PHI-BLAST searches for both homology and matching motifs
  • Patterns can be generated from multiple sequences using PRATT
  • The PRINTS database consists of fingerprints representing sets of conserved motifs that describe a protein family
  • The Pfam database defines profiles of protein families
  • 4.10 Patterns and Protein Function
  • Searches can be made for particular functional sites in proteins
  • Sequence comparison is not the only way of analyzing protein sequences
  • Summary
  • Further Reading
  • 4.1 Principles of Sequence Alignment
  • 4.2 Scoring Alignments
  • Twilight zone and midnight zone
  • 4.3 Substitution Matrices
  • 4.4 Inserting Gaps
  • 4.5 Types of Alignment
  • ClustalW
  • DIALIGN
  • 4.6 Searching Databases
  • BLAST
  • FASTA
  • PSI-BLAST
  • 4.8 Protein Sequence Motifs or Patterns
  • MEME
  • 4.9 Searching Using Motifs and Patterns
  • MOTIF
  • PRATT
  • 4.10 Patterns and Protein Function
  • HCA
  • 5 Pairwise Sequence Alignment and Database Searching
  • 5.1 Substitution Matrices and Scoring
  • Alignment scores attempt to measure the likelihood of a common evolutionary ancestor
  • The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins
  • The BLOSUM matrices were designed to find conserved regions of proteins
  • Scoring matrices for nucleotide sequence alignment can be derived in similar ways
  • The substitution scoring matrix used must be appropriate to the specific alignment problem
  • Gaps are scored in a much more heuristic way than substitutions
  • 5.2 Dynamic Programming Algorithms
  • Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm
  • Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm
  • Time can be saved with a loss of rigor by not calculating the whole matrix
  • 5.3 Indexing Techniques and Algorithmic Approximations
  • Suffix trees locate the positions of repeats and unique sequences
  • Hashing is an indexing technique that lists the starting positions of all k-tuples
  • The FASTA algorithm uses hashing and chaining for fast database searching
  • The BLAST algorithm makes use of finite-state automata
  • Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms
  • 5.4 Alignment Score Significance
  • The statistics of gapped local alignments can be approximated by the same theory
  • 5.5 Aligning Complete Genome Sequences
  • Indexing and scanning whole genome sequences efficiently is crucial for the sequence alignment of higher organisms
  • The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms
  • Summary
  • Further Reading
  • 5.1 Substitution Matrices and Scoring
  • The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins
  • The BLOSUM matrices were designed to find conserved regions of proteins
  • Scoring matrices for nucleotide sequence alignment can be derived in similar ways
  • The substitution scoring matrix used must be appropriate to the specific alignment problem
  • Gaps are scored in a much more heuristic way than substitutions
  • 5.2 Dynamic Programming Algorithms
  • Optimal global alignments are produced using efficient variations of the Needleman-Wunsch algorithm
  • Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm
  • Time can be saved with a loss of rigor by not calculating the whole matrix
  • 5.3 Indexing Techniques and Algorithmic Approximations
  • Suffix trees locate the positions of repeats and unique sequences; Hashing is an indexing technique that lists the starting positions of all k-tuples
  • The FASTA algorithm uses hashing and chaining for fast database searching; The BLAST algorithm makes use of finite-state automata
  • Box 5.2: Sometimes things just aren’t complex enough
  • 5.4 Alignment Score Significance
  • 5.5 Alignments Involving Complete Genome Sequences
  • 6 Patterns, Profiles, and Multiple Alignments
  • 6.1 Profiles and Sequence Logos
  • Position-specific scoring matrices are an extension of substitution scoring matrices
  • Methods for overcoming a lack of data in deriving the values for a PSSM
  • PSI-BLAST is a sequence database searching program
  • Representing a profile as a logo
  • 6.2 Profile Hidden Markov Models
  • The basic structure of HMMs used in sequence alignment to profiles
  • Estimating HMM parameters using aligned sequences
  • Scoring a sequence against a profile HMM: The most probable path and the sum over all paths
  • Estimating HMM parameters using unaligned sequences
  • 6.3 Aligning Profiles
  • Comparing two PSSMs by alignment
  • Aligning profile HMMs
  • 6.4 Multiple Sequence Alignments by Gradual Sequence Addition
  • The order in which sequences are added is chosen based on the estimated likelihood of incorporating errors in the alignment
  • Many different scoring schemes have been used in constructing multiple alignments
  • The multiple alignment is built using the guide tree and profile methods and may be further refined
  • 6.5 Other Ways of Obtaining Multiple Alignments
  • The multiple sequence alignment program DIALIGN aligns ungapped blocks
  • The SAGA method of multiple alignment uses a genetic algorithm
  • 6.6 Sequence Pattern Discovery
  • Discovering patterns in a multiple alignment: eMOTIF and AACC
  • Probabilistic searching for common patterns in sequences: Gibbs and MEME
  • Searching for more general sequence patterns
  • Summary
  • Further Reading
  • 6.1 Profiles and Sequence Logos
  • 6.2 Profile Hidden Markov Models
  • 6.3 Aligning Profiles
  • 6.4 Multiple Sequence Alignments by Gradual Sequence Additions
  • 6.5 Other Ways of Obtaining Multiple Alignments
  • 6.6 Sequence Pattern Discovery
  • Part 3 Evolutionary Processes
  • 7 Recovering Evolutionary History
  • 7.1 The Structure and Interpretation of Phylogenetic Trees
  • Phylogenetic trees reconstruct evolutionary relationships
  • Tree topology can be described in several ways
  • Consensus and condensed trees report the results of comparing tree topologies
  • 7.2 Molecular Evolution and its Consequences
  • Most related sequences have many positions that have mutated several times
  • The rate of accepted mutation is usually not the same for all types of base substitution
  • Different codon positions have different mutation rates
  • Only orthologous genes should be used to construct species phylogenetic trees
  • Major changes affecting large regions of the genome are surprisingly common
  • 7.3 Phylogenetic Tree Reconstruction
  • Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species
  • The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset
  • A model of evolution must be chosen to use with the method
  • All phylogenetic analyses must start with an accurate multiple alignment
  • Phylogenetic analyses of a small dataset of 16S RNA sequence data
  • Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved
  • Summary
  • Further Reading
  • General
  • 7.1 The Structure and Interpretation of Phylogenetic Trees
  • Phylogenetic trees reconstruct evolutionary relationships
  • Tree topology can be described in several ways
  • Consensus and condensed trees report the results of comparing tree topologies
  • 7.2 Molecular Evolution and its Consequences
  • The rate of accepted mutation is usually not the same for all types of base substitution
  • Different codon positions have different mutation rates
  • Only orthologous genes should be used to construct species phylogenetic trees
  • Major changes affecting large regions of the genome are surprisingly common
  • 7.3 Phylogenetic Tree Reconstruction
  • Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species
  • The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset
  • A model of evolution must be chosen to use with the method
  • Phylogenetic analyses of a small dataset of 16S RNA sequence data
  • Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved
  • 8 Building Phylogenetic Trees
  • 8.1 Evolutionary Models and the Calculation of Evolutionary Distance
  • A simple but inaccurate measure of evolutionary distance is the p-distance
  • The Poisson distance correction takes account of multiple mutations at the same site
  • The Gamma distance correction takes account of mutation rate variation at different sequence positions
  • The Jukes-Cantor model reproduces some basic features of the evolution of nucleotide sequences
  • More complex models distinguish between the relative frequencies of different types of mutation
  • There is a nucleotide bias in DNA sequences
  • Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment
  • 8.2 Generating Single Phylogenetic Trees
  • Clustering methods produce a phylogenetic tree based on evolutionary distances
  • The UPGMA method assumes a constant molecular clock and produces an ultrametric tree
  • The Fitch-Margoliash method produces an unrooted additive tree
  • The neighbor-joining method is related to the concept of minimum evolution
  • Stepwise addition and star-decomposition methods are usually used to generate starting trees for further exploration, not the final tree
  • 8.3 Generating Multiple Tree Topologies
  • The branch-and-bound method greatly improves the efficiency of exploring tree topology
  • Optimization of tree topology can be achieved by making a series of small changes to an existing tree
  • Finding the root gives a phylogenetic tree a direction in time
  • 8.4 Evaluating Tree Topologies
  • Functions based on evolutionary distances can be used to evaluate trees
  • Unweighted parsimony methods look for the trees with the smallest number of mutations
  • Mutations can be weighted in different ways in the parsimony method
  • Trees can be evaluated using the maximum likelihood method
  • The quartet-puzzling method also involves maximum likelihood in the standard implementation
  • Bayesian methods can also be used to reconstruct phylogenetic trees
  • 8.5 Assessing the Reliability of Tree Features and Comparing Trees
  • The long-branch attraction problem can arise even with perfect data and methodology
  • Tree topology can be tested by examining the interior branches
  • Tests have been proposed for comparing two or more alternative trees
  • Summary
  • Further Reading
  • 8.1 Evolutionary Models and the Calculation of Evolutionary Distance
  • The Gamma distance correction takes account of mutation rate variation at different sequence positions
  • More complex models distinguish between the relative frequencies of different types of mutation
  • There is a nucleotide bias in DNA sequences
  • Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment
  • 8.2 Generating Single Phylogenetic Trees
  • The UPGMA method assumes a constant molecular clock and produces an ultrametric tree
  • The Fitch-Margoliash method produces an unrooted additive tree
  • The neighbor-joining method is related to the concept of minimum evolution
  • 8.3 Generating Multiple Tree Topologies
  • Optimization of tree topology can be achieved by making a series of small changes to an existing tree
  • Finding the root gives a phylogenetic tree a direction in time
  • 8.4 Evaluating Tree Topologies
  • Functions based on evolutionary distances can be used to evaluate trees
  • Trees can be evaluated using the maximum likelihood method
  • The quartet-puzzling method also involves maximum likelihood in the standard implementation
  • Bayesian methods can also be used to reconstruct phylogenetic trees
  • 8.5 Assessing the Reliability of Tree Features and Comparing Trees
  • The long-branch attraction problem can arise even with perfect data and methodology
  • Tree topology can be tested by examining the interior branches
  • Tests have been proposed for comparing two or more alternative trees
  • Part 4 Genome Characteristics
  • 9 Revealing Genome Features
  • 9.1 Preliminary Examination of Genome Sequence
  • Whole genome sequences can be split up to simplify gene searches
  • Structural RNA genes and repeat sequences can be excluded from further analysis
  • Homology can be used to identify genes in both prokaryotic and eukaryotic genomes
  • 9.2 Gene Prediction in Prokaryotic Genomes
  • 9.3 Gene Prediction in Eukaryotic Genomes
  • Programs for predicting exons and introns use a variety of approaches
  • Gene predictions must preserve the correct reading frame
  • Some programs search for exons using only the query sequence and a model for exons
  • Some programs search for genes using only the query sequence and a gene model
  • Genes can be predicted using a gene model and sequence similarity
  • Genomes of related organisms can be used to improve gene prediction
  • 9.4 Splice Site Detection
  • Splice sites can be detected independently by specialized programs
  • 9.5 Prediction of Promoter Regions
  • Prokaryotic promoter regions contain relatively well-defined motifs
  • Eukaryotic promoter regions are typically more complex than prokaryotic promoters
  • A variety of promoter-prediction methods are available online
  • Promoter prediction results are not very clear-cut
  • 9.6 Confirming Predictions
  • There are various methods for calculating the accuracy of gene-prediction programs
  • Translating predicted exons can confirm the correctness of the prediction
  • Constructing the protein and identifying homologs
  • 9.7 Genome Annotation
  • Genome annotation is the final step in genome analysis
  • Gene ontology provides a standard vocabulary for gene annotation
  • 9.8 Large Genome Comparisons
  • Summary
  • Further Reading
  • 9.1 Preliminary Examination of Genome Sequence
  • 9.2 Gene Prediction in Prokaryotic Genomes
  • 9.3 Gene Prediction in Eukaryotic Genomes
  • 9.4 Splice Site Detection
  • 9.5 Prediction of Promoter Regions
  • 9.6 Confirming Predictions
  • 9.7 Genome Annotation
  • 9.8 Large Genome Comparisons
  • Box 9.5
  • 10 Gene Detection and Genome Annotation
  • 10.1 Detection of Functional RNA Molecules Using Decision Trees
  • Detection of tRNA genes using the tRNAscan algorithm
  • Detection of tRNA genes in eukaryotic genomes
  • 10.2 Features Useful for Gene Detection in Prokaryotes
  • 10.3 Algorithms for Gene Detection in Prokaryotes
  • GeneMark uses inhomogeneous Markov chains and dicodon statistics
  • GLIMMER uses interpolated Markov models of coding potential
  • Orpheus uses homology, codon statistics, and ribosome-binding sites
  • GeneMark.hmm uses explicit state duration hidden Markov models
  • EcoParse is an HMM gene model
  • 10.4 Features Used in Eukaryotic Gene Detection
  • Differences between prokaryotic and eukaryotic genes
  • Introns, exons, and splice sites
  • Promoter sequences and binding sites for transcription factors
  • 10.5 Predicting Eukaryotic Gene Signals
  • Detection of core promoter binding signals is a key element of some eukaryotic gene-prediction methods
  • A set of models has been designed to locate the site of core promoter sequence signals
  • Predicting promoter regions from general sequence properties can reduce the numbers of false-positive results
  • Predicting eukaryotic transcription and translation start sites
  • Translation and transcription stop signals complete the gene definition
  • 10.6 Predicting Exons and Introns
  • Exons can be identified using general sequence properties
  • Splice-site prediction
  • Splice sites can be predicted by sequence patterns combined with base statistics
  • GenScan uses a combination of weight matrices and decision trees to locate splice sites
  • GeneSplicer predicts splice sites using first-order Markov chains
  • NetPlantGene combines neural networks with intron and exon predictions to predict splice sites
  • Other splicing features may yet be exploited for splice-site prediction
  • Specific methods exist to identify initial and terminal exons
  • Exons can be defined by searching databases for homologous regions
  • 10.7 Complete Eukaryotic Gene Models
  • 10.8 Beyond the Prediction of Individual Genes
  • Functional annotation
  • Comparison of related genomes can help resolve uncertain predictions
  • Evaluation and reevaluation of gene-detection methods
  • Summary
  • Further Reading
  • 10.1 Detection of Functional RNA Molecules Using Decision Trees
  • tRNA detection
  • Detection of other RNA genes
  • 10.2 Features Useful for Gene Detection in Prokaryotes
  • Identifying protein-coding regions using base statistics
  • 10.3 Algorithms for Gene Detection in Prokaryotes
  • GeneMark, GeneMark.hmm, and further developments
  • Glimmer
  • Orpheus
  • EcoParse
  • Prokaryotic genomes
  • Markov models
  • 10.4 Features Used in Eukaryotic Gene Detection
  • 10.4 Preliminary analysis for human genes
  • 10.5 Predicting Eukaryotic Gene Signals
  • Initial analysis of core promoter sequences
  • Algorithms of core promoter detection
  • Promoter recognition
  • 10.6 Predicting Exons and Introns
  • 10.7 Complete Eukaryotic Gene Models
  • 10.8 Beyond the Prediction of Individual Genes
  • Detailed reexamination of the annotation of a complete genome
  • Large-scale changes in chromosomes
  • Box 10.1 Measures of gene prediction accuracy at the nucleotide level
  • Box 10.2 Sequencing many genomes at once
  • Box 10.3 Measures of gene prediction accuracy at the exon level
  • Part 5 Secondary Structures
  • 11 Obtaining Secondary Structure from Sequence
  • 11.1 Types of Prediction Methods
  • Statistical methods are based on rules that give the probability that a residue will form part of a particular secondary structure
  • Nearest-neighbor methods are statistical methods that incorporate additional information about protein structure
  • Machine-learning approaches to secondary structure prediction mainly make use of neural networks and HMM methods
  • 11.2 Training and Test Databases
  • There are several ways to define protein secondary structures
  • 11.3 Assessing the Accuracy of Prediction Programs
  • Q3 measures the accuracy of individual residue assignments
  • Secondary structure predictions should not be expected to reach 100% residue accuracy
  • The Sov value measures the prediction accuracy for whole elements
  • CAFASP/CASP: Unbiased and readily available protein prediction assessments
  • 11.4 Statistical and Knowledge-Based Methods
  • The GOR method uses an information theory approach
  • The program Zpred includes multiple alignment of homologous sequences and residue conservation information
  • There is an overall increase in prediction accuracy using multiple sequence information
  • The nearest-neighbor method: The use of multiple nonhomologous sequences
  • Predator is a combined statistical and knowledge-based program that includes the nearest-neighbor approach
  • 11.5 Neural Network Methods of Secondary Structure Prediction
  • Assessing the reliability of neural net predictions
  • Several examples of Web-based neural network secondary structure prediction programs
  • PROF: Protein forecasting
  • Psipred
  • Jnet: Using several alternative representations of the sequence alignment
  • 11.6 Some Secondary Structures Require Specialized Prediction Methods
  • Transmembrane proteins
  • Quantifying the preference for a membrane environment
  • 11.7 Prediction of Transmembrane Protein Structure
  • Multi-helix membrane proteins
  • A selection of prediction programs to predict transmembrane helices
  • Statistical methods
  • Knowledge-based prediction
  • Evolutionary information from protein families improves the prediction
  • Neural nets in transmembrane prediction
  • Predicting transmembrane helices with hidden Markov models
  • Comparing the results: What to choose
  • What happens if a non-transmembrane protein is submitted to transmembrane prediction programs
  • Prediction of transmembrane structure containing β-strands
  • 11.8 Coiled-Coil Structures
  • The COILS prediction program
  • PAIRCOIL and MULTICOIL are an extension of the COILS algorithm
  • Zipping the leucine zipper: A specialized coiled coil
  • 11.9 RNA Secondary Structure Prediction
  • Summary
  • Further Reading
  • 11.1 Types of Prediction Methods
  • 11.2 Training and Test Databases
  • PDB
  • STRIDE
  • DSSP
  • DEFINE
  • 11.3 Assessing the Accuracy of Prediction Programs
  • 11.4 Statistical and Knowledge-Based Methods
  • 11.5 Neural Network Methods of Secondary Structure Prediction
  • 11.7 Prediction of Transmembrane Protein Structure
  • 11.8 Coiled-Coil Structures
  • 11.9 RNA Secondary Structure Prediction
  • Box 11.1
  • Box 11.3
  • 12 Predicting Secondary Structures
  • 12.1 Defining Secondary Structure and Prediction Accuracy
  • The definitions used for automatic protein secondary structure assignment do not give identical results
  • There are several different measures of the accuracy of secondary structure prediction
  • 12.2 Secondary Structure Prediction Based on Residue Propensities
  • Each structural state has an amino acid preference which can be assigned as a residue propensity
  • The simplest prediction methods are based on the average residue propensity over a sequence window
  • Residue propensities are modulated by nearby sequence
  • Predictions can be significantly improved by including information from homologous sequences
  • 12.3 The Nearest-Neighbor Methods are Based on Sequence Segment Similarity
  • Short segments of similar sequence are found to have similar structure
  • Several sequence similarity measures have been used to identify nearest-neighbor segments
  • A weighted average of the nearest-neighbor segment structures is used to make the prediction
  • A nearest-neighbor method has been developed to predict regions with a high potential to misfold
  • 12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction
  • Layered feed-forward neural networks can transform a sequence into a structural prediction
  • Inclusion of information on homologous sequences improves neural network accuracy
  • More complex neural nets have been applied to predict secondary and other structural features
  • 12.5 Hidden Markov Models Have Been Applied to Structure Prediction
  • HMM methods have been found especially effective for transmembrane proteins
  • Nonmembrane protein secondary structures can also be successfully predicted with HMMs
  • 12.6 General Data Classification Techniques can Predict Structural Features
  • Support vector machines have been successfully used for protein structure prediction
  • Discriminants, SOMs, and other methods have also been used
  • Summary
  • Further Reading
  • 12.1 Defining Secondary Structure and Prediction Accuracy
  • DSSP
  • PALSSE
  • β-Spider
  • Limits of prediction accuracy
  • TM properties and accuracy measures
  • PSIPRED (Q3 and Sov variation)
  • Matthews correlation coefficient
  • Sov
  • SS assignment comparison
  • Length distributions (non-TM)
  • 12.2 Secondary Structure Prediction Based on Residue Propensities
  • PDB_SELECT
  • In-depth analysis of unbiased structural datasets
  • Hydrophobicity scales
  • Chou-Fasman
  • COILS
  • MEMSAT
  • Local and nonlocal effects
  • β-turn propensities
  • β-turn propensities and use of PSSMs
  • AAindex
  • GOR theory
  • GOR I
  • GOR II
  • GOR III
  • GOR IV
  • GOR V (includes other sequences)
  • Using consecutive pair of structural states
  • Zpred
  • Treatment of gaps in multiple alignments during prediction
  • 12.3 The Nearest-neighbor Methods are Based on Sequence Segment Similarity
  • NNSSP
  • SSPAL
  • I-sites
  • SIMPA96
  • Correlation between amino acid composition and protein structural class
  • HβP
  • 12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction
  • PHDsec
  • PHDpsi
  • PSIPRED
  • Back-propagation learning
  • BTPRED
  • Betaturns
  • DESTRUCT
  • SSpro
  • 12.5 Hidden Markov Models Have Been Applied to Structure Prediction
  • HMMTOP
  • TMHMM
  • Phobius
  • PROftmb
  • PRED-TMBB
  • YASPIN
  • MARCOIL
  • 12.6 General Data Classification Techniques can Predict Structural Features
  • DSC
  • FoldIndex
  • GPI-SOM
  • PSIMLR
  • Part 6 Tertiary Structures
  • 13 Modeling Protein Structure
  • 13.1 Potential Energy Functions and Force Fields
  • The conformation of a protein can be visualized in terms of a potential energy surface
  • Conformational energies can be described by simple mathematical functions
  • Similar force fields can be used to represent conformational energies in the presence of averaged environments
  • Potential energy functions can be used to assess a modeled structure
  • Energy minimization can be used to refine a modeled structure and identify local energy minima
  • Molecular dynamics and simulated annealing are used to find global energy minima
  • 13.2 Obtaining a Structure by Threading
  • The prediction of protein folds in the absence of known structural homologs
  • Libraries or databases of nonredundant protein folds are used in threading
  • Two distinct types of scoring schemes have been used in threading methods
  • Dynamic programming methods can identify optimal alignments of target sequences and structural folds
  • Several methods are available to assess the confidence to be put on the fold prediction
  • The C2-like domain from the Dictyostelia: A practical example of threading
  • 13.3 Principles of Homology Modeling
  • Closely related target and template sequences give better models
  • Significant sequence identity depends on the length of the sequence
  • Homology modeling has been automated to deal with the numbers of sequences that can now be modeled
  • Model building is based on a number of assumptions
  • 13.4 Steps in Homology Modeling
  • Structural homologs to the target protein are found in the PDB
  • Accurate alignment of target and template sequences is essential for successful modeling
  • The structurally conserved regions of a protein are modeled first
  • The modeled core is checked for misfits before proceeding to the next stage
  • Sequence realignment and remodeling may improve the structure
  • Insertions and deletions are usually modeled as loops
  • Nonidentical amino acid side chains are modeled mainly by using rotamer libraries
  • Energy minimization is used to relieve structural errors
  • Molecular dynamics can be used to explore possible conformations for mobile loops
  • Models need to be checked for accuracy
  • How far can homology models be trusted?
  • 13.5 Automated Homology Modeling
  • The program MODELLER models by satisfying protein structure constraints
  • COMPOSER uses fragment-based modeling to automatically generate a model
  • Automated methods available on the Web for comparative modeling
  • Assessment of structure prediction
  • 13.6 Homology Modeling of PI3 Kinase p110a
  • Swiss-Pdb Viewer can be used for manual or semi-manual modeling
  • Alignment, core modeling, and side-chain modeling are carried out all in one
  • The loops are modeled from a database of possible structures
  • Energy minimization and quality inspection can be carried out within Swiss-Pdb Viewer
  • MolIDE is a downloadable semi-automatic modeling package
  • Automated modeling on the Web illustrated with p110α kinase
  • Modeling a functionally related but sequentially dissimilar protein: mTOR
  • Generating a multidomain three-dimensional structure from sequence
  • Summary
  • Further Reading
  • Modeling: General
  • 13.1 Potential Energy Functions and Force Fields
  • Ab initio modeling
  • 13.2 Obtaining a Structure by Threading
  • 123D+
  • GenTHREADER
  • 3D-PSSM
  • FUGUE
  • LIBRA
  • LOOPP
  • LIBELLULA
  • SCOP
  • 13.3 Automated Homology Modeling
  • MolIDE
  • SCWRL
  • PSIPRED
  • LOOPY
  • MODELLER
  • Assessing models
  • MolProbity
  • PROCHECK
  • CASP
  • Antibody and HIV examples
  • 14 Analyzing Structure-Function Relationships
  • 14.1 Functional Conservation
  • Functional regions are usually structurally conserved
  • Similar biochemical function can be found in proteins with different folds
  • Fold libraries identify structurally similar proteins regardless of function
  • 14.2 Structure Comparison Methods
  • Finding domains in proteins aids structure comparison
  • Structural comparisons can reveal conserved functional elements not discernible from a sequence comparison
  • The CE method builds up a structural alignment from pairs of aligned protein segments
  • The Vector Alignment Search Tool (VAST) aligns secondary structural elements
  • DALI identifies structure superposition without maintaining segment order
  • FATCAT introduces rotations between rigid segments
  • 14.3 Finding Binding Sites
  • Highly conserved, strongly charged, or hydrophobic surface areas may indicate interaction sites
  • Searching for protein-protein interactions using surface properties
  • Surface calculations highlight clefts or holes in a protein that may serve as binding sites
  • Looking at residue conservation can identify binding sites
  • 14.4 Docking Methods and Programs
  • Simple docking procedures can be used when the structure of a homologous protein bound to a ligand analog is known
  • Specialized docking programs will automatically dock a ligand to a structure
  • Scoring functions are used to identify the most likely docked ligand
  • The DOCK program is a semirigid-body method that analyzes shape and chemical complementarity of ligand and binding site
  • Fragment docking identifies potential substrates by predicting types of atoms and functional groups in the binding area
  • GOLD is a flexible docking program, which utilizes a genetic algorithm
  • The water molecules in binding sites should also be considered
  • Summary
  • Further Reading
  • General
  • 14.1 Functional Conservation
  • Structure-function relationships
  • Fold recognition programs
  • Domain identification
  • Viewing programs
  • 14.3 Finding Binding Sites
  • Evolutionary trace methods
  • 14.4 Docking Methods and Programs
  • Part 7 Cells and Organisms
  • 15 Proteome and Gene Expression Analysis
  • 15.1 Analysis of Large-scale Gene Expression
  • The expression of large numbers of different genes can be measured simultaneously by DNA microarrays
  • Gene expression microarrays are mainly used to detect differences in gene expression in different conditions
  • Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression
  • Digital differential display uses bioinformatics and statistics to detect differential gene expression in different tissues
  • Facilitating the integration of data from different places and experiments
  • The simplest method of analyzing gene expression microarray data is hierarchical cluster analysis
  • Techniques based on self-organizing maps can be used for analyzing microarray data
  • Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision of clusters
  • Clustered gene expression data can be used as a tool for further research
  • 15.2 Analysis of Large-scale Protein Expression
  • Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell
  • Measuring the expression levels shown in 2D gels
  • Differences in protein expression levels between different samples can be detected by 2D gels
  • Clustering methods are used to identify protein spots with similar expression patterns
  • Principal component analysis (PCA) is an alternative to clustering for analyzing microarray and 2D gel data
  • The changes in a set of protein spots can be tracked over a number of different samples
  • Databases and online tools are available to aid the interpretation of 2D gel data
  • Protein microarrays allow the simultaneous detection of the presence or activity of large numbers of different proteins
  • Mass spectrometry can be used to identify the proteins separated and purified by 2D gel electrophoresis or other means
  • Protein-identification programs for mass spectrometry are freely available on the Web
  • Mass spectrometry can be used to measure protein concentration
  • Summary
  • Further Reading
  • 15.1 Analysis of Large-scale Gene Expression
  • Functional genomics
  • Microarray quality control focus
  • MIAME
  • 15.2 Analysis of Large-scale Protein Expression
  • Proteomics
  • Protein microarrays
  • MASCOT
  • ProteinProspector
  • Databases and software
  • 16 Clustering Methods and Statistics
  • 16.1 Expression Data Require Preparation Prior to Analysis
  • Data normalization is designed to remove systematic experimental errors
  • Expression levels are often analyzed as ratios and are usually transformed by taking logarithms
  • Sometimes further normalization is useful after the data transformation
  • Principal component analysis is a method for combining the properties of an object
  • 16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points
  • Euclidean distance is the measure used in everyday life
  • The Pearson correlation coefficient measures distance in terms of the shape of the expression response
  • The Mahalanobis distance takes account of the variation and correlation of expression responses
  • 16.3 Clustering Methods Identify Similar and Distinct Expression Patterns
  • Hierarchical clustering produces a related set of alternative partitions of the data
  • k-means clustering groups data into several clusters but does not determine a relationship between clusters
  • Self-organizing maps (SOMs) use neural network methods to cluster data into a predetermined number of clusters
  • Evolutionary clustering algorithms use selection, recombination, and mutation to find the best possible solution to a problem
  • The self-organizing tree algorithm (SOTA) determines the number of clusters required
  • Biclustering identifies a subset of similar expression level patterns occurring in a subset of the samples
  • The validity of clusters is determined by independent methods
  • 16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression
  • T-tests can be used to estimate the significance of the difference between two expression levels
  • Nonparametric tests are used to avoid making assumptions about the data sampling
  • Multiple testing of differential expression requires special techniques to control error rates
  • 16.5 Gene and Protein Expression Data Can be Used to Classify Samples
  • Many alternative methods have been proposed that can classify samples
  • Support vector machines are another form of supervised learning algorithm that can produce classifiers
  • Summary
  • Further Reading
  • Monograph
  • 16.1 Expression Data Require Preparation Prior to Analysis
  • Variance-stabilizing transformation
  • Lowess normalization
  • General review of normalization and transformation of data
  • Data used to generate Figures 16.2 to 16.4
  • Principal component analysis
  • 16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points
  • Distance definitions
  • 16.3 Clustering Methods Identify Similar and Distinct Expression Patterns
  • Self-organizing maps (SOMs)
  • Clustering using genetic algorithms
  • SOTA
  • Biclustering techniques
  • Validation of partitions produced by clustering methods
  • 16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression
  • Bayesian estimation of variance
  • Nonparametric tests
  • Multiple hypothesis testing
  • 16.5 Gene and Protein Expression Data Can be Used to Classify Samples
  • Support vector machines
  • 17 Systems Biology
  • 17.1 What is a System?
  • A system is more than the sum of its parts
  • A biological system is a living network
  • Databases are useful starting points in constructing a network
  • To construct a model more information is needed than a network
  • There are three possible approaches to constructing a model
  • Kinetic models are not the only way in systems biology
  • 17.2 Structure of the Model
  • Control circuits are an essential part of any biological system
  • The interactions in networks can be represented as simple differential equations
  • 17.3 Robustness of Biological Systems
  • Robustness is a distinct feature of complexity in biology
  • Modularity plays an important part in robustness
  • Redundancy in the system can provide robustness
  • Living systems can switch from one state to another by means of bistable switches
  • 17.4 Storing and Running System Models
  • Specialized programs make simulating systems easier
  • Standardized systems descriptions aid their storage and reuse
  • Summary
  • Further Reading
  • 17.1 What is a System?
  • Databases
  • 17.2 Structure of the Model
  • Topological modeling
  • 17.3 Robustness of Biological Systems
  • Molecular systems modeling
  • Appendix A: probability, Information, and Bayesian Analysis
  • Probability Theory, Entropy, and Information
  • Mutually exclusive events
  • Occurrence of two events
  • Occurrence of two random variables
  • Bayesian Analysis
  • Bayes’ theorem
  • Inference of parameter values
  • Further Reading
  • Appendix B: Molecular Energy Functions
  • Force Fields for Calculating Intra- and Intermolecular Interaction Energies
  • Bonding terms
  • Nonbonding terms
  • Potentials used in Threading
  • Potentials of mean force
  • Potential terms relating to solvent effects
  • Further Reading
  • Appendix C: Function Optimization
  • Full Search Methods
  • Dynamic programming and branch-and-bound
  • Local Optimization
  • The downhill simplex method
  • The steepest descent method
  • The conjugate gradient method
  • Methods using second derivatives
  • Thermodynamic Simulation and Global Optimization
  • Monte Carlo and genetic algorithms
  • Molecular dynamics
  • Simulated annealing
  • Summary
  • Further Reading
  • List of Symbols
  • Notation concepts
  • Glossary
  • Index
Show More

Additional information

Veldu vöru

Rafbók til eignar

Reviews

There are no reviews yet.

Be the first to review “Understanding Bioinformatics”

Netfang þitt verður ekki birt. Nauðsynlegir reitir eru merktir *

Aðrar vörur

0
    0
    Karfan þín
    Karfan þín er tómAftur í búð