Description
Efnisyfirlit
- Introduction
- About this Book
- Foolish Assumptions
- How This Book Is Organized
- Part I: Getting Started With Hadoop
- Part II: How Hadoop Works
- Part III: Hadoop and Structured Data
- Part IV: Administering and Configuring Hadoop
- Part V: The Part Of Tens: Getting More Out of Your Hadoop Cluster
- Icons Used in This Book
- Beyond the Book
- Where to Go from Here
- Part I: Getting Started with Hadoop
- Chapter 1: Introducing Hadoop and Seeing What It’s Good For
- Big Data and the Need for Hadoop
- Exploding data volumes
- Varying data structures
- A playground for data scientists
- The Origin and Design of Hadoop
- Distributed processing with MapReduce
- Apache Hadoop ecosystem
- Examining the Various Hadoop Offerings
- Comparing distributions
- Working with in-database MapReduce
- Looking at the Hadoop toolbox
- Chapter 2: Common Use Cases for Big Data in Hadoop
- The Keys to Successfully Adopting Hadoop (Or, “Please, Can We Keep Him?”)
- Log Data Analysis
- Data Warehouse Modernization
- Fraud Detection
- Risk Modeling
- Social Sentiment Analysis
- Image Classification
- Graph Analysis
- To Infinity and Beyond
- Chapter 3: Setting Up Your Hadoop Environment
- Choosing a Hadoop Distribution
- Choosing a Hadoop Cluster Architecture
- Pseudo-distributed mode (single node)
- Fully distributed mode (a cluster of nodes)
- The Hadoop For Dummies Environment
- The Hadoop For Dummies distribution: Apache Bigtop
- Setting up the Hadoop For Dummies environment
- The Hadoop For Dummies Sample Data Set: Airline on-time performance
- Your First Hadoop Program: Hello Hadoop!
- Part II: How Hadoop Works
- Chapter 4: Storing Data in Hadoop: The Hadoop Distributed File System
- Data Storage in HDFS
- Taking a closer look at data blocks
- Replicating data blocks
- Slave node and disk failures
- Sketching Out the HDFS Architecture
- Looking at slave nodes
- Keeping track of data blocks with NameNode
- Checkpointing updates
- HDFS Federation
- HDFS High Availability
- Chapter 5: Reading and Writing Data
- Compressing Data
- Managing Files with the Hadoop File System Commands
- Ingesting Log Data with Flume
- Chapter 6: MapReduce Programming
- Thinking in Parallel
- Seeing the Importance of MapReduce
- Doing Things in Parallel: Breaking Big Problems into Many Bite-Size Pieces
- Looking at MapReduce application flow
- Understanding input splits
- Seeing how key/value pairs fit into the MapReduce application flow
- Writing MapReduce Applications
- Getting Your Feet Wet: Writing a Simple MapReduce Application
- The FlightsByCarrier driver application
- The FlightsByCarrier mapper
- The FlightsByCarrier reducer
- Running the FlightsByCarrier application
- Chapter 7: Frameworks for Processing Data in Hadoop: YARN and MapReduce
- Running Applications Before Hadoop 2
- Tracking JobTracker
- Tracking TaskTracker
- Launching a MapReduce application
- Seeing a World beyond MapReduce
- Scouting out the YARN architecture
- Launching a YARN-based application
- Real-Time and Streaming Applications
- Chapter 8: Pig: Hadoop Programming Made Easier
- Admiring the Pig Architecture
- Going with the Pig Latin Application Flow
- Working through the ABCs of Pig Latin
- Uncovering Pig Latin structures
- Looking at Pig data types and syntax
- Evaluating Local and Distributed Modes of Running Pig scripts
- Checking Out the Pig Script Interfaces
- Scripting with Pig Latin
- Chapter 9: Statistical Analysis in Hadoop
- Pumping Up Your Statistical Analysis
- The limitations of sampling
- Factors that increase the scale of statistical analysis
- Running statistical models in MapReduce
- Machine Learning with Mahout
- Collaborative filtering
- Clustering
- Classifications
- R on Hadoop
- The R language
- Hadoop Integration with R
- Chapter 10: Developing and Scheduling Application Workflows with Oozie
- Getting Oozie in Place
- Developing and Running an Oozie Workflow
- Writing Oozie workflow definitions
- Configuring Oozie workflows
- Running Oozie workflows
- Scheduling and Coordinating Oozie Workflows
- Time-based scheduling for Oozie coordinator jobs
- Time and data availability-based scheduling for Oozie coordinator jobs
- Running Oozie coordinator jobs
- Part III: Hadoop and Structured Data
- Chapter 11: Hadoop and the Data Warehouse: Friends or Foes?
- Comparing and Contrasting Hadoop with Relational Databases
- NoSQL data stores
- ACID versus BASE data stores
- Structured data storage and processing in Hadoop
- Modernizing the Warehouse with Hadoop
- The landing zone
- A queryable archive of cold warehouse data
- Hadoop as a data preprocessing engine
- Data discovery and sandboxes
- Chapter 12: Extremely Big Tables: Storing Data in HBase
- Say Hello to HBase
- Sparse
- It’s distributed and persistent
- It has a multidimensional sorted map
- Understanding the HBase Data Model
- Understanding the HBase Architecture
- RegionServers
- MasterServer
- Zookeeper and HBase reliability
- Taking HBase for a Test Run
- Creating a table
- Working with Zookeeper
- Getting Things Done with HBase
- Working with an HBase Java API client example
- HBase and the RDBMS world
- Knowing when HBase makes sense for you?
- ACID Properties in HBase
- Transitioning from an RDBMS model to HBase
- Deploying and Tuning HBase
- Hardware requirements
- Deployment Considerations
- Tuning prerequisites
- Understanding your data access patterns
- Pre-Splitting your regions
- The importance of row key design
- Tuning major compactions
- Chapter 13: Applying Structure to Hadoop Data with Hive
- Saying Hello to Hive
- Seeing How the Hive is Put Together
- Getting Started with Apache Hive
- Examining the Hive Clients
- The Hive CLI client
- The web browser as Hive client
- SQuirreL as Hive client with the JDBC Driver
- Working with Hive Data Types
- Creating and Managing Databases and Tables
- Managing Hive databases
- Creating and managing tables with Hive
- Seeing How the Hive Data Manipulation Language Works
- LOAD DATA examples
- INSERT examples
- Create Table As Select (CTAS) examples
- Querying and Analyzing Data
- Joining tables with Hive
- Improving your Hive queries with indexes
- Windowing in HiveQL
- Other key HiveQL features
- Chapter 14: Integrating Hadoop with Relational Databases Using Sqoop
- The Principles of Sqoop Design
- Scooping Up Data with Sqoop
- Connectors and Drivers
- Importing Data with Sqoop
- Importing data into HDFS
- Importing data into Hive
- Importing data into HBase
- Importing incrementally
- Benefiting from additional Sqoop import features
- Sending Data Elsewhere with Sqoop
- Exporting data from HDFS
- Sqoop exports using the Insert approach
- Sqoop exports using the Update and Update Insert approach
- Sqoop exports using call stored procedures
- Sqoop exports and transactions
- Looking at Your Sqoop Input and Output Formatting Options
- Getting down to brass tacks: An example of output line-formatting and input-parsing
- Sqoop 2.0 Preview
- Chapter 15: The Holy Grail: Native SQL Access to Hadoop Data
- SQL’s Importance for Hadoop
- Looking at What SQL Access Actually Means
- SQL Access and Apache Hive
- Solutions Inspired by Google Dremel
- Apache Drill
- Cloudera Impala
- IBM Big SQL
- Pivotal HAWQ
- Hadapt
- The SQL Access Big Picture
- Part IV: Administering and Configuring Hadoop
- Chapter 16: Deploying Hadoop
- Working with Hadoop Cluster Components
- Rack considerations
- Master nodes
- Slave nodes
- Edge nodes
- Networking
- Hadoop Cluster Configurations
- Small
- Medium
- Large
- Alternate Deployment Form Factors
- Virtualized servers
- Cloud deployments
- Sizing Your Hadoop Cluster
- Chapter 17: Administering Your Hadoop Cluster
- Achieving Balance: A Big Factor in Cluster Health
- Mastering the Hadoop Administration Commands
- Understanding Factors for Performance
- Hardware
- MapReduce
- Benchmarking
- Tolerating Faults and Data Reliability
- Putting Apache Hadoop’s Capacity Scheduler to Good Use
- Setting Security: The Kerberos Protocol
- Expanding Your Toolset Options
- Hue
- Ambari
- Hadoop User Experience (Hue)
- The Hadoop shell
- Basic Hadoop Configuration Details
- Part V: The Part of Tens
- Chapter 18: Ten Hadoop Resources Worthy of a Bookmark
- Central Nervous System: Apache.org
- Tweet This
- Hortonworks University
- Cloudera University
- BigDataUniversity.com
- planet Big Data Blog Aggregator
- Quora’s Apache Hadoop Forum
- The IBM Big Data Hub
- Conferences Not to Be Missed
- The Google Papers That Started It All
- The Bonus Resource: What Did We Ever Do B.G.?
- Chapter 19: Ten Reasons to Adopt Hadoop
- Hadoop Is Relatively Inexpensive
- Hadoop Has an Active Open Source Community
- Hadoop Is Being Widely Adopted in Every Industry
- Hadoop Can Easily Scale Out As Your Data Grows
- Traditional Tools Are Integrating with Hadoop
- Hadoop Can Store Data in Any Format
- Hadoop Is Designed to Run Complex Analytics
- Hadoop Can Process a Full Data Set (As Opposed to Sampling)
- Hardware Is Being Optimized for Hadoop
- Hadoop Can Increasingly Handle Flexible Workloads (No Longer Just Batch)
- About the Authors
- Cheat Sheet
- More Dummies Products
Reviews
There are no reviews yet.