Hadoop For Dummies

Description

Efnisyfirlit

Introduction
About this Book
Foolish Assumptions
How This Book Is Organized
Part I: Getting Started With Hadoop
Part II: How Hadoop Works
Part III: Hadoop and Structured Data
Part IV: Administering and Configuring Hadoop
Part V: The Part Of Tens: Getting More Out of Your Hadoop Cluster
Icons Used in This Book
Beyond the Book
Where to Go from Here
Part I: Getting Started with Hadoop
Chapter 1: Introducing Hadoop and Seeing What It’s Good For
Big Data and the Need for Hadoop
Exploding data volumes
Varying data structures
A playground for data scientists
The Origin and Design of Hadoop
Distributed processing with MapReduce
Apache Hadoop ecosystem
Examining the Various Hadoop Offerings
Comparing distributions
Working with in-database MapReduce
Looking at the Hadoop toolbox
Chapter 2: Common Use Cases for Big Data in Hadoop
The Keys to Successfully Adopting Hadoop (Or, “Please, Can We Keep Him?”)
Log Data Analysis
Data Warehouse Modernization
Fraud Detection
Risk Modeling
Social Sentiment Analysis
Image Classification
Graph Analysis
To Infinity and Beyond
Chapter 3: Setting Up Your Hadoop Environment
Choosing a Hadoop Distribution
Choosing a Hadoop Cluster Architecture
Pseudo-distributed mode (single node)
Fully distributed mode (a cluster of nodes)
The Hadoop For Dummies Environment
The Hadoop For Dummies distribution: Apache Bigtop
Setting up the Hadoop For Dummies environment
The Hadoop For Dummies Sample Data Set: Airline on-time performance
Your First Hadoop Program: Hello Hadoop!
Part II: How Hadoop Works
Chapter 4: Storing Data in Hadoop: The Hadoop Distributed File System
Data Storage in HDFS
Taking a closer look at data blocks
Replicating data blocks
Slave node and disk failures
Sketching Out the HDFS Architecture
Looking at slave nodes
Keeping track of data blocks with NameNode
Checkpointing updates
HDFS Federation
HDFS High Availability
Chapter 5: Reading and Writing Data
Compressing Data
Managing Files with the Hadoop File System Commands
Ingesting Log Data with Flume
Chapter 6: MapReduce Programming
Thinking in Parallel
Seeing the Importance of MapReduce
Doing Things in Parallel: Breaking Big Problems into Many Bite-Size Pieces
Looking at MapReduce application flow
Understanding input splits
Seeing how key/value pairs fit into the MapReduce application flow
Writing MapReduce Applications
Getting Your Feet Wet: Writing a Simple MapReduce Application
The FlightsByCarrier driver application
The FlightsByCarrier mapper
The FlightsByCarrier reducer
Running the FlightsByCarrier application
Chapter 7: Frameworks for Processing Data in Hadoop: YARN and MapReduce
Running Applications Before Hadoop 2
Tracking JobTracker
Tracking TaskTracker
Launching a MapReduce application
Seeing a World beyond MapReduce
Scouting out the YARN architecture
Launching a YARN-based application
Real-Time and Streaming Applications
Chapter 8: Pig: Hadoop Programming Made Easier
Admiring the Pig Architecture
Going with the Pig Latin Application Flow
Working through the ABCs of Pig Latin
Uncovering Pig Latin structures
Looking at Pig data types and syntax
Evaluating Local and Distributed Modes of Running Pig scripts
Checking Out the Pig Script Interfaces
Scripting with Pig Latin
Chapter 9: Statistical Analysis in Hadoop
Pumping Up Your Statistical Analysis
The limitations of sampling
Factors that increase the scale of statistical analysis
Running statistical models in MapReduce
Machine Learning with Mahout
Collaborative filtering
Clustering
Classifications
R on Hadoop
The R language
Hadoop Integration with R
Chapter 10: Developing and Scheduling Application Workflows with Oozie
Getting Oozie in Place
Developing and Running an Oozie Workflow
Writing Oozie workflow definitions
Configuring Oozie workflows
Running Oozie workflows
Scheduling and Coordinating Oozie Workflows
Time-based scheduling for Oozie coordinator jobs
Time and data availability-based scheduling for Oozie coordinator jobs
Running Oozie coordinator jobs
Part III: Hadoop and Structured Data
Chapter 11: Hadoop and the Data Warehouse: Friends or Foes?
Comparing and Contrasting Hadoop with Relational Databases
NoSQL data stores
ACID versus BASE data stores
Structured data storage and processing in Hadoop
Modernizing the Warehouse with Hadoop
The landing zone
A queryable archive of cold warehouse data
Hadoop as a data preprocessing engine
Data discovery and sandboxes
Chapter 12: Extremely Big Tables: Storing Data in HBase
Say Hello to HBase
Sparse
It’s distributed and persistent
It has a multidimensional sorted map
Understanding the HBase Data Model
Understanding the HBase Architecture
RegionServers
MasterServer
Zookeeper and HBase reliability
Taking HBase for a Test Run
Creating a table
Working with Zookeeper
Getting Things Done with HBase
Working with an HBase Java API client example
HBase and the RDBMS world
Knowing when HBase makes sense for you?
ACID Properties in HBase
Transitioning from an RDBMS model to HBase
Deploying and Tuning HBase
Hardware requirements
Deployment Considerations
Tuning prerequisites
Understanding your data access patterns
Pre-Splitting your regions
The importance of row key design
Tuning major compactions
Chapter 13: Applying Structure to Hadoop Data with Hive
Saying Hello to Hive
Seeing How the Hive is Put Together
Getting Started with Apache Hive
Examining the Hive Clients
The Hive CLI client
The web browser as Hive client
SQuirreL as Hive client with the JDBC Driver
Working with Hive Data Types
Creating and Managing Databases and Tables
Managing Hive databases
Creating and managing tables with Hive
Seeing How the Hive Data Manipulation Language Works
LOAD DATA examples
INSERT examples
Create Table As Select (CTAS) examples
Querying and Analyzing Data
Joining tables with Hive
Improving your Hive queries with indexes
Windowing in HiveQL
Other key HiveQL features
Chapter 14: Integrating Hadoop with Relational Databases Using Sqoop
The Principles of Sqoop Design
Scooping Up Data with Sqoop
Connectors and Drivers
Importing Data with Sqoop
Importing data into HDFS
Importing data into Hive
Importing data into HBase
Importing incrementally
Benefiting from additional Sqoop import features
Sending Data Elsewhere with Sqoop
Exporting data from HDFS
Sqoop exports using the Insert approach
Sqoop exports using the Update and Update Insert approach
Sqoop exports using call stored procedures
Sqoop exports and transactions
Looking at Your Sqoop Input and Output Formatting Options
Getting down to brass tacks: An example of output line-formatting and input-parsing
Sqoop 2.0 Preview
Chapter 15: The Holy Grail: Native SQL Access to Hadoop Data
SQL’s Importance for Hadoop
Looking at What SQL Access Actually Means
SQL Access and Apache Hive
Solutions Inspired by Google Dremel
Apache Drill
Cloudera Impala
IBM Big SQL
Pivotal HAWQ
Hadapt
The SQL Access Big Picture
Part IV: Administering and Configuring Hadoop
Chapter 16: Deploying Hadoop
Working with Hadoop Cluster Components
Rack considerations
Master nodes
Slave nodes
Edge nodes
Networking
Hadoop Cluster Configurations
Small
Medium
Large
Alternate Deployment Form Factors
Virtualized servers
Cloud deployments
Sizing Your Hadoop Cluster
Chapter 17: Administering Your Hadoop Cluster
Achieving Balance: A Big Factor in Cluster Health
Mastering the Hadoop Administration Commands
Understanding Factors for Performance
Hardware
MapReduce
Benchmarking
Tolerating Faults and Data Reliability
Putting Apache Hadoop’s Capacity Scheduler to Good Use
Setting Security: The Kerberos Protocol
Expanding Your Toolset Options
Hue
Ambari
Hadoop User Experience (Hue)
The Hadoop shell
Basic Hadoop Configuration Details
Part V: The Part of Tens
Chapter 18: Ten Hadoop Resources Worthy of a Bookmark
Central Nervous System: Apache.org
Tweet This
Hortonworks University
Cloudera University
BigDataUniversity.com
planet Big Data Blog Aggregator
Quora’s Apache Hadoop Forum
The IBM Big Data Hub
Conferences Not to Be Missed
The Google Papers That Started It All
The Bonus Resource: What Did We Ever Do B.G.?
Chapter 19: Ten Reasons to Adopt Hadoop
Hadoop Is Relatively Inexpensive
Hadoop Has an Active Open Source Community
Hadoop Is Being Widely Adopted in Every Industry
Hadoop Can Easily Scale Out As Your Data Grows
Traditional Tools Are Integrating with Hadoop
Hadoop Can Store Data in Any Format
Hadoop Is Designed to Run Complex Analytics
Hadoop Can Process a Full Data Set (As Opposed to Sampling)
Hardware Is Being Optimized for Hadoop
Hadoop Can Increasingly Handle Flexible Workloads (No Longer Just Batch)
About the Authors
Cheat Sheet
More Dummies Products

Additional information

Veldu vöru	Rafbók til eignar

Reviews

There are no reviews yet.

Be the first to review “Hadoop For Dummies”

Description

Efnisyfirlit

Additional information

Reviews

Aðrar vörur

Bókakaup

Um okkur

Skráðu þig á póstlistann okkar

Hadoop For Dummies

Description

Efnisyfirlit

Additional information

Reviews

Aðrar vörur

Related products

Architectural Graphics

An Introduction to Sociolinguistics

An Introduction to Behavioural Ecology

Against Borders

Bókakaup

Um okkur

Skráðu þig á póstlistann okkar