Hadoop For Dummies

Höfundur Dirk deRoos

Útgefandi Wiley Professional Development (P&T)

Snið ePub

Print ISBN 9781118607558

Útgáfa 1

Útgáfuár 2014

2.190 kr.

Description

Efnisyfirlit

  • Introduction
  • About this Book
  • Foolish Assumptions
  • How This Book Is Organized
  • Part I: Getting Started With Hadoop
  • Part II: How Hadoop Works
  • Part III: Hadoop and Structured Data
  • Part IV: Administering and Configuring Hadoop
  • Part V: The Part Of Tens: Getting More Out of Your Hadoop Cluster
  • Icons Used in This Book
  • Beyond the Book
  • Where to Go from Here
  • Part I: Getting Started with Hadoop
  • Chapter 1: Introducing Hadoop and Seeing What It’s Good For
  • Big Data and the Need for Hadoop
  • Exploding data volumes
  • Varying data structures
  • A playground for data scientists
  • The Origin and Design of Hadoop
  • Distributed processing with MapReduce
  • Apache Hadoop ecosystem
  • Examining the Various Hadoop Offerings
  • Comparing distributions
  • Working with in-database MapReduce
  • Looking at the Hadoop toolbox
  • Chapter 2: Common Use Cases for Big Data in Hadoop
  • The Keys to Successfully Adopting Hadoop (Or, “Please, Can We Keep Him?”)
  • Log Data Analysis
  • Data Warehouse Modernization
  • Fraud Detection
  • Risk Modeling
  • Social Sentiment Analysis
  • Image Classification
  • Graph Analysis
  • To Infinity and Beyond
  • Chapter 3: Setting Up Your Hadoop Environment
  • Choosing a Hadoop Distribution
  • Choosing a Hadoop Cluster Architecture
  • Pseudo-distributed mode (single node)
  • Fully distributed mode (a cluster of nodes)
  • The Hadoop For Dummies Environment
  • The Hadoop For Dummies distribution: Apache Bigtop
  • Setting up the Hadoop For Dummies environment
  • The Hadoop For Dummies Sample Data Set: Airline on-time performance
  • Your First Hadoop Program: Hello Hadoop!
  • Part II: How Hadoop Works
  • Chapter 4: Storing Data in Hadoop: The Hadoop Distributed File System
  • Data Storage in HDFS
  • Taking a closer look at data blocks
  • Replicating data blocks
  • Slave node and disk failures
  • Sketching Out the HDFS Architecture
  • Looking at slave nodes
  • Keeping track of data blocks with NameNode
  • Checkpointing updates
  • HDFS Federation
  • HDFS High Availability
  • Chapter 5: Reading and Writing Data
  • Compressing Data
  • Managing Files with the Hadoop File System Commands
  • Ingesting Log Data with Flume
  • Chapter 6: MapReduce Programming
  • Thinking in Parallel
  • Seeing the Importance of MapReduce
  • Doing Things in Parallel: Breaking Big Problems into Many Bite-Size Pieces
  • Looking at MapReduce application flow
  • Understanding input splits
  • Seeing how key/value pairs fit into the MapReduce application flow
  • Writing MapReduce Applications
  • Getting Your Feet Wet: Writing a Simple MapReduce Application
  • The FlightsByCarrier driver application
  • The FlightsByCarrier mapper
  • The FlightsByCarrier reducer
  • Running the FlightsByCarrier application
  • Chapter 7: Frameworks for Processing Data in Hadoop: YARN and MapReduce
  • Running Applications Before Hadoop 2
  • Tracking JobTracker
  • Tracking TaskTracker
  • Launching a MapReduce application
  • Seeing a World beyond MapReduce
  • Scouting out the YARN architecture
  • Launching a YARN-based application
  • Real-Time and Streaming Applications
  • Chapter 8: Pig: Hadoop Programming Made Easier
  • Admiring the Pig Architecture
  • Going with the Pig Latin Application Flow
  • Working through the ABCs of Pig Latin
  • Uncovering Pig Latin structures
  • Looking at Pig data types and syntax
  • Evaluating Local and Distributed Modes of Running Pig scripts
  • Checking Out the Pig Script Interfaces
  • Scripting with Pig Latin
  • Chapter 9: Statistical Analysis in Hadoop
  • Pumping Up Your Statistical Analysis
  • The limitations of sampling
  • Factors that increase the scale of statistical analysis
  • Running statistical models in MapReduce
  • Machine Learning with Mahout
  • Collaborative filtering
  • Clustering
  • Classifications
  • R on Hadoop
  • The R language
  • Hadoop Integration with R
  • Chapter 10: Developing and Scheduling Application Workflows with Oozie
  • Getting Oozie in Place
  • Developing and Running an Oozie Workflow
  • Writing Oozie workflow definitions
  • Configuring Oozie workflows
  • Running Oozie workflows
  • Scheduling and Coordinating Oozie Workflows
  • Time-based scheduling for Oozie coordinator jobs
  • Time and data availability-based scheduling for Oozie coordinator jobs
  • Running Oozie coordinator jobs
  • Part III: Hadoop and Structured Data
  • Chapter 11: Hadoop and the Data Warehouse: Friends or Foes?
  • Comparing and Contrasting Hadoop with Relational Databases
  • NoSQL data stores
  • ACID versus BASE data stores
  • Structured data storage and processing in Hadoop
  • Modernizing the Warehouse with Hadoop
  • The landing zone
  • A queryable archive of cold warehouse data
  • Hadoop as a data preprocessing engine
  • Data discovery and sandboxes
  • Chapter 12: Extremely Big Tables: Storing Data in HBase
  • Say Hello to HBase
  • Sparse
  • It’s distributed and persistent
  • It has a multidimensional sorted map
  • Understanding the HBase Data Model
  • Understanding the HBase Architecture
  • RegionServers
  • MasterServer
  • Zookeeper and HBase reliability
  • Taking HBase for a Test Run
  • Creating a table
  • Working with Zookeeper
  • Getting Things Done with HBase
  • Working with an HBase Java API client example
  • HBase and the RDBMS world
  • Knowing when HBase makes sense for you?
  • ACID Properties in HBase
  • Transitioning from an RDBMS model to HBase
  • Deploying and Tuning HBase
  • Hardware requirements
  • Deployment Considerations
  • Tuning prerequisites
  • Understanding your data access patterns
  • Pre-Splitting your regions
  • The importance of row key design
  • Tuning major compactions
  • Chapter 13: Applying Structure to Hadoop Data with Hive
  • Saying Hello to Hive
  • Seeing How the Hive is Put Together
  • Getting Started with Apache Hive
  • Examining the Hive Clients
  • The Hive CLI client
  • The web browser as Hive client
  • SQuirreL as Hive client with the JDBC Driver
  • Working with Hive Data Types
  • Creating and Managing Databases and Tables
  • Managing Hive databases
  • Creating and managing tables with Hive
  • Seeing How the Hive Data Manipulation Language Works
  • LOAD DATA examples
  • INSERT examples
  • Create Table As Select (CTAS) examples
  • Querying and Analyzing Data
  • Joining tables with Hive
  • Improving your Hive queries with indexes
  • Windowing in HiveQL
  • Other key HiveQL features
  • Chapter 14: Integrating Hadoop with Relational Databases Using Sqoop
  • The Principles of Sqoop Design
  • Scooping Up Data with Sqoop
  • Connectors and Drivers
  • Importing Data with Sqoop
  • Importing data into HDFS
  • Importing data into Hive
  • Importing data into HBase
  • Importing incrementally
  • Benefiting from additional Sqoop import features
  • Sending Data Elsewhere with Sqoop
  • Exporting data from HDFS
  • Sqoop exports using the Insert approach
  • Sqoop exports using the Update and Update Insert approach
  • Sqoop exports using call stored procedures
  • Sqoop exports and transactions
  • Looking at Your Sqoop Input and Output Formatting Options
  • Getting down to brass tacks: An example of output line-formatting and input-parsing
  • Sqoop 2.0 Preview
  • Chapter 15: The Holy Grail: Native SQL Access to Hadoop Data
  • SQL’s Importance for Hadoop
  • Looking at What SQL Access Actually Means
  • SQL Access and Apache Hive
  • Solutions Inspired by Google Dremel
  • Apache Drill
  • Cloudera Impala
  • IBM Big SQL
  • Pivotal HAWQ
  • Hadapt
  • The SQL Access Big Picture
  • Part IV: Administering and Configuring Hadoop
  • Chapter 16: Deploying Hadoop
  • Working with Hadoop Cluster Components
  • Rack considerations
  • Master nodes
  • Slave nodes
  • Edge nodes
  • Networking
  • Hadoop Cluster Configurations
  • Small
  • Medium
  • Large
  • Alternate Deployment Form Factors
  • Virtualized servers
  • Cloud deployments
  • Sizing Your Hadoop Cluster
  • Chapter 17: Administering Your Hadoop Cluster
  • Achieving Balance: A Big Factor in Cluster Health
  • Mastering the Hadoop Administration Commands
  • Understanding Factors for Performance
  • Hardware
  • MapReduce
  • Benchmarking
  • Tolerating Faults and Data Reliability
  • Putting Apache Hadoop’s Capacity Scheduler to Good Use
  • Setting Security: The Kerberos Protocol
  • Expanding Your Toolset Options
  • Hue
  • Ambari
  • Hadoop User Experience (Hue)
  • The Hadoop shell
  • Basic Hadoop Configuration Details
  • Part V: The Part of Tens
  • Chapter 18: Ten Hadoop Resources Worthy of a Bookmark
  • Central Nervous System: Apache.org
  • Tweet This
  • Hortonworks University
  • Cloudera University
  • BigDataUniversity.com
  • planet Big Data Blog Aggregator
  • Quora’s Apache Hadoop Forum
  • The IBM Big Data Hub
  • Conferences Not to Be Missed
  • The Google Papers That Started It All
  • The Bonus Resource: What Did We Ever Do B.G.?
  • Chapter 19: Ten Reasons to Adopt Hadoop
  • Hadoop Is Relatively Inexpensive
  • Hadoop Has an Active Open Source Community
  • Hadoop Is Being Widely Adopted in Every Industry
  • Hadoop Can Easily Scale Out As Your Data Grows
  • Traditional Tools Are Integrating with Hadoop
  • Hadoop Can Store Data in Any Format
  • Hadoop Is Designed to Run Complex Analytics
  • Hadoop Can Process a Full Data Set (As Opposed to Sampling)
  • Hardware Is Being Optimized for Hadoop
  • Hadoop Can Increasingly Handle Flexible Workloads (No Longer Just Batch)
  • About the Authors
  • Cheat Sheet
  • More Dummies Products

Additional information

Veldu vöru

Rafbók til eignar

Aðrar vörur

0
    0
    Karfan þín
    Karfan þín er tómAftur í búð