`
thecloud
  • 浏览: 884790 次
文章分类
社区版块
存档分类
最新评论

Hadoop Introduction

 
阅读更多

Hadoop is an opensource framework for writing and running distributed applications that processlarge amounts of data. Distributed computing is a wide and varied field, butthe key distributions of Hadoop are that it is

Acciable – Hadoop runson large clusters of commodity machines or on cloud computing services such asAmazon’s Elastic Compute Cloud (EC2).

Robust - Because itis intended to run on commodity hardware, Hadoop is architected with theassumption of frequent hardaware malfunctions. It can gracefully handle mostsuch failures.

Scalable – Hadoop scaleslinearly to handle larger data by adding more nodes to the cluster.

Simple – Hadoop allowsusers to quickly write efficient parallel code.

SQL (StructuredQuery Language) is by design targeted at structured data. Many of Hadoop’sinitial applications deal with unstructured data such as text. From thisperspective Hadoop provides a more general paradigm than SQL. A machine withfour times the power of a standard PC costs a lot more than putting four suchPCs in a cluster. Hadoop is designed to be a scale-out architecture operatingon a cluster of commodity PC machines. Adding more resources means adding moremachines to the Hadoop cluster. Hadoop Clusters with ten to hundreds ofmachines is standard. In fact, other than for development purposes, there’s noreason to run Hadoop on a single server.

Large data sets areoften unstructured or semistrcutrued. Hadoop uses key/value pairs as its basicdata unit, which is flexible enough to work with the less-structured datatypes. In Hadoop, data can originate in any form, but it eventually transformsinto (key/value) pairs for the processing functions to work on.

Hadoop is best usedas a write-once, read-many-times type of data store. In this aspect it’ssimilar to data warehouses in the SQL world.

Data processingmodels such as pipelines and message queues. Pipelines can help the reuse ofprocessing primitives; simple chaining of existing modules creates new ones.Message queues can help the synchronization of processing primitives. Theprogrammer writes her data processing task as processing primitives in the formof either a producer or a consumer. The timing of their execution is managed bythe system. Similarly, MapReduce is also a data processing model. Its greatestadvantage is the easy scaling of data processing over multiple computingnodes.Under the MapReduce model, the data processing primitives are calledmappers and reducers.

If the documentsare all stored in one central storage server, then the bottleneck is in thebandwidth of that server.

In the mappingphase, MapReduce takes the input data and feeds each data to the mapper. In thereducing phase, the reducer processes all the outputs from the mapper andarrives at a final result. In simple terms, the mapper is meant to filter andtransform the input into something that the reducer can aggregate over.


分享到:
评论

相关推荐

    Apache Hadoop introduction

    Apache Hadoop是一个用java语言实现的软件框架,在由大量计算机组成的集群中运行海量数据的分布式计算,它可以让应用程序支持上千个节点和PB级别的数据。Hadoop的核心子项目,提供了一个分布式文件系统(HDFS)和支持...

    Field Guide to Hadoop An Introduction to Hadoop, Its Ecosystem, and 无水印原版pdf

    Field Guide to Hadoop An Introduction to Hadoop, Its Ecosystem, and Aligned Technologies 英文无水印原版pdf pdf所有页面使用FoxitReader、PDF-XChangeViewer、SumatraPDF和Firefox测试都可以打开 本资源...

    Data Analytics with Hadoop: An Introduction for Data Scientists

    "Data Analytics with Hadoop: An Introduction for Data Scientists" ISBN: 1491913703 | 2016 | PDF | 288 pages | 7 MB Ready to use statistical and machine-learning techniques across large data sets? ...

    Field Guide to Hadoop An Introduction to Hadoop, Its Ecosystem, and Aligned epub

    Field Guide to Hadoop An Introduction to Hadoop, Its Ecosystem, and Aligned Technologies 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书

    Data Analytics with Hadoop An Introduction for Data Scientists epub

    Data Analytics with Hadoop An Introduction for Data Scientists 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书

    Introduction to SAS and Hadoop

    This course teaches you how to use SAS programming methods to read, write, and manipulate Hadoop data. Base SAS methods that are covered include reading and writing raw data with the DATA step and ...

    Hadoop The Definitive Guide PDF

    Chapter 2 provides an introduction to MapReduce. Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth. Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, ...

    Hadoop.Essentials.1784396680

    Chapter 1: Introduction To Big Data And Hadoop Chapter 2: Hadoop Ecosystem Chapter 3: Pillars Of Hadoop – Hdfs, Mapreduce, And Yarn Chapter 4: Data Access Components – Hive And Pig Chapter 5: ...

    apache hadoop 2.7.2.chm

    Introduction Resource Manager Node Manager Timeline Server Hadoop Compatible File Systems Amazon S3 Azure Blob Storage OpenStack Swift Auth Overview Examples Configuration Building Tools ...

    Deep Learning with Hadoop

    Introduction to Deep Learning Chapter 2. Distributed Deep Learning for Large-Scale Data Chapter 3. Convolutional Neural Network Chapter 4. Recurrent Neural Network Chapter 5. Restricted Boltzmann ...

    Introduction_to_Hadoop

    介绍Hadoop,hadoop的基本原理

    Hadoop.in.Practice.2nd.Edition

    Chapter 2: Introduction to YARN Part 2: Data logistics Chapter 3: Data serialization— working with text and beyond Chapter 4: Organizing and optimizing data in HDFS Chapter 5: Moving data into and ...

    hadoop Illuminated

    this book to serve as a gentle introduction to Big Data and Hadoop. No deep technical knowledge is needed to go through the book.

    Hadoop.Security.Protecting.Your.Big.Data.Platform.1491900989

    Introduction Part I. Security Architecture Chapter 2. Securing Distributed Systems Chapter 3. System Architecture Chapter 4. Kerberos Part II. Authentication, Authorization, and Accounting Chapter ...

    Data-intensive Systems: Principles and Fundamentals using Hadoop and Spark

    hadoop是什么? 初学hadoop需要哪些技术基础? Data-intensive Systems: Principles and Fundamentals using Hadoop and Spark (Advanced Information and Knowledge Processing) By 作者: Tomasz – Wiktorski – ...

    Practical Hadoop Ecosystem

    就我个人来说,本书的最大优点是,帮你把Hadoop生态系统内最典型的一些框架和工具进行分类,让你明白它们各自是做什么的,处于体系内的哪个Level。 Part I: Fundamentals Chapter 1: Introduction Chapter 2: ...

    Learning.Hadoop.2

    Introduction Chapter 2. Storage Chapter 3. Processing – Mapreduce And Beyond Chapter 4. Real-Time Computation With Samza Chapter 5. Iterative Computation With Spark Chapter 6. Data Analysis With ...

    Hadoop at Cloudera

    Hadoop at Cloudera: HPlab introduction about Hadoop in cloudera

Global site tag (gtag.js) - Google Analytics