apache hadoop_Apache Hadoop框架快速概述-白红宇

apache hadoop_Apache Hadoop框架快速概述

阅读量：2517 次

发布时间：2019-05-11

本文共 4802 字，大约阅读时间需要 16 分钟。

apache hadoop

Hadoop, now known as Apache Hadoop, was named after a toy elephant that belonged to co-founder Doug Cutting’s son. Doug chose the name for the open-source project as it was easy to spell, pronounce, and find in search results. The original yellow stuffed elephant that inspired the name appears in Hadoop’s logo.

Hadoop(现称为Apache Hadoop)以联合创始人Doug Cutting儿子的玩具大象的名字命名。 Doug选择了开源项目的名称，因为它很容易在搜索结果中拼写，发音和查找。启发该名称的原始黄色绒毛大象出现在Hadoop的徽标中。

什么是Apache Hadoop？ (What is Apache Hadoop?)

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop软件库是一个框架，该框架允许使用简单的编程模型跨计算机集群对大型数据集进行分布式处理。它旨在从单个服务器扩展到数千台机器，每台机器都提供本地计算和存储。该库本身不依赖于硬件来提供高可用性，而是被设计用来检测和处理应用程序层的故障，因此可以在计算机集群的顶部提供高可用性服务，每台计算机都容易出现故障。

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop软件库是一个框架，该框架允许使用简单的编程模型跨计算机集群对大型数据集进行分布式处理。它旨在从单个服务器扩展到数千台机器，每台机器都提供本地计算和存储。库本身不依赖于硬件来提供高可用性，而是被设计用来检测和处理应用程序层的故障，因此在计算机集群的顶部提供高可用性服务，每台计算机都容易出现故障。

Source:

来源：

In 2003 Google released their paper on the Google File System (GFS). It detailed a proprietary distributed file system intended to provide efficient access to large amounts of data using commodity hardware. A year later, Google released another paper entitled “MapReduce: Simplified Data Processing on Large Clusters.” At the time, Doug was working at Yahoo. These papers were the inspiration for his open-source project Apache Nutch. In 2006, the project components then known as Hadoop moved out of Apache Nutch and was released.

2003年，Google发布了有关Google文件系统(GFS)的论文。它详细介绍了一个专有的分布式文件系统，该系统旨在使用商品硬件提供对大量数据的有效访问。一年后，谷歌发布了另一篇名为“ MapReduce：大型集群上的简化数据处理”的论文。当时，道格在Yahoo工作。这些论文是他的开源项目Apache Nutch的灵感来源。 2006年，当时称为Hadoop的项目组件从Apache Nutch移出并发布了。

Hadoop为什么有用？ (Why is Hadoop useful?)

Every day, billions of gigabytes of data are created in a variety of forms. Some examples of frequently created data are:

每天，都会以各种形式创建数十亿兆的数据。经常创建的数据的一些示例是：

Metadata from phone usage
来自电话使用情况的元数据

Website logs
网站日志

Credit card purchase transactions
信用卡购买交易

Social media posts
社交媒体帖子

Videos
影片

Information gathered from medical devices
从医疗设备收集的信息

“Big data” refers to data sets that are too large or complex to process using traditional software applications. Factors that contribute to the complexity of data are the size of the data set, speed of available processors, and the data’s format.

“大数据”是指太大或太复杂而无法使用传统软件应用程序进行处理的数据集。导致数据复杂性的因素包括数据集的大小，可用处理器的速度以及数据的格式。

At the time of its release, Hadoop was capable of processing data on a larger scale than traditional software.

在发布时，Hadoop能够处理比传统软件更大的数据。

核心Hadoop (Core Hadoop)

Data is stored in the Hadoop Distributed File System (HDFS). Using map reduce, Hadoop processes data in parallel chunks (processing several parts at the same time) rather than in a single queue. This reduces the time needed to process large data sets.

数据存储在Hadoop分布式文件系统(HDFS)中。通过使用map reduce，Hadoop可以并行块(同时处理多个部分)而不是单个队列中处理数据。这减少了处理大型数据集所需的时间。

HDFS works by storing large files divided into chunks, and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.

HDFS的工作原理是将大文件分为多个块，然后在许多服务器之间复制它们。拥有多个文件副本会创建冗余，从而防止数据丢失。

Hadoop生态系统 (Hadoop Ecosystem)

Many other software packages exist to complement Hadoop. These programs comprise the the Hadoop Ecosystem. Some programs make it easier to load data into the Hadoop cluster, while others make Hadoop easier to use.

存在许多其他软件包来补充Hadoop。这些程序包括Hadoop生态系统。有些程序使将数据加载到Hadoop集群变得更容易，而其他程序使Hadoop易于使用。

The Hadoop Ecosystem includes:

Hadoop生态系统包括：

Apache Hive
阿帕奇蜂巢

Apache Pig
阿帕奇猪

Apache HBase
Apache HBase

Apache Phoenix
阿帕奇凤凰

Apache Spark
Apache Spark

Apache ZooKeeper
Apache ZooKeeper

Cloudera Impala
Cloudera Impala

Apache Flume
阿帕奇水槽

Apache Sqoop
Apache Sqoop

Apache Oozie
阿帕奇·奥兹(Apache Oozie)

更多信息： (More Information:)

翻译自:

apache hadoop

转载地址：http://wmrwd.baihongyu.com/

你可能感兴趣的文章