博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
apache hadoop_Apache Hadoop框架快速概述
阅读量:2517 次
发布时间:2019-05-11

本文共 4802 字,大约阅读时间需要 16 分钟。

apache hadoop

Hadoop, now known as Apache Hadoop, was named after a toy elephant that belonged to co-founder Doug Cutting’s son. Doug chose the name for the open-source project as it was easy to spell, pronounce, and find in search results. The original yellow stuffed elephant that inspired the name appears in Hadoop’s logo.

Hadoop(现称为Apache Hadoop)以联合创始人Doug Cutting儿子的玩具大象的名字命名。 Doug选择了开源项目的名称,因为它很容易在搜索结果中拼写,发音和查找。 启发该名称的原始黄色绒毛大象出现在Hadoop的徽标中。

什么是Apache Hadoop? (What is Apache Hadoop?)

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop软件库是一个框架,该框架允许使用简单的编程模型跨计算机集群对大型数据集进行分布式处理。 它旨在从单个服务器扩展到数千台机器,每台机器都提供本地计算和存储。 该库本身不依赖于硬件来提供高可用性,而是被设计用来检测和处理应用程序层的故障,因此可以在计算机集群的顶部提供高可用性服务,每台计算机都容易出现故障。

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop软件库是一个框架,该框架允许使用简单的编程模型跨计算机集群对大型数据集进行分布式处理。 它旨在从单个服务器扩展到数千台机器,每台机器都提供本地计算和存储。 库本身不依赖于硬件来提供高可用性,而是被设计用来检测和处理应用程序层的故障,因此在计算机集群的顶部提供高可用性服务,每台计算机都容易出现故障。

Source:

来源:

In 2003 Google released their paper on the Google File System (GFS). It detailed a proprietary distributed file system intended to provide efficient access to large amounts of data using commodity hardware. A year later, Google released another paper entitled “MapReduce: Simplified Data Processing on Large Clusters.” At the time, Doug was working at Yahoo. These papers were the inspiration for his open-source project Apache Nutch. In 2006, the project components then known as Hadoop moved out of Apache Nutch and was released.

2003年,Google发布了有关Google文件系统(GFS)的论文。 它详细介绍了一个专有的分布式文件系统,该系统旨在使用商品硬件提供对大量数据的有效访问。 一年后,谷歌发布了另一篇名为“ MapReduce:大型集群上的简化数据处理”的论文。 当时,道格在Yahoo工作。 这些论文是他的开源项目Apache Nutch的灵感来源。 2006年,当时称为Hadoop的项目组件从Apache Nutch移出并发布了。

Hadoop为什么有用? (Why is Hadoop useful?)

Every day, billions of gigabytes of data are created in a variety of forms. Some examples of frequently created data are:

每天,都会以各种形式创建数十亿兆的数据。 经常创建的数据的一些示例是:

  • Metadata from phone usage

    来自电话使用情况的元数据
  • Website logs

    网站日志
  • Credit card purchase transactions

    信用卡购买交易
  • Social media posts

    社交媒体帖子
  • Videos

    影片
  • Information gathered from medical devices

    从医疗设备收集的信息

“Big data” refers to data sets that are too large or complex to process using traditional software applications. Factors that contribute to the complexity of data are the size of the data set, speed of available processors, and the data’s format.

“大数据”是指太大或太复杂而无法使用传统软件应用程序进行处理的数据集。 导致数据复杂性的因素包括数据集的大小,可用处理器的速度以及数据的格式。

At the time of its release, Hadoop was capable of processing data on a larger scale than traditional software.

在发布时,Hadoop能够处理比传统软件更大的数据。

核心Hadoop (Core Hadoop)

Data is stored in the Hadoop Distributed File System (HDFS). Using map reduce, Hadoop processes data in parallel chunks (processing several parts at the same time) rather than in a single queue. This reduces the time needed to process large data sets.

数据存储在Hadoop分布式文件系统(HDFS)中。 通过使用map reduce,Hadoop可以并行块(同时处理多个部分)而不是单个队列中处理数据。 这减少了处理大型数据集所需的时间。

HDFS works by storing large files divided into chunks, and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.

HDFS的工作原理是将大文件分为多个块,然后在许多服务器之间复制它们。 拥有多个文件副本会创建冗余,从而防止数据丢失。

Hadoop生态系统 (Hadoop Ecosystem)

Many other software packages exist to complement Hadoop. These programs comprise the the Hadoop Ecosystem. Some programs make it easier to load data into the Hadoop cluster, while others make Hadoop easier to use.

存在许多其他软件包来补充Hadoop。 这些程序包括Hadoop生态系统。 有些程序使将数据加载到Hadoop集群变得更容易,而其他程序使Hadoop易于使用。

The Hadoop Ecosystem includes:

Hadoop生态系统包括:

  • Apache Hive

    阿帕奇蜂巢
  • Apache Pig

    阿帕奇猪
  • Apache HBase

    Apache HBase
  • Apache Phoenix

    阿帕奇凤凰
  • Apache Spark

    Apache Spark
  • Apache ZooKeeper

    Apache ZooKeeper
  • Cloudera Impala

    Cloudera Impala
  • Apache Flume

    阿帕奇水槽
  • Apache Sqoop

    Apache Sqoop
  • Apache Oozie

    阿帕奇·奥兹(Apache Oozie)

更多信息: (More Information:)

翻译自:

apache hadoop

转载地址:http://wmrwd.baihongyu.com/

你可能感兴趣的文章
Myflight航班查询系统
查看>>
团队-团队编程项目爬取豆瓣电影top250-代码设计规范
查看>>
表头固定内容可滚动表格的3种实现方法
查看>>
想对你说
查看>>
day5 面向对象
查看>>
{算法}Young司机带你轻松KMP
查看>>
不同方法获得视差图比较
查看>>
jQuery笔记(二)
查看>>
Velocity模版进行shiro验证
查看>>
新生舞会
查看>>
双倍回文(bzoj 2342)
查看>>
excel怎么并排查看两个工作表
查看>>
session不会过期
查看>>
线段树合并浅谈
查看>>
一道概率题
查看>>
PHP中empty、isset和is_null的使用区别
查看>>
结合C++和GDAL实现shapefile(shp)文件的读取
查看>>
myeclipse添加svn
查看>>
自己写的一款get、post请求测试工具(使用说明在帮助->关于里)
查看>>
后缀数组---New Distinct Substrings
查看>>