來源:燈塔大數(shù)據(jù) 時(shí)間:2017-06-21 10:29:25 作者:
? 隨著大數(shù)據(jù)市場的穩(wěn)步發(fā)展,越來越多的公司開始部署大數(shù)據(jù)驅(qū)動(dòng)戰(zhàn)略。
Apache Hadoop是目前最成熟的大數(shù)據(jù)分析工具,但是市場上也不乏其他優(yōu)秀的大數(shù)據(jù)工具。目前市場上有數(shù)千種工具能夠幫你節(jié)約時(shí)間和成本,帶你從全新的角度洞察你所在的行業(yè)。
以下介紹18種功能實(shí)用的大數(shù)據(jù)工具:
Avro:由Doug Cutting公司研發(fā),可用于編碼Hadoop文件模式的數(shù)據(jù)序列化。
Cassandra:一種分布式的開源數(shù)據(jù)庫。可用于處理商品服務(wù)器在提供高可用性服務(wù)時(shí)產(chǎn)生的大量分布式數(shù)據(jù)。這是一種非關(guān)系型數(shù)據(jù)庫(NoSQL)解決方案,最初由Facebook主導(dǎo)研發(fā)。
目前很多公司組織都在使用這一數(shù)據(jù)庫,如Netflix,Cisco,Twitter。
Drill:一種開源分布式系統(tǒng),用于大規(guī)模數(shù)據(jù)集的交互分析。Drill與谷歌的Dremel系統(tǒng)類似,由Apache公司管理運(yùn)行。
Elasticsearch:Apache Lucene開發(fā)的開源搜索引擎。Elasticsearch是基于Java的系統(tǒng),可以實(shí)現(xiàn)高速搜索,支持你的數(shù)據(jù)搜索工作。
Flume:使用網(wǎng)絡(luò)服務(wù)器、應(yīng)用服務(wù)器和移動(dòng)服務(wù)器的數(shù)據(jù)來填充Hadoop的大數(shù)據(jù)應(yīng)用框架,是數(shù)據(jù)源和Hadoop之間的一種連接紐帶。
HCatalog:是針對(duì)Apache Hadoop的集中元數(shù)據(jù)管理和分享服務(wù)??梢酝ㄟ^它集中查看Hadoop集群中的所有數(shù)據(jù),并可以在不知道數(shù)據(jù)在集群中存儲(chǔ)位置的情況下,通過Pig和 Hive等多種工具處理所有數(shù)據(jù)元素。
Impala: 使用與Apache Hive相同的元數(shù)據(jù),SQL語法(Hive SQL),ODBC驅(qū)動(dòng)程序和用戶界面(HueBeeswax),直接幫助您對(duì)存儲(chǔ)在HDFS或HBase中的Apache Hadoop數(shù)據(jù)進(jìn)行快速的交互式SQL查詢。
它為批量導(dǎo)向或?qū)崟r(shí)查詢提供了一個(gè)方便操作的統(tǒng)一平臺(tái)。
JSON:今天的許多非關(guān)系型數(shù)據(jù)庫(NoSQL)都以JSON(JavaScript對(duì)象符號(hào))格式存儲(chǔ)數(shù)據(jù),這些格式在Web開發(fā)人員中很受歡迎。
Kafka:這是種分布式“發(fā)布——訂閱”的消息傳送系統(tǒng),它能夠提供一種解決方案,幫助處理所有數(shù)據(jù)流活動(dòng),并在消費(fèi)者網(wǎng)站上處理這些數(shù)據(jù)。
這種類型的數(shù)據(jù)(包括頁面查看數(shù)據(jù),搜索數(shù)據(jù)和其他用戶操作數(shù)據(jù))是當(dāng)前社交網(wǎng)絡(luò)的關(guān)鍵組成部分。
MongoDB:是一個(gè)在開源概念指導(dǎo)下開發(fā)出來的面向文檔的非關(guān)系型數(shù)據(jù)庫(NoSQL)。它具有完整的索引支持,同時(shí)可以靈活地對(duì)任何屬性進(jìn)行索引,并在不影響功能的情況下進(jìn)行橫向擴(kuò)容。
Neo4j:是一個(gè)圖形數(shù)據(jù)庫,與關(guān)系數(shù)據(jù)庫相比,性能提升高達(dá)1000多倍或更高。
Oozie:一種工作流程處理系統(tǒng),可以讓用戶自定義不同語言編寫的一系列工作,如Map Reduce,Pig 和 Hive。它還可以實(shí)現(xiàn)不同工作項(xiàng)目之間的智能連接,Oozie還支持用戶指定依賴關(guān)系。
Pig:是由雅虎開發(fā)的基于Hadoop的一種語言,對(duì)于用戶來說,學(xué)習(xí)起來相對(duì)簡單,且Pig擅長處理非常深入且非常長的數(shù)據(jù)管道(data pipeline)。
Storm:是一種免費(fèi)的進(jìn)行實(shí)時(shí)分布式計(jì)算的開源系統(tǒng)。通過Storm,用戶可以非常輕松的在能夠進(jìn)行實(shí)時(shí)處理操作的范圍內(nèi),對(duì)非結(jié)構(gòu)化數(shù)據(jù)流進(jìn)行可靠處理。
系統(tǒng)具有容錯(cuò)特性,支持幾乎所有編程語言,當(dāng)然最常用的語言還是Java。Storm最初是Apache家族的一個(gè)分支,現(xiàn)在已被Twitter收購。
Tableau:是一種主要關(guān)注商業(yè)智能的數(shù)據(jù)可視化工具。用戶無需編程,就可以利用Tableau創(chuàng)建地圖,條形圖,散點(diǎn)圖等可視化圖像。
他們最近發(fā)布了一個(gè)Web連接器,允許用戶直接連接數(shù)據(jù)庫或應(yīng)用程序界面(API),從而使用戶能夠在進(jìn)行可視化項(xiàng)目時(shí)獲取實(shí)時(shí)數(shù)據(jù)。
ZooKeeper:為大型分布式系統(tǒng)提供集中配置和開放代碼名稱注冊的服務(wù)。
每天大數(shù)據(jù)技術(shù)領(lǐng)域都會(huì)涌現(xiàn)出大量新的大數(shù)據(jù)相關(guān)工具,要想學(xué)會(huì)使用每個(gè)工具是非常困難且沒有意義的。挑選幾個(gè)你能夠熟練使用的工具,并不斷學(xué)習(xí)技術(shù)知識(shí),才是最好的方式。
英文原文
18 Big Data Tools You Need To Know About
Use these tools to get ahead
In today’s digital transformation, big datahas given organizations an edge to analyze customer behavior &hyper-personalize every interaction which results into cross-sell, improvedcustomer experience, and obviously more revenue.
The market for Big Data has grown upsteadily as more and more enterprises have implemented a data-driven strategy.
While Apache Hadoop is the most well-established tool for analyzing big data,there are thousands of big data tools out there.
All of them promising to saveyou time, money, and help you uncover never-before-seen business insights.
I have selected few to get you going….
Avro: It was developed by Doug Cutting& used for data serialization for encoding the schema of Hadoop files.
Cassandra: is a distributed and Open Sourcedatabase. Designed to handle large amounts of distributed data across commodityservers while providing a highly available service.
It is a NoSQL solution thatwas initially developed by Facebook. It is used by many organizations likeNetflix, Cisco, Twitter.
Drill: An open source distributed systemfor performing interactive analysis on large-scale datasets. It is similar toGoogle’s Dremel, and is managed by Apache.
Elasticsearch: An open source search enginebuilt on Apache Lucene. It is developed on Java, can power extremely fastsearches that support your data discovery applications.
Flume: is a framework for populating Hadoopwith data from web servers, application servers and mobile devices. It is theplumbing between sources and Hadoop.
HCatalog: is a centralized metadatamanagement and sharing service for Apache Hadoop.
It allows for a unified viewof all data in Hadoop clusters and allows diverse tools, including Pig andHive, to process any data elements without needing to know physically where inthe cluster the data is stored.
Impala: provides fast, interactive SQLqueries directly on your Apache Hadoop data stored in HDFS or HBase using thesame metadata, SQL syntax (Hive SQL), ODBC driver and user interface (HueBeeswax) as Apache Hive.
This provides a familiar and unified platform forbatch-oriented or real-time queries.
JSON: Many of today’s NoSQL databases storedata in the JSON (JavaScript Object Notation) format that’s become popular withWeb developers
Kafka: is a distributed publish-subscribemessaging system that offers a solution capable of handling all data flowactivity and processing these data on a consumer website.
This type of data(page views, searches, and other user actions) are a key ingredient in thecurrent social web.
MongoDB: is a NoSQL database oriented todocuments, developed under the open source concept. This comes with full indexsupport and the flexibility to index any attribute and scale horizontallywithout affecting functionality.
責(zé)任編輯:陳近梅