最新頂尖數(shù)據(jù)分析師必用的15大Python庫（上）

來源：燈塔大數(shù)據(jù) 時(shí)間：2017-07-05 11:33:06 作者：

　　因?yàn)檫@里提到的所有的庫都是開源的，所以我們還備注了每個(gè)庫的貢獻(xiàn)資料數(shù)量、貢獻(xiàn)者人數(shù)以及其他指數(shù)，可對每個(gè)Python庫的受歡迎程度加以輔助說明。

　　1.NumPy

　?。ㄙY料數(shù)量：15980；貢獻(xiàn)者：522）

　　在最開始接觸Python的時(shí)候，我們不可避免的都需要尋求Python的SciPyStack的幫助，SciPyStack是一款專為Python中科學(xué)計(jì)算而設(shè)計(jì)的軟件集。所以我們在講Python庫的時(shí)候就不得不提到它了。但是SciPyStack所含內(nèi)容非常廣泛，其中包括了十幾個(gè)庫，而我們需要做的是找到其中最重要的軟件包。

　　NumPy（代表Numerical Python）是構(gòu)建科學(xué)計(jì)算棧（scientific computation stack）的最基礎(chǔ)的軟件包。它的功能豐富，可以滿足Python中n數(shù)組和矩陣的操作需求。該庫提供了NumPy數(shù)組類型的數(shù)學(xué)運(yùn)算向量化，可以改善性能，從而加快執(zhí)行速度。

　　2.SciPy

　?。ㄙY料數(shù)量：17213；貢獻(xiàn)者：489）

　　SciPy是一個(gè)工程和科學(xué)軟件庫。您還需要了解SciPyStack和SciPyLibrary之間的區(qū)別。SciPy包含線性代數(shù)，優(yōu)化，集成和統(tǒng)計(jì)多個(gè)模塊。SciPyLibrary的主要功能是建立在NumPy的基礎(chǔ)上，因此它的數(shù)組大量使用NumPy。它通過其特定的子模塊提供有效的數(shù)值例程（numerical routines），如數(shù)字積分，優(yōu)化等等。SciPy的所有子模塊中功能都有詳細(xì)的記錄–這是它的另一大優(yōu)勢。

　　3.Pandas

　　（資料數(shù)量：15089；貢獻(xiàn)者：762）

　　Pandas是一個(gè)Python軟件包，可以處理“標(biāo)記”（labeled）和“關(guān)聯(lián)”（relational）數(shù)據(jù)，簡單直觀。Pandas是數(shù)據(jù)整理的完美工具。使用者可以通過它快速簡便地完成數(shù)據(jù)操作，聚合和可視化。

　　Pandas庫有兩種主要數(shù)據(jù)結(jié)構(gòu)：

　　“系列”（Series）——單維結(jié)構(gòu)

　　“數(shù)據(jù)幀”（Data Frames）——二維結(jié)構(gòu)

　　例如，如果你通過Series在Data Frame中附加一行數(shù)據(jù)，你就能從這兩種數(shù)據(jù)結(jié)構(gòu)中獲得一個(gè)的新的“數(shù)據(jù)幀”

　　使用Pandas你可以完成以下操作：

　　輕松刪除或添加“數(shù)據(jù)幀”

　　bjects將數(shù)據(jù)結(jié)構(gòu)轉(zhuǎn)化成“數(shù)據(jù)幀對象”

　　處理缺失數(shù)據(jù)，用NaNs表示

　　強(qiáng)大的分組功能

　　4.Matplotlib

　?。ㄙY料數(shù)量：21754；貢獻(xiàn)者：588）

　　MatPlotlib是SciPyStack另一個(gè)核心軟件包和Python庫，可以輕松生成簡單而強(qiáng)大的可視化功能。這個(gè)頂尖軟件包使得Python（有一些NumPy，SciPy和Pandas的幫助）可以與MatLab或Mathematica等科學(xué)工具的一較高下。

　　然而，這個(gè)庫還是相對比較低級的，這意味著你需要編寫更多的代碼才能達(dá)到高級的可視化效果，而且通常會比使用那些高級工具要付出更多的努力，但總體來說還是值得一試的。

　　你可以使用它實(shí)現(xiàn)各種可視化：

　　線路圖

　　散點(diǎn)圖;

　　條形圖和直方圖;

　　餅狀圖;

　　莖葉圖

　　等值線圖

　　向量場圖

　　頻譜圖

　　還可以使用Matplotlib創(chuàng)建標(biāo)簽，網(wǎng)格，圖例和許多其他格式化字符?；緛碚f，一切都是可進(jìn)行自定義的。

　　這個(gè)庫由很多平臺支持，并使用不同的圖形用戶界面（GUI）套件來描繪所得的可視化。很多IDE（如IPython）都支持Matplotlib的功能。

　　5.Seaborn

　?。ㄙY料數(shù)量：1699；貢獻(xiàn)者：71）

　　Seaborn主要關(guān)注統(tǒng)計(jì)模型的可視化，如熱圖，這些可視化圖形在總結(jié)數(shù)據(jù)的同時(shí)描繪數(shù)據(jù)的總體分布。Seaborn是基于Matplotlib的，并高度依賴于它。

　　6.Bokeh

　?。ㄙY料數(shù)量：15724；貢獻(xiàn)者：223）

　　Bokeh是另一個(gè)強(qiáng)大的可視化庫，可以實(shí)現(xiàn)交互式可視化。與其他的庫相比，它的特別之處在于它是獨(dú)立于Matplotlib的。Bokeh的主要關(guān)注點(diǎn)是交互性，所以它可以通過現(xiàn)代瀏覽器以數(shù)據(jù)驅(qū)動(dòng)文檔（d3.js）的方式進(jìn)行演示。

　　7.Plotly

　?。ㄙY料數(shù)量：2486；貢獻(xiàn)者：33）

　　它是一個(gè)基于網(wǎng)絡(luò)的工具箱，可用于構(gòu)建可視化，用編程語言（其中包括Python）處理應(yīng)用程序界面（API）。在“plotly”網(wǎng)站上有一些強(qiáng)大的“開箱即用”的圖形。在使用Plotly之前，您需要設(shè)置您的API密鑰。這些圖形將在服務(wù)器端上進(jìn)行處理，然后發(fā)布到互聯(lián)網(wǎng)上，當(dāng)然也可以選擇不發(fā)布。

　　英文原文

　　Top 15 Python Libraries for Data Science in 2017

As Python has gained a lot of traction in the recent years in Data Science industry, I wanted to outline some of its most useful libraries for data scientists and engineers, based on recent experience.

And, since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.

Core Libraries.

1. NumPy (Commits: 15980, Contributors: 522)

When starting to deal with the scientific task in Python, one inevitably comes for help to Python’s SciPy Stack, which is a collection of software specifically designed for scientific computing in Python (do not confuse with SciPy library, which is part of this stack, and the community around this stack). This way we want to start with a look at it. However, the stack is pretty vast, there is more than a dozen of libraries in it, and we want to put a focal point on the core packages (particularly the most essential ones).

The most fundamental package, around which the scientific computation stack is built, is NumPy (stands for Numerical Python). It provides an abundance of useful features for operations on n-arrays and matrices in Python. The library provides vectorization of mathematical operations on the NumPy array type, which ameliorates performance and accordingly speeds up the execution.

2. SciPy (Commits: 17213, Contributors: 489)

SciPy is a library of software for engineering and science. Again you need to understand the difference between SciPy Stack and SciPy Library. SciPy contains modules for linear algebra, optimization, integration, and statistics. The main functionality of SciPy library is built upon NumPy, and its arrays thus make substantial use of NumPy. It provides efficient numerical routines as numerical integration, optimization, and many others via its specific submodules. The functions in all submodules of SciPy are well documented?—?another coin in its pot.

3. Pandas (Commits: 15089, Contributors: 762)

Pandas is a Python package designed to do work with “l(fā)abeled” and “relational” data simple and intuitive. Pandas is a perfect tool for data wrangling. It designed for quick and easy data manipulation, aggregation, and visualization.

There are two main data structures in the library:

“Series”?—?one-dimensional

“Data Frames”, two-dimensional

For example, when you want to receive a new Dataframe from these two types of structures, as a result you will receive such DF by appending a single row to a DataFrame by passing a Series:

Here is just a small list of things that you can do with Pandas:

Easily delete and add columns from DataFrame

Convert data structures to DataFrame objects

Handle missing data, represents as NaNs

Powerful grouping by functionality

4.Matplotlib (Commits: 21754, Contributors: 588)

Another SciPy Stack core package and another Python Library that is tailored for the generation of simple and powerful visualizations with ease is Matplotlib. It is a top-notch piece of software which is making Python (with some help of NumPy, SciPy, and Pandas) a cognizant competitor to such scientific tools as MatLab or Mathematica.

However, the library is pretty low-level, meaning that you will need to write more code to reach the advanced levels of visualizations and you will generally put more effort, than if using more high-level tools, but the overall effort is worth a shot.

With a bit of effort you can make just about any visualizations:

Line plots;

Scatter plots;

Bar charts and Histograms;

Pie charts;

Stem plots;

Contour plots;

Quiver plots;

Spectrograms

There are also facilities for creating labels, grids, legends, and many other formatting entities with Matplotlib. Basically, everything is customizable.

The library is supported by different platforms and makes use of different GUI kits for the depiction of resulting visualizations. Varying IDEs (like IPython) support functionality of Matplotlib.

There are also some additional libraries that can make visualization even easier.

5. Seaborn (Commits: 1699, Contributors: 71)

Seaborn is mostly focused on the visualization of statistical models; such visualizations include heat maps, those that summarize the data but still depict the overall distributions. Seaborn is based on Matplotlib and highly dependent on that.

6. Bokeh (Commits: 15724, Contributors: 223)

Another great visualization library is Bokeh, which is aimed at interactive visualizations. In contrast to the previous library, this one is independent of Matplotlib. The main focus of Bokeh, as we already mentioned, is interactivity and it makes its presentation via modern browsers in the style of Data-Driven Documents (d3.js).

7. Plotly (Commits: 2486, Contributors: 33)

Finally, a word about Plotly. It is rather a web-based toolbox for building visualizations, exposing APIs to some programming languages (Python among them). There is a number of robust, out-of-box graphics on the plot.ly website. In order to use Plotly, you will need to set up your API key. The graphics will be processed server side and will be posted on the internet, but there is a way to avoid it.

責(zé)任編輯：陳近梅

精品无人区无码乱码毛片国产_性做久久久久久免费观看_天堂中文在线资源_7777久久亚洲中文字幕

最新頂尖數(shù)據(jù)分析師必用的15大Python庫（上）