-->

Thursday, January 5, 2017

Python and Core Python Packages for Data Science

Python

Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes. Python combines remarkable power with very clear syntax. It has interfaces to many system calls and libraries, as well as to various window systems, and is extensible in C or C++. It is also usable as an extension language for applications that need a programmable interface. Finally, Python is portable: it runs on many Unix variants, on the Mac, and on PCs under MS-DOS, Windows, Windows NT, and OS/2.

From <https://docs.python.org/2.7/faq/general.html>

Enthought Canopy

Enthought Canopy is a comprehensive Python analysis environment that provides easy installation of over 450 core scientific analytic and Python packages, creating a robust platform you can explore, develop, and visualize on. In addition to its pre-built, tested Python distribution, Enthought Canopy has valuable tools for iterative data analysis, visualization and application development including:

From <https://www.enthought.com/products/canopy/>

Spark

Fast and general engine for large-scale data processing. Lets you load data into RDDs (resilient distributed databases), and auto optimally spreads it out to a cluster of machines.

numpy

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance.

From <https://docs.scipy.org/doc/numpy/user/whatisnumpy.html>

matplotlib

matplotlib is a library for making 2D plots of arrays in Python. Although it has its origins in emulating the MATLAB® [1] graphics commands, it is independent of MATLAB, and can be used in a Pythonic, object oriented way. Although matplotlib is written primarily in pure Python, it makes heavy use of NumPy and other extension code to provide good performance even for large arrays.
matplotlib is designed with the philosophy that you should be able to create simple plots with just a few commands, or just one! If you want to see a histogram of your data, you shouldn’t need to instantiate objects, call methods, set properties, and so on; it should just work.

From <http://matplotlib.org/users/intro.html>

SciPy

 

 

SciPy is a collection of mathematical algorithms and convenience functions built on the Numpy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data. With SciPy an interactive Python session becomes a data-processing and system-prototyping environment rivaling systems such as MATLAB, IDL, Octave, R-Lab, and SciLab

From <https://docs.scipy.org/doc/scipy/reference/tutorial/general.html>

pandas

 

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

From <http://pandas.pydata.org/pandas-docs/stable/?v=20161222083801>

urllib3

urllib3 is a powerful, sanity-friendly HTTP client for Python. 
  • Thread safety.
  • Connection pooling.
  • Client-side SSL/TLS verification.
  • File uploads with multipart encoding.
  • Helpers for retrying requests and dealing with HTTP redirects.
  • Support for gzip and deflate encoding.
  • Proxy support for HTTP and SOCKS.
  • 100% test coverage.

From <https://urllib3.readthedocs.io/en/latest/>

''

Installing Python Packages

Intent

How to install python packages in Pycharm, Canopy, and potentially more.

Pycharm

  1. Hit 'Ctrl+Alt+S' to open up settings
  2. Navigate to Project Interpreter

  1. Hit the + symbol and browse for packages

  1. PITFALL: Don't forget to upgrade packages. In this case, couldn't install matlib without upgrading pip

Canopy

  1. From the Enthought Canopy frontpage, select 'Project Manager'

  1. Use the search bar to find packages and upgrade them