AI News, Useful libraries for data science in Python

Useful libraries for data science in Python

updated: 10/22/2014 This is not meant to be a complete list of all Python libraries out there that are related to scientific computing and data analysis -- printed on paper and stacked one on top of the other, the stack could easily reach a height of 238,857 miles, the distance from Earth to Moon.

browser-based documents IPython Notebooks are a great environment for scientific computing: Not only to execute code, but also to add informative documentation via Markdown, HTML, LaTeX, embedded images, and inline data plots via e.g., matplotlib.

It includes a broad range of different classifiers, cross-validation and other model selection methods, dimensionality reduction techniques, modules for regression and clustering analysis, and a useful data-preprocessing module.

Website: PyLearn2 is a machine learning research library - a library to study machine learning - focussed on deep and convolutional neural networks, restricted Boltzman machines, and auto-encoders.

To produce interactive plots, plotly requires connection to the internet to stream data to the plotly servers, however, plots can also be saved in common image formats for offline use.

Website: Prettyplotlib is a nice enhancement-library that turns matplotlib's default styles into beautiful, presentation-ready plots based on information design and color perception studies.

Website: Seaborn is based on matplotlib's core functionality and adds additional features (e.g., violin plots) and visual enhancements to create even more beautiful plots.

SQLitean open-source SQL database engine that is ideal for smaller workgroups, because it is a single locally stored database file (up to 140 Tb in size) that does not require -- in contrast to SQL -- any server infrastructure.

Top 15 Python Libraries for Data Science in 2017

As Python has gained a lot of traction in the recent years in Data Science industry, I wanted to outline some of its most useful libraries for data scientists and engineers, based on recent experience.

When starting to deal with the scientific task in Python, one inevitably comes for help to Python’s SciPy Stack, which is a collection of software specifically designed for scientific computing in Python (do not confuse with SciPy library, which is part of this stack, and the community around this stack).

However, the stack is pretty vast, there is more than a dozen of libraries in it, and we want to put a focal point on the core packages (particularly the most essential ones).

There are two main data structures in the library: “Series” — one-dimensional “Data Frames”, two-dimensional For example, when you want to receive a new Dataframe from these two types of structures, as a result you will receive such DF by appending a single row to a DataFrame by passing a Series: Here is just a small list of things that you can do with Pandas: Another SciPy Stack core package and another Python Library that is tailored for the generation of simple and powerful visualizations with ease is Matplotlib.

However, the library is pretty low-level, meaning that you will need to write more code to reach the advanced levels of visualizations and you will generally put more effort, than if using more high-level tools, but the overall effort is worth a shot.

Efficiency and stability tweaks allow for much more precise results with even very small values, for example, computation of log(1+x) will give cognizant results for even smallest values of x.

The functionality of NLTK allows a lot of operations such as text tagging, classification, and tokenizing, name entities identification, building corpus tree that reveals inter and intra-sentence dependencies, stemming, semantic reasoning.

Gensim implements algorithms such as hierarchical Dirichlet processes (HDP), latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), as well as tf-idf, random projections, word2vec and document2vec facilitate examination of texts for recurring patterns of words in the set of documents (often referred as a corpus).

It was originally designed strictly for scraping, as its name indicate, but it has evolved in the full-fledged framework with the ability to gather data from APIs and act as general-purpose crawlers.

The library follows famous Don’t Repeat Yourself in the interface design — it prompts its users to write the general, universal code that is going to be reusable, thus making building and scaling large crawlers.

As you have probably guessed from the name, statsmodels is a library for Python that enables its users to conduct data exploration via the use of various methods of estimation of statistical models and performing statistical assertions and analysis.

Among many useful features are descriptive and result statistics via the use of linear regression models, generalized linear models, discrete choice models, robust linear models, time series analysis models, various estimators.

The library also provides extensive plotting functions that are designed specifically for the use in statistical analysis and tweaked for good performance with big data sets of statistical data.

And here are the detailed stats of Github activities for each of those libraries: Source: Google Spreadsheet Of course, this is not the fully exhaustive list and there are many other libraries and frameworks that are also worthy and deserve proper attention for particular tasks.

A Complete Tutorial to Learn Data Science with Python from Scratch

After working on SAS for more than 5 years, I decided to move out of my comfort zone. Being a data scientist, my hunt for other useful tools was ON!

But, over the years, with strong community support, this language got dedicated library for data analysis and predictive modeling.

Due to lack of resource on python for data science, I decided to create this tutorial to help many others to learn python faster. In this tutorial, we will take bite sized information about how to use Python for Data Analysis, chew it till we are comfortable and practice it at our own end.

The imitation of this approach is you have to wait for the entire package to be upgraded, even if you are interested in the latest version of a single library.

It provides a lot of good features for documenting while writing the code itself and you can choose to run the code in blocks (rather than the line by line execution) We will use iPython environment for this complete tutorial.

The most commonly used construct is if-else, with following syntax: For instance, if we want to print whether the number N is even or odd: Now that you are familiar with Python fundamentals, let’s take a step further.

What if you have to perform the following tasks: If you try to write code from scratch, its going to be a nightmare and you won’t stay on Python for more than 2 days!

Following are a list of libraries, you will need for any scientific computations and data analysis: Additional libraries, you might need: Now that we are familiar with Python fundamentals and additional libraries, lets take a deep dive into problem solving through Python.

We will now use Pandas to read a data set from an Analytics Vidhya competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem.

The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.

To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt: This opens up iPython notebook in pylab environment, which has a few useful libraries already imported.

You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):

describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read this article to refresh basic statistics to understand population distribution) Here are a few inferences, you can draw by looking at the output of describe() function: Please note that we can get an idea of a possible skew in the data by comparing the mean to the median, i.e.

The frequency table can be printed by following command: Similarly, we can look at unique values of port of credit history.

Now we will look at the steps required to generate a similar insight using Python. Please refer to this article for getting a hang of the different data manipulation techniques in Pandas.

If you have not realized already, we have just created two basic classification algorithms here, one based on credit history, while other on 2 categorical variables (including gender).

Next let’s explore ApplicantIncome and LoanStatus variables further, perform data munging and create a dataset for applying various modeling techniques.

Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not.

So, let us check the number of nulls / NaNs in the dataset This command should tell us the number of missing values in each column as isnull() returns 1, if the value is null.

the simplest being replacement by mean, which can be done by following code: The other extreme could be to build a supervised learning model to predict loan amount on the basis of other variables and then use age along with other variables to predict survival.

Since, the purpose now is to bring out the steps in data munging, I’ll rather take an approach, which lies some where in between these 2 extremes.

This can be done using the following code: Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed and Education features.

Next, we define a function, which returns the values of these cells and apply it to fill the missing values of loan amount: This should provide you a good way to impute missing values of loan amount.

So instead of treating them as outliers, let’s try a log transformation to nullify their effect: Looking at the histogram again:

For example, creating a column for LoanAmount/TotalIncome might make sense as it gives an idea of how well the applicant is suited to pay back his loan.

After, we have made the data useful for modeling, let’s now look at the python code to create a predictive model on our data set.

One way would be to take all the variables into the model but this might result in overfitting (don’t worry if you’re unaware of this terminology yet).

In simple words, taking all variables might result in the model understanding complex relations specific to the data and will not generalize well.

Accuracy : 80.945% Cross-Validation Score : 80.946% Accuracy : 80.945% Cross-Validation Score : 80.946% Generally we expect the accuracy to increase on adding variables.

Accuracy : 81.930% Cross-Validation Score : 76.656% Here the model based on categorical variables is unable to have an impact because Credit History is dominating over them.

Let’s try a few numerical variables: Accuracy : 92.345% Cross-Validation Score : 71.009% Here we observed that although the accuracy went up on adding variables, the cross-validation error went down.

Also, we will modify the parameters of random forest model a little bit: Accuracy : 82.899% Cross-Validation Score : 81.461% Notice that although accuracy reduced, but the cross-validation score is improving showing that the model is generalizing well.

You would have noticed that even after some basic parameter tuning on random forest, we have reached a cross-validation accuracy only slightly better than the original logistic regression model.

I am sure this not only gave you an idea about basic data analysis methods but it also showed you how to implement some of the more sophisticated techniques available today.

If you come across any difficulty while practicing Python, or you have any thoughts / suggestions / feedback on the post, please feel free to post them through comments below.

Your First Machine Learning Project in Python Step-By-Step

Do you want to do machine learning using Python, but you’re having trouble getting started?

In this step-by-step tutorial you will: If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.

The best way to learn machine learning is by designing and completing small projects.

machine learning project may not be linear, but it has a number of well known steps: The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps.

You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

The best small project to start with on a new tool is the classification of iris flowers (e.g.

The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows.

recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs.

If you do have network problems, you can download the file into your working directory and load it using the same method, changing URL to the local file name.

In this step we are going to take a look at the data a few different ways: Don’t worry, each look at the data is one command.

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

We are going to look at two types of plots: We start with some univariate plots, that is, plots of each individual variable.

This gives us a much clearer idea of the distribution of the input attributes: We can also create a histogram of each input variable to get an idea of the distribution.

We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

The specific random seed does not matter, learn more about pseudorandom number generators here: We are using the metric of ‘accuracy‘

This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g.

We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model.

There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak.

We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.

Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).

You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.

Python NumPy Tutorial | NumPy Array | Python Tutorial For Beginners | Python Training | Edureka

Python Training : ) This Edureka Python Numpy tutorial (Python Tutorial Blog: explains what exactly is .

Visualize Machine learning data - Histogram, Density plot in pandas, MatplotLib

This tutorial will explain how to to visualize sample indian diabetes patient database with python pandas and plotting matplotlib library in form of histogram and ...

Visualize Machine learning data - Box and correlation plot , Density plot in pandas, MatplotLib

This tutorial will explain how to to visualize sample indian diabetes patient database with python pandas and plotting matplotlib library in form of Box and ...

Predicting Stock Prices - Learn Python for Data Science #4

In this video, we build an Apple Stock Prediction script in 40 lines of Python using the scikit-learn library and plot the graph using the matplotlib library.

Popular Python Libraries for Data Visualization

It's not just Matplotlib any more... there are more options than ever for creating graphs and visualizations in Python. Libraries now come with beautiful styling, ...

Install Python, Numpy, Matplotlib, Scipy on Windows

See for a more recent video on Python 3.6 with NumPy, SciPy, and Matplotlib. This tutorial covers how to download and install ..

K Nearest Neighbors Application - Practical Machine Learning Tutorial with Python p.14

In the last part we introduced Classification, which is a supervised form of machine learning, and explained the K Nearest Neighbors algorithm intuition. In this ...

Cython: Speed up Python and NumPy, Pythonize C, C++, and Fortran, SciPy2013 Tutorial, Part 1 of 4

Presenter: Kurt Smith Description Cython is a flexible and multi-faceted tool that brings down the barrier between Python and other languages. With cython, you ...

Geospatial Data with Open Source Tools in Python | SciPy 2015 Tutorial | Kelsey Jordahl

Making Predictions with Data and Python : Plotting with Matplotlib |

This playlist/video has been uploaded for Marketing purposes and contains only selective videos. For the entire video course and code, visit ...