AI News, A Complete Tutorial to Learn Data Science with Python from Scratch

A Complete Tutorial to Learn Data Science with Python from Scratch

After working on SAS for more than 5 years, I decided to move out of my comfort zone. Being a data scientist, my hunt for other useful tools was ON!

But, over the years, with strong community support, this language got dedicated library for data analysis and predictive modeling.

Due to lack of resource on python for data science, I decided to create this tutorial to help many others to learn python faster. In this tutorial, we will take bite sized information about how to use Python for Data Analysis, chew it till we are comfortable and practice it at our own end.

There are 2 approaches to install Python: Second method provides a hassle free installation and hence I’ll recommend that to beginners.

The imitation of this approach is you have to wait for the entire package to be upgraded, even if you are interested in the latest version of a single library.

It provides a lot of good features for documenting while writing the code itself and you can choose to run the code in blocks (rather than the line by line execution) We will use iPython environment for this complete tutorial.

The most commonly used construct is if-else, with following syntax: For instance, if we want to print whether the number N is even or odd: Now that you are familiar with Python fundamentals, let’s take a step further.

What if you have to perform the following tasks: If you try to write code from scratch, its going to be a nightmare and you won’t stay on Python for more than 2 days!

Following are a list of libraries, you will need for any scientific computations and data analysis: Additional libraries, you might need: Now that we are familiar with Python fundamentals and additional libraries, lets take a deep dive into problem solving through Python.

We will now use Pandas to read a data set from an Analytics Vidhya competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem.

The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.

To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt: This opens up iPython notebook in pylab environment, which has a few useful libraries already imported.

You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):

describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read this article to refresh basic statistics to understand population distribution) Here are a few inferences, you can draw by looking at the output of describe() function: Please note that we can get an idea of a possible skew in the data by comparing the mean to the median, i.e.

The frequency table can be printed by following command: Similarly, we can look at unique values of port of credit history.

Now we will look at the steps required to generate a similar insight using Python. Please refer to this article for getting a hang of the different data manipulation techniques in Pandas.

If you have not realized already, we have just created two basic classification algorithms here, one based on credit history, while other on 2 categorical variables (including gender).

Next let’s explore ApplicantIncome and LoanStatus variables further, perform data munging and create a dataset for applying various modeling techniques.

Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not.

So, let us check the number of nulls / NaNs in the dataset This command should tell us the number of missing values in each column as isnull() returns 1, if the value is null.

the simplest being replacement by mean, which can be done by following code: The other extreme could be to build a supervised learning model to predict loan amount on the basis of other variables and then use age along with other variables to predict survival.

Since, the purpose now is to bring out the steps in data munging, I’ll rather take an approach, which lies some where in between these 2 extremes.

This can be done using the following code: Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed and Education features.

Next, we define a function, which returns the values of these cells and apply it to fill the missing values of loan amount: This should provide you a good way to impute missing values of loan amount.

So instead of treating them as outliers, let’s try a log transformation to nullify their effect: Looking at the histogram again:

For example, creating a column for LoanAmount/TotalIncome might make sense as it gives an idea of how well the applicant is suited to pay back his loan.

After, we have made the data useful for modeling, let’s now look at the python code to create a predictive model on our data set.

One way would be to take all the variables into the model but this might result in overfitting (don’t worry if you’re unaware of this terminology yet).

In simple words, taking all variables might result in the model understanding complex relations specific to the data and will not generalize well.

Accuracy : 80.945% Cross-Validation Score : 80.946% Accuracy : 80.945% Cross-Validation Score : 80.946% Generally we expect the accuracy to increase on adding variables.

Accuracy : 81.930% Cross-Validation Score : 76.656% Here the model based on categorical variables is unable to have an impact because Credit History is dominating over them.

Let’s try a few numerical variables: Accuracy : 92.345% Cross-Validation Score : 71.009% Here we observed that although the accuracy went up on adding variables, the cross-validation error went down.

Also, we will modify the parameters of random forest model a little bit: Accuracy : 82.899% Cross-Validation Score : 81.461% Notice that although accuracy reduced, but the cross-validation score is improving showing that the model is generalizing well.

You would have noticed that even after some basic parameter tuning on random forest, we have reached a cross-validation accuracy only slightly better than the original logistic regression model.

I am sure this not only gave you an idea about basic data analysis methods but it also showed you how to implement some of the more sophisticated techniques available today.

If you come across any difficulty while practicing Python, or you have any thoughts / suggestions / feedback on the post, please feel free to post them through comments below.

12 Useful Pandas Techniques in Python for Data Manipulation

Python is fast becoming the preferred language for data scientists –

It provides the larger ecosystem of a programming language and the depth of good scientific computation libraries.

would recommend that you look at the codes for data exploration before going ahead. To help you understand better, I’ve taken a data set to perform these operations and manipulations.

What do you do, if you want to filter values of a column based on conditions from another set of columns? For instance, we want a list of all females who are not graduate and got a loan.

Apply returns some value after passing each row/column of a data frame with some function.

Other sophisticated techniques include modeling the missing values, using grouped averages (mean/mode/median).

Now, it is evident that people with a credit history have much higher chances of getting a loan as 80% people with credit history got a loan as compared to only 9% without credit history.

Since I know that having a credit history is super important, what if I predict loan status to be Y for ones with credit history and N otherwise.

Also, I hope this gives some intuition into why even a 0.05% increase in accuracy can result in jump of 500 ranks on the Kaggle leaderboard.

Consider a hypothetical case where the average property rates (INR per sq meters) is available for different property types.

Many of you might be unaware that boxplots and histograms can be directly plotted in Pandas and calling matplotlib separately is not necessary.

This shows that income is not a big deciding factor on its own as there is no appreciable difference between the people who received and were denied the loan.

For example, if we’re trying to model traffic (#cars on road) with time of the day (minutes).

The exact minute of an hour might not be that relevant for predicting traffic as compared to actual period of the day like “Morning”, “Afternoon”, “Evening”, “Night”, “Late Night”.

This can be due to various reasons: Here I’ve defined a generic function which takes in input as a dictionary and codes the values using ‘replace’

This generally happens when: So it’s generally a good idea to manually define the column types.

A good way to tackle such issues is to create a csv file with column names and types.

This way, we can make a generic function to read the file and assign column data types.

In this article, we covered various functions of Pandas which can make our life easy while performing data exploration and feature engineering.

2.2. Values and Data Types¶

We often refer to these values as objects and we will use the words value and object interchangeably.

Not surprisingly, strings belong to the class str and integers belong to the class

You may have used function notation in a math class, like y = f(x), likely only for functions that act on a single numeric value, and produce a single numeric value.

In the Python shell, it is not necessary to use the print function to see the values shown above.

The shell evaluates the Python function and automatically prints the result.

For example, consider the shell session shown below.

ask the shell to evaluate type("Hello, World!"), it responds with the appropriate answer and then goes on to display

Continuing with our discussion of data types, numbers with a decimal point belong to a class called

- the double quote character), or three of the same separate quote characters (''' or """).

They can contain either single or double quotes: Triple quoted strings can even span multiple lines: Python doesn’t care whether you use single or double quotes or the three-of-a-kind

In fact, the print function can print any number of values as long as

Variables are nothing but reserved memory locations to store values.

Based on the data type of a variable, the interpreter allocates memory and decides what can be stored in the reserved memory.

The operand to the left of the = operator is the name of the variable and the operand to the right of the = operator is the value stored in the variable.

Here, 100, 1000.0 and 'John' are the values assigned to counter, miles, and name variables, respectively.

Here, two integer objects with values 1 and 2 are assigned to variables a and b respectively, and one string object with the value 'john' is assigned to the variable c.

For example, a person's age is stored as a numeric value and his or her address is stored as alphanumeric characters.

Subsets of strings can be taken using the slice operator ([ ] and [:] ) with indexes starting at 0 in the beginning of the string and working their way from -1 at the end.

The values stored in a list can be accessed using the slice operator ([ ] and [:]) with indexes starting at 0 in the beginning of the list and working their way to end -1.

The main differences between lists and tuples are: Lists are enclosed in brackets ( [ ] ) and their elements and size can be changed, while tuples are enclosed in parentheses ( ( ) ) and cannot be updated.

Dictionaries are enclosed by curly braces ({ }) and values can be assigned and accessed using square braces ([]).

Choosing your storage and database on Google Cloud Platform

Different applications and workloads require different storage solutions. Google Cloud Platform offers a full suite of storage and database solutions to meet your ...

Functional Data Validation (or How to Think Functionally)

Sooner or later, all developers have to deal with data validation: reading input from the user, checking it, and reporting errors back to the UI. For such a ...

What is Cloud Job Discovery?

Looking to get more qualified candidates out of your job property or career site channel? Tarquin Clark explains the value of Cloud Job Discovery and how it ...

Tlačová konferencia SLÚŠ - 1. časť

Vystúpenie prezidenta Slovenskej lekárskej únie špecialistov MUDr. Andreja Janca na tlačovej konferencii k problémom Slovenského zdravotníctva a ...