AI News, Which one is best: R, SAS or Python, for data science?

Which one is best: R, SAS or Python, for data science?

Though things have changed, I consider R mostly as a tool to perform ad-hoc analysis or EDA (exploratory data analysis) rather than a component of enterprise analytic applications / production code running in batch mode or accessed via API's.

My favorite would be Python, but since I code my own applications (as opposed to working with a team), I still use Perl for its automated memory allocation, nice string processing features (though many languages do as good as Perl now with NLP and regular expressions), and high flexibility.

Also speed of execution (fast C versus relatively slow Perl, R or Python) is not a big issue anymore with big data, as most of the computer time is not spent in running algorithms (if the algorithms are well optimized)  but instead in data transfers.

Python vs. R (vs. SAS) – which tool should I learn?

Note: This article was originally published on Mar 27th, 2014 and updated on Sept 12th, 2017 We love comparisons!

Windows in mobile OS to comparing candidates for upcoming elections or selecting captain for the world cup team, comparisons and discussions enrich us in our life.

If you love discussions, all you need to do is pop up a relevant question in middle of a passionate community and then watch it explode!

The beauty of the process is that everyone in the room walks away as a more knowledgeable person.

Python is one of the fastest growing languages now and has come a long way since it’s inception.

The reason for me to start this discussion is not to watch it explode (that would be fun as well though).

But I still feel the need for discussion for following reasons: So, without any further delay, let the combat begin!

So, if you are looking for purchasing a tool for your company, you may not get complete answer here.

SAS is easy to learn and provides easy option (PROC SQL) for people who already know SQL.

In terms of resources, there are tutorials available on websites of various university and SAS has a comprehensive documentation.

R is a low level programming language and hence simple procedures can take longer codes.

R computes every thing in memory (RAM) and hence the computations were limited by the amount of RAM on 32 bit machines.

All three languages have good data handling capabilities and options for parallel computations.

Since R has been used widely in academics in past, development of new techniques is fast.

R / Python, on the other hand are better options for start-ups and companies looking for cost efficiency.

Python jobs for data analysis will have similar or higher trend as R jobs: The graph below shows R in Blue and SAS in Orange.

It will be pre-mature to place bets on what will prevail, given the dynamic nature of industry.

Depending on your circumstances (career stage, financials etc.) you can add your own weights and come up with what might be suitable for you.

Here are a few specific scenarios: Strategically, corporate setups that require more hands-on assistance and training choose SAS as an option.

Python has been the obvious choice for startups today due to its lightweight nature and growing community.

R Data Import/Export

This is a guide to importing and exporting data to and from R.

This manual is for R, version 3.6.0 Under development (2018-10-15).

Permission is granted to make and distribute verbatim copies of this manual

provided the copyright notice and this permission notice are preserved

Permission is granted to copy and distribute modified versions of this manual

under the conditions for verbatim copying, provided that the entire

resulting derived work is distributed under the terms of a permission

Permission is granted to copy and distribute translations of this manual into

another language, under the above conditions for modified versions, except

The relational databases part of this manual is based in part on an earlier

Many volunteers have contributed to the packages used here.

Reading data into a statistical system for analysis and exporting the results

to some other system for report writing can be frustrating tasks that

can take far more time than the statistical analysis itself, even though

most readers will find the latter far more appealing.

This manual describes the import and export facilities available either in

Unless otherwise stated, everything described in this manual is (at least

In general, statistical systems like R are not particularly well suited

There are packages to allow functionality developed in languages such as Java,

of small re-usable tools, and it can be rewarding to use tools such

as awk and perl to manipulate data before import or after

is an example of this, where Unix tools were used to check and manipulate

This manual was first written in 2000, and the number of scope of R packages

is worth searching to see if a suitable package already exists.

The easiest form of data to import into R is a simple text file, and this

primary function to import from a text file is scan, and this underlies

a client with a memory stick (formerly, a floppy disc or CD-R) of data

in some proprietary binary format, for example ‘an Excel spreadsheet’

the originating application to export the data as a text file (and statistical

facilities are available to access such files directly from R. For

In a few cases, data have been stored in a binary form for compactness and

is imaging data, which is normally stored as a stream of bytes as represented

For much larger databases it is common to handle the data using a database

Importing data via network connections is discussed

detect with certainty which 8-bit encoding (although guesses

above), so you may simply have to ask the originator for some clues

at the file with the command-line utility od or a hex editor

Note that utf8 is not a valid encoding name (UTF-8 is), and

Exporting results from R is usually a less contentious task, but there

a file argument, and the append argument allows a text

file to be written via successive calls to cat.

The most common task is to write a matrix or data frame to file as a rectangular

grid of numbers, possibly with row and column labels.

write just writes out a matrix or vector in a specified number

is more convenient, and writes out a data frame (or an

object that can be coerced to a data frame) with row and column labels.

There are a number of issues that need to be considered in writing out a data

interface for writing matrices, with the option of writing them

in blocks and thereby reducing memory usage.

It is possible to use sink to divert the standard R output to a

file, and thereby capture the output of (possibly implicit) print

to produce a text file and also writes a code file that

will read this text file into another statistical package.

When reading data from text files, it is the responsibility of the user to

know and to specify the conventions used to create that file, e.g.

the comment character, whether a header line is present, the value separator,

is becoming extremely popular and is emerging as a standard

communities to describe geographical data such as maps, graphical

XML provides a way to specify the file’s encoding, e.g.

The XML package provides general facilities for reading and writing

yaml is another system for structuring text data, with emphasis

In Export to text files we saw a number of variations on the format

of a spreadsheet-like text file, in which the data are presented in

a rectangular grid, possibly with row and column labels.

The function read.table is the most convenient way to read in a rectangular

other functions that call read.table but change a group of

Beware that read.table is an inefficient way to read in very

large numerical matrices: see scan below.

in those locales where the comma is used for the decimal point and (for

If the options to read.table are specified incorrectly, the error message

or This may give enough information to find the problem, but the auxiliary function

Efficiency can be important when reading large data grids.

specify comment.char = "", colClasses as one of the atomic

vector types (logical, integer, numeric, complex, character or perhaps

raw) for each column, and to give nrows, the number of rows

to be read (and a mild over-estimate is better than not specifying this

Sometimes data files have no field delimiters but have fields in pre-specified

and is still sometimes used to save file space.

Function read.fwf provides a simple way to read such files, specifying

as whole lines, splits the resulting character strings, writes out

is adequate for small files, but for anything more complicated we recommend

using the facilities of a language like perl to pre-process

Function read.fortran is a similar function for fixed-format files, using

An old format sometimes used for spreadsheet-like data is DIF, or Data Interchange format.

Function read.DIF provides a simple way to read such files.

It takes arguments similar to read.table for assigning types to each of the columns.

On Windows, spreadsheet programs often store spreadsheet data copied to the

Both read.table and read.fwf use scan to read the file,

which specifies a list of modes of variables to be read from

If the list is named, the names are used for the components

returns a list with three components and discards the fourth column in the

you want is to read whole lines into R for further processing.

One common use of scan is to read in a large matrix.

matrix.dat just contains the numbers for a 200 x 2000 matrix.

On one test this took 1 second (under Linux, 3 seconds under Windows on the

reading 2000 separate short columns: were they of length 2000, scan

took 9 seconds whereas read.table took 18 if used efficiently

Note that timings can depend on the type read and the data.

and a million examples of a small set of codes:

Note that these timings depend heavily on the operating system (the basic

Sometimes spreadsheet data is in a compact format that gives the covariates

Consider the following sample of data from repeated MRI brain measurements

We can use stack to help manipulate these data to give a single response.

The reshape function has a more complicated syntax than stack

Some people prefer the tools in packages reshape, reshape2

Displaying higher-dimensional contingency tables in array form typically is

often represented in the form of bordered two-dimensional arrays with leading

rows and columns specifying the combination of factor levels corresponding

the obvious convention that rows are read from top to bottom and columns

As a simple example, consider the R standard data set UCBAdmissions

the six largest departments in 1973 classified by admission and sex.

The printed representation is clearly more useful than displaying the data

has additional arguments for dealing with variants on how exactly the

information on row and column variables names and levels is represented.

The flat tables can be converted to standard contingency tables

variables is given, one should instead use read.table to read in

the data, and create the contingency table from this using xtabs.

In this chapter we consider the problem of reading a binary data file written

In all cases the facilities described were written for data files from specific

versions of the other system (often in the early 2000s), and have

not necessarily been updated for the most recent versions of the other

The recommended package foreign provides import facilities for files

cases these functions may require substantially less memory than read.table

write.foreign (See Export to text files) provides an export mechanism with support currently for SAS,

EpiInfo versions 5 and 6 stored data in a self-describing fixed-width text

read.ssd can be used to create and run a SAS script that saves

4.x or 2000 on (32-bit) Unix or Windows (and can read them on a different

it can read vectors, matrices and data frames and lists containing

Function data.restore reads S-PLUS data dumps (created by data.dump)

to read in very large objects it may be preferable to use the dump file

Function read.spss can read files created by the ‘save’

with value labels are optionally converted to R factors.

By default it creates data files with extra formatting information

Some third-party applications claim to produce data ‘in SPSS format’

Stata .dta files are a binary file format.

to 12 of Stata can be read and written by functions read.dta and

data files (mtype = 1) written on little-endian machines

foreign can read in files in Octave text data format created

using the Octave command save -ascii, with support for most

of the common types of variables, including the standard atomic (real

and complex scalars, matrices, and N-d arrays, strings, ranges,

and boolean scalars and matrices) and recursive (structs, cells, and

all data being manipulated by R are resident in memory, and several

copies of the data can be created during execution of a function,

R is not well suited to extremely large data sets.

to run out of memory, particularly on a 32-bit operating system.

does not easily support concurrent access to data.

does support persistence of data, in that you can save a data object

or an entire worksheet from one session and restore it at the subsequent

session, but the format of the stored data is specific to R

The sort of statistical applications for which DBMS might be used are to extract

a 10% sample of the data, to cross-tabulate data to produce a multi-dimensional

contingency table, and to extract data group by group from

…), the former marked out by much greater emphasis on data security

There are other commonly used data sources, including spreadsheets, non-relational

All of the packages described later in this chapter provide clients to client/server

The more comprehensive R interfaces generate SQL behind the scenes

upper case, but many users will find it more convenient to use lower case

relational DBMS stores data as a database of tables (or relations)

character, date, currency, …) and rows or records

on a third column and asks the results be sorted.

a database join on two tables student and school

SELECT queries use FROM to select the table, WHERE to specify a condition

SELECT DISTINCT queries will only return one copy of each distinct row in

The GROUP BY clause selects subgroups of the rows according to the criterion.

or exclude groups depending on the aggregated value.

If the SELECT statement contains an ORDER BY statement that produces a unique

ordering, a LIMIT clause can be added to select (by number) a contiguous

There are queries to create a table (CREATE TABLE, but usually one copies

Data can be stored in a database in various data types.

for large blocks of text and binary data, respectively.

provide means to copy whole data frames to and from databases.

functions to select data within the database via SQL queries,

here applies to versions 0.5-0 and later: earlier versions

of names where the operating file system is case-sensitive, so not on

R data frame, mapping the row names of the data frame to the field row_names

Which file formats are supported depends on the versions of the

GUI allows a database to be selected via dialog boxes) which returns

sends a query to the database and saves the result as a table

in the database.) A finer level of control is attained by first calling

The latter can be used within a loop to retrieve a limited number

column and data frame names to lower case.

we created earlier, and had the DSN (data source name) set up

including descriptions, labels, formats, units, ….

HDF5 groups, and can write numeric and character vectors and matrices.

NetCDF’s version 4 format (confusingly, implemented in netCDF 4.1.1 and later,

which has a binary flat-file format that became popular, with file

logical and numeric fields, and other types in later versions (see

Functions read.dbf and write.dbf provide ways to read and write

particular class of binary files are those representing images, and a not

There are many formats for image files (most with lots of variants), and it

(2001), a set of functions to replace the use of file names by a flexible

Files compressed via the algorithm used by gzip can be used as connections

Unix programmers are used to dealing with special files stdin, stdout

R interface, stdin refers to the lines submitted from readline

The three terminal connections are always open, and cannot be opened or closed.

same place, but whereas normal output can be re-directed by a call to

sink, error output is sent to stderr unless re-directed by

a pipe connection for writing (it makes no sense to append to a pipe)

sink as writing to a file, possibly appending to a file if argument append = TRUE, and this is what they did prior to R version

The current behaviour is equivalent, but what actually happens is that when

the file argument is a character string, a file connection is

opened (for writing or appending) and closed again at the end of the function

efficient to explicitly declare and open the connection, and pass the

it possible to write to pipes, which was implemented earlier in a limited

way via the syntax file = "|cmd"

These take a character string argument and open a file

a file connection allows a file to be read sequentially in different

of lines of text can be pushed back onto a connection via a call to

Pushbacks operate as a stack, so a read request first uses each line from

lines pushed back can be found via a call to pushBackLength.

Pushback is only available for connections opened for input in text mode.

summary of all the connections currently opened by the user can be found

The generic function seek can be used to read and (on some connections)

The function truncate can be used to truncate a file opened for writing

to the file as a stream of bytes exactly as it is represented in memory.

readBin reads a stream of bytes from the file and interprets them as

the maximum number of vector elements to read from the connection:

signed allows 1-byte and 2-byte integers to be read

The remaining two arguments are used to write or read data for interchange

for example to read 16-bit integers or write single-precision real numbers.

allows sizes 1, 2, 4, 8 for integers and logicals, and sizes 4,

arithmetic is used, so standard C facilities can be used to test for

Some limited facilities are available to exchange data at a lower level across

The earlier low-level interface is given by functions make.socket, read.socket,

The most common R data import/export question seems to be ‘how do I read an

What is a HashTable Data Structure - Introduction to Hash Tables , Part 0

This tutorial is an introduction to hash tables. A hash table is a data structure that is used to implement an associative array. This video explains some of the ...

R tutorial: connecting to a database

Learn more about connecting to databases with R: Welcome to part two of importing data in R! The ..

Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science Training | Edureka

Data Science Training - ) This Edureka Random Forest tutorial will help you understand all the basics of Random Forest ..

REST API concepts and examples

This video introduces the viewer to some API concepts by making example calls to Facebook's Graph API, Google Maps' API, Instagram's Media Search API, ...

Faster Processing than SAS

Learn how these organizations have improved their productivity by switching from SAS to Alteryx. Visit for more details

Bar Charts and Pie Charts in R (R Tutorial 2.1)

How to produce "bar charts" and "pie charts" in R, add titles, change axes labels, and many other modifications to these plots. This tutorial explains how to use ...

Python Programming Tutorial - How to Make a Stock Screener

This video teaches you how to create a stock screener based on any indicator you have built in Python. Don't know how to build indicators in Python?

Bubble sort algorithm

See complete series on sorting algorithms here: This series is in progress, ..

R in Bioinformatics (2 of 2)

Part 1: Part 2: Presenter: Original Meetup page: .

THE TOOL EVERY CAR GUY MUST HAVE!

How to reset a check engine light? Every car guy must have one of these. If you have ever worked on newer cars or just want to be able to view what code keeps ...