AI News, Which one is best: R, SAS or Python, for data science?
Which one is best: R, SAS or Python, for data science?
Though things have changed, I consider R mostly as a tool to perform ad-hoc analysis or EDA (exploratory data analysis) rather than a component of enterprise analytic applications / production code running in batch mode or accessed via API's.
My favorite would be Python, but since I code my own applications (as opposed to working with a team), I still use Perl for its automated memory allocation, nice string processing features (though many languages do as good as Perl now with NLP and regular expressions), and high flexibility.
Also speed of execution (fast C versus relatively slow Perl, R or Python) is not a big issue anymore with big data, as most of the computer time is not spent in running algorithms (if the algorithms are well optimized) but instead in data transfers.
Python vs. R (vs. SAS) – which tool should I learn?
Note: This article was originally published on Mar 27th, 2014 and updated on Sept 12th, 2017 We love comparisons!
Windows in mobile OS to comparing candidates for upcoming elections or selecting captain for the world cup team, comparisons and discussions enrich us in our life.
If you love discussions, all you need to do is pop up a relevant question in middle of a passionate community and then watch it explode!
The beauty of the process is that everyone in the room walks away as a more knowledgeable person.
Python is one of the fastest growing languages now and has come a long way since it’s inception.
The reason for me to start this discussion is not to watch it explode (that would be fun as well though).
But I still feel the need for discussion for following reasons: So, without any further delay, let the combat begin!
So, if you are looking for purchasing a tool for your company, you may not get complete answer here.
SAS is easy to learn and provides easy option (PROC SQL) for people who already know SQL.
In terms of resources, there are tutorials available on websites of various university and SAS has a comprehensive documentation.
R is a low level programming language and hence simple procedures can take longer codes.
R computes every thing in memory (RAM) and hence the computations were limited by the amount of RAM on 32 bit machines.
All three languages have good data handling capabilities and options for parallel computations.
Since R has been used widely in academics in past, development of new techniques is fast.
R / Python, on the other hand are better options for start-ups and companies looking for cost efficiency.
Python jobs for data analysis will have similar or higher trend as R jobs: The graph below shows R in Blue and SAS in Orange.
It will be pre-mature to place bets on what will prevail, given the dynamic nature of industry.
Depending on your circumstances (career stage, financials etc.) you can add your own weights and come up with what might be suitable for you.
Here are a few specific scenarios: Strategically, corporate setups that require more hands-on assistance and training choose SAS as an option.
Python has been the obvious choice for startups today due to its lightweight nature and growing community.
R Data Import/Export
This is a guide to importing and exporting data to and from R.
This manual is for R, version 3.6.0 Under development (2018-10-15).
Permission is granted to make and distribute verbatim copies of this manual
provided the copyright notice and this permission notice are preserved
Permission is granted to copy and distribute modified versions of this manual
under the conditions for verbatim copying, provided that the entire
resulting derived work is distributed under the terms of a permission
Permission is granted to copy and distribute translations of this manual into
another language, under the above conditions for modified versions, except
The relational databases part of this manual is based in part on an earlier
Many volunteers have contributed to the packages used here.
Reading data into a statistical system for analysis and exporting the results
to some other system for report writing can be frustrating tasks that
can take far more time than the statistical analysis itself, even though
most readers will find the latter far more appealing.
This manual describes the import and export facilities available either in
Unless otherwise stated, everything described in this manual is (at least
In general, statistical systems like R are not particularly well suited
There are packages to allow functionality developed in languages such as Java,
of small re-usable tools, and it can be rewarding to use tools such
as awk and perl to manipulate data before import or after
is an example of this, where Unix tools were used to check and manipulate
This manual was first written in 2000, and the number of scope of R packages
is worth searching to see if a suitable package already exists.
The easiest form of data to import into R is a simple text file, and this
primary function to import from a text file is scan, and this underlies
a client with a memory stick (formerly, a floppy disc or CD-R) of data
in some proprietary binary format, for example ‘an Excel spreadsheet’
the originating application to export the data as a text file (and statistical
facilities are available to access such files directly from R. For
In a few cases, data have been stored in a binary form for compactness and
is imaging data, which is normally stored as a stream of bytes as represented
For much larger databases it is common to handle the data using a database
Importing data via network connections is discussed
detect with certainty which 8-bit encoding (although guesses
above), so you may simply have to ask the originator for some clues
at the file with the command-line utility od or a hex editor
Note that utf8 is not a valid encoding name (UTF-8 is), and
Exporting results from R is usually a less contentious task, but there
a file argument, and the append argument allows a text
file to be written via successive calls to cat.
The most common task is to write a matrix or data frame to file as a rectangular
grid of numbers, possibly with row and column labels.
write just writes out a matrix or vector in a specified number
is more convenient, and writes out a data frame (or an
object that can be coerced to a data frame) with row and column labels.
There are a number of issues that need to be considered in writing out a data
interface for writing matrices, with the option of writing them
in blocks and thereby reducing memory usage.
It is possible to use sink to divert the standard R output to a
file, and thereby capture the output of (possibly implicit) print
to produce a text file and also writes a code file that
will read this text file into another statistical package.
When reading data from text files, it is the responsibility of the user to
know and to specify the conventions used to create that file, e.g.
the comment character, whether a header line is present, the value separator,
is becoming extremely popular and is emerging as a standard
communities to describe geographical data such as maps, graphical
XML provides a way to specify the file’s encoding, e.g.
The XML package provides general facilities for reading and writing
yaml is another system for structuring text data, with emphasis
In Export to text files we saw a number of variations on the format
of a spreadsheet-like text file, in which the data are presented in
a rectangular grid, possibly with row and column labels.
The function read.table is the most convenient way to read in a rectangular
other functions that call read.table but change a group of
Beware that read.table is an inefficient way to read in very
large numerical matrices: see scan below.
in those locales where the comma is used for the decimal point and (for
If the options to read.table are specified incorrectly, the error message
or This may give enough information to find the problem, but the auxiliary function
Efficiency can be important when reading large data grids.
specify comment.char = "", colClasses as one of the atomic
vector types (logical, integer, numeric, complex, character or perhaps
raw) for each column, and to give nrows, the number of rows
to be read (and a mild over-estimate is better than not specifying this
Sometimes data files have no field delimiters but have fields in pre-specified
and is still sometimes used to save file space.
Function read.fwf provides a simple way to read such files, specifying
as whole lines, splits the resulting character strings, writes out
is adequate for small files, but for anything more complicated we recommend
using the facilities of a language like perl to pre-process
Function read.fortran is a similar function for fixed-format files, using
An old format sometimes used for spreadsheet-like data is DIF, or Data Interchange format.
Function read.DIF provides a simple way to read such files.
It takes arguments similar to read.table for assigning types to each of the columns.
On Windows, spreadsheet programs often store spreadsheet data copied to the
Both read.table and read.fwf use scan to read the file,
which specifies a list of modes of variables to be read from
If the list is named, the names are used for the components
returns a list with three components and discards the fourth column in the
you want is to read whole lines into R for further processing.
One common use of scan is to read in a large matrix.
matrix.dat just contains the numbers for a 200 x 2000 matrix.
On one test this took 1 second (under Linux, 3 seconds under Windows on the
reading 2000 separate short columns: were they of length 2000, scan
took 9 seconds whereas read.table took 18 if used efficiently
Note that timings can depend on the type read and the data.
and a million examples of a small set of codes:
Note that these timings depend heavily on the operating system (the basic
Sometimes spreadsheet data is in a compact format that gives the covariates
Consider the following sample of data from repeated MRI brain measurements
We can use stack to help manipulate these data to give a single response.
The reshape function has a more complicated syntax than stack
Some people prefer the tools in packages reshape, reshape2
Displaying higher-dimensional contingency tables in array form typically is
often represented in the form of bordered two-dimensional arrays with leading
rows and columns specifying the combination of factor levels corresponding
the obvious convention that rows are read from top to bottom and columns
As a simple example, consider the R standard data set UCBAdmissions
the six largest departments in 1973 classified by admission and sex.
The printed representation is clearly more useful than displaying the data
has additional arguments for dealing with variants on how exactly the
information on row and column variables names and levels is represented.
The flat tables can be converted to standard contingency tables
variables is given, one should instead use read.table to read in
the data, and create the contingency table from this using xtabs.
In this chapter we consider the problem of reading a binary data file written
In all cases the facilities described were written for data files from specific
versions of the other system (often in the early 2000s), and have
not necessarily been updated for the most recent versions of the other
The recommended package foreign provides import facilities for files
cases these functions may require substantially less memory than read.table
write.foreign (See Export to text files) provides an export mechanism with support currently for SAS,
EpiInfo versions 5 and 6 stored data in a self-describing fixed-width text
read.ssd can be used to create and run a SAS script that saves
4.x or 2000 on (32-bit) Unix or Windows (and can read them on a different
it can read vectors, matrices and data frames and lists containing
Function data.restore reads S-PLUS data dumps (created by data.dump)
to read in very large objects it may be preferable to use the dump file
Function read.spss can read files created by the ‘save’
with value labels are optionally converted to R factors.
By default it creates data files with extra formatting information
Some third-party applications claim to produce data ‘in SPSS format’
Stata .dta files are a binary file format.
to 12 of Stata can be read and written by functions read.dta and
data files (mtype = 1) written on little-endian machines
foreign can read in files in Octave text data format created
using the Octave command save -ascii, with support for most
of the common types of variables, including the standard atomic (real
and complex scalars, matrices, and N-d arrays, strings, ranges,
and boolean scalars and matrices) and recursive (structs, cells, and
all data being manipulated by R are resident in memory, and several
copies of the data can be created during execution of a function,
R is not well suited to extremely large data sets.
to run out of memory, particularly on a 32-bit operating system.
does not easily support concurrent access to data.
does support persistence of data, in that you can save a data object
or an entire worksheet from one session and restore it at the subsequent
session, but the format of the stored data is specific to R
The sort of statistical applications for which DBMS might be used are to extract
a 10% sample of the data, to cross-tabulate data to produce a multi-dimensional
contingency table, and to extract data group by group from
…), the former marked out by much greater emphasis on data security
There are other commonly used data sources, including spreadsheets, non-relational
All of the packages described later in this chapter provide clients to client/server
The more comprehensive R interfaces generate SQL behind the scenes
upper case, but many users will find it more convenient to use lower case
relational DBMS stores data as a database of tables (or relations)
character, date, currency, …) and rows or records
on a third column and asks the results be sorted.
a database join on two tables student and school
SELECT queries use FROM to select the table, WHERE to specify a condition
SELECT DISTINCT queries will only return one copy of each distinct row in
The GROUP BY clause selects subgroups of the rows according to the criterion.
or exclude groups depending on the aggregated value.
If the SELECT statement contains an ORDER BY statement that produces a unique
ordering, a LIMIT clause can be added to select (by number) a contiguous
There are queries to create a table (CREATE TABLE, but usually one copies
Data can be stored in a database in various data types.
for large blocks of text and binary data, respectively.
provide means to copy whole data frames to and from databases.
functions to select data within the database via SQL queries,
here applies to versions 0.5-0 and later: earlier versions
of names where the operating file system is case-sensitive, so not on
R data frame, mapping the row names of the data frame to the field row_names
Which file formats are supported depends on the versions of the
GUI allows a database to be selected via dialog boxes) which returns
sends a query to the database and saves the result as a table
in the database.) A finer level of control is attained by first calling
The latter can be used within a loop to retrieve a limited number
column and data frame names to lower case.
we created earlier, and had the DSN (data source name) set up
including descriptions, labels, formats, units, ….
HDF5 groups, and can write numeric and character vectors and matrices.
NetCDF’s version 4 format (confusingly, implemented in netCDF 4.1.1 and later,
which has a binary flat-file format that became popular, with file
logical and numeric fields, and other types in later versions (see
Functions read.dbf and write.dbf provide ways to read and write
particular class of binary files are those representing images, and a not
There are many formats for image files (most with lots of variants), and it
(2001), a set of functions to replace the use of file names by a flexible
Files compressed via the algorithm used by gzip can be used as connections
Unix programmers are used to dealing with special files stdin, stdout
R interface, stdin refers to the lines submitted from readline
The three terminal connections are always open, and cannot be opened or closed.
same place, but whereas normal output can be re-directed by a call to
sink, error output is sent to stderr unless re-directed by
a pipe connection for writing (it makes no sense to append to a pipe)
sink as writing to a file, possibly appending to a file if argument append = TRUE, and this is what they did prior to R version
The current behaviour is equivalent, but what actually happens is that when
the file argument is a character string, a file connection is
opened (for writing or appending) and closed again at the end of the function
efficient to explicitly declare and open the connection, and pass the
it possible to write to pipes, which was implemented earlier in a limited
way via the syntax file = "|cmd"
These take a character string argument and open a file
a file connection allows a file to be read sequentially in different
of lines of text can be pushed back onto a connection via a call to
Pushbacks operate as a stack, so a read request first uses each line from
lines pushed back can be found via a call to pushBackLength.
Pushback is only available for connections opened for input in text mode.
summary of all the connections currently opened by the user can be found
The generic function seek can be used to read and (on some connections)
The function truncate can be used to truncate a file opened for writing
to the file as a stream of bytes exactly as it is represented in memory.
readBin reads a stream of bytes from the file and interprets them as
the maximum number of vector elements to read from the connection:
signed allows 1-byte and 2-byte integers to be read
The remaining two arguments are used to write or read data for interchange
for example to read 16-bit integers or write single-precision real numbers.
allows sizes 1, 2, 4, 8 for integers and logicals, and sizes 4,
arithmetic is used, so standard C facilities can be used to test for
Some limited facilities are available to exchange data at a lower level across
The earlier low-level interface is given by functions make.socket, read.socket,
The most common R data import/export question seems to be ‘how do I read an
- On Thursday, June 4, 2020
What is a HashTable Data Structure - Introduction to Hash Tables , Part 0
This tutorial is an introduction to hash tables. A hash table is a data structure that is used to implement an associative array. This video explains some of the ...
R tutorial: connecting to a database
Learn more about connecting to databases with R: Welcome to part two of importing data in R! The ..
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science Training | Edureka
Data Science Training - ) This Edureka Random Forest tutorial will help you understand all the basics of Random Forest ..
REST API concepts and examples
This video introduces the viewer to some API concepts by making example calls to Facebook's Graph API, Google Maps' API, Instagram's Media Search API, ...
Faster Processing than SAS
Learn how these organizations have improved their productivity by switching from SAS to Alteryx. Visit for more details
Bar Charts and Pie Charts in R (R Tutorial 2.1)
How to produce "bar charts" and "pie charts" in R, add titles, change axes labels, and many other modifications to these plots. This tutorial explains how to use ...
Python Programming Tutorial - How to Make a Stock Screener
This video teaches you how to create a stock screener based on any indicator you have built in Python. Don't know how to build indicators in Python?
Bubble sort algorithm
See complete series on sorting algorithms here: This series is in progress, ..
R in Bioinformatics (2 of 2)
Part 1: Part 2: Presenter: Original Meetup page: .
THE TOOL EVERY CAR GUY MUST HAVE!
How to reset a check engine light? Every car guy must have one of these. If you have ever worked on newer cars or just want to be able to view what code keeps ...