AI News, Data Science Cheat Sheet

Data Science Cheat Sheet

Even if you work heavily on the cloud (AWS, or in my case, access to a few remote servers mostly to store data, receive data from clients and backups), your laptop is you core device to connect to all external services (via the Internet).

When processing such a file (they are rather rare fortunately), you'll first need to clean it and standardize it to traditional ASCII (one byte = one character).  Finally, the best text format that you can use is tab separated: each column or field is separated by a TAB, an invisible char represented by \t in some programming languages.

The reason is that some fields contain commas, and thus using csv (comma-separated text files) results in broken fields and data that looks like garbage, and is hard to process (requiring a laborious cleaning step first, or talking to your client to receive tab-separated format instead).

File management Filenames should be designed carefully (no space or special char in a filename), especially when you have thousands or millions of files across thousands of directories and sub-directories, and across dozens of servers (the cloud).

It's never been an issue on Windows for me, but on a true UNIX operating system (not Cygwin), you might want to set the right protections: for example Perl scripts (despite being text) must be set to Executable, with the UNIX command chmod 755, where is your Perl script.

Scripting language You can get started in data science wth just a few Unix commands, a tool for statistical analyses such as R (unless you write your own algorithms to get more robust and simple tools) and a scripting programming language such as Perl or Python.

Our choice of Perl in this tutorial is based on its ease of use: In short, it's a great language for learning data science, though not so great if you work in a big team and have to share, integrate and update various pieces of Perl code from various coders.

Perl used to be the only language with great string processing functions, and able to handle regular expressions easily - an advantage over other languages, for text processing or text mining.

This can potentially slow down executation a little bit, but in my experience, most of what I developed in Perl runs 10 to 100 times faster (without loss of accuracy) than what I've seen in the corporate world, mostly thanks to developing better algorithms and using fewer (but better, more predictive) metrics, and fewer observations (samples).

These algorithms are listed at the bottom of this article - an example (in the context of feature selection) is testing dozens of features at once rather than in parralel, using smaller samples thanks to better use of data science.

Core elements of scripting languages Some basic stuff that is used in pretty much any programs include The easiest way to learn how to code is to look at simple, well written sample programs of increasing complexity, and become an expert in Google search to find solutions to coding questions - many answers can be found on StackOverflow.

There's some basic string processing here, for instance:  $ip=~s/\n//g substitutes each carriage return / line feed (special character \n) by nothing (empty) in the variable $ip.

Now, you can download big logfiles for free (see section 10), extract IP addresses and traffic statistics per IP address, and run the above script (using a distributed architecture, with 20 copies of your script running on your laptop) to extract domain names attached to IP addresses.

Exercise Write a Perl script that accesses all the text files on your laptop using two steps: Then count the number of occurrences for each word (broken down per file creation year) across these files, using a hash table.

Also check this list of references, many are about R or Python, including analytic libraries such as Pandas (Python).  Hadoop Hadoop is a file management system used to perform tasks in a distributed environment, across multiple servers if necessary, by spitting files into sub-files, performing the analysis on each sub-file separately, and summarizing the results (by collecting the various outputs associated with each file, and putting it together).

The following articles are starting point to understand Hadoop: SQL Finally, don't forget that SQL is still a widely used language that we should all know: at least the basics, up to joining multiple tables efficiently, and playing with indexes and keys.

It will focus on some advanced Excel functions such as Linest (linear regression), Vlookup, quantiles, ranks, random numbers and some data science applications that can easily be performed with Excel, for instance, the following analyses (offered with nice Excel spreadsheet)  Also, a list of articles about data science with Excel can be found here.

Installed files

When a package is installed, everything in inst/ is copied into the top-level package directory.

This means that you should avoid inst/build, inst/data, inst/demo, inst/exec, inst/help, inst/html, inst/inst, inst/libs, inst/Meta, inst/man, inst/po, inst/R, inst/src, inst/tests, inst/tools and inst/vignettes.

Calling citation() without any arguments tells you how to cite base R: Calling it with a package name tells you how to cite that package: To customise the citation for your package, add a inst/CITATION that looks like this: You need to create inst/CITATION.

(This field is for human reading so don’t worry about exactly how you specify it.) Java is a special case because you need to include both the source code (which should go in java/ and be listed in .Rinstignore), and the compiled jar files (which should go in inst/java).

Reading and Importing Excel Files into R

After saving your data set in Excel and some adjusting your workspace, you can finally start with the real importing of your file into R!

This means that you can also just write the file’s name as an argument of the read.table() function without specifying the file’s location, just like this: Note that the field separator character for this function is set to ""

You can easily indicate this by adding the sep argument to the read.table() function: The strip.white argument allows you to indicate whether you want the white spaces from unquoted character fields stripped.

You see the extra white space before the class BEST in the second row has been removed, that the columns are perfectly separated thanks to the denomination of the sep argument and that the empty value, denoted with “EMPTY” in row three was replaced with NA.

Note that if you ever come across a warning like “incomplete final line found by readTableHeader on…”, try adding an End Of Line (EOL) character by moving your cursor to the end of the last line in your file and pressing enter.

To read .csv files that use a comma as separator symbol, you can use the read.csv() function, like this: Note that the quote argument denotes whether your file uses a certain symbol as quotes: in the command above, you pass \"

The command above imports the following data set: You see that the columns and rows are given names through the col.names and row.names arguments, that all fields are clearly separated, with the third unequal row filled in with a blank field, thanks to fill = TRUE.

Remember that they are also almost identical to the read.table() function, except for the fact that they assume that the first line that is being read in is a header with the attribute names, while they use a tab as a separator instead of a whitespace, comma or semicolon.

Lastly, the is used to suppress factor conversion for a subset of the variables in your data, if they weren’t otherwise specified: just supply the argument with a vector of indices of the columns that you don’t want to convert, like in the command above, or give in a logical vector with a length equal to the number of columns that are read.

In this case, the data set has give columns, with the first two of type integer, replicating the class “integer” twice, the second of “date”, the third of “numeric” and lastly, the fourth of “character”.

The read.delim2() function that was defined above was applied to the following data set, which you also used in the exercise above: However, you will get an error when you try to force the third column to be read in as a date.

This is why you can first better read it in as a character, by replacing “date” by “character” in the colClasses argument, and then run the following command: Note that the as.POSIXct() function allows you to specify your own format, in cases where you decided to use a specific time and date notation, just like in the data set above.

Why You Need Perl/Python If You Know R/Shell [Ngs Data Analysis]

In my opinion, you must learn Perl/Python/Ruby/(add your favorite here) to stay ahead of the data deluge that never ends if you work in a large lab.

In all of these cases, you would start with a search for the Bio* package containing the methods you need, and then decide if you have to extend that functionality or write something from scratch if it doesn't exist.

The main reasons for learning one of these scripting languages (in a small amount of words) is you have direct programmatic access to local and remote databases and analysis tools, and you have a large user community that has solved most of the common tasks.

R tutorial: connecting to a database

Learn more about connecting to databases with R: Welcome to part two of importing data in R! The ..

R Tutorial: How to Read an Excel file into R

Regular Expressions (Regex) Tutorial: How to Match Any Pattern of Text

In this regular expressions (regex) tutorial, we're going to be learning how to match patterns of text. Regular expressions are extremely useful for matching ...

Beginner Perl Maven tutorial: 13.6 - Reading Excel file in Perl


Perl part 4: Regular Expressions

Dr. Rob Edwards from San Diego State University discusses an introduction to using regular expressions in Perl.



Perl Data Language

NLPW::2014 Nederlandse Perl Workshop, 25-April, Utrecht by: Jan Hoogenraad.

StartR 01-Install R for Windows and ActivePerl

On a Windows 8 system, this demonstrates the R installation, basic usage, as well as the importation of an Excel spreadsheet using the gdata package's read.xls ...

StartR 02-Editors for R in Windows: Emacs (ESS), RStudio, Notepad++

Installing and using programmer's file editors to interact with an R session. Demonstrates the installation, configuration, and usage to offer the viewer a clear ...

Learn Perl 5 By Doing It : Writing Files and Replacing Text

Learn Perl 5 By Doing It Learn Perl by actually creating useful, working Perl programs for everything .