AI News, Useful Unix commands for data science

Useful Unix commands for data science

The above line says: It took less than two minutes to run on the entire file - much faster than other options and written in a lot fewer characters.

For those who are a bit newer to the command line than the rest of this post assumes, Hilary previously wrote a nice introduction to it.

I tend avoid regular expressions when possible, but still find grep to be invaluable when searching through log files for a particular event.

There's an assortment of extra parameters you can use with grep, but the ones I tend to use the most are -i (ignore case), -r (recursively search directories), -B N (N lines before), -A N (N lines after).

Sed is similar to grep and awk in many ways, however I find that I most often use it when needing to do some find and replace magic on a very large file.

If a key isn't specified, sort will treat each line as a concatenated string and sort based on the values of the first column.

While it's sometimes difficult to remember all of the parameters for the Unix commands, getting familiar with them has been beneficial to my productivity and allowed me to avoid many headaches when working with large text files.


The Select-String cmdlet searches for text and text patterns in input strings and files. You

default, Select-String finds the first match in each line and, for each match, it displays the file name, line number, and all text in the line containing the match. However,

you can direct it to detect multiple matches per line, display text before and after the match, or display only a Boolean value (True or False) that indicates whether a match is found.

Select-String uses regular expression matching, but it can also perform a simple match that searches the input for the text that you specify.

Example 2: Find matches in XML files only This command searches through all files with the .xml file name extension in the current directory and displays the lines in those files that include the string "the the".

Example 3: Find a pattern match This command searches the Windows PowerShell conceptual Help files (about_*.txt) for information about the use of the at sign (@).

Example 6: Find a string in subdirectories This command examines all files in the subdirectories of C:\Windows\System32 with the .txt file name extension and searches for the string "Microsoft". The

Example 7: Find strings that do not match a pattern This command finds lines of text in the Process.txt file that do not include the words "idle"

The second command uses the Count property of object arrays to display the number of matches found, in this case, 2.

uses array notation to indicate the first match (match 0 in a zero-based array), and it uses the Format-List cmdlet to display the value of the Context property as a list.

Matches property of the first command contains just one match (that is, one System.Text.RegularExpressions.Match object), whereas the Matches property of the second command contains objects for both of the matches in the line.

How to Grep for Text in Files

If you need a more expressive regular expression syntax, grep is capable of accepting patterns in alternate formats with the following flags: Grep provides a number of powerful options to control its output: In addition to reading content from files, grep can read and filter text from standard input.

For instance, given the following command: This filters the output of the ls command’s help text and looks for appearances of “dired”, and outputs them to standard out: While straightforward pattern matching is sufficient for some filtering tasks, the true power of grep lies in its ability to use regular expressions for complex pattern matching.

however, there are some sequences that carry special significance: One popular use of grep is to extract useful information from system logs: In this command, grep filters an Apache access log for all lines that begin with an IP address, followed by a number of characters, a space and then the characters 200 (where 200 represents a successful HTTP connection).

The following command searches the most recent /var/log/auth.log file for invalid login attempts: You can split the above command into two layers to output a list of IP addresses with failed login attempts to your system: grep can filter the output of commands such as tail -F to provide real-time monitoring of specific log events: In this case, tail follows the ~/procmail/procmail.log file.

This command filters the tar help text to more efficiently find the options for dealing with bzip files: grep is also useful for filtering the output of ls when listing the contents of directories with a large number of files: zgrep command functions identically to the grep command above;

Linux grep command

grep, which stands for 'global regular expression print,' processes text line by line and prints any lines which match a specified pattern.

Grep is a powerful tool for matching a regular expression against text in a file, multiple files, or a stream of input.

The line is longer than our terminal width so the text wraps around to the following lines, but this output corresponds to exactly one line in our FILE.

In the above example, all the characters we used (letters and a space) are interpreted literally in regular expressions, so only the exact phrase will be matched.

When the command is executed, the shell will expand the asterisk to the name of any file it finds (within the current directory) which ends in '.html'.

Recursively searching subdirectories We can extend our search to subdirectories and any files they contain using the -r option, which tells grep to perform its search recursively.

Using regular expressions to perform more powerful searches The true power of grep is that it can be used to match regular expressions.

Grep is a powerful tool that can help you work with text files, and it gets even more powerful when you become comfortable using regular expressions.

grep searches the named input FILEs (or standard input if no files are named, or if a single dash ('-') is given as the file name) for lines containing a match to the given PATTERN.

Also, three variant programs egrep, fgrep and rgrep are available: In older operating systems, egrep, fgrep and rgrep were distinct programs with their own executables.

It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set.

Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].

(Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most meta-characters lose their special meaning inside bracket expressions.

The caret ^ and the dollar sign $ are meta-characters that respectively match the empty string at the beginning and end of a line.

The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word.

The back-reference \n, where n is a single digit, matches the substring previously matched by the nth parenthesized subexpression of the regular expression.

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning;

Traditional versions of egrep did not support the { meta-character, and some egrep implementations support \{ instead, so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.

GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start of an invalid interval specification.

For example, the command grep -E '{1' searches for the two-character string {1 instead of reporting a syntax error in the regular expression.

The C locale is used if none of these environment variables are set, if the locale catalog is not installed, or if grep was not compiled with national language support (NLS).

Display the filenames (but not the matching lines themselves) of any files in /www/ (but not its subdirectories) whose contents include the string 'hope'.

BASH scripting lesson 10 working with CSV files

More videos like this online at We now have some more great fun and see how much we can use the shell for; creating reports ..

The param column extracting function

On Sept 28th, 2012 I posted a function and alias generating loop created by @brimston3 on Twitter. This video explains how it works. I missed this link when ...

Becoming a Command Line Expert with the AWS CLI (TLS304) | AWS re:Invent 2013

The AWS CLI is a command line interface that allows you to control the full set of AWS services. You learn how to perform quick ad hoc service operations, and ...

RailsConf 2017: Practical Debugging by Kevin Deisz

RailsConf 2017: Practical Debugging by Kevin Deisz People give ruby a bad reputation for speed, efficiency, weak typing, etc. But one of the biggest benefits of ...

Building Dynamic Websites at Harvard - Lecture 8

Section 1: More Comfortable