# AI News, How to Analyze Big Data with Excel

## How to Analyze Big Data with Excel

We, the marketers, should defend our role of strategic decision-makers by staying in control of the data analysis function that we are losing to the new generation of software coders and data managers.

Perhaps the most appealing one from a career standpoint is to reaffirm our value in the new world of highly-engineered, relentlessly growing and often inflexible IT systems, filled with lots of data someone thinks could be very useful if only appropriately analyzed.

These programmers, specializing in different kinds of software, are in some cases already bypassing collaboration with the marketers and going straight into the development of applications used for business analytics purposes.

Consider you have a large dataset, such as 20 million rows from visitors to your website, or 200 million rows of tweets, or 2 billion rows of daily option prices.  Suppose also you want to investigate this data to search for associations, clusters, trends, differences or anything else that might be of interest to you.

The same concept applies to data records too, and in both cases there are at least three legitimate questions to ask: For our example we will use a database holding 200,184,345 records containing data from the purchase orders of one product line of a given company during 12 months.

sample of 66,327 randomly selected records can approximate the underlying characteristics of the dataset it comes from at the 99% confidence interval and 0.5% error level.

The confidence level tells us that if we extract 100 random samples of 66,327 records each from the same population, 99 samples may be assumed to reproduce the underlying characteristics of the dataset they come from.

The 0.5% error level says the values we obtain should be read in the plus or minus 0.5% interval, for instance after transforming the records in contingency tables.

To reduce the risk of extracting records biased by the lack of randomness, before extracting the records of the sample, it is a good habit to sort the main list, for instance alphabetically by the person's first name or by any other variable that is not directly related to the values of the variable(s) object of the study.

They can be read as follows: Record number 1 (row 2) is a purchase order from North America, received on September 2007, concerning one single item priced USD 13,159 and sold for USD 11,800.

Cell B1, for instance, tells us on average one purchase order (a record) of the main dataset accounts for a sales “Volume” equal to 1.865 items with an “Average Sales” value of USD 10’418 and an “Average Discounted Sales” value of USD 5’841.

For sample values departing severely from the control values in row 2 the probability is high that a Z-test at the 99% probability level captures the anomaly.

In this case too, the difference between main dataset and samples is quite small, and the Z-Test (columns E:G) shows no evidence of bias, with the exception of slight deviations in Europe for sample 5, 8, 9, and 18-20.

However, because the former comes from a sample, we need to verify from a statistical point of view the probability the difference between the two values is caused by a bias in the sampling method.

With small sample sizes (30), the 90% probability threshold can still be used, although this implies higher risk of erroneously considering two values equal when in fact they are different.

This also means that the share of purchase orders incoming from the three continents reproduced with random samples do not show evidence of dramatic differences outside the expected boundaries.

No one sampled value is different from the correspondent value from the main dataset with a probability larger than 75% and only a small number of values have a probability larger than 70%.

To verify whether this could have happened by chance, we repeated the test using two non-random samples: the first time taking the very first 66,327 records of the main dataset and the second time taking the very last 66,327 records.

## Random Samples in Excel Using the RAND Function

By Angela Henderson, Director of Institutional Research and Effectiveness, Stetson University This tip provides two methods for selecting a simple random sample from an Excel data file using the RAND() function.

This replaces the formulas in column E with the calculated values and prevents the data from changing. To prevent the random values from changing when the worksheet recalculates, select and copy all values in column E.

This replaces the formulas in column E with the calculated values and prevents the data from changing.  Select all the data columns and click on “Sort &

For example, if your data file contains 1,000 records, simply select the first 250 records for a random sample of 25% of the population.

If seeking a different sample size, for example, 50%, simply change the value in the formula to reflect the desired proportion: = RAND()&lt;0.505.

This replaces the formulas in column E with the calculated values and prevents the data from changing.  To add filters to the columns, select all the data columns and click on “Sort &

Once filters are applied, click the filter arrow in the header of the TRUE/FALSE indicator column (column E) and uncheck FALSE and (Blanks) so only TRUE remains selected.

## Subsetting Data

R has powerful indexing features for accessing object elements.

The following code snippets demonstrate ways to keep or delete variables and observations and to take random samples from a dataset.

In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10.

using subset function newdata &lt;- subset(mydata, age &gt;= 20 |

In the next example, we select all men over the age of 25 and we keep variables weight through income (weight, income and all columns between them).

newdata &lt;- subset(mydata, sex==&quot;m&quot;

How to Create a Random Sample in Excel (in 3 minutes!)

A 3-minute tutorial that demonstrates how to generate a random sampling of records using Excel.

Generating Random Sample Using SQL

Recorded with ScreenCastify ( the screen video recorder for Chrome

Selecting a Random Sample without Replacement using Excel VBA

This video demonstrates how to select a random sample without replacement using Excel VBA. Using a selected range, the VBA subroutine will return a ...

StatQuest: Random Forests Part 2: Missing data and clustering

NOTE: This one has Russian subtitles if you want to read along in Russian! Last time we talked about how to create, use and evaluate random forests. Now it's ...

Select random sample of data using 'Select Cases' in SPSS

How to select a random sample of data from a data file in SPSS using the 'Select Cases' tool. ASK SPSS Tutorial Series.

Excel Magic Trick 302: Randomly Select Names No Repeats

See how to use the INDEX & MRAND functions to randomly select names without repeats. Learn about MoreFun (More Functions) add-in. Excel Magic Trick 276 ...

Generate Random Data in an Excel Spreadsheet

Learn how to generate random data in an Excel spreadsheet using =RAND(), a probability table and =VLOOKUP. Download the practice file here: ...

MS Excel: Normal Distributions and Bell Curves