AI News, Setting expectations in data science projects
Setting expectations in data science projects
How is it even possible to set expectations and launch data science projects?
That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method research or method invention.
Given the wide range of wants, diverse data sources, required levels of innovation and methods it often feels like you can not even set goals for data science projects.
To complete the expectation setting project we need reusable methods to set useful, realistic goals that really measure if a data science project is on track (i.e.
We outline a few methods to generate prior estimates for two of the important data science project measures: model performance and business utility.
This unknown result determines if the project even has a chance to succeed, so it makes sense to try and eliminate the hidden project risk it represents by determining success criteria as a separate project.
But this is good: nobody wants to start doomed projects, they instead want to know what to changes to implement to allow a successful project to be later launched.
You can in fact run data science projects as you would run any development project (all projects have risks and unknowns- so these problems are not in fact unique to data science projects).
Specific measurement, control and feedback in a data science project requires running a few cars across the bridge (but won’t require all lanes be ready at the start).
The expectation setting part of a data science project is to estimate how well a very good model would perform without paying the time and cost of producing the model.
Some ways to prior estimate this are given here: What the business really needs to know is if promised increase in classifier performance leads to a desired increase in business (customers, revenue or anything as long as it is specific and measurable).
The first project itself should have specific description like: “be confident in the estimates of the following measures of possible model quality, data availability and probably business impact by this date.”
Data Science For Business: 3 Reasons You Need To Learn The Expected Value Framework
One of the most difficult and most critical parts of implementing data science in business is quantifying the return-on-investment or ROI.
For those that are wondering what the threshold at max F1 is, it’s the threshold that harmonically balances the precision and recall (in other words, it optimally aims to reduce both the false positives and the false negatives finding a threshold that achieves a relative balance).
Calculating Expected Attrition Cost From H2O + LIME Results We develop a proposal to reduce overtime using our H2O classification model, which by default uses the threshold that maximizes F1 (treats Type 1 and Type 2 errors equally).
We end up misclassifying people that leave as stay (Type 2 error) at roughly the same rate as we misclassify people that stay as leave (Type 1 error).
Here lies the problem: The cost of reducing the overtime incorrectly for some one that stays is 30% of missing the opportunity to reduce overtime for an employee incorrectly predicted to stay when they leave.
When we have a calculation to determine the expected value using business costs, we can perform the calculation iteratively to find the optimal threshold that maximizes the expected profit or savings of the business problem.
In the example below, we can see in the threshold optimization results that the maximum savings ($546K) occurs at a threshold of 0.149, which is 16% more savings than the savings at threshold at max F1 ($470K).
It’s worth mentioning that the threshold that maximizes F1 was 0.280, and that for a test set containing 15% of the total population it cost $76K due to being sub-optimal ($546K - $470K).
In the human resources example below, we tested for a range of values average overtime percentage and net revenue per employee because our estimates for the future may be off.
In the Sensitivity Analysis Results shown below, we can see in the profitability heat map that as long as the average overtime percentage is less than or equal to 25%, implementing a targeted overtime policy saves the organization money.
Not only can we test for the optimal threshold that maximizes the business case, we can use expected value to test for a range of inputs that are variable from year to year and person to person.
Here’s a 20-minute tutorial on the Expected Value Framework that applies to the Data Science For Business (DS4B 201) course where we use H2O Automated Machine Learning to develop high performance models identifying those likely to leave and then the expected value framework to calculate the savings due to various policy changes.
A face-to-face meeting will cost your organization $5,000 for travel costs, time to develop a professional quotation, and time and materials involved in the sales meeting.
If our total spend for the face-to-face meeting and quotation process is $5000, then it makes sense to take the meeting because it’s lower than $10,000 dollars, or the expected benefit.
It enables us to combine: We can use this combination to target based on postive class probability (think employee flight risk quantified as a probability) to gain even greater expected savings than an “all-or-none” approach without the framework.
It’s the dividing line that we (the data scientist) select between which values for the positive class probability (Yes in the example shown below) we convert to a positive versus a negative.
For the Employee Churn problem, one way to think of the cost benefit is analyzing two states: an initial state (baseline) and a new state (after a policy change to reduce overtime).
Like the initial state, there are four scenarios we need to account for with probabilities of their own The expected cost is $38K for this scenario At an overtime percentage of 20%, the savings is (negative) -$13K versus the baseline.
“To be honest, this course is the best example of an end to end project I have seen from business understanding to communication.” Siddhartha Choudhury, Data Architect at Accenture See for yourself why our students have rated Data Science For Business (DS4B 201) a 9.0 of 10.0 for Course Satisfaction!
Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated: The Virtual Workshop is code intensive (like these articles) but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework and many data science tools in an integrated fashion.
Not only do students work a business problem end-to-end, but the icing on the cake is “peer programming” with Matt, albeit virtually, who codes clean, leverages best practices + a good mix of packages, and talks you through the why behind his coding decisions – all of which lead to a solid foundation and better habit formation for the student.” Jason Aizkalns, Data Science Lead at Saint-Gobain Get Started Today!
Sales Analytics: How to Use Machine Learning to Predict and Optimize Product Backorders
Sales, customer service, supply chain and logistics, manufacturing… no matter which department you’re in, you more than likely care about backorders.
We implement a special technique for dealing with unbalanced data sets called SMOTE (synthetic minority over-sampling technique) that improves modeling accuracy and efficiency (win-win).
It’s a fine line: Too much supply increases inventory costs while too little supply increases the risk that customers may cancel orders.
For most retailers and manufacturers, this strategy will drive inventory costs through the roof considering they likely have a large number of SKUs (unique product IDs).
predictive analytics program can identify which products are most likely to experience backorders giving the organization information and time to adjust.
If backorders are very infrequent but highly important, it can be very difficult to predict the minority class accurately because of the imbalance between backorders to non-backorders within the data set.
The challenge is to accurately predict future backorder risk using predictive analytics and machine learning and then to identify the optimal strategy for inventorying products with high backorder risk.
We can make a pre-process function that drops unnecessary columns, deals with NA values, converts Yes/No data to 1/0, and converts the target to a factor.
step may be needed to transform (normalize, center, scale, apply PCA, etc) the data especially if using deep learning as a classification method.
To deal with this class imbalance, we’ll implement a technique called SMOTE (synthetic minority over-sampling technique), which oversamples the minority class by generating synthetic minority examples in the neighborhood of observed ones.
In other words, it shrinks the prevalence of the majority class (under sampling) while simultaneously synthetically increasing the prevalence of the minority class using a k-nearest neighbors approach (over sampling via knn).
The great thing is that SMOTE can improve classifier performance (due to better classifier balance) and improve efficiency (due to smaller but focused training set) at the same time (win-win)!
We can also check the new balance of Yes/No: It’s now 43% Yes to 57% No, which in theory should enable the classifier to better detect relationships with the positive (Yes) class.
The Receiver Operating Characteristic (ROC) curve is a graphical method that pits the true positive rate (y-axis) against the false positive rate (x-axis).
At AUC = 0.92, our automatic machine learning model is in the same ball park as the Kaggle competitors, which is quite impressive considering the minimal effort to get to this point.
In the business context we need to decide what cutoff (threshold of probability to assign yes/no) to use: Is the “p1” cutoff >= 0.63 (probability above which predict “Yes”) adequate?
The answer lies in the balance between the cost of inventorying incorrect product (low precision) versus the cost of the lost customer (low recall): By shifting the cutoff, we can control the precision and recall and this has major effect on the business strategy.
The cost-benefit matrix is a business assessment of the cost and benefit for each of four potential outcomes: The cost-benefit information is needed for each decision pair.
If hypothetically the value for True Positive (benefit) is $400/unit in profit from correctly predict a backorder and the False Positive (cost) of accidentally inventorying and item that was not backordered is $10/unit then a data frame can be be structured like so.
The expected value equation generalizes to: Where, The general form isn’t very useful, but from it we can create an Expected Profit equation using a basic rule of probability p(x,y) = p(y)*p(x|y) that combines both the Expected Rates (2x2 matrix of probabilities after normalization of Confusion Matrix) and a Cost/Benefit Matrix (2x2 matrix with expected costs and benefits).
We create a function to calculate the expected profit using the probability of a positive case (positive prior, p1), the cost/benefit of a true positive (cb_tp), and the cost/benefit of a false positive (cb_fp).
We’ll take advantage of the expected_rates data frame we previously created, which contains the true positive rate and false positive rate for each threshold (400 thresholds in the range of 0 and 1).
Note that an inventory all items strategy (threshold = 0) would cause the company to lose money on low probability of backorder items (-$6/unit) and an inventory nothing strategy would result in no benefit but no loss ($0/unit).
Conversely if we investigate a hypothetical item with high probability of backorder, we can see that it’s much more advantageous to have a loose strategy with respect to inventory conservatism.
Units with low probability of backorder (the majority class) will tend to increase the cutoff while units with high probability will tend to lower the cutoff.
In addition, we’ll include a backorder purchase quantity with logic of 100% safety stock (meaning items believed to be backordered will have an additional quantity purchased equal to that of the safety stock level).
We’ll investigate optimal stocking level for this subset of items to illustrate scaling the analysis to find the global optimized cutoff (threshold).
We then “extend” (multiply the unit expected profit by the backorder-prevention purchase quantity, which is 100% of safety stock level per our logic) to get total expected profit per unit.
We can visualize the expected profit curves for each item extended for backorder-prevention quantity to be purchased and sold (note that selling 100% is a simplifying assumption).
We spent a considerable amount of effort optimizing the cutoff (threshold) selection to maximize expected profit, which ultimately matters most to the bottom line.
You learn everything you need to know about how to apply data science in a business context: “If you’ve been looking for a program like this, I’m happy to say it’s finally here!
It’s why I created Business Science University.” Matt Dancho, Founder of Business Science Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity?
Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301 Our first Data Science For Business (HR 201) Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated: The Virtual Workshop is intended for intermediate and advanced R users.
Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects.
DS4B 201-R: Data Science For Business With R
It's an in-depth study of one churn / binary classification problem that goes into every facet of how to solve it.
You begin with the problem overview and tool introduction covering how employee churn effects the organization, our toolbox to combat the problem, and code setup.
Next, you prepare the data for both humans and machines with the goal of making sure you have good features prior to moving into modeling.
Next, you use the recipes package to create a “machine readable” processing pipeline that is used to create a pre-modeling correlation analysis visualization.
We then teach how to optimize the threshold using purrr for iteration to maximize expected savings of a targeted policy.
We then teach you Sensitivity Analysis again using purrr to show a heatmap that covers confidence ranges that you can explain to executives.
- On Saturday, December 7, 2019
How to Become a Data Scientist 2018 | Skills Required to Become a Data Scientist in 2018
How to Become a Data Scientist | Skills Required to Become a Data Scientist in 2018 ...
Use forward and backward pass to determine project duration and critical path
Check out for more free engineering tutorials and math lessons! Project Management Tutorial: Use forward and backward pass to ..
Data Science Project
In my previous post we kicked-off the Global Super Store Project. We formed the questions we will try to answer and collected / cleansed the data we will be ...
Data Science & Machine Learning - Support Confidence Lift - Apriori- DIY- 36 -of-50
Data Science & Machine Learning - Support Confidence Lift - Apriori- DIY- 36 -of-50 Do it yourself Tutorial by Bharati DW Consultancy cell: +1-562-646-6746 ...
Elena Grewal: A data scientist is measured by the value of the problems she solves
Data science plays a critical role in shaping Airbnb's business strategies and helping it craft a more satisfying experience for its customers. Here's how the ...
Research Methodology (Part 1 of 3): 5 Steps, 4 Types and 7 Ethics in Research
This lecture by Dr. Manishika Jain explains the basics of research methodology, steps in scientific research, correlation and experimental research, variables ...
9 Cool Deep Learning Applications | Two Minute Papers #35
Machine learning provides us an incredible set of tools. If you have a difficult problem at hand, you don't need to hand craft an algorithm for it. It finds out by itself ...
Linear Regression Analysis | Linear Regression in Python | Machine Learning Algorithms | Simplilearn
This Linear Regression in Machine Learning video will help you understand the basics of Linear Regression algorithm - what is Linear Regression, why is it ...
Decision Tree Tutorial in 7 minutes with Decision Tree Analysis & Decision Tree Example (Basic)
Clicked here and OMG wow! I'm SHOCKED how easy.. No wonder others goin crazy sharing this??? Share it with your other friends ..
Determine the Early Start (ES) and Early Finish (EF) of activities in a PDM network diagram
Check out for more free engineering tutorials and math lessons! Project Management Tutorial: Determine the ES and EF of ..