# AI News, Why feature weights in a machine learning model are meaningless

- On Saturday, September 8, 2018
- By Read More

## Why feature weights in a machine learning model are meaningless

As I see our customers fall in love with Big Query ML, an old problem rises its head — I find that they can not resist the temptation to assign meaning to feature weights.

“The largest weight in my model to predict customer lifetime value,” they might remark, “is whether or not the customer received a thank you call from an executive.” Or they might look at negative weights and draw a dire conclusion: “Stores located in urban areas lead to negative satisfaction scores.” Please don’t do that.

For example, we can move part of the weight into the “bias” term and create an equivalent model: Categorical variables, in other words, provide a lot of leeway in how the model can assign its weights.

The input feature magnitudes (“executive sales calls have a huge weight”) or feature sign (“urban stores lead to poor satisfaction”) should not be used to derive conclusions.

- On Saturday, September 8, 2018
- By Read More

## 7 methods to perform Time Series forecasting (with Python codes)

Source: Bitcoin Besides Crypto Currencies, there are multiple important areas where time series forecasting is used for example : forecasting Sales, Call Volume in a Call Center, Solar activity, Ocean tides, Stock market behaviour, and many others .

Assume the Manager of a hotel wants to predict how many visitors should he expect next year to accordingly adjust the hotel’s inventories and make a reasonable guess of the hotel’s revenue.

We are provided with a Time Series problem involving prediction of number of commuters of JetRail, a new high speed rail service by Unicorn Investors.

We are provided with 2 years of data(Aug 2012-Sept 2014) and using this data we have to forecast the number of commuters for next 7 months.

As seen from the print statements above, we are given 2 years of data(2012-2014) at hourly level with the number of commuters travelling and we need to estimate the number of commuters for future.

If we want to forecast the price for the next day, we can simply take the last day value and estimate the same value for the next day.

We can infer from the graph that the price of the coin is increasing and decreasing randomly by a small margin, such that the average remains constant.

Many a times we are provided with a dataset, which though varies by a small margin throughout it’s time period, but the average at each time period remains constant.

Such forecasting technique which forecasts the expected value equal to the average of all previously observed points is called Simple Average technique.

We can infer from the graph that the prices of the coin increased some time periods ago by a big margin but now they are stable. Many a times we are provided with a dataset, in which the prices/sales of the object increased/decreased sharply some time periods ago.

Using a simple moving average model, we forecast the next value(s) in a time series based on the average of a fixed finite number ‘p’ of the previous values.

A weighted moving average is a moving average where within the sliding window values are given different weights, typically so that more recent points matter more.

For example if we pick [0.40, 0.25, 0.20, 0.15] as weights, we would be giving 40%, 25%, 20% and 15% to the last 4 points respectively.

Forecasts are calculated using weighted averages where the weights decrease exponentially as observations come from further in the past, the smallest weights are associated with the oldest observations:

If we use any of the above methods, it won’t take into account this trend. Trend is the general pattern of prices that we observe over a period of time.

the Naive method would assume that trend between last two points is going to stay the same, or we could average all slopes between all points to get an average trend, use a moving trend average or apply exponential smoothing.

To express this in mathematical notation we now need three equations: one for level, one for the trend and one to combine the level and trend to get the expected forecast ŷ

As with simple exponential smoothing, the level equation here shows that it is a weighted average of observation and the within-sample one-step-ahead forecast The trend equation shows that it is a weighted average of the estimated trend at time t based on ℓ(t)−ℓ(t−1) and b(t−1), the previous estimate of the trend.

When the trend increases or decreases linearly, additive equation is used whereas when the trend increases of decreases exponentially, multiplicative equation is used.Practice shows that multiplicative is a more stable predictor, the additive method however is simpler to understand.

Datasets which show a similar set of pattern after fixed intervals of a time period suffer from seasonality.

The idea behind triple exponential smoothing(Holt’s Winter) is to apply exponential smoothing to the seasonal components in addition to level and trend.

The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations — one for the level ℓt, one for trend bt and one for the seasonal component denoted by st, with smoothing parameters α, β and γ.

source where s is the length of the seasonal cycle, for 0 ≤ α ≤ 1, 0 ≤ β ≤ 1 and 0 ≤ γ ≤ 1.

The level equation shows a weighted average between the seasonally adjusted observation and the non-seasonal forecast for time t.

The seasonal equation shows a weighted average between the current seasonal index, and the seasonal index of the same season last year (i.e., s time periods ago).

The additive method is preferred when the seasonal variations are roughly constant through the series, while the multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series.

It stand for Autoregressive Integrated Moving average. While exponential smoothing models were based on a description of trend and seasonality in the data, ARIMA models aim to describe the correlations in the data with each other.

I suggest you take different kinds of problem statements and take your time to solve them using the above-mentioned techniques.

You can also explore forecast package built for Time series modelling in R language. You may also explore Double seasonality models from forecast package.

- On Saturday, September 8, 2018
- By Read More

## Predicting Cryptocurrency Price With Tensorflow and Keras

Cryptocurrencies, especially Bitcoin, have been one of the top hit in social media and search engines recently.

Their high volatility leads to the great potential of high profit if intelligent inventing strategies are taken.

Since the original data ranges from 0 to over 10000, data scaling is needed to allow the neural network to understand the data easier.

The former one helps me to track all the training and validation progress, while the latter one allows me to store the model’s weight for each epoch.

LSTM is relatively easier than CNN to implement as you don’t even need to care about the relationship among kernel size, strides, input size and output size.

(which is also true in this blog as LSTM takes around 45 secs/ epoch, while GRU takes less than 40 secs/ epoch) Simply replace the second line of building model in LSTM with Since the result plotting is similar for the three model, I will only show CNN’s version.

The blue line in the below graph represents the ground true (actual data), whereas the red dots represent the predicted Bitcoin price.

Each row of the above table is the model that derives the best validation loss from the total 100 training epochs.

However, 4-layered CNN with Leaky ReLU as activation function creates a large validation loss, this can due to wrong deployment of model which might require re-validation.

The best model seems to be LSTM with tanh and Leaky ReLU as activation function, though 3-layered CNN seems to be better in capturing local temporal dependency of data.

To visualize the comparison, we can use boxplot: According to the comparison, it seems that L2 regularizer of coefficient 0.01 on the bias vector derives the best outcome.

To find out the best combination among all the regularizers, including activation, bias, kernel, recurrent matrix, it would be necessary to test all of them one by one, which does not seem practical to my current hardware configuration.

You have learned: Future work for this blog would be finding out the best hyper-parameter for the best model, and possibly using social media to help predict the trend more accurately.

- On Wednesday, March 20, 2019

**Confirmation hearing for Supreme Court nominee Judge Brett Kavanaugh (Day 2)**

Confirmation hearing for Supreme Court nominee Judge Brett #Kavanaugh (Day 2, PArt 1) - LIVE at 9:30am ET on C-SPAN3, C-SPAN Radio & online here: ...

**Marginal Utility**

Marginal utility and marginal benefit. How you would spend $5 on chocolate and fruit More free lessons at:

**Lecture 13 - Validation**

Validation - Taking a peek out of sample. Model selection and data contamination. Cross validation. Lecture 13 of 18 of Caltech's Machine Learning Course - CS ...

**14. Classification and Statistical Sins**

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: John Guttag ..

**The Periodic Table: Crash Course Chemistry #4**

Hank gives us a tour of the most important table ever, including the life story of the obsessive man who championed it, Dmitri Mendeleev. The periodic table of ...

**Day one of Brett Kavanaugh’s Supreme Court confirmation hearing**

The Washington Post brings you live coverage and analysis of Supreme Court nominee Brett Kavanaugh's confirmation hearing. Read more: ...

**Is It Ever OK to Lie? ft Lizzie | Ear Biscuits**

Rhett and Link are joined by Mythical Crew member, Lizzie Bassett, to discuss if it's every okay to lie, discuss what happened when Lizzie lied as a camp ...

**1. Introduction to Statistics**

NOTE: This video was recorded in Fall 2017. The rest of the lectures were recorded in Fall 2016, but video of Lecture 1 was not available. MIT 18.650 Statistics ...

**6. Monte Carlo Simulation**

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: John Guttag ..