This is a continuously updated repository that documents personal journey on learning data science, machine learning related topics.

For those interested there's also a pyspark rdd cheatsheet and pyspark dataframe cheatsheet that may come in handy.

Choosing the optimal cutoff value for logistic regression using cost-sensitive mistakes (meaning when the cost of misclassification might differ between the two classes) when your dataset consists of unbalanced binary classes.

Majority of the data points in the dataset have a positive outcome, while few have negative, or vice versa.

The notion can be extended to any other classification algorithm that can predict class’s probability, this documentation just uses logistic regression for illustration purpose.

