Machine Learning for Hackers Chapter 4: Priority e-mailranking

For example, here’s the code in my chapter for program that identifies e-mails that are part of a thread, by looking for “re:”-like prefixes on thesubjects.

In thread_flag, if the input is a pandas series of e-mail subject lines, then the function will use a vectorized string function, called with .str.contains() to see if a pattern matching a reply-type prefix is in the subject.

The function clean_subjects, if given a pandas Series input, will use the vectorized string methods .str.replace() and .str.strip() to clean the re- and fwd-like patterns out of thesubjects.

Notice there are some differences between the naming of pandas string methods and the base string methods or re module functions that perform similar operations on single strings.

In the code for that chapter, I built a TDM function that wrapped the term-document matrix function in the textmining package, adding some options that tried to mimic the tdm function in R’s tm package.

For example, the following thread has twoe-mails: If you ignore the timezones, it looks like 763 comes three hours after 734.

Even though the authors scale the individual feature weights (typically with log-scales), by calculating the final rank as a product, you can get big rank difference based on what might seem to be practically similar features (even without any bugs)—for example, in some cases it doesn’t take a big difference to double a feature’s weight, which then doubles the e-mail’s rank.So it seems to me the ranking procedure in the book is not very stable.

