piątek, 25 listopada 2016

Marketing with R, very interesting articles.

My search started with the question, how to reduce the huge amount of information about the history of purchases, contacts of hundred thousands of customers to manageable few variables.
I came across  Recency, Frequency, Monetary value (RFM) analysis.


piątek, 4 listopada 2016

Cleaning data with Regex in Excel (then in R)...

I have a dirty column that should contain duration in months as integer, but it contains variants of strings: due to end on dd-mm-yy, duration x days, from dd.mm.yyyy through/to dd.mm.yyyy y.
Headache.

This is the day I started playing with regex.

First, how to find a date after a specific word:
http://stackoverflow.com/questions/19193251/regex-to-get-the-words-after-matching-string

After some reading (eg. http://stackoverflow.com/questions/18139493/finding-first-word-after-regex-match, http://stackoverflow.com/questions/546220/how-to-match-the-first-word-after-an-expression-with-regex


modifications and experimentations I reached the following compact solution:

[\n\r].*[skończy się:|until:|do:|zakończenia:]\s*(\w+.\w+.\w+)

Tested with: "The contract will last until: dd.mm.yyy or to the [...]" it will return just the date "dd.mm.yyy

Cute!

Macros, that work with Excel:

http://analystcave.com/excel-regex-tutorial/ (I used RegexExecute macro with success)
http://www.vbaexpress.com/kb/getarticle.php?kb_id=841
http://stackoverflow.com/questions/22542834/how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops
http://www.vbaexpress.com/kb/getarticle.php?kb_id=841
http://superuser.com/questions/978532/how-do-i-use-regex-using-instr-in-vba
http://stackoverflow.com/questions/19481175/general-purpose-udfs-for-using-regular-expressions-in-excel (set of very useful tools)



R code to test:

http://stackoverflow.com/questions/34804708/matching-a-word-after-another-word-in-r-regex
https://renkun.me/blog/2014/02/20/extract-information-from-texts-with-regular-expressions-in-r.html

In R the following worked with me:

library(stringr)

str_extract(string=df$duration, pattern= "(?<=(dnia:|zakończenia:|until:)\\s)(\\w+.\\w+.\\w+)" )