piątek, 4 listopada 2016

Cleaning data with Regex in Excel (then in R)...

I have a dirty column that should contain duration in months as integer, but it contains variants of strings: due to end on dd-mm-yy, duration x days, from dd.mm.yyyy through/to dd.mm.yyyy y.
Headache.

This is the day I started playing with regex.

First, how to find a date after a specific word:
http://stackoverflow.com/questions/19193251/regex-to-get-the-words-after-matching-string

After some reading (eg. http://stackoverflow.com/questions/18139493/finding-first-word-after-regex-match, http://stackoverflow.com/questions/546220/how-to-match-the-first-word-after-an-expression-with-regex


modifications and experimentations I reached the following compact solution:

[\n\r].*[skończy się:|until:|do:|zakończenia:]\s*(\w+.\w+.\w+)

Tested with: "The contract will last until: dd.mm.yyy or to the [...]" it will return just the date "dd.mm.yyy

Cute!

Macros, that work with Excel:

http://analystcave.com/excel-regex-tutorial/ (I used RegexExecute macro with success)
http://www.vbaexpress.com/kb/getarticle.php?kb_id=841
http://stackoverflow.com/questions/22542834/how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops
http://www.vbaexpress.com/kb/getarticle.php?kb_id=841
http://superuser.com/questions/978532/how-do-i-use-regex-using-instr-in-vba
http://stackoverflow.com/questions/19481175/general-purpose-udfs-for-using-regular-expressions-in-excel (set of very useful tools)



R code to test:

http://stackoverflow.com/questions/34804708/matching-a-word-after-another-word-in-r-regex
https://renkun.me/blog/2014/02/20/extract-information-from-texts-with-regular-expressions-in-r.html

In R the following worked with me:

library(stringr)

str_extract(string=df$duration, pattern= "(?<=(dnia:|zakończenia:|until:)\\s)(\\w+.\\w+.\\w+)" )


Brak komentarzy:

Prześlij komentarz