I have a dirty column that should contain duration in months as integer, but it contains variants of strings: due to end on dd-mm-yy, duration x days, from dd.mm.yyyy through/to dd.mm.yyyy y.
Headache.
This is the day I started playing with regex.
First, how to find a date after a specific word:
http://stackoverflow.com/questions/19193251/regex-to-get-the-words-after-matching-string
After some reading (eg. http://stackoverflow.com/questions/18139493/finding-first-word-after-regex-match, http://stackoverflow.com/questions/546220/how-to-match-the-first-word-after-an-expression-with-regex
modifications and experimentations I reached the following compact solution:
[\n\r].*[skończy się:|until:|do:|zakończenia:]\s*(\w+.\w+.\w+)
Tested with: "The contract will last until: dd.mm.yyy or to the [...]" it will return just the date "dd.mm.yyy
Cute!
Macros, that work with Excel:
http://analystcave.com/excel-regex-tutorial/ (I used RegexExecute macro with success)
http://www.vbaexpress.com/kb/getarticle.php?kb_id=841
http://stackoverflow.com/questions/22542834/how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops
http://www.vbaexpress.com/kb/getarticle.php?kb_id=841
http://superuser.com/questions/978532/how-do-i-use-regex-using-instr-in-vba
http://stackoverflow.com/questions/19481175/general-purpose-udfs-for-using-regular-expressions-in-excel (set of very useful tools)
R code to test:
http://stackoverflow.com/questions/34804708/matching-a-word-after-another-word-in-r-regex
https://renkun.me/blog/2014/02/20/extract-information-from-texts-with-regular-expressions-in-r.html
In R the following worked with me:
library(stringr)
str_extract(string=df$duration, pattern= "(?<=(dnia:|zakończenia:|until:)\\s)(\\w+.\\w+.\\w+)" )
Brak komentarzy:
Prześlij komentarz