sobota, 9 lipca 2016

R: Braindump for this week: classification, ETL, etc.

1. I am pondering about the use of clustering to classify unknown data.
  • first prepare data (what to do with categorical, ordinal?, mutate - create new variables?)
    • for this i found some interesting tutorials:
      http://www.r-bloggers.com/clustering-mixed-data-types-in-r-2/
      http://www.sthda.com/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning
    • I also found an algorhitm that proved to be very precise:
  • then run through a clustering algorytm (k-means, local density, other?)
    • first I chose to play with various algorythms to see how they detect known clusters
      eg. the three species of famous iris database, I played with "cclust", "cluster", and "densityclust" and discovered "factoextra"for quick graphs/diagnostics
  • finally throw out a decision tree (or other non-black box tool to show rules and dependencies).
2. Getting out tidy data: broom (in .r documents), pander (in .Rmd documents), memisc:mtable + pander to compare linear models (http://stackoverflow.com/questions/24342162/regression-tables-in-markdown-format-for-flexible-use-in-r-markdown-v2)

3. ETL concept:
https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
https://github.com/beanumber/etl
http://www.r-bloggers.com/r-and-sqlite-part-1/


Brak komentarzy:

Prześlij komentarz