środa, 30 grudnia 2015

Day 2 with R. Ideas for diving in deep waters


Start with example here: https://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html
Then go through: http://blog.datacamp.com/machine-learning-in-r/
========================
Additional reading:
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/kNN
https://www3.nd.edu/~steve/computing_with_data/17_Refining_kNN/refining_knn.html
http://blog.webagesolutions.com/archives/1164
http://www.analyticsvidhya.com/blog/2015/08/learning-concept-knn-algorithms-programming/

K means vs K nearest neighbors are two different things.
https://www.quora.com/Is-the-k-Means-algorithm-related-to-the-k-Nearest-Neighbors-algorithm
https://www.quora.com/How-is-the-k-nearest-neighbor-algorithm-different-from-k-means-clustering

What next:
http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-r/

Additional topic:
R and SQL/MS SQL
http://www.burns-stat.com/translating-r-sql-basics/
http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ - use SQL in R
https://www.simple-talk.com/sql/reporting-services/making-data-analytics-simpler-sql-server-and-r/  make R talk to MS SQL

Ideas for day 3: http://courses.had.co.nz/10-tokyo/
http://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-business-intelligence-big-data/learning-path-r-data-science/

Picturen Lite: best app for viewing lectures on android.

This app appears to be the best choice for viewing video lectures (esp. downloaded khanacademy lessons, R coding lectures) on small devices. One can zoom in to the code being written by the lecturer with fingers. Great. Big thanks to the authors. One fault though (bug? please correct) After pausing and ocking/unlocking the screen it does not remember the last position and starts playing the video from the beginning.

If you know a better one, please let me know.

poniedziałek, 28 grudnia 2015

Day 1 with R

What I have learned today:
> setwd("C:\R")  - set the directory I will put my files,
> getwd() check if it is set correctly

Let's get some data ...

I created a csv file with notepad:
Nazwisko,Wartosc,Wartosc2
Kowalski,23.22,32.12
Nowak,21.21,34.12
Sikorowski,20.2,11.3
Then I read it with R

> b <- read.csv ("test.csv") 

Play with charts:

> plot(b$Wartosc, b$Wartosc2)

> plot(density(b$Wartosc)) - basic density plot

> plot(boxplot(b$Wartosc)) - basic boxplot


Play with k-means clusters
clusters <- kmeans(b[-1],2)

Get data with cluster no. column
 > b <- cbind (b, clusters$cluster)

Plot x y with clusters distinguished by color:
> plot (b$Wartosc, b$Wartosc2, col=b[,3])

To add labels I tried:
> text(b$Wartosc, b$Wartosc2, labels=b$Nazwisko, cex= 0.7, pos=3)

Then I played with http://stats.stackexchange.com/questions/109273/creating-a-cluster-analysis-on-multiple-variables

> numbers_only <- b[c(-1,-4)]
> rownames(numbers_only) <- b$Nazwisko
> d <- dist(numbers_only, method="euclidean")
> fit <- hclust(d, method = "ward.D")
> plot(fit)

What if the tree is too large, and I want to generalize clusters. I need to cut a tree? http://stackoverflow.com/questions/6518133/clustering-list-for-hclust-function

> cutree(fit, h=10)
  Kowalski      Nowak Sikorowski
         1          1          2

Shit, it is 01:00 AM. Good night




Code for Poland

http://kodujdlapolski.pl/projekty/

poniedziałek, 21 grudnia 2015

Publication on closed gov system in Poland

http://www.isppan.waw.pl/ksiegarnia/system_zamkniety.htm

Politico: Sober look at Polish politics

http://www.politico.eu/article/give-pis-a-chance-poland-kaczynski-andrzej-duda-szydlo/

CoffeeBreak: Outliers2, what about skewness

My new version of outlier's highlighting VBA

Sub outliers_IQR()

Dim Rng As Range, rTest As Range: Set rTest = Selection
Dim rQ1 As Double: rQ1 = WorksheetFunction.Quartile(rTest, 1)
Dim rQ3 As Double: rQ3 = WorksheetFunction.Quartile(rTest, 3)
Dim IQR As Double: IQR = Q3 - Q1
Dim Factor As Double: Factor = 2.2


For Each Rng In rTest
    If Rng > (rQ3 + Factor * IQR) Or Rng < (rQ1 - Factor * IQR) Then
        Rng.Interior.Color = RGB(255, 0, 0)        '.Value = "Outlier" 'or delete the data with Rng.clearcontents
    Else
        Rng.Interior.Color = xlNone
    End If
Next

End Sub


What about skewness? The proposal to deal with it can be found here: https://wis.kuleuven.be/stat/robust/papers/2008/outlierdetectionskeweddata-revision.pdf
I need to decipher this math and implement it above.


Some related reading:
http://www.real-statistics.com/students-t-distribution/identifying-outliers-using-t-distribution/grubbs-test/
http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/
http://stats.stackexchange.com/questions/60235/how-accurate-is-iqr-for-detecting-outliers

Excel mark duplicates.

Simple and interesting:
- in one helper column concatenate the columns you are checking for duplicates.
- in another =COUNTIF([concatenated_colum_range];checked_concatenated_value)>1
It returns TRUE when there are more than 1.

Found here: http://www.ozgrid.com/forum/showthread.php?t=59302 and http://www.ozgrid.com/Excel/highlight-duplicates.htm and http://blog.contextures.com/archives/2013/04/11/highlight-duplicate-records-in-an-excel-list/

czwartek, 17 grudnia 2015

CoffeeBreak: Outliers, Kernel Density

Reading on using IQR to identify outliers in Excel:
http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/
http://brownmath.com/stat/nchkxl.htm
esp. see worksheet: http://brownmath.com/stat/prog/normalitycheck.xlsm

Using SD, not a good idea but a good macro to start with:
http://www.mrexcel.com/forum/excel-questions/732424-how-remove-outliers-data-set-2.html
Sub outliers_mod2()
Dim dblAverage As Double, dblStdDev As Double
Dim NoStdDevs As Integer
Dim rTest As Range, Rng As Range
'Application.ScreenUpdating = False
NoStdDevs = 3 'adjust to your outlier preference of sigma
Set rTest = Selection 'Application.InputBox("Select a range", "Get Range", Type:=8)
dblAverage = WorksheetFunction.Average(rTest)
dblStdDev = WorksheetFunction.StDev(rTest)
For Each Rng In rTest
    If Rng > dblAverage + NoStdDevs * dblStdDev Or Rng < dblAverage - NoStdDevs * dblStdDev Then
        Rng.Interior.Color = RGB(255, 0, 0)        '.Value = "Outlier" 'or delete the data with Rng.clearcontents
    End If
Next
'Application.ScreenUpdating = True
End Sub 
 
====
Normalize(simple linear normalize) data in Excel
http://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range?newreg=f530ce192d144b109ec26a077cab00af 
 
Kernel Density/Regression, alternatives to histogram.
==== 
good article: http://www.stat-d.si/mz/mz4.1/vidmar.pdf
http://people.revoledu.com/kardi/tutorial/index.html
2d: http://www.r-bloggers.com/recipe-for-computing-and-sampling-multivariate-kernel-density-estimates-and-plotting-contours-for-2d-kdes/
 
Density plugin: http://www.prodomosua.eu/ppage02.html 

Good  VBA example for density plots (incl UDF function). 
http://www.iimahd.ernet.in/~jrvarma/software.php
(found in this page: http://www.mathfinance.cn/category/vba/1/5/)
 
Some R code http://www.wessa.net/rwasp_density.wasp#output
Another article: http://www.rsc.org/images/data-distributions-kernel-density-technical-brief-4_tcm18-214836.pdf

 
Some plugin/vba to check: http://www.rsc.org/Membership/Networking/InterestGroups/Analytical/AMC/Software/RobustStatistics.asp

Cheatsheets for R: distributions: http://www.r-bloggers.com/ggplot2-cheatsheet-for-visualizing-distributions/

Good Wikipedia entry: https://en.wikipedia.org/wiki/Outlier
Advanced article: http://d-scholarship.pitt.edu/7948/1/Seo.pdf
For beginners: https://www.dataz.io/display/Public/2013/03/20/Describing+Data%3A+Why+median+and+IQR+are+often+better+than+mean+and+standard+deviation

====
Simple MAD solution with Excel formulas: http://www.codeproject.com/Tips/214330/Statistical-Outliers-detection

wtorek, 8 grudnia 2015

Coffee break: MSAccess Transform/Pivot syntax.

http://blogannath.blogspot.com/2010/02/microsoft-access-tips-tricks-crosstab.html
http://stackoverflow.com/questions/27710313/ms-access-sql-transform-aggregate-manipluation-of-values-for-pivot

pivot datepart (aby otrzymać quarters)
https://msdn.microsoft.com/en-us/library/bb208956%28v=office.12%29.aspx
http://weblogs.sqlteam.com/jeffs/archive/2007/09/10/group-by-month-sql.aspx

sum if
http://allenbrowne.com/ser-67.html  
http://www.blueclaw-db.com/accessquerysql/crosstab.htm

Crosstabs on two columns
http://stackoverflow.com/questions/24706235/sql-pivot-cross-tab-and-multiple-columns

piątek, 4 grudnia 2015

Stay secure for free

Talk & instant messages: https://whispersystems.org/
Disc: https://veracrypt.codeplex.com/
Files and mail: http://www.gpg4usb.org/

środa, 2 grudnia 2015

Coffee break - Access transform-pivot scavenging.

Looking for articles on Transform - Pivot, ie. crosstab queries in Access for my VBA ACE SQL solutions. I need it because my Excel files need to work with 100k+ data, and a lot of filtering, joining etc. Interesting: http://blogannath.blogspot.com/2010/02/microsoft-access-tips-tricks-crosstab.html#ixzz3t3Pb0e6j And more. http://datapigtechnologies.com/blog/index.php/running-crosstab-queries-in-excel/ http://www.fmsinc.com/MicrosoftAccess/query/crosstab-report/ http://allenbrowne.com/ser-67.html http://www.mrexcel.com/forum/microsoft-access/261348-help-transform-pivot-crosstab-querys-access.html http://www.fmsinc.com/MicrosoftAccess/query/crosstab-report/

Coffee Break ListobjectTables Totals in VBA

I found good tutorial on ListObject Tables in VBA here:
http://www.thespreadsheetguru.com/blog/2014/6/20/the-vba-guide-to-listobject-excel-tables

I wanted to insert totals in selected columns, my playground looked like this:

Function to see total of selected column:
Function sumColumnCenaNetto()
    sumColumnCenaNetto = WorksheetFunction.sum _
    (Sheets("Sheet1").ListObjects("Table1").ListColumns("CenaNetto").DataBodyRange)
End Function

Insert a total formula under the table.
Sub InsertTotal()
   With Sheets("Sheet1").ListObjects("Table1")
      .ListColumns("CenaNetto").TotalsCalculation = 1
      .ShowTotals = True
End Sub