środa, 30 grudnia 2015
Day 2 with R. Ideas for diving in deep waters
Start with example here: https://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html
Then go through: http://blog.datacamp.com/machine-learning-in-r/
========================
Additional reading:
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/kNN
https://www3.nd.edu/~steve/computing_with_data/17_Refining_kNN/refining_knn.html
http://blog.webagesolutions.com/archives/1164
http://www.analyticsvidhya.com/blog/2015/08/learning-concept-knn-algorithms-programming/
K means vs K nearest neighbors are two different things.
https://www.quora.com/Is-the-k-Means-algorithm-related-to-the-k-Nearest-Neighbors-algorithm
https://www.quora.com/How-is-the-k-nearest-neighbor-algorithm-different-from-k-means-clustering
What next:
http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-r/
Additional topic:
R and SQL/MS SQL
http://www.burns-stat.com/translating-r-sql-basics/
http://www.r-bloggers.com/make-r-speak-sql-with-sqldf/ - use SQL in R
https://www.simple-talk.com/sql/reporting-services/making-data-analytics-simpler-sql-server-and-r/ make R talk to MS SQL
Ideas for day 3: http://courses.had.co.nz/10-tokyo/
http://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-business-intelligence-big-data/learning-path-r-data-science/
Picturen Lite: best app for viewing lectures on android.
This app appears to be the best choice for viewing video lectures (esp.
downloaded khanacademy lessons, R coding lectures) on small devices.
One can zoom in to the code being written by the lecturer with fingers.
Great. Big thanks to the authors. One fault though (bug? please
correct) After pausing and ocking/unlocking the screen it does not
remember the last position and starts playing the video from the
beginning.
If you know a better one, please let me know.
If you know a better one, please let me know.
wtorek, 29 grudnia 2015
Free books from Grean Tea Press
Think Stats, Think Python, Think Bayes. Visit: http://greenteapress.com/wp/
poniedziałek, 28 grudnia 2015
Day 1 with R
What I have learned today:
> setwd("C:\R") - set the directory I will put my files,
> getwd() check if it is set correctly
Let's get some data ...
I created a csv file with notepad:
> b <- read.csv ("test.csv")
Play with charts:
> plot(b$Wartosc, b$Wartosc2)
> plot(density(b$Wartosc)) - basic density plot
> plot(boxplot(b$Wartosc)) - basic boxplot
Play with k-means clusters
clusters <- kmeans(b[-1],2)
Get data with cluster no. column
> b <- cbind (b, clusters$cluster)
Plot x y with clusters distinguished by color:
> plot (b$Wartosc, b$Wartosc2, col=b[,3])
To add labels I tried:
> text(b$Wartosc, b$Wartosc2, labels=b$Nazwisko, cex= 0.7, pos=3)
Then I played with http://stats.stackexchange.com/questions/109273/creating-a-cluster-analysis-on-multiple-variables
> numbers_only <- b[c(-1,-4)]
> rownames(numbers_only) <- b$Nazwisko
> d <- dist(numbers_only, method="euclidean")
> fit <- hclust(d, method = "ward.D")
> plot(fit)
What if the tree is too large, and I want to generalize clusters. I need to cut a tree? http://stackoverflow.com/questions/6518133/clustering-list-for-hclust-function
> cutree(fit, h=10)
Kowalski Nowak Sikorowski
1 1 2
Shit, it is 01:00 AM. Good night
> setwd("C:\R") - set the directory I will put my files,
> getwd() check if it is set correctly
Let's get some data ...
I created a csv file with notepad:
Nazwisko,Wartosc,Wartosc2Then I read it with R
Kowalski,23.22,32.12
Nowak,21.21,34.12
Sikorowski,20.2,11.3
> b <- read.csv ("test.csv")
Play with charts:
> plot(b$Wartosc, b$Wartosc2)
> plot(density(b$Wartosc)) - basic density plot
> plot(boxplot(b$Wartosc)) - basic boxplot
Play with k-means clusters
clusters <- kmeans(b[-1],2)
Get data with cluster no. column
> b <- cbind (b, clusters$cluster)
Plot x y with clusters distinguished by color:
> plot (b$Wartosc, b$Wartosc2, col=b[,3])
To add labels I tried:
> text(b$Wartosc, b$Wartosc2, labels=b$Nazwisko, cex= 0.7, pos=3)
Then I played with http://stats.stackexchange.com/questions/109273/creating-a-cluster-analysis-on-multiple-variables
> numbers_only <- b[c(-1,-4)]
> rownames(numbers_only) <- b$Nazwisko
> d <- dist(numbers_only, method="euclidean")
> fit <- hclust(d, method = "ward.D")
> plot(fit)
What if the tree is too large, and I want to generalize clusters. I need to cut a tree? http://stackoverflow.com/questions/6518133/clustering-list-for-hclust-function
> cutree(fit, h=10)
Kowalski Nowak Sikorowski
1 1 2
Shit, it is 01:00 AM. Good night
wtorek, 22 grudnia 2015
Is there a Python-Studio like R-Studio?
I visited R-Bloggers and they redirected me today to http://blog.yhathq.com/posts/rodeo-native.html
need to check it out.
need to check it out.
poniedziałek, 21 grudnia 2015
Politico: Sober look at Polish politics
http://www.politico.eu/article/give-pis-a-chance-poland-kaczynski-andrzej-duda-szydlo/
CoffeeBreak: Outliers2, what about skewness
My new version of outlier's highlighting VBA
Sub outliers_IQR()
Dim Rng As Range, rTest As Range: Set rTest = Selection
Dim rQ1 As Double: rQ1 = WorksheetFunction.Quartile(rTest, 1)
Dim rQ3 As Double: rQ3 = WorksheetFunction.Quartile(rTest, 3)
Dim IQR As Double: IQR = Q3 - Q1
Dim Factor As Double: Factor = 2.2
For Each Rng In rTest
If Rng > (rQ3 + Factor * IQR) Or Rng < (rQ1 - Factor * IQR) Then
Rng.Interior.Color = RGB(255, 0, 0) '.Value = "Outlier" 'or delete the data with Rng.clearcontents
Else
Rng.Interior.Color = xlNone
End If
Next
End Sub
What about skewness? The proposal to deal with it can be found here: https://wis.kuleuven.be/stat/robust/papers/2008/outlierdetectionskeweddata-revision.pdf
I need to decipher this math and implement it above.
Some related reading:
http://www.real-statistics.com/students-t-distribution/identifying-outliers-using-t-distribution/grubbs-test/
http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/
http://stats.stackexchange.com/questions/60235/how-accurate-is-iqr-for-detecting-outliers
Sub outliers_IQR()
Dim Rng As Range, rTest As Range: Set rTest = Selection
Dim rQ1 As Double: rQ1 = WorksheetFunction.Quartile(rTest, 1)
Dim rQ3 As Double: rQ3 = WorksheetFunction.Quartile(rTest, 3)
Dim IQR As Double: IQR = Q3 - Q1
Dim Factor As Double: Factor = 2.2
For Each Rng In rTest
If Rng > (rQ3 + Factor * IQR) Or Rng < (rQ1 - Factor * IQR) Then
Rng.Interior.Color = RGB(255, 0, 0) '.Value = "Outlier" 'or delete the data with Rng.clearcontents
Else
Rng.Interior.Color = xlNone
End If
Next
End Sub
What about skewness? The proposal to deal with it can be found here: https://wis.kuleuven.be/stat/robust/papers/2008/outlierdetectionskeweddata-revision.pdf
I need to decipher this math and implement it above.
Some related reading:
http://www.real-statistics.com/students-t-distribution/identifying-outliers-using-t-distribution/grubbs-test/
http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/
http://stats.stackexchange.com/questions/60235/how-accurate-is-iqr-for-detecting-outliers
Excel mark duplicates.
Simple and interesting:
- in one helper column concatenate the columns you are checking for duplicates.
- in another =COUNTIF([concatenated_colum_range];checked_concatenated_value)>1
It returns TRUE when there are more than 1.
Found here: http://www.ozgrid.com/forum/showthread.php?t=59302 and http://www.ozgrid.com/Excel/highlight-duplicates.htm and http://blog.contextures.com/archives/2013/04/11/highlight-duplicate-records-in-an-excel-list/
- in one helper column concatenate the columns you are checking for duplicates.
- in another =COUNTIF([concatenated_colum_range];checked_concatenated_value)>1
It returns TRUE when there are more than 1.
Found here: http://www.ozgrid.com/forum/showthread.php?t=59302 and http://www.ozgrid.com/Excel/highlight-duplicates.htm and http://blog.contextures.com/archives/2013/04/11/highlight-duplicate-records-in-an-excel-list/
czwartek, 17 grudnia 2015
CoffeeBreak: Outliers, Kernel Density
Reading on using IQR to identify outliers in Excel:
http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/
http://brownmath.com/stat/nchkxl.htm
esp. see worksheet: http://brownmath.com/stat/prog/normalitycheck.xlsm
Using SD, not a good idea but a good macro to start with:
http://www.mrexcel.com/forum/excel-questions/732424-how-remove-outliers-data-set-2.html
Cheatsheets for R: distributions: http://www.r-bloggers.com/ggplot2-cheatsheet-for-visualizing-distributions/
Good Wikipedia entry: https://en.wikipedia.org/wiki/Outlier
Advanced article: http://d-scholarship.pitt.edu/7948/1/Seo.pdf
For beginners: https://www.dataz.io/display/Public/2013/03/20/Describing+Data%3A+Why+median+and+IQR+are+often+better+than+mean+and+standard+deviation
====
Simple MAD solution with Excel formulas: http://www.codeproject.com/Tips/214330/Statistical-Outliers-detection
http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/
http://brownmath.com/stat/nchkxl.htm
esp. see worksheet: http://brownmath.com/stat/prog/normalitycheck.xlsm
Using SD, not a good idea but a good macro to start with:
http://www.mrexcel.com/forum/excel-questions/732424-how-remove-outliers-data-set-2.html
Sub outliers_mod2() Dim dblAverage As Double, dblStdDev As Double Dim NoStdDevs As Integer Dim rTest As Range, Rng As Range 'Application.ScreenUpdating = False NoStdDevs = 3 'adjust to your outlier preference of sigma Set rTest = Selection 'Application.InputBox("Select a range", "Get Range", Type:=8) dblAverage = WorksheetFunction.Average(rTest) dblStdDev = WorksheetFunction.StDev(rTest) For Each Rng In rTest If Rng > dblAverage + NoStdDevs * dblStdDev Or Rng < dblAverage - NoStdDevs * dblStdDev Then Rng.Interior.Color = RGB(255, 0, 0) '.Value = "Outlier" 'or delete the data with Rng.clearcontents End If Next 'Application.ScreenUpdating = True End Sub
====
Normalize(simple linear normalize) data in Excel
http://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range?newreg=f530ce192d144b109ec26a077cab00af
Kernel Density/Regression, alternatives to histogram.
====
good article: http://www.stat-d.si/mz/mz4.1/vidmar.pdf http://people.revoledu.com/kardi/tutorial/index.html
2d: http://www.r-bloggers.com/recipe-for-computing-and-sampling-multivariate-kernel-density-estimates-and-plotting-contours-for-2d-kdes/
Density plugin: http://www.prodomosua.eu/ppage02.html
Good VBA example for density plots (incl UDF function).
http://www.iimahd.ernet.in/~jrvarma/software.php
(found in this page: http://www.mathfinance.cn/category/vba/1/5/)
Some R code http://www.wessa.net/rwasp_density.wasp#output
Another article: http://www.rsc.org/images/data-distributions-kernel-density-technical-brief-4_tcm18-214836.pdf
Some plugin/vba to check: http://www.rsc.org/Membership/Networking/InterestGroups/Analytical/AMC/Software/RobustStatistics.asp
Cheatsheets for R: distributions: http://www.r-bloggers.com/ggplot2-cheatsheet-for-visualizing-distributions/
Good Wikipedia entry: https://en.wikipedia.org/wiki/Outlier
Advanced article: http://d-scholarship.pitt.edu/7948/1/Seo.pdf
For beginners: https://www.dataz.io/display/Public/2013/03/20/Describing+Data%3A+Why+median+and+IQR+are+often+better+than+mean+and+standard+deviation
====
Simple MAD solution with Excel formulas: http://www.codeproject.com/Tips/214330/Statistical-Outliers-detection
wtorek, 8 grudnia 2015
Coffee break: MSAccess Transform/Pivot syntax.
http://blogannath.blogspot.com/2010/02/microsoft-access-tips-tricks-crosstab.html
http://stackoverflow.com/questions/27710313/ms-access-sql-transform-aggregate-manipluation-of-values-for-pivot
pivot datepart (aby otrzymać quarters)
https://msdn.microsoft.com/en-us/library/bb208956%28v=office.12%29.aspx
http://weblogs.sqlteam.com/jeffs/archive/2007/09/10/group-by-month-sql.aspx
sum if
http://allenbrowne.com/ser-67.html
http://www.blueclaw-db.com/accessquerysql/crosstab.htm
Crosstabs on two columns
http://stackoverflow.com/questions/24706235/sql-pivot-cross-tab-and-multiple-columns
http://stackoverflow.com/questions/27710313/ms-access-sql-transform-aggregate-manipluation-of-values-for-pivot
pivot datepart (aby otrzymać quarters)
https://msdn.microsoft.com/en-us/library/bb208956%28v=office.12%29.aspx
http://weblogs.sqlteam.com/jeffs/archive/2007/09/10/group-by-month-sql.aspx
sum if
http://allenbrowne.com/ser-67.html
http://www.blueclaw-db.com/accessquerysql/crosstab.htm
Crosstabs on two columns
http://stackoverflow.com/questions/24706235/sql-pivot-cross-tab-and-multiple-columns
piątek, 4 grudnia 2015
Stay secure for free
Talk & instant messages: https://whispersystems.org/
Disc: https://veracrypt.codeplex.com/
Files and mail: http://www.gpg4usb.org/
Disc: https://veracrypt.codeplex.com/
Files and mail: http://www.gpg4usb.org/
środa, 2 grudnia 2015
Coffee break - Access transform-pivot scavenging.
Looking for articles on Transform - Pivot, ie. crosstab queries in Access for my VBA ACE SQL solutions.
I need it because my Excel files need to work with 100k+ data, and a lot of filtering, joining etc.
Interesting:
http://blogannath.blogspot.com/2010/02/microsoft-access-tips-tricks-crosstab.html#ixzz3t3Pb0e6j
And more.
http://datapigtechnologies.com/blog/index.php/running-crosstab-queries-in-excel/
http://www.fmsinc.com/MicrosoftAccess/query/crosstab-report/
http://allenbrowne.com/ser-67.html
http://www.mrexcel.com/forum/microsoft-access/261348-help-transform-pivot-crosstab-querys-access.html
http://www.fmsinc.com/MicrosoftAccess/query/crosstab-report/
Coffee Break ListobjectTables Totals in VBA
I found good tutorial on ListObject Tables in VBA here:
http://www.thespreadsheetguru.com/blog/2014/6/20/the-vba-guide-to-listobject-excel-tables
I wanted to insert totals in selected columns, my playground looked like this:
Function to see total of selected column:
http://www.thespreadsheetguru.com/blog/2014/6/20/the-vba-guide-to-listobject-excel-tables
I wanted to insert totals in selected columns, my playground looked like this:
Function to see total of selected column:
Function sumColumnCenaNetto() sumColumnCenaNetto = WorksheetFunction.sum _ (Sheets("Sheet1").ListObjects("Table1").ListColumns("CenaNetto").DataBodyRange) End FunctionInsert a total formula under the table.
Sub InsertTotal() With Sheets("Sheet1").ListObjects("Table1") .ListColumns("CenaNetto").TotalsCalculation = 1 .ShowTotals = True End Sub
Subskrybuj:
Posty (Atom)