hclust - Próba analizy odpowiedzi w formularzu zapisów na konferencję WhyR. kolumna: "Chcesz.podzielić.się.odpowiedzią.na.to.pytanie..Chętnie.przedstawimy.najciekawsze.odpowiedzi"
- bo lingua franca,
- bo ogromne możliwości,
- bo ggplot2,
- bo nie Python, bo nie SAS,
- bo za free,
- bo Data Science...
Interpretacja subiektywna w Paincie :-)
R za darmo daje ogromne możliwości i pozwala się rozwijać i komunikować (lingua franca), szczególnie dzięki GGPlot2. Jest za darmo. Jest alternatywą dla SASa i Pythona.
Edycja w EzGif.com
wtorek, 12 września 2017
piątek, 9 czerwca 2017
Introductory Python ML, short 2 day course.
Conclusions:
1. R is so much easier, portable, supercool. Jupyter Notebooks are far from Rstudio R Notebooks.
2. Python can be learned/is similar, just remember, indenting is part of syntax :-)
3. I must learn Pandas! Scikit Learn! Seaborn! maybe by comparison to dplyr, ggplot2.
Microsoft, please allow more of R/ExcelVBA/Python interoperability for all.
1. R is so much easier, portable, supercool. Jupyter Notebooks are far from Rstudio R Notebooks.
2. Python can be learned/is similar, just remember, indenting is part of syntax :-)
3. I must learn Pandas! Scikit Learn! Seaborn! maybe by comparison to dplyr, ggplot2.
Microsoft, please allow more of R/ExcelVBA/Python interoperability for all.
Excel: Conditional formatting string numbers...
Why I have never discovered that until today!!!:
[=0]"";rrrr-mm-dd
"If there is 0, then do not put an erroneous date (a result of some bug in Excel insisting there is a 1900-01-00 date)". Is there more? Like coloring the font dependent on conditions met? Supercool!
[=0]"";rrrr-mm-dd
"If there is 0, then do not put an erroneous date (a result of some bug in Excel insisting there is a 1900-01-00 date)". Is there more? Like coloring the font dependent on conditions met? Supercool!
sobota, 6 maja 2017
Big Data on a laptop, some options.
Data mining on streams.
http://moa.cms.waikato.ac.nz/rmoa-massive-online-data-stream-classifications-with-r-moa/
http://jwijffels.github.io/RMOA/
https://cran.r-project.org/web/packages/stream/stream.pdf
Database light backend.
https://www.monetdb.org/blog/monetdblite-r
Data mining
https://rdrr.io/cran/ffbase/man/bigglm.ffdf.html
https://cran.r-project.org/web/packages/speedglm/speedglm.pdf
https://cran.r-project.org/web/packages/randomForest.ddR/randomForest.ddR.pdf
http://moa.cms.waikato.ac.nz/rmoa-massive-online-data-stream-classifications-with-r-moa/
http://jwijffels.github.io/RMOA/
https://cran.r-project.org/web/packages/stream/stream.pdf
Database light backend.
https://www.monetdb.org/blog/monetdblite-r
Data mining
https://rdrr.io/cran/ffbase/man/bigglm.ffdf.html
https://cran.r-project.org/web/packages/speedglm/speedglm.pdf
https://cran.r-project.org/web/packages/randomForest.ddR/randomForest.ddR.pdf
piątek, 5 maja 2017
Bayes for beginners videos.
How to explain it plain E:
https://www.khanacademy.org/math/statistics-probability/probability-library/conditional-probability-independence/v/calculating-conditional-probability
http://www.watchknowlearn.org/Video.aspx?VideoID=16751&CategoryID=4457
https://www.khanacademy.org/math/ap-statistics/probability-ap/stats-conditional-probability/v/bayes-theorem-visualized
https://www.khanacademy.org/partner-content/wi-phi/wiphi-critical-thinking/wiphi-fundamentals/v/bayes-theorem
https://www.youtube.com/watch?v=Y-V4rfdl3NI
https://brilliant.org/wiki/bayes-theorem/
https://www.khanacademy.org/math/statistics-probability/probability-library/conditional-probability-independence/v/calculating-conditional-probability
http://www.watchknowlearn.org/Video.aspx?VideoID=16751&CategoryID=4457
https://www.khanacademy.org/math/ap-statistics/probability-ap/stats-conditional-probability/v/bayes-theorem-visualized
https://www.khanacademy.org/partner-content/wi-phi/wiphi-critical-thinking/wiphi-fundamentals/v/bayes-theorem
https://www.youtube.com/watch?v=Y-V4rfdl3NI
https://brilliant.org/wiki/bayes-theorem/
piątek, 24 marca 2017
Risk matrix in R (interesting readings)
1) Risk matrix examples http://davidmeza1.github.io/2015/12/17/2015-12-17-Creating-a-Risk-Matrix-in-R.html
2) Use ggrepel instead of jitter https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html
for multiple points.
3) Is there ggrepel for Excel (?) http://stackoverflow.com/questions/30294041/excel-bubble-chart-overlapping-data-label
wtorek, 21 marca 2017
Stats Day2: Covariance
Covariance
Cov(for sample) = sum[(x - śr(x))*(y-śr(y))]/n-1 ... we are interesting in the sign. Does not provide strength. Is not standardized.
Covariance matrix...diagonal shows variance of each variable, off-diagonal show covariances betw. each variable pair.
Correlation (Pearson, r)
r =Cov(x,y)/[SD(x)* SD(y)]
Cov(for sample) = sum[(x - śr(x))*(y-śr(y))]/n-1 ... we are interesting in the sign. Does not provide strength. Is not standardized.
Covariance matrix...diagonal shows variance of each variable, off-diagonal show covariances betw. each variable pair.
Correlation (Pearson, r)
r =Cov(x,y)/[SD(x)* SD(y)]
poniedziałek, 20 marca 2017
Stats Day1
With Spring day 1 I start regular intro statistics study:
My notes from day 1:
Standard Error of Mean (SD of a sample):
SD/ root(n) - the larger sample size, the error decreases,
Z-Score: (x-mean)/SD
https://en.wikipedia.org/wiki/Standard_score
T-Score:
Conversion Score = z * (NewSD)+NewMean
NewSD for T-Score = 10 and NewMean = 50
T-Score = Z-Score*10+50
More: Essentials of Testing and Assessment: A Practical Guide for Counselors ...Autorzy Edward S. Neukrug,R. Charles Fawcett p 133 (fragment available online via Google)
Coefficient of Variation
CV% = (SD/Mean)*100
My notes from day 1:
Standard Error of Mean (SD of a sample):
SD/ root(n) - the larger sample size, the error decreases,
Z-Score: (x-mean)/SD
https://en.wikipedia.org/wiki/Standard_score
T-Score:
Conversion Score = z * (NewSD)+NewMean
NewSD for T-Score = 10 and NewMean = 50
T-Score = Z-Score*10+50
More: Essentials of Testing and Assessment: A Practical Guide for Counselors ...Autorzy Edward S. Neukrug,R. Charles Fawcett p 133 (fragment available online via Google)
Coefficient of Variation
CV% = (SD/Mean)*100
ExcelVBA: From recorded macro to reusable function.
Problem: with each change in slicer, pivot chart formatting changes.
So I want all my lables vertical (and with each change the damn thing returns to horizontal). My recorded macro did not return anything sensible. Lecture on chart label properties brought me to this solution:
Sub Macro1()
'
' Wersja "Recorder"
'
Dim mySrs As Series
With ActiveSheet.ChartObjects("chValueofOffers").Chart
.SeriesCollection(1).DataLabels.Orientation = xlUpward
.SeriesCollection(2).DataLabels.Orientation = xlUpward
End With
Next I wanted to iterate through series collection so that I do not need to change macro for more series:
Sub Macro2()
'
' Wersja "Obiektowo pętlowa"
'
Set seriesCol = ActiveSheet.ChartObjects("chValueofOffers").Chart.SeriesCollection
For Each mySeries In seriesCol
mySeries.DataLabels.Orientation = xlUpward
Next
Set seriesCol = ActiveSheet.ChartObjects("chValueofOffersTotal").Chart.SeriesCollection
End Sub
Finally... what if I have several charts on the same sheet that need vertical labels. Why not making it a reusable function with chart name as attribute. Now... with each change in the slicer the labels in the two charts get corrected to vertical.
Private Function DataLabelsVertical(mySheet As String, chtName As String)
Dim mySeries As Series
Set seriesCol = Worksheets(mySheet).ChartObjects(chtName).Chart.SeriesCollection
For Each mySeries In seriesCol
mySeries.DataLabels.Orientation = xlUpward
Next
End Function
Private Sub Worksheet_Change(ByVal Target As Range)
DataLabelsVertical "Value_tenders", "chValueofOffers"
DataLabelsVertical "Value_tenders", "chValueofOffersTotal"
DataLabelsVertical "Status", "chStatusValue"
DataLabelsVertical "Status", "chStatusValueTotal"
End Sub
End Function
Sources:
http://www.java2s.com/Code/VBA-Excel-Access-Word/Excel/Loopthrougheachseriesinchartandaltermarkercolors.htm
http://stackoverflow.com/questions/21165581/vba-looping-through-all-series-within-all-charts
So I want all my lables vertical (and with each change the damn thing returns to horizontal). My recorded macro did not return anything sensible. Lecture on chart label properties brought me to this solution:
Sub Macro1()
'
' Wersja "Recorder"
'
Dim mySrs As Series
With ActiveSheet.ChartObjects("chValueofOffers").Chart
.SeriesCollection(1).DataLabels.Orientation = xlUpward
.SeriesCollection(2).DataLabels.Orientation = xlUpward
End With
Next I wanted to iterate through series collection so that I do not need to change macro for more series:
Sub Macro2()
'
' Wersja "Obiektowo pętlowa"
'
Set seriesCol = ActiveSheet.ChartObjects("chValueofOffers").Chart.SeriesCollection
For Each mySeries In seriesCol
mySeries.DataLabels.Orientation = xlUpward
Next
Set seriesCol = ActiveSheet.ChartObjects("chValueofOffersTotal").Chart.SeriesCollection
End Sub
Finally... what if I have several charts on the same sheet that need vertical labels. Why not making it a reusable function with chart name as attribute. Now... with each change in the slicer the labels in the two charts get corrected to vertical.
Private Function DataLabelsVertical(mySheet As String, chtName As String)
Dim mySeries As Series
Set seriesCol = Worksheets(mySheet).ChartObjects(chtName).Chart.SeriesCollection
For Each mySeries In seriesCol
mySeries.DataLabels.Orientation = xlUpward
Next
End Function
Private Sub Worksheet_Change(ByVal Target As Range)
DataLabelsVertical "Value_tenders", "chValueofOffers"
DataLabelsVertical "Value_tenders", "chValueofOffersTotal"
DataLabelsVertical "Status", "chStatusValue"
DataLabelsVertical "Status", "chStatusValueTotal"
End Sub
End Function
Sources:
http://www.java2s.com/Code/VBA-Excel-Access-Word/Excel/Loopthrougheachseriesinchartandaltermarkercolors.htm
http://stackoverflow.com/questions/21165581/vba-looping-through-all-series-within-all-charts
piątek, 17 lutego 2017
R: self organizing maps
Interesting topics, see
https://www.r-bloggers.com/self-organising-maps-for-customer-segmentation-using-r/
http://www.slideshare.net/shanelynn/2014-0117-dublin-r-selforganising-maps-for-customer-segmentation-shane-lynn
Other interesting reading
https://www.r-bloggers.com/r-an-integrated-statistical-programming-environment-and-gis/
https://www.r-bloggers.com/how-to-perform-pca-with-r/
https://www.r-bloggers.com/self-organising-maps-for-customer-segmentation-using-r/
http://www.slideshare.net/shanelynn/2014-0117-dublin-r-selforganising-maps-for-customer-segmentation-shane-lynn
Other interesting reading
https://www.r-bloggers.com/r-an-integrated-statistical-programming-environment-and-gis/
https://www.r-bloggers.com/how-to-perform-pca-with-r/
czwartek, 16 lutego 2017
SQL playground, remove duplicates and count students that passed exam.
Patent na usuwanie duplikatów z tabeli “in
place” tj bez nadpisywania jej, bez usuwania i wstawiania innej:
Do przetestowania:
Kolumny ID (nie powtarza się). imie,
nazwisko, dane (może duplikować się).
DELETE
*
FROM Tabela1
WHERE [id] NOT IN
(SELECT Max(Tabela1.id) AS id
FROM Tabela1
GROUP BY Tabela1.imie, Tabela1.nazwisko, Tabela1.dane);
FROM Tabela1
WHERE [id] NOT IN
(SELECT Max(Tabela1.id) AS id
FROM Tabela1
GROUP BY Tabela1.imie, Tabela1.nazwisko, Tabela1.dane);
w linii GROUP BY można określić gdzie
szukamy duplikatów, w jakich kolumnach
Select Max... można użyć First(), albo Min(), żeby określić, które zduplikowane rekordy zachować.
Select Max... można użyć First(), albo Min(), żeby określić, które zduplikowane rekordy zachować.
Są trzy tabele
1. Studenci (_indeks_, imie, nazwisko)
2. Kursy (_id_, tytul, godzin, punktow), gdzie punktow oznacza ile punktow za zaliczenie kursu się dostaje
3. Szkolenia( osoba, kurs, zaliczenie) gdzie osoba to numer indeksu studenta, zaliczenie to data zaliczenia, a kurs to numer kursu
i pierwsze zadanie z matury brzmi: Ilu studentów zaliczyło w pierwszym terminie? (do 30 czerwca 2016)<pre>
Select count([_indeks_Stud]) AS ilu_zdalo From
(Select [Studenci$].[_indeks_Stud]
From ([Kursy$]
Inner Join [Szkolenia$] On [Kursy$].[_id_kurs] = [Szkolenia$].kurs)
Inner Join [Studenci$] On [Studenci$].[_indeks_Stud] = [Szkolenia$].osoba
Where [Szkolenia$].zaliczenie <= #2016-06-30#
Group By [Studenci$].[_indeks_Stud]
Having Sum([Kursy$].punktow) >= 15)
<./pre>
Kod powstał w ide Flyspeed SQL Query na bazie stworzonej w zakładkach w Excelu, stąd znaki dolara.... kod więc powinien działać po wklejeniu do Microsoft Query.
Python in RStudio
Data used:
Titanic data: https://www.kaggle.com/c/titanic/data
and tutorial: http://nbviewer.jupyter.org/github/savarin/pyconuk-introtutorial/blob/master/notebooks/Section%201-0%20-%20First%20Cut.ipynb
Flights data: http://ucl.ac.uk/~uctqiax/data/flights.csv
Software used:
Portable scientific winpython (with pandas scikit-learn):
https://sourceforge.net/projects/winpython/?source=typ_redirect
To work it needed windows updates (my OS is windows 7):
https://www.microsoft.com/en-us/download/confirmation.aspx?id=49093
To install packages from source it needed:
http://landinghub.visualstudio.com/visual-cpp-build-tools
I needed feather package so I dowloaded it and used python command: pip install
as taught here: https://github.com/winpython/winpython/wiki/Installing-Additional-Packages
and installed from source: https://github.com/wesm/feather/tree/master/python
To learn how to use other languages in RStudio: http://rmarkdown.rstudio.com/authoring_knitr_engines.html
I also wanted to try if some portable version of bash would work. No problem:
http://win-bash.sourceforge.net/
Code for my playground.
Titanic data: https://www.kaggle.com/c/titanic/data
and tutorial: http://nbviewer.jupyter.org/github/savarin/pyconuk-introtutorial/blob/master/notebooks/Section%201-0%20-%20First%20Cut.ipynb
Flights data: http://ucl.ac.uk/~uctqiax/data/flights.csv
Software used:
Portable scientific winpython (with pandas scikit-learn):
https://sourceforge.net/projects/winpython/?source=typ_redirect
To work it needed windows updates (my OS is windows 7):
https://www.microsoft.com/en-us/download/confirmation.aspx?id=49093
To install packages from source it needed:
http://landinghub.visualstudio.com/visual-cpp-build-tools
I needed feather package so I dowloaded it and used python command: pip install
as taught here: https://github.com/winpython/winpython/wiki/Installing-Additional-Packages
and installed from source: https://github.com/wesm/feather/tree/master/python
To learn how to use other languages in RStudio: http://rmarkdown.rstudio.com/authoring_knitr_engines.html
I also wanted to try if some portable version of bash would work. No problem:
http://win-bash.sourceforge.net/
Code for my playground.
--- title: "R Notebook" output: html_notebook --- ## Bash ```{bash, engine.path="C:\\Users\\jkotows2\\Desktop\\shell.w32-ix86\\bash.exe"} cat flights1.csv flights2.csv flights3.csv > flights.csv ``` ## Python http://rmarkdown.rstudio.com/authoring_knitr_engines.html ```{python, engine.path="C:\\Users\\jkotows2\\Desktop\\WinPython\\python-3.6.0.amd64\\python.exe"} import pandas import feather # Read flights data and select flights to O'Hare flights = pandas.read_csv("C:\\Users\\jkotows2\\Desktop\\_flights\\flights.csv") flights = flights[flights['dest'] == "ORD"] # Select carrier and delay columns and drop rows with missing values flights = flights[['carrier', 'dep_delay', 'arr_delay']] flights = flights.dropna() print (flights.head(10)) # Write to feather file for reading from R feather.write_dataframe(flights, "C:\\Users\\jkotows2\\Desktop\\_flights\\flights.feather") ``` ## Back to R ```{r} library(feather) library(ggplot2) # Read from feather and plot flights <- read_feather("C:\\Users\\jkotows2\\Desktop\\_flights\\flights.feather") ggplot(flights, aes(carrier, arr_delay)) + geom_point() + geom_jitter() ```
środa, 15 lutego 2017
lpsolve - solver in R
To study:
http://flovv.github.io/From_descritpive_to_prescriptive/
https://icyrock.com/blog/2013/12/linear-programming-in-r-using-lpsolve/
http://lpsolve.r-forge.r-project.org/
http://horicky.blogspot.co.uk/2013/01/optimization-in-r.html
http://lpsolve.sourceforge.net/5.5/R.htm
http://flovv.github.io/From_descritpive_to_prescriptive/
https://icyrock.com/blog/2013/12/linear-programming-in-r-using-lpsolve/
http://lpsolve.r-forge.r-project.org/
http://horicky.blogspot.co.uk/2013/01/optimization-in-r.html
http://lpsolve.sourceforge.net/5.5/R.htm
czwartek, 19 stycznia 2017
R packages for the lazy.
Data entry
Datapasta: Copy data from Excel or HTML to an R file - with keyboard shortcuts. And do it in a nice readable txt "tribble" format. https://github.com/MilesMcBain/datapasta
Quick overview
Quickly plot data having no time for a nice ggplot2 code. https://github.com/stefan-schroedl/plotluck
Quick ensembleR predict data: ensembleR package
Datapasta: Copy data from Excel or HTML to an R file - with keyboard shortcuts. And do it in a nice readable txt "tribble" format. https://github.com/MilesMcBain/datapasta
Quick overview
Quickly plot data having no time for a nice ggplot2 code. https://github.com/stefan-schroedl/plotluck
Quick ensembleR predict data: ensembleR package
Subskrybuj:
Posty (Atom)