Jacek Kotowski's toolbox.: 2017

wtorek, 12 września 2017

Dlaczego R? hclust

hclust - Próba analizy odpowiedzi w formularzu zapisów na konferencję WhyR. kolumna: "Chcesz.podzielić.się.odpowiedzią.na.to.pytanie..Chętnie.przedstawimy.najciekawsze.odpowiedzi"
- bo lingua franca,
- bo ogromne możliwości,
- bo ggplot2,
- bo nie Python, bo nie SAS,
- bo za free,
- bo Data Science...

Interpretacja subiektywna w Paincie :-)

R za darmo daje ogromne możliwości i pozwala się rozwijać i komunikować (lingua franca), szczególnie dzięki GGPlot2. Jest za darmo. Jest alternatywą dla SASa i Pythona.

Edycja w EzGif.com

piątek, 9 czerwca 2017

Introductory Python ML, short 2 day course.

Conclusions:
1. R is so much easier, portable, supercool. Jupyter Notebooks are far from Rstudio R Notebooks.
2. Python can be learned/is similar, just remember, indenting is part of syntax :-)
3. I must learn Pandas! Scikit Learn! Seaborn! maybe by comparison to dplyr, ggplot2.

Microsoft, please allow more of R/ExcelVBA/Python interoperability for all.

Excel: Conditional formatting string numbers...

Why I have never discovered that until today!!!:

[=0]"";rrrr-mm-dd

"If there is 0, then do not put an erroneous date (a result of some bug in Excel insisting there is a 1900-01-00 date)". Is there more? Like coloring the font dependent on conditions met? Supercool!

sobota, 6 maja 2017

Big Data on a laptop, some options.

Data mining on streams.
http://moa.cms.waikato.ac.nz/rmoa-massive-online-data-stream-classifications-with-r-moa/
http://jwijffels.github.io/RMOA/
https://cran.r-project.org/web/packages/stream/stream.pdf

Database light backend.
https://www.monetdb.org/blog/monetdblite-r

Data mining
https://rdrr.io/cran/ffbase/man/bigglm.ffdf.html
https://cran.r-project.org/web/packages/speedglm/speedglm.pdf
https://cran.r-project.org/web/packages/randomForest.ddR/randomForest.ddR.pdf

piątek, 5 maja 2017

Bayes for beginners videos.

How to explain it plain E:
https://www.khanacademy.org/math/statistics-probability/probability-library/conditional-probability-independence/v/calculating-conditional-probability
http://www.watchknowlearn.org/Video.aspx?VideoID=16751&CategoryID=4457

https://www.khanacademy.org/math/ap-statistics/probability-ap/stats-conditional-probability/v/bayes-theorem-visualized
https://www.khanacademy.org/partner-content/wi-phi/wiphi-critical-thinking/wiphi-fundamentals/v/bayes-theorem
https://www.youtube.com/watch?v=Y-V4rfdl3NI
https://brilliant.org/wiki/bayes-theorem/

piątek, 24 marca 2017

Risk matrix in R (interesting readings)

1) Risk matrix examples http://davidmeza1.github.io/2015/12/17/2015-12-17-Creating-a-Risk-Matrix-in-R.html

2) Use ggrepel instead of jitter https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html
for multiple points.

3) Is there ggrepel for Excel (?) http://stackoverflow.com/questions/30294041/excel-bubble-chart-overlapping-data-label

wtorek, 21 marca 2017

Stats Day2: Covariance

Covariance
Cov(for sample) = sum[(x - śr(x))*(y-śr(y))]/n-1 ... we are interesting in the sign. Does not provide strength. Is not standardized.

Covariance matrix...diagonal shows variance of each variable, off-diagonal show covariances betw. each variable pair.

Correlation (Pearson, r)
r =Cov(x,y)/[SD(x)* SD(y)]

poniedziałek, 20 marca 2017

Stats Day1

With Spring day 1 I start regular intro statistics study:

My notes from day 1:
Standard Error of Mean (SD of a sample):
SD/ root(n) - the larger sample size, the error decreases,

Z-Score: (x-mean)/SD
https://en.wikipedia.org/wiki/Standard_score

T-Score:
Conversion Score = z * (NewSD)+NewMean
NewSD for T-Score = 10 and NewMean = 50
T-Score = Z-Score*10+50
More: Essentials of Testing and Assessment: A Practical Guide for Counselors ...Autorzy Edward S. Neukrug,R. Charles Fawcett p 133 (fragment available online via Google)

Coefficient of Variation
CV% = (SD/Mean)*100

ExcelVBA: From recorded macro to reusable function.

Problem: with each change in slicer, pivot chart formatting changes.

So I want all my lables vertical (and with each change the damn thing returns to horizontal). My recorded macro did not return anything sensible. Lecture on chart label properties brought me to this solution:

Sub Macro1()
'
' Wersja "Recorder"
'
Dim mySrs As Series

With ActiveSheet.ChartObjects("chValueofOffers").Chart


    .SeriesCollection(1).DataLabels.Orientation = xlUpward
    .SeriesCollection(2).DataLabels.Orientation = xlUpward

End With

Next I wanted to iterate through series collection so that I do not need to change macro for more series:

Sub Macro2()
'
' Wersja "Obiektowo pętlowa"

'
Set seriesCol = ActiveSheet.ChartObjects("chValueofOffers").Chart.SeriesCollection

For Each mySeries In seriesCol
    mySeries.DataLabels.Orientation = xlUpward
Next

Set seriesCol = ActiveSheet.ChartObjects("chValueofOffersTotal").Chart.SeriesCollection


End Sub

Finally... what if I have several charts on the same sheet that need vertical labels. Why not making it a reusable function with chart name as attribute. Now... with each change in the slicer the labels in the two charts get corrected to vertical.

Private Function DataLabelsVertical(mySheet As String, chtName As String)

Dim mySeries As Series
Set seriesCol = Worksheets(mySheet).ChartObjects(chtName).Chart.SeriesCollection

    For Each mySeries In seriesCol
       mySeries.DataLabels.Orientation = xlUpward
    Next

End Function

Private Sub Worksheet_Change(ByVal Target As Range)
   DataLabelsVertical "Value_tenders", "chValueofOffers"
   DataLabelsVertical "Value_tenders", "chValueofOffersTotal"
   DataLabelsVertical "Status", "chStatusValue"
   DataLabelsVertical "Status", "chStatusValueTotal"
End Sub

End Function

Sources:
http://www.java2s.com/Code/VBA-Excel-Access-Word/Excel/Loopthrougheachseriesinchartandaltermarkercolors.htm
http://stackoverflow.com/questions/21165581/vba-looping-through-all-series-within-all-charts

piątek, 17 lutego 2017

R: self organizing maps

Interesting topics, see

https://www.r-bloggers.com/self-organising-maps-for-customer-segmentation-using-r/
http://www.slideshare.net/shanelynn/2014-0117-dublin-r-selforganising-maps-for-customer-segmentation-shane-lynn

Other interesting reading
https://www.r-bloggers.com/r-an-integrated-statistical-programming-environment-and-gis/

https://www.r-bloggers.com/how-to-perform-pca-with-r/

czwartek, 16 lutego 2017

SQL playground, remove duplicates and count students that passed exam.

Patent na usuwanie duplikatów z tabeli “in place” tj bez nadpisywania jej, bez usuwania i wstawiania innej:

Do przetestowania:

Kolumny ID (nie powtarza się). imie, nazwisko, dane (może duplikować się).

DELETE *
FROM Tabela1
WHERE [id] NOT IN
  (SELECT Max(Tabela1.id) AS id
    FROM Tabela1
    GROUP BY Tabela1.imie, Tabela1.nazwisko, Tabela1.dane);

w linii GROUP BY można określić gdzie szukamy duplikatów, w jakich kolumnach
Select Max... można użyć First(), albo Min(), żeby określić, które zduplikowane rekordy zachować.

Są trzy tabele
1. Studenci (_indeks_, imie, nazwisko)
2. Kursy (_id_, tytul, godzin, punktow), gdzie punktow oznacza ile punktow za zaliczenie kursu się dostaje
3. Szkolenia( osoba, kurs, zaliczenie) gdzie osoba to numer indeksu studenta, zaliczenie to data zaliczenia, a kurs to numer kursu
i pierwsze zadanie z matury brzmi: Ilu studentów zaliczyło w pierwszym terminie? (do 30 czerwca 2016)<pre>
Select count([_indeks_Stud]) AS ilu_zdalo From
    (Select [Studenci$].[_indeks_Stud]
From ([Kursy$]
    Inner Join [Szkolenia$] On [Kursy$].[_id_kurs] = [Szkolenia$].kurs)
        Inner Join [Studenci$]   On [Studenci$].[_indeks_Stud] = [Szkolenia$].osoba
   Where [Szkolenia$].zaliczenie <= #2016-06-30#
   Group By [Studenci$].[_indeks_Stud]
   Having Sum([Kursy$].punktow) >= 15)
<./pre>
Kod powstał w ide Flyspeed SQL Query na bazie stworzonej w zakładkach w Excelu, stąd znaki dolara.... kod więc powinien działać po wklejeniu do Microsoft Query.

Python in RStudio

Data used:
Titanic data: https://www.kaggle.com/c/titanic/data
and tutorial: http://nbviewer.jupyter.org/github/savarin/pyconuk-introtutorial/blob/master/notebooks/Section%201-0%20-%20First%20Cut.ipynb

Flights data: http://ucl.ac.uk/~uctqiax/data/flights.csv

Software used:
Portable scientific winpython (with pandas scikit-learn):
https://sourceforge.net/projects/winpython/?source=typ_redirect
To work it needed windows updates (my OS is windows 7):
https://www.microsoft.com/en-us/download/confirmation.aspx?id=49093

To install packages from source it needed:
http://landinghub.visualstudio.com/visual-cpp-build-tools
I needed feather package so I dowloaded it and used python command: pip install
as taught here: https://github.com/winpython/winpython/wiki/Installing-Additional-Packages
and installed from source: https://github.com/wesm/feather/tree/master/python

To learn how to use other languages in RStudio: http://rmarkdown.rstudio.com/authoring_knitr_engines.html

I also wanted to try if some portable version of bash would work. No problem:
http://win-bash.sourceforge.net/

Code for my playground.

---
title: "R Notebook"
output: html_notebook
---

## Bash

```{bash, engine.path="C:\\Users\\jkotows2\\Desktop\\shell.w32-ix86\\bash.exe"}
cat flights1.csv flights2.csv flights3.csv > flights.csv
```

## Python

http://rmarkdown.rstudio.com/authoring_knitr_engines.html

```{python, engine.path="C:\\Users\\jkotows2\\Desktop\\WinPython\\python-3.6.0.amd64\\python.exe"}
import pandas
import feather

# Read flights data and select flights to O'Hare
flights = pandas.read_csv("C:\\Users\\jkotows2\\Desktop\\_flights\\flights.csv")
flights = flights[flights['dest'] == "ORD"]

# Select carrier and delay columns and drop rows with missing values
flights = flights[['carrier', 'dep_delay', 'arr_delay']]
flights = flights.dropna()
print (flights.head(10))

# Write to feather file for reading from R
feather.write_dataframe(flights, "C:\\Users\\jkotows2\\Desktop\\_flights\\flights.feather")
```

## Back to R

```{r}
library(feather)
library(ggplot2)

# Read from feather and plot
flights <- read_feather("C:\\Users\\jkotows2\\Desktop\\_flights\\flights.feather")
ggplot(flights, aes(carrier, arr_delay)) + geom_point() + geom_jitter()
```

środa, 15 lutego 2017

lpsolve - solver in R

To study:
http://flovv.github.io/From_descritpive_to_prescriptive/
https://icyrock.com/blog/2013/12/linear-programming-in-r-using-lpsolve/
http://lpsolve.r-forge.r-project.org/
http://horicky.blogspot.co.uk/2013/01/optimization-in-r.html
http://lpsolve.sourceforge.net/5.5/R.htm

czwartek, 19 stycznia 2017

R packages for the lazy.

Data entry
Datapasta: Copy data from Excel or HTML to an R file - with keyboard shortcuts. And do it in a nice readable txt "tribble" format. https://github.com/MilesMcBain/datapasta

Quick overview
Quickly plot data having no time for a nice ggplot2 code. https://github.com/stefan-schroedl/plotluck

Quick ensembleR predict data: ensembleR package