Titanic – Machine Learning from Disaster (Part 1)

February 17, 2016, 12:06 pm

≫ Next: It’s not the p-values’ fault – reflections on the recent ASA statement (+relevant R resources)

≪ Previous: New Course! A hands-on introduction to statistics with R by A. Conway (Princeton University)

(This article was first published on R – Networkx, and kindly contributed to R-bloggers)

Synopsis

In the challenge Titanic – Machine Learning from Disaster from Kaggle, you need to predict of what kind of people were likely to survive the disaster or did not. In particular, they ask to apply the tools of machine learning to predict which passengers survived the tragedy.

I’ve split this up into two seperate parts.

Part 1 – Data Exploration and basic Model Building
Part 2 – Creating own variables

Data Exploration

I’ve download the train and test data from Kaggle. At this page you could also find the variable descriptions.

Import the training and testing set into R.

train <- read.csv("train.csv")
test <- read.csv("test.csv")

Let’s have a look at the data.

summary(train)

##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

head(train,2)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
##   Parch    Ticket    Fare Cabin Embarked
## 1     0 A/5 21171  7.2500              S
## 2     0  PC 17599 71.2833   C85        C

head(test,2)

##   PassengerId Pclass                             Name    Sex  Age SibSp
## 1         892      3                 Kelly, Mr. James   male 34.5     0
## 2         893      3 Wilkes, Mrs. James (Ellen Needs) female 47.0     1
##   Parch Ticket   Fare Cabin Embarked
## 1     0 330911 7.8292              Q
## 2     0 363272 7.0000              S

dim(train)

## [1] 891  12

dim(test)

## [1] 418  11

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

str(test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

The training set has 891 observations and 12 variables and the testing set has 418 observations and 11 variables. The traning set has 1 extra varible. Check which which one we are missing. I know we could see that in a very small dataset like this, but if its larger we want two compare them.

colnames_check <- colnames(train) %in% colnames(test)
colnames(train[colnames_check==FALSE])

## [1] "Survived"

As we can see we are missing the Survived in the test set. Which is correct because thats our challenge, we must predict this by creating a model.

Let’s look deeper into the training set, and check how many passengers that survived vs did not make it.

table(train$Survived)

## 
##   0   1 
## 549 342

Hmm oke, of the 891 there are only 342 who survived it. Check also as proportions.

prop.table(table(train$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

A little more than one-third of the passengers survived the disaster. Now see if there is a difference between males and females that survived vs males that passed away.

table(train$Sex, train$Survived)

##         
##            0   1
##   female  81 233
##   male   468 109

prop.table(table(train$Sex, train$Survived),margin = 1)

##         
##                  0         1
##   female 0.2579618 0.7420382
##   male   0.8110919 0.1889081

As we can see most of the female survived and most of the male did not make it.

Model Building

After doing some exploratory analysis of the data, let’s do some first prediction before getting deeper into the data.

First prediction – All Female Survived

Create a copy of test to test_female, Initialize a Survived column to 0 and Set Survived to 1 if Sex equals “female”

test_female <- test
test_female$Survived <- 0
test_female$Survived[test_female$Sex == "female"] <- 1

Create a data frame with two columns: PassengerId & Survived and write the solution away to a csv file.

my_solution <- data.frame(PassengerId = test_female$PassengerId, Survived = test_female$Survived)
write.csv(my_solution, file =  "all_female.csv", row.names = FALSE)

That’s our first submission to Kaggle and it’s good for a score of 0.76555. That’s not so bad, but we want more!!

Clean up the dataset

Now we need to clean the dataset to create our models. Note that it is important to explore the data so that we understand what elements need to be cleaned.
For example we have noticed that there are missing values in the data set, especially in the Age column of the training set. Show which columns have missing values in the training and test set.

colSums(is.na(train))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

colSums(is.na(test))

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           1           0           0

As we can see we have missing values in Age in the training set and Age, Fare in the test set.

To tackle the missing values I’m going to predict the missing values with the full data set. First we need to combine the test and training set together.

train2 <- train
test2 <- test
test2$Survived <- NA
full <- rbind(train2, test2)

First we tackle the missing Fare, because this is only one value. Let see in wich row it’s missing.

full[!complete.cases(full$Fare),]

##      PassengerId Survived Pclass               Name  Sex  Age SibSp Parch
## 1044        1044       NA      3 Storey, Mr. Thomas male 60.5     0     0
##      Ticket Fare Cabin Embarked
## 1044   3701   NA              S

As we can see the passenger on row 1044 has an NA Fare value. Let’s replace it with the median fare value.

full$Fare[1044] <- median(full$Fare, na.rm = TRUE)

How to fill in missing Age values? We make a prediction of a passengers Age using the other variables and a decision tree model.
This time we give method = “anova” since you are predicting a continuous variable.

library(rpart)
predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked,
                       data = full[!is.na(full$Age),], method = "anova")
full$Age[is.na(full$Age)] <- predict(predicted_age, full[is.na(full$Age),])

We know that the training set has 891 observations and the test set 418, we can split the data back into a train set and a test set.

train2 <- full[1:891,]
test2 <- full[892:1309,]

Build a Decision Tree with rpart

Build the decision tree with rpart to predict Survived with the variables Pclass, Sex, Age, SibSp, Parch, Fare and Embarked.

my_dt1 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, 
                     data = train2, 
                     method = "class")

Load in the packages to create a fancified visualized version of your tree.

library(rattle)
library(rpart.plot)
library(RColorBrewer)

Visualize the decision tree using fancy tree of rpart.

fancyRpartPlot(my_dt1)

plot of chunk my_dt1

From the top we can see that the node is voting 0, so at this level everyone would die. Below that we see that 62% of passengers die, while 38% survive (the most will die here that’s why the node is voting that everyone die). If we go down to the male/female 81%/26% will die and 19%/74% will survive as the proportions exactly match those we find earlier. Let’s see the proportions again rounded with two decimals.

round(prop.table(table(train2$Survived)),2)

## 
##    0    1 
## 0.62 0.38

round(prop.table(table(train2$Sex, train2$Survived),margin = 1),2)

##         
##             0    1
##   female 0.26 0.74
##   male   0.81 0.19

That are the same number’s

Make the prediction using the test2 set.

my_prediction <- predict(my_dt1, newdata = test2, type = "class")

Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions.

my_solution <- data.frame(PassengerId = test2$PassengerId, Survived = my_prediction)

Check that your data frame has 418 entries.

nrow(my_solution)

## [1] 418

Write your solution to a csv file with the name my_dt1.csv.

write.csv(my_solution, file =  "my_dt1.csv", row.names = FALSE)

This gives u a score of 0.77512, this is a little better than our first submission.

Create a new decision tree my_dt2 with some control aspects. The aspects are cp for splitting up of the decision tree stops and minsplit for the amount of observations in a bucket.

my_dt2 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, 
                       data = train2, 
                       method = "class",
                       control = rpart.control(minsplit = 50, cp = 0))

Visualize your new decision tree.

fancyRpartPlot(my_dt2)

plot of chunk my_dt2

Make the prediction using the test2, create the two column dataset, check the amount of rows and save it to my_dt2.csv.

my_prediction <- predict(my_dt2, newdata = test2, type = "class")
my_solution <- data.frame(PassengerId = test2$PassengerId, Survived = my_prediction)
nrow(my_solution)

## [1] 418

write.csv(my_solution, file =  "my_dt2.csv", row.names = FALSE)

This will gives us a score of 0.74163. Oke this is not an improvement.

In part two I will create my own variables for making a model.

To leave a comment for the author, please follow the link and comment on their blog: R – Networkx.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

It’s not the p-values’ fault – reflections on the recent ASA statement (+relevant R resources)

March 10, 2016, 10:50 am

≫ Next: R 3.2.4 is released

≪ Previous: Titanic – Machine Learning from Disaster (Part 1)

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

Joint post by Yoav Benjamini and Tal Galili. The post highlights points raised by Yoav in his official response to the ASA statement (available as on page 4 in the ASA supplemental tab), as well as offers a list of relevant R resources.

Summary

The ASA statement about the misuses of the p-value singles it out. It is just as well relevant to the use of most other statistical methods: context matters, no single statistical measure suffices, specific thresholds should be avoided and reporting should not be done selectively. The latter problem is discussed mainly in relation to omitted inferences. We argue that the selective reporting of inferences problem is serious enough a problem in our current industrialized science even when no omission takes place. Many R tools are available to address it, but they are mainly used in very large problems and are grossly underused in areas where lack of replicability hits hard.

Source: xkcd

Preface – the ASA released a statement about the p-value

A few days ago the ASA released a statement titled “on p-values: context, process, and purpose”. It was a way for the ASA to address the concerns about the role of Statistics in the Reproducibility and Replicability (R&R) crisis. In the discussions about R&R the p-value has become a scapegoat, being such a widely used statistical method. The ASA statement made an effort to clarify various misinterpretations and to point at misuses of the p-value, but we fear that the result is a statement that might be read by the target readers as expressing very negative attitude towards the p-value. And indeed, just two days after the release of the ASA statement, a blog post titled “After 150 Years, the ASA Says No to p-values” was published (by Norman Matloff), even though the ASA (as far as we read it) did not say “no to P-values” anywhere in the statement. Thankfully, other online reactions to the ASA statements, such as the article in Nature, and other posts in the blogosphere (see [1], [2], [3], [4]), did not use an anti-p-value rhetoric.

Why the p-value was (and still is) valuable

In spite of its misinterpretations, the p-value served science well over the 20^th century. Why? Because in some sense the p-value offers a first defense line against being fooled by randomness, separating signal from noise. It requires simpler (or fewer) models than those needed by other statistical tool. The p-value requires only a statistical model for the behavior of a statistic under the null hypothesis to hold. Even if a model of an alternative hypothesis is used for choosing a “good” statistic (which would be used for constructing the p-value), this alternative model does not have to be correct in order for the p-value to be valid and useful (i.e.: control type I error at the desired level while offering some power to detect a real effect). In contrast, other (wonderful and useful) statistical methods such as Likelihood ratios, effect size estimation, confidence intervals, or Bayesian methods all need the assumed models to hold over a wider range of situations, not merely under the tested null. And most importantly, the model needed for the calculation of the p-value may be guaranteed to hold under an appropriately designed and executed randomized experiment.

The p-value is a very valuable tool, but it should be complemented – not replaced – by confidence intervals and effect size estimators (as is possible in the specific setting). The ends of a 95% confidence interval indicates a range of potential null hypothesis that could be rejected. An estimator of effect size (supported by an assessment of uncertainty) is crucial for interpretation and for assessing the scientific significance of the results.

While useful, all these types of inferences are also affected by similar problems as the p-values do. What level of likelihood ratio in favor of the research hypothesis will be acceptable to the journal? or should scientific discoveries be based on whether posterior odds pass a specific threshold? Does either of them measure the size of the effect? Finally, 95% confidence intervals or credence intervals offer no protection against selection when only those that do not cover 0, are selected into the abstract. The properties each method has on the average for a single parameter (level, coverage or unbiased) will not necessarily hold even on the average when a selection is made.

The p-value (and other methods) in the new era of “industrialized science”

What, then, went wrong in the last decade or two? The change in the scale of the scientific work, brought about by high throughput experimentation methodologies, availability of large databases and ease of computation, a change that parallels the industrialization that production processes have already gone through. In Genomics, Proteomics, Brain Imaging and such, the number of potential discoveries scanned is enormous so the selection of the interesting ones for highlighting is a must. It has by now been recognized in these fields that merely “full reporting and transparency” (as recommended by ASA) is not enough, and methods should be used to control the effect of the unavoidable selection. Therefore, in those same areas, the p-value bright-line is not set at the traditional 5% level. Methods for adaptively setting it to directly control a variety of false discovery rates or other error rates are commonly used.

Addressing the effect of selection on inference (be it when using p-value, or other methods) has been a very active research area; New strategies and sophisticated selective inference tools for testing, confidence intervals, and effect size estimation, in different setups are being offered. Much of it still remains outside the practitioners’ active toolset, even though many are already available in R, as we describe below. The appendix of this post contains a partial list of R packages that support simultaneous and selective inference.

In summary, when discussing the impact of statistical practices on R&R, the p-value should not be singled out nor its usage discouraged: it’s more likely the fault of selection, and not the p-values’ fault.

Appendix – R packages for Simultaneous and Selective Inference (“SASI” R packages)

Extended support for classical and modern adjustment for Simultaneous and Selective Inference (also known as “multiple comparisons”) is available in R and in various R packages. Traditional concern in these areas has been on properties holding simultaneously for all inferences. More recent concerns are on properties holding on the average over the selected, addressed by varieties of false discovery rates, false coverage rates and conditional approaches. The following is a list of relevant R resources. If you have more, please mention them in the comments.

Every R installation offers functions (from the {stats} package) for dealing with multiple comparisons, such as:

adjust – that gets a set of p-values as input and returns p-values adjusted using one of several methods: Bonferroni, Holm (1979), Hochberg (1988), Hommel (1988), FDR by Benjamini & Hochberg (1995), and Benjamini & Yekutieli (2001),
t.test, pairwise.wilcox.test, and pairwise.prop.test – all rely on p.adjust and can calculate pairwise comparisons between group levels with corrections for multiple testing.
TukeyHSD- Create a set of confidence intervals on the differences between the means of the levels of a factor with the specified family-wise probability of coverage. The intervals are based on the Studentized range statistic, Tukey’s ‘Honest Significant Difference’ method.

Once we venture outside of the core R functions, we are introduced to a wealth of R packages and statistical procedures. What follows is a partial list (if you wish to contribute and extend this list, please leave your comment to this post):

multcomp – Simultaneous tests and confidence intervals for general linear hypotheses in parametric models, including linear, generalized linear, linear mixed effects, and survival models. The package includes demos reproducing analyzes presented in the book “Multiple Comparisons Using R” (Bretz, Hothorn, Westfall, 2010, CRC Press).
coin (+RcmdrPlugin.coin)- Conditional inference procedures for the general independence problem including two-sample, K-sample (non-parametric ANOVA), correlation, censored, ordered and multivariate problems.
SimComp – Simultaneous tests and confidence intervals are provided for one-way experimental designs with one or many normally distributed, primary response variables (endpoints).
PMCMR – Calculate Pairwise Multiple Comparisons of Mean Rank Sums
mratios – perform (simultaneous) inferences for ratios of linear combinations of coefficients in the general linear model.
mutoss (and accompanying mutossGUI) – are designed to ease the application and comparison of multiple hypothesis testing procedures.
nparcomp – compute nonparametric simultaneous confidence intervals for relative contrast effects in the unbalanced one way layout. Moreover, it computes simultaneous p-values.
ANOM – The package takes results from multiple comparisons with the grand mean (obtained with ‘multcomp’, ‘SimComp’, ‘nparcomp’, or ‘MCPAN’) or corresponding simultaneous confidence intervals as input and produces ANOM decision charts that illustrate which group means deviate significantly from the grand mean.
gMCP – Functions and a graphical user interface for graphical described multiple test procedures.
MCPAN – Multiple contrast tests and simultaneous confidence intervals based on normal approximation.
mcprofile – Calculation of signed root deviance profiles for linear combinations of parameters in a generalized linear model. Multiple tests and simultaneous confidence intervals are provided.
factorplot – Calculate, print, summarize and plot pairwise differences from GLMs, GLHT or Multinomial Logit models. Relies on stats::p.adjust
multcompView – Convert a logical vector or a vector of p-values or a correlation, difference, or distance matrix into a display identifying the pairs for which the differences were not significantly different. Designed for use in conjunction with the output of functions like TukeyHSD, dist{stats}, simint, simtest, csimint, csimtest{multcomp}, friedmanmc, kruskalmc{pgirmess}.
discreteMTP – Multiple testing procedures for discrete test statistics, that use the known discrete null distribution of the p-values for simultaneous inference.
someMTP – a collection of functions for Multiplicity Correction and Multiple Testing.
hdi – Implementation of multiple approaches to perform inference in high-dimensional models
ERP – Significance Analysis of Event-Related Potentials Data
TukeyC – Perform the conventional Tukey test from aov and aovlist objects
qvalue – offers a function which takes a list of p-values resulting from the simultaneous testing of many hypotheses and estimates their q-values and local FDR values. (reading this discussion thread might be helpful)
fdrtool – Estimates both tail area-based false discovery rates (Fdr) as well as local false discovery rates (fdr) for a variety of null models (p-values, z-scores, correlation coefficients, t-scores).
cp4p – Functions to check whether a vector of p-values respects the assumptions of FDR (false discovery rate) control procedures and to compute adjusted p-values.
multtest – Non-parametric bootstrap and permutation resampling-based multiple testing procedures (including empirical Bayes methods) for controlling the family-wise error rate (FWER), generalized family-wise error rate (gFWER), tail probability of the proportion of false positives (TPPFP), and false discovery rate (FDR).
selectiveInference – New tools for post-selection inference, for use with forward stepwise regression, least angle regression, the lasso, and the many means problem.
PoSI (site) – Valid Post-Selection Inference for Linear LS Regression
HWBH– A shiny app for hierarchical weighted FDR testing of primary and secondary endpoints in Medical Research. By Benjamini Y & Cohen R, 2013. Top of Form
repfdr(@github)- estimation of Bayes and local Bayes false discovery rates for replicability analysis. Heller R, Yekutieli D, 2014
SelectiveCI : An R package for computing confidence intervals for selected parameters as described in Asaf Weinstein, William Fithian & Yoav Benjamini,2013 and Yoav Benjamini, Daniel Yekutieli,2005
Rvalue– Software for FDR testing for replicability in primary and follow-up endpoints. Heller R, Bogomolov M, Benjamini Y, 2014 “Deciding whether follow-up studies have replicated findings in a preliminary large-scale “omics’ study”, under review and available upon request from the first author. Bogomolov M, Heller R, 2013

Other than Simultaneous and Selective Inference, one should also mention that there are many R packages for reproducible research, i.e.: the connecting of data, R code, analysis output, and interpretation – so that scholarship can be recreated, better understood and verified. As well as for meta analysis, i.e.: the combining of findings from independent studies in order to make a more general claim.

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.

↧

R 3.2.4 is released

March 11, 2016, 7:16 am

≫ Next: R for Publication by Page Piccinini

≪ Previous: It’s not the p-values’ fault – reflections on the recent ASA statement (+relevant R resources)

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

R 3.2.4 (codename “Very Secure Dishes”) was released today. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below.

Upgrading to R 3.2.4 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install 
setInternet2(TRUE)
installr::updateR() # updating R.

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package.

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

NEW FEATURES

install.packages() and related functions now give a more informative warning when an attempt is made to install a base package.
summary(x) now prints with less rounding when x contains infinite values. (Request of PR#16620.)
provideDimnames() gets an optional unique argument.
shQuote() gains type = "cmd2" for quoting in cmd.exe in Windows. (Response to PR#16636.)
The data.frame method of rbind() gains an optional argument stringsAsFactors (instead of only depending on getOption("stringsAsFactors")).
smooth(x, *) now also works for long vectors.
tools::texi2dvi() has a workaround for problems with the texi2dvi script supplied by texinfo 6.1.
It extracts more error messages from the LaTeX logs when in emulation mode.

UTILITIES

R CMD check will leave a log file ‘build_vignettes.log’ from the re-building of vignettes in the ‘.Rcheck’ directory if there is a problem, and always if environment variable_R_CHECK_ALWAYS_LOG_VIGNETTE_OUTPUT_ is set to a true value.

DEPRECATED AND DEFUNCT

Use of SUPPORT_OPENMP from header ‘Rconfig.h’ is deprecated in favour of the standard OpenMP define _OPENMP.
(This has been the recommendation in the manual for a while now.)
The make macro AWK which is long unused by R itself but recorded in file ‘etc/Makeconf’ is deprecated and will be removed in R 3.3.0.
The C header file ‘S.h’ is no longer documented: its use should be replaced by ‘R.h’.

BUG FIXES

kmeans(x, centers = <1-row>) now works. (PR#16623)
Vectorize() now checks for clashes in argument names. (PR#16577)
file.copy(overwrite = FALSE) would signal a successful copy when none had taken place. (PR#16576)
ngettext() now uses the same default domain as gettext(). (PR#14605)
array(.., dimnames = *) now warns about non-list dimnames and, from R 3.3.0, will signal the same error for invalid dimnames as matrix() has always done.
addmargins() now adds dimnames for the extended margins in all cases, as always documented.
heatmap() evaluated its add.expr argument in the wrong environment. (PR#16583)
require() etc now give the correct entry of lib.loc in the warning about an old version of a package masking a newer required one.
The internal deparser did not add parentheses when necessary, e.g. before [] or [[]]. (Reported by Lukas Stadler; additional fixes included as well).
as.data.frame.vector(*, row.names=*) no longer produces ‘corrupted’ data frames from row names of incorrect length, but rather warns about them. This will become an error.
url connections with method = "libcurl" are destroyed properly. (PR#16681)
withCallingHandler() now (again) handles warnings even during S4 generic’s argument evaluation. (PR#16111)
deparse(..., control = "quoteExpressions") incorrectly quoted empty expressions. (PR#16686)
format()ting datetime objects ("POSIX[cl]?t") could segfault or recycle wrongly. (PR#16685)
plot.ts(<matrix>, las = 1) now does use las.
saveRDS(*, compress = "gzip") now works as documented. (PR#16653)
(Windows only) The Rgui front end did not always initialize the console properly, and could cause R to crash. (PR#16998)
dummy.coef.lm() now works in more cases, thanks to a proposal by Werner Stahel (PR#16665). In addition, it now works for multivariate linear models ("mlm", manova) thanks to a proposal by Daniel Wollschlaeger.
The as.hclust() method for "dendrogram"s failed often when there were ties in the heights.
reorder() and midcache.dendrogram() now are non-recursive and hence applicable to somewhat deeply nested dendrograms, thanks to a proposal by Suharto Anggono in PR#16424.
cor.test() now calculates very small p values more accurately (affecting the result only in extreme not statistically relevant cases). (PR#16704)
smooth(*, do.ends=TRUE) did not always work correctly in R versions between 3.0.0 and 3.2.3.
pretty(D) for date-time objects D now also works well if range(D) is (much) smaller than a second. In the case of only one unique value in D, the pretty range now is more symmetric around that value than previously.
Similarly, pretty(dt) no longer returns a length 5 vector with duplicated entries for Date objects dt which span only a few days.
The figures in help pages such as ?points were accidentally damaged, and did not appear in R 3.2.3. (PR#16708)
available.packages() sometimes deleted the wrong file when cleaning up temporary files. (PR#16712)
The X11() device sometimes froze on Red Hat Enterprise Linux 6. It now waits for MapNotify events instead of Expose events, thanks to Siteshwar Vashisht. (PR#16497)
[dpqr]nbinom(*, size=Inf, mu=.) now works as limit case, for ‘dpq’ as the Poisson. (PR#16727)
pnbinom() no longer loops infinitely in border cases.
approxfun(*, method="constant") and hence ecdf() which calls the former now correctly “predict” NaN values as NaN.
summary.data.frame() now displays NAs in Date columns in all cases. (PR#16709)

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.

↧

R for Publication by Page Piccinini

March 23, 2016, 11:32 am

≫ Next: Additive modelling global temperature time series: revisited

≪ Previous: R 3.2.4 is released

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

The goal of this course is to give you the skills to do the statistics that are in current published papers, and make pretty figures to show off your results. While we will go over the mathematical concepts behind the statistics, this is NOT meant to be a classical statistics class. We will focus more on making the connection between the mathematical equation and the R code, and what types of variables fit into each slot of the equation.

Much of the R code will come from the Hadleyverse, including the well-known ggplot2, the less-well known dplyr, and the even less-well known (but still very useful!) purrr. If you already have experience with R, but are less familiar with these packages, this course will help you improve your R pipeline to be more readable and efficient. Moreover, you can read dplyr tutorial and ggplot2 tutorial published here at DataScience+

In addition to statistics and figure making, this course will get you acquainted with other aspects of R and RStudio to allow for more productive data analysis and management, including R Projects, Git, and Bitbucket.

Pre-course To Do

To begin you will need to have a few things pre-installed or set up:

Install R. If you already have R installed, be sure it is the newest version.

Install RStudio.

Make sure tex (e.g. LaTeX) is installed.

Set up Git on your local computer.

Make a Bitbucket account.

After that you’re ready to go!

Syllabus

The course is set up to follow a certain order with each lesson building on the previous one. However, you can also use the links below to jump to a specific topic. All videos for lessons thus far are also provided below. New material will be added throughout the course and this post will be updated frequently. To be alerted of new context, subscribe to my YouTube channel.

The upcoming video lessons will be: Analysis of Variance (ANOVA); Linear Mixed Effects Models, Part 1; Linear Mixed Effects Models, Part 2.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

Additive modelling global temperature time series: revisited

March 24, 2016, 11:00 pm

≫ Next: Football by the numbers

≪ Previous: R for Publication by Page Piccinini

(This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers)

Quite some time ago, back in 2011, I wrote a post that used an additive model to fit a smooth trend to the then-current Hadley Centre/CRU global temperature time series data set. Since then the media and scientific papers have been full of reports of record warm temperatures in the past couple of years, of controversies (imagined) regarding data-changes to suit the hypothesis of human induce global warming, and the brouhaha over whether global warming had stalled; the great global warming hiatus or pause. So it seemed like a good time to revisit that analysis and update it using the latest HadCRUT data.

A further motivation was my reading Cahill, Rahmstorf, and Parnell (2015), in which the authors use a Bayesian change point model for global temperatures. This model is essentially piece-wise linear but with smooth transitions between the piece-wise linear components. I don’t immediately see where in their Bayesian model the smooth transitions come from, but that’s what they show. My gut reaction was why piece-wise linear with smooth transitions? Why not smooth everywhere? And that’s what the additive model I show here assumes.

First, I grab the data (Morice et al. 2012) from the Hadley Centre’s website and load it into R

library("curl")
tmpf <- tempfile()
curl_download("http://www.metoffice.gov.uk/hadobs/hadcrut4/data/current/time_series/HadCRUT.4.4.0.0.annual_ns_avg.txt", tmpf)
gtemp <- read.table(tmpf, colClasses = rep("numeric", 12))[, 1:2] # only want some of the variables
names(gtemp) <- c("Year", "Temperature")

The values in Temperature are anomalies relative to 1961–1990, in degrees C.

The model I fitted in the last post was

[ y = _0 + f() + , N(0, ^2) ]

where we have a smooth function of Year as the trend, and allow for possibly correlated residuals via correlation matrix ( ).

The data set contains a partial set of observations for 2016, but seeing as that year is (at the time of writing) incomplete, I delete that observation.

gtemp <- head(gtemp, -1)                # -1 drops the last row

The data are shown below

library("ggplot2")
theme_set(theme_bw())
p1 <- ggplot(gtemp, aes(x = Year, y = Temperature)) +
    geom_point()
p1 + geom_line()

HadCRUT4 global mean temperature anomaly

The model described above can be fitted using the gamm() function in the mgcv package. There are other options that allow one to use gam(), or even bam() in the same package, which are simpler, but I want to keep this post consistent with the one from a few years ago, so gamm() it is. Recall that gamm() represents the additive model as a mixed effects model via the well-known equivalence between random effects and splines, and fits the model using lme(). This allows for correlation structures in the residuals. Previously we saw that an AR(1) process in the residuals was the best fitting of the models tried, so we start with that and then try a model with AR(2) errors.

library("mgcv")

Loading required package: nlme

This is mgcv 1.8-12. For overview type 'help("mgcv-package")'.

m1 <- gamm(Temperature ~ s(Year), data = gtemp, correlation = corARMA(form = ~ Year, p = 1))
m2 <- gamm(Temperature ~ s(Year), data = gtemp, correlation = corARMA(form = ~ Year, p = 2))

A generalised likelihood ratio test suggests little support for the more complex AR(2) errors model

anova(m1$lme, m2$lme)

       Model df       AIC       BIC   logLik   Test L.Ratio p-value
m1$lme     1  5 -277.7465 -262.1866 143.8733                       
m2$lme     2  6 -278.2519 -259.5799 145.1259 1 vs 2 2.50538  0.1135

The AR(1) has successfully modelled most of the residual correlation

ACF <- acf(resid(m1$lme, type = "normalized"), plot = FALSE)
ACF <- setNames(data.frame(unclass(ACF)[c("acf", "lag")]), c("ACF","Lag"))
ggplot(ACF, aes(x = Lag, y = ACF)) +
    geom_hline(aes(yintercept = 0)) +
    geom_segment(mapping = aes(xend = Lag, yend = 0))

Autocorrelation function of residuals from the additive model with AR(1) errors

Before drawing the fitted trend, I want to put a simultaneous confidence interval around the estimate. mgcv makes this very easy to do via posterior simulation. To simulate from the fitted model, I have written a simulate.gamm() method for the simulate() generic that ships with R. The code below downloads the Gist containing the simulate.gam code and then uses it to simulate from the model at 200 locations over the time period of the observations. I’ve written about posterior simulation from GAMs before, so if the code below or the general idea isn’t clear, I suggest you check out the earlier post.

tmpf <- tempfile()
curl_download("https://gist.githubusercontent.com/gavinsimpson/d23ae67e653d5bfff652/raw/25fd719c3ab699e48927e286934045622d33b3bf/simulate.gamm.R", tmpf)
source(tmpf)

set.seed(10)
newd <- with(gtemp, data.frame(Year = seq(min(Year), max(Year), length.out = 200)))
sims <- simulate(m1, nsim = 10000, newdata = newd)

ci <- apply(sims, 1L, quantile, probs = c(0.025, 0.975))
newd <- transform(newd,
                  fitted = predict(m1$gam, newdata = newd),
                  lower  = ci[1, ],
                  upper  = ci[2, ])

Having arranged the fitted values and upper and lower simultaneous confidence intervals tidily they can be added easily to the existing plot of the datat

p1 + geom_ribbon(data = newd, aes(ymin = lower, ymax = upper, x = Year, y = fitted),
                 alpha = 0.2, fill = "grey") +
    geom_line(data = newd, aes(y = fitted, x = Year))

Estimated trend in global mean temperature plus 95% simultaneous confidence interval

Whilst the simultaneous confidence interval shows the uncertainty in the fitted trend, it isn’t as clear about what form this uncertainty takes; for example, periods where there is little change or large uncertainty are often characterised by a wide range range of functional forms, not just flat, smooth functions. To get a sense of the uncertainty in the shapes of the simulated trends we can plot some of the draws from the posterior distribution of the model

set.seed(42)
S <- 50
sims2 <- setNames(data.frame(sims[, sample(10000, S)]), paste0("sim", seq_len(S)))
sims2 <- setNames(stack(sims2), c("Temperature", "Simulation"))
sims2 <- transform(sims2, Year = rep(newd$Year, S))

ggplot(sims2, aes(x = Year, y = Temperature, group = Simulation)) +
    geom_line(alpha = 0.3)

50 random simulated trends drawn from the posterior distribution of the fitted model

If you look closely at the period 1850–1900, you’ll notice a wide range of trends through this period, each of which is consistent with the fitted model but illustrates the uncertainty in the estimates of the spline coefficients. An additional factor is that these splines have a global amount of smoothness; once the smoothness parameter(s) are estimated, the smoothness allowance this affords is spread evenly over the fitted function. Adaptive splines would solve this problem as they in effect allow you to spread the smoothness allowance unevenly, using it sparingly where there is no smooth variation in he data and applying it liberally where there is.

An instructive visualisation for the period of the purported pause or hiatus in global warming is to look at the shapes of the posterior simulations and the slopes of the trends for each year. I first look at the posterior simulations:

ggplot(sims2, aes(x = Year, y = Temperature, group = Simulation)) +
    geom_line(alpha = 0.5) + xlim(c(1995, 2015)) + ylim(c(0.2, 0.75))

Warning: Removed 8750 rows containing missing values (geom_path).

50 random simulated trends drawn from the posterior distribution of the fitted model: 1995–2015

Whilst the plot only shows 50 of the 10,000 posterior draws, it’s pretty clear that, in these data at least, there is little or no support for the pause hypothesis; most of the posterior simulations are linearly increasing over the period of interest. Only one or two show a marked shallowing of the slope of the simulated trend through the period.

The first derivatives of the fitted trend can be used to determine where temperatures are increasing or decreasing. Using the standard error of the derivative or posterior simulation we can also say where the confidence interval on the derivative doesn’t include 0 — suggesting statistically significant change in temperature.

The code below uses some functions I wrote to compute the first derivatives of GAM(M) model terms via posterior simulation. I’ve written about this method before, so I suggest you check out that post if any of this isn’t clear.

tmpf <- tempfile()
curl_download("https://gist.githubusercontent.com/gavinsimpson/ca18c9c789ef5237dbc6/raw/295fc5cf7366c831ab166efaee42093a80622fa8/derivSimulCI.R", tmpf)
source(tmpf)

fd <- derivSimulCI(m1, samples = 10000, n = 200)

Loading required package: MASS

CI <- apply(fd[[1]]$simulations, 1, quantile, probs = c(0.025, 0.975))
sigD <- signifD(fd[["Year"]]$deriv, fd[["Year"]]$deriv, CI[2, ], CI[1, ],
                eval = 0)
newd <- transform(newd,
                  derivative = fd[["Year"]]$deriv[, 1], # computed first derivative
                  fdUpper = CI[2, ],                    # upper CI on first deriv
                  fdLower = CI[1, ],                    # lower CI on first deriv
                  increasing = sigD$incr,               # where is curve increasing?
                  decreasing = sigD$decr)               # ... or decreasing?

A ggplot2 version of the derivatives is produced using the code below. The two additional geom_line() calls add thick lines over sections of the derivative plot to illustrate those points where zero is not contained within the confidence interval of the first derivative.

ggplot(newd, aes(x = Year, y = derivative)) +
    geom_ribbon(aes(ymax = fdUpper, ymin = fdLower), alpha = 0.3, fill = "grey") +
    geom_line() +
    geom_line(aes(y = increasing), size = 1.5) +
    geom_line(aes(y = decreasing), size = 1.5) +
    ylab(expression(italic(hat(f) * "'") * (Year))) +
    xlab("Year")

Warning: Removed 74 rows containing missing values (geom_path).

Warning: Removed 190 rows containing missing values (geom_path).

First derivative of the fitted trend plus 95% simultaneous confidence interval

Looking at this plot, despite the large (and expected) uncertainty in the derivative of the fitted trend towards the end of the observation period, the first derivatives of at least 95% of the 10,000 posterior simulations are all bounded well above zero. I’ll take a closer look at this now, plotting kernel density estimates of the posterior distribution of first derivatives evaluated at each year for the period of interest.

First I generate another 10,000 simulations from the posterior of the fitted model, this time for each year in the interval 1998–2015. Then I do a little processing to get the derivatives into a format suitable for plotting with ggplot and finally create kernel density estimate plots faceted by Year.

set.seed(123)
nsim <- 10000
pauseD <- derivSimulCI(m1, samples = nsim,
                       newdata = data.frame(Year = seq(1998, 2015, by = 1)))

annSlopes <- setNames(stack(setNames(data.frame(pauseD$Year$simulations),
                                     paste0("sim", seq_len(nsim)))),
                      c("Derivative", "Simulations"))
annSlopes <- transform(annSlopes, Year = rep(seq(1998, 2015, by = 1), each = nsim))

ggplot(annSlopes, aes(x = Derivative, group = Year)) +
    geom_line(stat = "density", trim = TRUE) + facet_wrap(~ Year)

Kernel density estimates of the first derivative of posterior simulations from the fitted trend model for selected years

We can also look at the smallest derivative for each year over all of the 10,000 posterior simulations

minD <- aggregate(Derivative ~ Year, data = annSlopes, FUN = min)
ggplot(minD, aes(x = Year, y = Derivative)) +
    geom_point()

Dotplot showing the minimum first derivative over 10,000 posterior simulations from the fitted additive model

Only 4 of the 18 years have a single simulation with a derivative less than 0. We can also plot all the kernel density estimates on the same plot to see if there is much variation between years (there doesn’t appear to be much going on from the previous figures).

library("viridis")
ggplot(annSlopes, aes(x = Derivative, group = Year, colour = Year)) +
    geom_line(stat = "density", trim = TRUE) + scale_color_viridis(option = "magma") +
    theme(legend.position = "top", legend.key.width = unit(3, "cm"))

Kernel density estimates of the first derivative of posterior simulations from the fitted trend model for selected years. The colour of each density estimate differentiates individual years

As anticipated, there’s very little between-year shift in the slopes of the trends simulated from the posterior distribution of the model.

Returning to Cahill, Rahmstorf, and Parnell (2015) for a moment; the fitted trend from their Bayesian change point model is very similar to the fitted spline. There are some differences in the early part of the series; where their model has a single piecewise linear function through 1850–1900, the additive model suggests a small decrease in global temperatures leading up to 1900. Thereafter the models are very similar, with the exception that the smooth transitions between periods of increase are somewhat longer with the additive model than the one of Cahill, Rahmstorf, and Parnell (2015).

References

Cahill, Niamh, Stefan Rahmstorf, and Andrew C Parnell. 2015. “Change Points of Global Temperature.” Environmental Research Letters: ERL [Web Site] 10 (8). IOP Publishing: 084002. doi:http://doi.org/10.1088/1748-9326/10/8/084002.

Morice, Colin P, John J Kennedy, Nick A Rayner, and Phil D Jones. 2012. “Quantifying Uncertainties in Global and Regional Temperature Change Using an Ensemble of Observational Estimates: The HadCRUT4 Data Set.” J. Geophys. Res. 117 (D8): D08101.

To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R.

↧

Football by the numbers

March 24, 2016, 9:42 pm

≫ Next: Missing Value Treatment

≪ Previous: Additive modelling global temperature time series: revisited

(This article was first published on RSS Feed, and kindly contributed to R-bloggers)

Salvino A. Salvaggio [1] [2] [3]

In this blog I publish data analysis cases based on the R statistical language. No statistical or mathematical theory here, no discussions of the R language, no software tutorials, but only concrete case studies using existing R tools.

To download R code and dataset, click here (4.0 MB).

Over the last 40-50 years, the international spread of the passion for football has revealed as one of the most pandemic social phenomena. Something that was considered as a fun form of national crazyness typical of Brazilian, British and Italian people in the 1960s and ’70s is now commonly shared by a vast majority of the Earth population (including orbitating astronauts that are regularly kept informed of the matches results).

As I am an absolute outsider to that trend, I randomly scraped the web[4] in search of results and scorings, and ended up with a dataset of approx. 400,000 first leagues matches (381,257 after a bit of cleaning) which I don’t really have a precise idea of what to do with. A clear advantage of this outsider positioning is that I can dig deeper into something while not having an ounce of positive or negative preconceived ideas on the topic. However, a clear disantvantage is that I may not even think to analytical approaches that would be obvious to a football fan or expert.

…The dataset is very international comprising matches from 60 different countries spread over 6 continents representing all FIFA regions.

continent	matches	FIFAregion	matches	top_countries	matches
Africa	16748	AFC	28065	united kingdom	66454
Asia	25062	CAF	16748	france	26417
Europe	292733	CONCACAF	13853	italy	24716
North America	10649	CONMEBOL	25475	spain	23140
Oceania	7386	OFC	560	netherlands	17790
South America	28679	UEFA	296556	germany	15854

From 65 matches in 1888 to more than 15,000 matches per annum from 2006 onwards,[5] the dataset shows a sort of exponential growth in the number of matches logged annually (with the exception of the two World Wars). Actually, this is not only due to an overall trend in the football industry but also to the way the original data sources I taped data from are fed.

As a matter of fact, since the 1950s a growing number of countries have an official championship which results are made available (by their respective federations or fans communities).

…Summary statistics confirm what most fans and non-fans say:

…Football is a low scoring sport. The mean of total number of goals per match is 2.77 with an average difference in scoring between winner and loser of 0.57 goal only. To put it differently, there was 1 goal every 32 minutes 30 seconds across the dataset.

…The pattern of matches results is quite predictable, with almost twice as many home wins as visiting wins or draws.

Win	Frequency
Home	189886
Draw	97563
Visiting	93808

However, continent where matches are played seems to somehow impact the distribution of home wins, draws and visiting wins –the over-representation of Europe in the dataset (76.8% of all cases) forces to more cautiousness in comparing subsets; for example, the under-representation of visiting wins in Africa compared to the rest strongly contributes to the ChiSquare despite this is only a very small proportion (0.87%) of the whole dataset.

	Home	Draw	Visiting
Africa	8736	4706	3306
Asia	11141	6828	7093
Europe	148413	73557	70763
North America	4976	2693	2980
Oceania	3394	1755	2237
South America	13226	8024	7429

ChiSquare: 985.26 — df: 10 — p.value: 2.797101e-205

…On average, more goals are scored by home teams than visiting teams. Overall, in the dataset 636,034 goals were scored by the home teams while 419,775 by the visiting teams. Not only the sum is significantly different,[6] but also the shape of the distribution.

…As a further confirmation to the perception of football as a low scoring sport, approximately two third of all the results in the dataset (67.8 %) are within a 2:2 score (i.e., 0:0, 1:0, 2:0, 0:1, 0:2, 1:1, 2:2), and 86.4% if we consider all matches with score up to 3:3.[7]

…Many times I heard football fans but, mostly, newsreaders and commentators stating that football is more offensive and more goals are scored in some specific countries, which make the games more entertaining overall. According to the same commentators other countries seem to have a mainly defensive football tradition characterized by a lower number of goals per match, and, ultimately, less fun watching the games. As topical examples of these 2 extreme ways of playing football, Brazil and Italy were always mentionned: a dynamic and high scoring football in Brazil, while chilly and defensive in Italy. … Football in New Zealand, Scandinavian countries, Germany, Holland, Canada and the UK offers its fans more goals scored overall, while Brazil and Italy both belong to a less-scoring category, with an average total number of goals per match in the 2.51 to 3.00 range.[8]

…From a historical perspective, the average total number of goals per match tends to decrease over time. From the dataset, I filtered out all the countries which have less than 50 years (football seasons) of data and was left with 10 countries for a total of 226,671 matches. Plotting the average total number of goals per year against the season over time shows that “younger championships” generate more goals but go through quite a steep fall over the initial 25 years, then a slower descrease over the following 50 years.[9]

…[1] This document is the result of an analysis conducted by the author and exclusively reflects the author’s positions. Therefore, this document does not involve, either directly or indirectly, any of the employers, past and present, of the author. The author also declares not to have any conflict of interest with companies, institutions, organizations, authorities related to the football eco-system.

[2] Contact: salvino [dot] salvaggio [at] gmail [dot] com

[3] In this document, football refers to the European definition, which is soccer in the USA.

[4] Sites such as http://www.calciostoria.it/ or http://www.calcio.com/

[5] Current football season is still ongoing, which explains the substantial drop in the number of matches of the last available year in the dataset.

[6] p-value of t.test < 2.2e-16 [7] If no colored tile is shown in the graph, it means no matches in the dataset ended with such score. If a colored tile reporting 0% is shown, it means that less than 0.005% (but more than 0) of all the matches ended with such score.

[8] Pr-value of one-way ANOVA < 2e-16. [9] Stabilization in the average total number of goals per match after the 75th year does not mean a lot in this case because only one national football championship, the UK, has such longevity.

To leave a comment for the author, please follow the link and comment on their blog: RSS Feed.

↧

Missing Value Treatment

April 25, 2016, 12:06 am

≫ Next: The five element ninjas approach to teaching design matrices

≪ Previous: Football by the numbers

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Missing values in data is a common phenomenon in real world problems. Knowing how to handle missing values effectively is a required step to reduce bias and to produce powerful models. Lets explore various options of how to deal with missing values and how to implement them.

Data prep and pattern

Lets use the BostonHousing dataset in mlbench package to discuss the various approaches to treating missing values. Though the original BostonHousing data doesn’t have missing values, I am going to randomly introduce missing values. This way, we can validate the imputed missing values against the actuals, so that we know how effective are the approaches in reproducing the actual data. Lets begin by importing the data from mlbench pkg and randomly insert missing values (NA).

# initialize the data
data ("BostonHousing", package="mlbench")
original <- BostonHousing  # backup original data

# Introduce missing values
set.seed(100)
BostonHousing[sample(1:nrow(BostonHousing), 40), "rad"] <- NA
BostonHousing[sample(1:nrow(BostonHousing), 40), "ptratio"]       

#>      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat medv
#> 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
#> 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
#> 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
#> 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
#> 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
#> 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

The missing values have been injected. Though we know where the missings are, lets quickly check the ‘missings’ pattern using mice::md.pattern.

# Pattern of missing values
library(mice)
md.pattern(BostonHousing)  # pattern or missing values in data.

#>     crim zn indus chas nox rm age dis tax b lstat medv rad ptratio   
#> 431    1  1     1    1   1  1   1   1   1 1     1    1   1       1  0
#>  35    1  1     1    1   1  1   1   1   1 1     1    1   0       1  1
#>  35    1  1     1    1   1  1   1   1   1 1     1    1   1       0  1
#>   5    1  1     1    1   1  1   1   1   1 1     1    1   0       0  2
#>        0  0     0    0   0  0   0   0   0 0     0    0  40      40 80

There are really four ways you can handle missing values:

1. Deleting the observations

If you have large number of observations in your dataset, where all the classes to be predicted are sufficiently represented in the training data, then try deleting (or not to include missing values while model building, for example by setting na.action=na.omit) those observations (rows) that contain missing values. Make sure after deleting the observations, you have:

1. Have sufficent data points, so the model doesn’t lose power.
2. Not to introduce bias (meaning, disproportionate or non-representation of classes).

# Example
lm(medv ~ ptratio + rad, data=BostonHousing, na.action=na.omit)

2. Deleting the variable

If a particular variable is having more missing values that rest of the variables in the dataset, and, if by removing that one variable you can save many observations. I would, then, suggest to remove that particular variable, unless it is a really important predictor that makes a lot of business sense. It is a matter of deciding between the importance of the variable and losing out on a number of observations.

3. Imputation with mean / median / mode

Replacing the missing values with the mean / median / mode is a crude way of treating missing values. Depending on the context, like if the variation is low or if the variable has low leverage over the response, such a rough approximation is acceptable and could possibly give satisfactory results.

library(Hmisc)
impute(BostonHousing$ptratio, mean)  # replace with mean
impute(BostonHousing$ptratio, median)  # median
impute(BostonHousing$ptratio, 20)  # replace specific number
# or if you want to impute manually
BostonHousing$ptratio[is.na(BostonHousing$ptratio)] <- mean(BostonHousing$ptratio, na.rm = T)  # not run

Lets compute the accuracy when it is imputed with mean

library(DMwR)
actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- rep(mean(BostonHousing$ptratio, na.rm=T), length(actuals))
regr.eval(actuals, predicteds)

#>        mae        mse       rmse       mape 
#> 1.62324034 4.19306071 2.04769644 0.09545664

4. Prediction

Prediction is most advanced method to impute your missing values and includes different approaches such as: kNN Imputation, rpart, and mice.

4.1. kNN Imputation

DMwR::knnImputation uses k-Nearest Neighbours approach to impute missing values. What kNN imputation does in simpler terms is as follows: For every observation to be imputed, it identifies ‘k’ closest observations based on the euclidean distance and computes the weighted average (weighted based on distance) of these ‘k’ obs.

The advantage is that you could impute all the missing values in all variables with one call to the function. It takes the whole data frame as the argument and you don’t even have to specify which variable you want to impute. But be cautious not to include the response variable while imputing, because, when imputing in test/production environment, if your data contains missing values, you won’t be able to use the unknown response variable at that time.

library(DMwR)
knnOutput <- knnImputation(BostonHousing[, !names(BostonHousing) %in% "medv"])  # perform knn imputation.
anyNA(knnOutput)
#> FALSE

Lets compute the accuracy.

actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- knnOutput[is.na(BostonHousing$ptratio), "ptratio"]
regr.eval(actuals, predicteds)
#>        mae        mse       rmse       mape 
#> 1.00188715 1.97910183 1.40680554 0.05859526

The mean absolute percentage error (mape) has improved by ~ 39% compared to the imputation by mean.
Good.

4.2 rpart

The limitation with DMwR::knnImputation is that it sometimes may not be appropriate to use when the missing value comes from a factor variable. Both rpart and mice has flexibility to handle that scenario. The advantage with rpart is that you just need only one of the variables to be non NA in the predictor fields.

The idea here is we are going to use rpart to predict the missing values instead of kNN. To handle factor variable, we can set the method=class while calling rpart(). For numeric, we use, method=anova. Here again, we need to make sure not to train rpart on response variable (medv).

library(rpart)
class_mod <- rpart(rad ~ . - medv, data=BostonHousing[!is.na(BostonHousing$rad), ], method="class", na.action=na.omit)  # since rad is a factor
anova_mod <- rpart(ptratio ~ . - medv, data=BostonHousing[!is.na(BostonHousing$ptratio), ], method="anova", na.action=na.omit)  # since ptratio is numeric.
rad_pred <- predict(class_mod, BostonHousing[is.na(BostonHousing$rad), ])
ptratio_pred <- predict(anova_mod, BostonHousing[is.na(BostonHousing$ptratio), ])

Lets compute the accuracy for ptratio

actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- ptratio_pred
regr.eval(actuals, predicteds)
#>        mae        mse       rmse       mape 
#> 0.71061673 0.99693845 0.99846805 0.04099908

The mean absolute percentage error (mape) has improved additionally by another ~ 30% compared to the knnImputation. Very Good.

Accuracy for rad

actuals <- original$rad[is.na(BostonHousing$rad)]
predicteds <- as.numeric(colnames(rad_pred)[apply(rad_pred, 1, which.max)])
mean(actuals != predicteds)  # compute misclass error.
#> 0.25

This yields a mis-classification error of 25%. Not bad for a factor variable!

4.3 mice

mice short for Multivariate Imputation by Chained Equations is an R package that provides advanced features for missing value treatment. It uses a slightly uncommon way of implementing the imputation in 2-steps, using mice() to build the model and complete() to generate the completed data. The mice(df) function produces multiple complete copies of df, each with different imputations of the missing data. The complete() function returns one or several of these data sets, with the default being the first. Lets see how to impute ‘rad’ and ‘ptratio’:

library(mice)
miceMod <- mice(BostonHousing[, !names(BostonHousing) %in% "medv"], method="rf")  # perform mice imputation, based on random forests.
miceOutput <- complete(miceMod)  # generate the completed data.
anyNA(miceOutput)
#> FALSE

Lets compute the accuracy of ptratio.

actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- miceOutput[is.na(BostonHousing$ptratio), "ptratio"]
regr.eval(actuals, predicteds)
#>        mae        mse       rmse       mape 
#> 0.36500000 0.78100000 0.88374204 0.02121326

The mean absolute percentage error (mape) has improved additionally by ~ 48% compared to the rpart. Excellent!.

Lets compute the accuracy of rad

actuals <- original$rad[is.na(BostonHousing$rad)]
predicteds <- miceOutput[is.na(BostonHousing$rad), "rad"]
mean(actuals != predicteds)  # compute misclass error.
#> 0.15

The mis-classification error reduced to 15%, which is 6 out of 40 observations. This is a good improvement compared to rpart’s 25%.

If you’d like to dig in deeper, here is the manual or in this other post about mice from DataScience+.

Though we have an idea of how each method performs, there is not enough evidence to conclude which method is better or worse. But these are definitely worth testing out the next time you impute missing values.

If you have any question leave a comment below or contact me in LinkedIn.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

The five element ninjas approach to teaching design matrices

April 25, 2016, 4:50 pm

≫ Next: Introduction to R for Data Science :: Session 1

≪ Previous: Missing Value Treatment

(This article was first published on Maxwell B. Joseph, and kindly contributed to R-bloggers)

Design matrices unite seemingly disparate statistical methods, including linear regression, ANOVA, multiple regression, ANCOVA, and generalized linear modeling.
As part of a hierarchical Bayesian modeling course that we offered this semester, we wanted our students to learn about design matrices to facilitate model specification and parameter interpretation.
Naively, I thought that I could spend a few minutes in class reviewing matrix multiplication and a design matrix for simple linear regression, and if students wanted more, they might end up on Wikipedia’s Design matrix page.

It quickly became clear that this approach was not effective, so I started to think about how students could construct their own understanding of design matrices.
About the same time, I watched a pretty incredible kung fu movie called Five Element Ninjas, and it occurred to me that the “five elements” concept could be an effective device for getting my students to think about model specification and design matrices.

Learning goals

Students should be able to specify design matrices for many different types of models (e.g., linear models and generalized linear models), and they should be able to interpret the parameters.

Approach

The broad idea was to get the students to think about model specification from five perspectives:

Model specification via a design matrix
Model specification via R syntax (e.g., the formula argument to lm)
Model specification via “long form” equations
Graphical model specification
Verbal model specification (along with an interpretation of each of the parameter estimates)

This leverages what students already know, and encourages them to connect new concepts to their existing knowledge.
In our case, the students were all students in CU Boulder’s Ecology and Evolutionary Biology graduate program.
Most of them had a strong grasp of perspective 2 (model specification in R syntax), but relatively weak understanding of the remaining perspectives.

Getting the students started

Before we asked them to do anything, I demonstrated this five elements approach on a simple model: the model of the mean.

1. Design matrix specification

2. R syntax

The formula for a model of the mean is y ~ 1

3. Long form equations

4. Graphical interpretation

5. Verbal description

I asked for a student to take a stab at a verbal description of the model specification, and also to explain the interpretation of the parameter $beta$.
If they’re having a hard time understanding the task, you can tell them to pretend that they are talking to a classmate on the phone and trying to describe the model.

The activity

We provided students with a very simple data set that does not include the “response” variable.
This was printed ahead of time, so that each student had a paper copy that they could also use as scratch paper.

Covariate 1	Covariate 2
1.0	A
2.0	B
3.0	A
4.0	B

The omission of the response variable is deliberate, reinforcing the idea that one can construct a design matrix without knowing the outcome variable (this is useful later in our class for prior and posterior predictive simulations).

We organized the students into groups of three or four and had each group come up to the blackboard, which we partitioned ahead of time to have a space for each group to work.
Then, we proceeded to work through incrementally more complex models with our five-pronged approach:

A model that includes an effect of covariate 1.
A model that includes an effect of covariate 2.
A model that includes additive effects of covariate 1 and 2 (no interactions).
A model that includes additive effects and an interaction between covariate 1 and 2.

Each of these exercises took about 15 minutes, and once all the groups were done we checked in with each group as a class to see what they came up with.
Some groups opted for effects parameterizations, while others opted for means parameterizations, which lead to a useful discussion of the default treatment of intercepts in R model formulas and the manual suppression of intercepts (e.g., y ~ 0 + x).

The outcome

This in-class activity was surprisingly well-received, and it seemed to provide the context and practice necessary for the students to understand design matrices on a deeper level.
Throughout the rest of the semester, model matrices were preferred over other specifications by many of the students – a far cry from the widespread confusion at the beginning of the semester.

To leave a comment for the author, please follow the link and comment on their blog: Maxwell B. Joseph.

↧

Introduction to R for Data Science :: Session 1

April 30, 2016, 9:26 am

≫ Next: Bike Rental Demand Estimation with Microsoft R Server

≪ Previous: The five element ninjas approach to teaching design matrices

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

Welcome to Introduction to R for Data Science Session 1! The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

[in Serbian]

Lecturers

dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Institut savremenih nauka, Data Science Serbia
Goran S. Milovanović, Phd, DataScientist@DiploFoundation, Data Science Serbia

Summary of Session 1, 28. april 2016 :: Introduction to R

Elementary data structures, data.frames + an illustrative example of a simple linear regression model. An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot().

Intro to R for Data Science SlideShare :: Session 1

Introduction to R for Data Science :: Session 1 from Goran Milovanović

R script + Data Set :: Session 1

########################################################
# Introduction to R for Data Science
# SESSION 1 :: 28 April, 2016
# Data Science Community Serbia + Startit
# :: Branko Kovač and Goran S. Milovanović ::
########################################################
 
# This is an R comment: it begins with "#" and ends with nothing 
# data source: http://www.stat.ufl.edu/~winner/datasets.html (modified, from .dat to .csv)
# from the website of Mr. Larry Winner, Department of Statistics, University of Florida
 
# Data set: RKO Films Costs and Revenues 1930-1941
# More on RKO Films: https://en.wikipedia.org/wiki/RKO_Pictures
 
# First question: where are we?
getwd(); # this will tell you the path to the R working directory
 
# Where are my files?
# NOTE: Here you need to change filesDir to match your local path
filesDir <- "/home/goran/Desktop/__IntroR_Session1/";
class(filesDir); # now filesDir is a of a character type; there are classes and types in R
typeof(filesDir);
# By the way, you do not need to use the semicolon to separate lines of code:
class(filesDir)
typeof(filesDir)
# point R to where your files are stored
setwd(filesDir); # set working directory
getwd(); # check
 
# Read some data in csv (comma separated values
# - it might turn out that you will be using these very often)
fileName <- "rko_film_1930-1941.csv";
dataSet <- read.csv(fileName,
                    header=T,
                    check.names=F,
                    stringsAsFactors=F,
                    row.names=NULL);
 
# read.csv is for reading comma separated values
# type ? in front of any R function for help
?read.csv
# to find our that read.csv is a member of a wider read* family of functions
# of which read.table is the most generic one
 
# now, dataSet is of type...
typeof(dataSet); # in type semantics, dataSet is a list. In R we use lists a lot.
class(dataSet); # in object semantics, dataSet is a data.frame!
 
# what is the first member of the dataSet list?
dataSet[[1]];
# what are the first two members?
dataSet[1:2];
# mind the difference between subsetting a list with [[]] and []
# does a single member of dataSet have a name?
names(dataSet[[1]]);
# of what type is it?
typeof(dataSet[[1]]);
class(dataSet[[1]]);
# do first two elements have names?
names(dataSet[1:2]); # wow
typeof(dataSet[1:2]);
# the first element of dataSet, understood as a character vector, does not have a name
# however, elements OF A list do have names
# can we subset a data.frame object by names?
dataSet$movie;
dataSet$movie[1:10];
dataSet$movie[[1]];
class(dataSet$movie[[1]]);
typeof(dataSet$movie[[1]]);
# thus, a character vector is the first member = the first column of the dataSet data.frame
testWord <- testWord testWord[[1]];
testWord[[1:2]]; # error
testWord[1:2];
# similar
dataSet[1:2]; # first two columns of a dataSet
# back to characters
tW <- testWord[1];
tW[1]
tW[2] # NA
# from a viewpoint of a statistical spreadsheet user, NA is used for missing data in R
# what is the second letter in tW == 'Ana'
substring(tW,2,2); # there are functions in R to deal with characters as strings!
# finding elements of vectors
w <- testWord[w];
# how many elements in testWord?
length(testWord);
# subsetting testWord, again
testWord[2:length(testWord)]; # length is another important function, like which() or substring()
tail(testWord,2); # vectors have tails, yay!
head(testWord,3); # and heads as well
# a data.frame has a head too, and that knowledge often comes handy...
head(dataSet,5); # ... especially when dealing with large data sets
# of course...
tail(dataSet,10);
# another two functions: tail() and head()
# further subsetting of a data.frame object
dataSet$reRelease # columns can have names; reRelease is the name of the 2nd column of dataSet
typeof(dataSet$reRelease);
class(dataSet$reRelease);
# automatic type conversion in R: from numeric to logical
is.numeric(dataSet$reRelease);
reRelease
is.logical(reRelease);
# vectors, sequences...
# automatic type conversion (coercing) in R: from real to integer
x <- 2:10;
# is the same as...
x <- seq(2,10,by=1);
# multiples of 3.1415927...
multipliPi <- x*pi;
multipliPi
# NOTE multiplication * in R operates element-wise
# This is one of the reasons we call it a vector programming language...
is.double(multipliPi);
# type conversion in R: from double to integer
as.integer(multipliPi)
is.integer(multipliPi)
is.integer(as.integer(multipliPi))
# rounding
round(multipliPi,1)
round(multipliPi,2)
# carefully!
as.integer(multipliPi) == round(multipliPi,0) # check documentation
?as.integer # enjoy...
# more coercion...
num <- as.numeric("123");
is.numeric(num)
ch <- as.character(num)
is.character(ch)
 
# What do we all love in Data Science and Statistics? Random numbers..!
runif(100,0,1) # one hundred uniformly distributed random numbers on a range 0 .. 1
rnorm(100, mean=0, sd=1) # one hundred random deviates from the standard Gaussian
# all probability density and mass functions in R have similar r* functions to generate random deviates
 
# Enough! Let's do something for real...
# Q: Is it possible to predict the total revenue from movie production cost?
# Are these two related at all?
# What is the size of the data set?
n # any missing data?
sum(!(is.na(dataSet$productionCost)));
sum(!(is.na(dataSet$totalRevenue)));
# plot dataSet$productionCost on x-axis and dataSet$totalRevenue on y-axis
plot(dataSet$productionCost, dataSet$totalRevenue);
# are these two correlated?
cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson");
cPearson

Created by Pretty R at inside-R.org

Session1-Fig1 Session1-Fig2

# are these two correlated?
cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson");
cPearson
# hm, maybe I should use non-parametric correlation instead
cSpearman <- cor(dataSet$productionCost, dataSet$totalRevenue,method="spearman");
cSpearman
# log-transform will not help much in this case...
hist(log(dataSet$productionCost),20); # the default base of log in R is e (natural)
hist(log(dataSet$totalRevenue),20);

Created by Pretty R at inside-R.org

Session1-Fig3 Session1-Fig4

# However, who in the World tests the assumptions of the linear model... Kick it!
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost);
summary(reg);
# get residuals
reg$residuals
# get coefficients
reg$coefficients 
# some functions to inspect the simple linear model
coefficients(reg) # model coefficients
confint(reg, level=0.95) # CIs for model parameters 
fitted(reg) # predicted values
residuals(reg) # residuals
anova(reg) # anova table 
vcov(reg) # covariance matrix for model parameters 
 
# plot model
intercept <- reg$coefficients[1];
slope <- reg$coefficients[2];
plot(dataSet$productionCost, dataSet$totalRevenue);
abline(reg$coefficients); # as simple as that; abline() is a generic function, check it out ?abline

Created by Pretty R at inside-R.org

Session1-Fig5

# and now for a nice plot
library(ggplot2); # first do: install.packages("ggplot2");not now - it can take a while
# library() is a call to use any R package
# of which the powerful ggplot2 is among the most popular
g <- ggplot(data=dataSet,
            aes(x = productionCost,
                y = totalRevenue)) +
  geom_point() +
  geom_smooth(method=lm,
              se=TRUE) +
  xlab("nProduction Cost") +
  ylab("Total Revenuen") +
  ggtitle("Linear Regressionn"); 
print(g);

Created by Pretty R at inside-R.org

Session1-Fig6

# Q1: Is this model any good?
# Q2: Are there any truly dangerous outliers present in the data set?
 
# print is also a generic function in R: for example,
print("Doviđenja i uživajte u praznicima uz gomilu materijala za čitanje i vežbu!")
 
# P.S. Play with:
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost + dataSet$domesticRevenue);
summary(reg) # etc.

Created by Pretty R at inside-R.org

Readings :: Session 2 [5. May, 2016, @Startit.rs, 19h CET]

Chapters 1 – 5, The Art of R Programming, Norman Matloff

Intro to R
Vectors and Matrics
Lists

Session 1 Photos

20160428_204815 20160428_193859

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

↧

Bike Rental Demand Estimation with Microsoft R Server

May 10, 2016, 8:30 am

≫ Next: Absence of evidence is not evidence of absence: Testing for equivalence

≪ Previous: Introduction to R for Data Science :: Session 1

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Katherine Zhao, Hong Lu, Zhongmou Li, Data Scientists at Microsoft

Bicycle rental has become popular as a convenient and environmentally friendly transportation option. Accurate estimation of bike demand at different locations and different times would help bicycle-sharing systems better meet rental demand and allocate bikes to locations.

In this blog post, we walk through how to use Microsoft R Server (MRS) to build a regression model to predict bike rental demand. In the example below, we demonstrate an end-to-end machine learning solution development process in MRS, including data importing, data cleaning, feature engineering, parameter sweeping, and model training and evaluation.

Data

The Bike Rental UCI dataset is used as the input raw data for this sample. This dataset is based on real-world data from the Capital Bikeshare company, which operates a bike rental network in Washington DC in the United States.

The dataset contains 17,379 rows and 17 columns, with each row representing the number of bike rentals within a specific hour of a day in the years 2011 or 2012. Weather conditions (such as temperature, humidity, and wind speed) are included in the raw data, and the dates are categorized as holiday vs. weekday, etc.

The field to predict is cnt, which contains a count value ranging from 1 to 977, representing the number of bike rentals within a specific hour.

Model Overview

In this example, we use historical bike rental counts as well as the weather condition data to predict the number of bike rentals within a specific hour in the future. We approach this problem as a regression problem, since the label column (number of rentals) contains continuous real numbers.

Along this line, we split the raw data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. Specifically, we employ the Decision Forest Regression algorithm as the regression model and build two models on different feature sets. Finally, we evaluate their prediction performance. We will elaborate the details in the following sections.

Microsoft R Server

We build the models using the RevoScaleR library in MRS. The RevoScaleR library provides extremely fast statistical analysis on terabyte-class datasets without needing specialized hardware. RevoScaleR's distributed computing capabilities can use a different (possibly remote) computing context using the same RevoScaleR commands to manage and analyze local data. A wide range of rx prefixed functions provide functionality for:

Accessing external data sets (SAS, SPSS, ODBC, Teradata, and delimited and fixed format text) for analysis in R.
Efficiently storing and retrieving data in a high-performance data file.
Cleaning, exploring, and manipulating data.
Fast, basic statistical analysis.
Train and score advanced machine learning models.

Running the Experiment

Overall, there are five major steps of building this example using Microsoft R Server:

Step 1: Import and Clean Data

First, we import the Bike Rental UCI dataset. Since there are a small portion of missing records within the dataset, we use rxDataStep() to replace the missing records with the latest non-missing observations. rxDataStep() is a commonly used function for data manipulation. It transforms the input dataset chunk by chunk and saves the results to the output dataset.

# Define the tranformation function for the rxDataStep.
xform <- function(dataList) {
  # Identify the features with missing values.
  featureNames <- c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt")
  # Use "na.locf" function to carry forward last observation.
  dataList[featureNames] <- lapply(dataList[featureNames], zoo::na.locf)
  # Return the data list.
  return(dataList)
}
 
# Use rxDataStep to replace missings with the latest non-missing observations.
cleanXdf <- rxDataStep(inData = mergeXdf, outFile = outFileClean, overwrite = TRUE,
                       # Apply the "last observation carried forward" operation.
                       transformFunc = xform,  
                       # Identify the features to apply the tranformation.
                       transformVars = c("weathersit", "temp", "atemp", "hum", "windspeed", "cnt"),
                       # Drop the "dteday" feature.
                       varsToDrop = "dteday")
Step 2: Perform Feature Engineering

Step 2: Perform Feature Engineering

In addition to the original features in the raw data, we add number of bikes rented in each of the previous 12 hours as features to provide better predictive power. We create a computeLagFeatures() helper function to compute the 12 lag features and use it as the transformation function in rxDataStep().

Note that rxDataStep() processes data chunk by chunk and lag feature computation requires data from previous rows. In computLagFeatures(), we use the internal function .rxSet() to save the last n rows of a chunk to a variable lagData. When processing the next chunk, we use another internal function .rxGet() to retrieve the values stored in lagData and compute the lag features.

# Add number of bikes that were rented in each of
# the previous 12 hours as 12 lag features.
computeLagFeatures <- function(dataList) {
  # Total number of lags that need to be added.
  numLags <- length(nLagsVector)
  # Lag feature names as lagN.
  varLagNameVector <- paste("cnt_", nLagsVector, "hour", sep="")
 
  # Set the value of an object "storeLagData" in the transform environment.
  if (!exists("storeLagData"))
  {
    lagData <- mapply(rep, dataList[[varName]][1], times = nLagsVector)
    names(lagData) <- varLagNameVector
    .rxSet("storeLagData", lagData)
  }
 
  if (!.rxIsTestChunk)
  {
    for (iL in 1:numLags)
    {
      # Number of rows in the current chunk.
      numRowsInChunk <- length(dataList[[varName]])
      nlags <- nLagsVector[iL]
      varLagName <- paste("cnt_", nlags, "hour", sep = "")
      # Retrieve lag data from the previous chunk.
      lagData <- .rxGet("storeLagData")
      # Concatenate lagData and the "cnt" feature.
      allData <- c(lagData[[varLagName]], dataList[[varName]])
      # Take the first N rows of allData, where N is
      # the total number of rows in the original dataList.
      dataList[[varLagName]] <- allData[1:numRowsInChunk]
      # Save last nlag rows as the new lagData to be used
      # to process in the next chunk.
      lagData[[varLagName]] <- tail(allData, nlags)
      .rxSet("storeLagData", lagData)
    }
  }
  return(dataList)
}
 
# Apply the "computeLagFeatures" on the bike data.
lagXdf <- rxDataStep(inData = cleanXdf, outFile = outFileLag,
                     transformFunc = computeLagFeatures,
                     transformObjects = list(varName = "cnt",
                                             nLagsVector = seq(12)),
                     transformVars = "cnt", overwrite = TRUE)

Step 3: Prepare Training, Test and Score Datasets

Before training the regression model, we split data into two parts: data records in year 2011 to learn the regression model, and data records in year 2012 to score and evaluate the model. In order to obtain the best combination of parameters for regression models, we further divide year 2011 data into training and test datasets: 80% of the records are randomly selected to train regression models with various combinations of parameters, and the remaining 20% are used to evaluate the models obtained and determine the optimal combination.

# Split data by "yr" so that the training and test data contains records
# for the year 2011 and the score data contains records for 2012.
rxSplit(inData = lagXdf,
        outFilesBase = paste0(td, "/modelData"),
        splitByFactor = "yr",
        overwrite = TRUE,
        reportProgress = 0,
        verbose = 0)
 
# Point to the .xdf files for the training & test and score set.
trainTest <- RxXdfData(paste0(td, "/modelData.yr.0.xdf"))
score <- RxXdfData(paste0(td, "/modelData.yr.1.xdf"))
 
# Randomly split records for the year 2011 into training and test sets
# for sweeping parameters.
# 80% of data as training and 20% as test.
rxSplit(inData = trainTest,
        outFilesBase = paste0(td, "/sweepData"),
        outFileSuffixes = c("Train", "Test"),
        splitByFactor = "splitVar",
        overwrite = TRUE,
        transforms = list(splitVar = factor(sample(c("Train", "Test"),
                                                   size = .rxNumRows,
                                                   replace = TRUE,
                                                   prob = c(.80, .20)),
                                            levels = c("Train", "Test"))),
        rngSeed = 17,
        consoleOutput = TRUE)
 
# Point to the .xdf files for the training and test set.
train <- RxXdfData(paste0(td, "/sweepData.splitVar.Train.xdf"))
test <- RxXdfData(paste0(td, "/sweepData.splitVar.Test.xdf"))

Step 4: Sweep Parameters and Train Regression Models

In this step, we construct two training datasets based on the same raw input data, but with different sets of features:

Set A = weather + holiday + weekday + weekend features for the predicted day
Set B = Set A + number of bikes rented in each of the previous 12 hours, which captures very recent demand for the bikes

In order to perform parameter sweeping, we create a helper function to evaluate the performance of a model trained with a given combination of number of trees and maximum depth. We use Root Mean Squared Error (RMSE) as the evaluation metric.

# Define a function to train and test models with given parameters
# and then return Root Mean Squared Error (RMSE) as the performance metric.
TrainTestDForestfunction <- function(trainData, testData, form, numTrees, maxD)
{
  # Build decision forest regression models with given parameters.
  dForest <- rxDForest(form, data = trainData,
                       method = "anova",
                       maxDepth = maxD,
                       nTree = numTrees,
                       seed = 123)
  # Predict the the number of bike rental on the test data.
  rxPredict(dForest, data = testData,
            predVarNames = "cnt_Pred",
            residVarNames = "cnt_Resid",
            overwrite = TRUE,
            computeResiduals = TRUE)
  # Calcualte the RMSE.
  result <- rxSummary(~ cnt_Resid,
                      data = testData,
                      summaryStats = "Mean",
                      transforms = list(cnt_Resid = cnt_Resid^2)
  )$sDataFrame
  # Return lists of number of trees, maximum depth and RMSE.
  return(c(numTrees, maxD, sqrt(result[1,2])))
}

The following is another helper function to sweep and select the optimal parameter combination. Under local parallel compute context (rxSetComputeContext(RxLocalParallel())), rxExec() executes multiple runs of model training and evaluation with different parameters in parallel, which significantly speeds up parameter sweeping. When used in a compute context with multiple nodes, e.g. high-performance computing clusters and Hadoop, rxExec() can be used to distribute a large number of tasks to the nodes and run the tasks in parallel.

# Define a function to sweep and select the optimal parameter combination.
findOptimal <- function(DFfunction, train, test, form, nTreeArg, maxDepthArg) {
  # Sweep different combination of parameters.
  sweepResults <- rxExec(DFfunction, train, test, form, rxElemArg(nTreeArg), rxElemArg(maxDepthArg))
  # Sort the nested list by the third element (RMSE) in the list in ascending order.
  sortResults <- sweepResults[order(unlist(lapply(sweepResults, `[[`, 3)))]
  # Select the optimal parameter combination.
  nTreeOptimal <- sortResults[[1]][1]
  maxDepthOptimal <- sortResults[[1]][2]
  # Return the optimal values.
  return(c(nTreeOptimal, maxDepthOptimal))
}

A large number of parameter combinations are usually swept through in modeling process. For demonstration purpose, we use 9 combinations of parameters in this example.

# Define a list of parameters to sweep through.
# To save time, we only sweep 9 combinations of number of trees and max tree depth.
numTreesToSweep <- rep(seq(20, 60, 20), times = 3)
maxDepthToSweep <- rep(seq(10, 30, 10), each = 3)

Next, we find the best parameter combination and get the optimal regression model for each training dataset. For simplicity, we only present the process for Set A.

# Set A = weather + holiday + weekday + weekend features for the predicted day.
# Build a formula for the regression model and remove the "yr",
# which is used to split the training and test data.
newHourFeatures <- paste("cnt_", seq(12), "hour", sep = "")  # Define the hourly lags.
formA <- formula(train, depVars = "cnt", varsToDrop = c("splitVar", newHourFeatures))
 
# Find the optimal parameters for Set A.
optimalResultsA <- findOptimal(TrainTestDForestfunction,
                               train, test, formA,
                               numTreesToSweep,
                               maxDepthToSweep)
 
# Use the optimal parameters to fit a model for feature Set A.
nTreeOptimalA <- optimalResultsA[[1]]
maxDepthOptimalA <- optimalResultsA[[2]]
dForestA <- rxDForest(formA, data = trainTest,
                      method = "anova",
                      maxDepth = maxDepthOptimalA,
                      nTree = nTreeOptimalA,
                      importance = TRUE, seed = 123)

Finally, we plot the dot charts of the variable importance and the out-of-bag error rates for the two optimal decision forest models.

Step 5: Test, Evaluate, and Compare Models

In this step, we use the rxPredict() function to predict the bike rental demand on the score dataset, and compare the two regression models over three performance metrics – Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Relative Absolute Error (RAE).

# Set A: Predict the probability on the test dataset.
rxPredict(dForestA, data = score,
          predVarNames = "cnt_Pred_A",
          residVarNames = "cnt_Resid_A",
          overwrite = TRUE, computeResiduals = TRUE)
 
# Set B: Predict the probability on the test dataset.
rxPredict(dForestB, data = score,
          predVarNames = "cnt_Pred_B",
          residVarNames = "cnt_Resid_B",
          overwrite = TRUE, computeResiduals = TRUE)
 
# Calculate three statistical metrics:
# Mean Absolute Error (MAE),
# Root Mean Squared Error (RMSE), and
# Relative Absolute Error (RAE).
sumResults <- rxSummary(~ cnt_Resid_A_abs + cnt_Resid_A_2 + cnt_rel_A +
                   cnt_Resid_B_abs + cnt_Resid_B_2 + cnt_rel_B,
                 data = score,
                 summaryStats = "Mean",
                 transforms = list(cnt_Resid_A_abs = abs(cnt_Resid_A),
                                   cnt_Resid_A_2 = cnt_Resid_A^2,
                                   cnt_rel_A = abs(cnt_Resid_A)/cnt,
                                   cnt_Resid_B_abs = abs(cnt_Resid_B),
                                   cnt_Resid_B_2 = cnt_Resid_B^2,
                                   cnt_rel_B = abs(cnt_Resid_B)/cnt)
)$sDataFrame
 
# Add row names.
features <- c("baseline: weather + holiday + weekday + weekend features for the predicted day",
              "baseline + previous 12 hours demand")
 
# List all metrics in a data frame.
metrics <- data.frame(Features = features,
                       MAE = c(sumResults[1, 2], sumResults[4, 2]),
                       RMSE = c(sqrt(sumResults[2, 2]), sqrt(sumResults[5, 2])),
                       RAE = c(sumResults[3, 2], sumResults[6, 2]))

Based on all three metrics listed below, the regression model built on feature set B outperforms the one built on feature set A. This result is not surprising, since from the variable importance chart we can see, the lag features play a critical part in the regression model. Adding this set of features can lead to better performance.

Feature Set	MAE	RMSE	RAE
A	101.34848	146.9973	0.9454142
B	62.48245	105.6198	0.3737669

Follow this link for source code and datasets: Bike Rental Demand Estimation with Microsoft R Server

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Absence of evidence is not evidence of absence: Testing for equivalence

May 20, 2016, 8:18 am

≫ Next: R profiling

≪ Previous: Bike Rental Demand Estimation with Microsoft R Server

(This article was first published on The 20% Statistician, and kindly contributed to R-bloggers)

When you find p > 0.05, you did not observe surprising data, assuming there is no true effect. You can often read in the literature how p > 0.05 is interpreted as ‘no effect’ but due to a lack of power the data might not be surprising if there was an effect. In this blog I’ll explain how to test for equivalence, or the lack of a meaningful effect, using equivalence hypothesis testing. I’ve created easy to use R functions that allow you to perform equivalence hypothesis tests. Warning: If you read beyond this paragraph, you will never again be able to write “as predicted, the interaction revealed there was an effect for participants in the experimental condition (p< 0.05) but there was no effect in the control condition (F < 1).” If you prefer the veil of ignorance, here’s a nice site with cute baby animals to spend the next 9 minutes on instead.

Any science that wants to be taken seriously needs to be able to provide support for the null-hypothesis. I often see people switching over from Frequentist statistics when effects are significant, to the use of Bayes Factors to be able to provide support for the null hypothesis. But it is possible to test if there is a lack of an effect using p-values (why no one ever told me this in the 11 years I worked in science is beyond me). It’s as easy as doing a t-test, or more precisely, as doing two t-tests.

The practice of Equivalence Hypothesis Testing (EHT) is used in medicine, for example to test whether a new cheaper drug isn’t worse (or better) than the existing more expensive option. A very simple EHT approach is the ‘two-one-sided t-tests’ (TOST) procedure (Schuirmann, 1987). Its simplicity makes it wonderfully easy to use.

The basic idea of the test is to flip things around: In Equivalence Hypothesis Testing the null hypothesis is that there is a true effect larger than a Smallest Effect Size of Interest (SESOI; Lakens, 2014). That’s right – the null-hypothesis is now that there IS an effect, and we are going to try to reject it (with a p < 0.05). The alternative hypothesis is that the effect is smaller than a SESOI, anywhere in the equivalence range – any effect you think is too small to matter, or too small to feasibly examine. For example, a Cohen’s d of 0.5 is a medium effect, so you might set d = 0.5 as your SESOI, and the equivalence range goes from d = -0.5 to d = 0.5 In the TOST procedure, you first decide upon your SESOI: anything smaller than your smallest effect size of interest is considered smaller than small, and will allow you to reject the null-hypothesis that there is a true effect. You perform two t-tests, one testing if the effect is smaller than the upper bound of the equivalence range, and one testing whether the effect is larger than the lower bound of the equivalence range. Yes, it’s that simple.

Let’s visualize this. Below on the left axis is a scale for the effect size measure Cohen’s d. On the left is a single 90% confidence interval (the crossed circles indicate the endpoints of the 95% confidence interval) with an effect size of d = 0.13. On the right is the equivalence range. It is centered on 0, and ranges from d = -0.5 to d = 0.5.

We see from the 95% confidence interval around d = 0.13 (again, the endpoints of which are indicated by the crossed circles) that the lower bound overlaps with 0. This means the effect (d = 0.13, from an independent t-test with two conditions of 90 participants each) is not statistically different from 0 at an alpha of 5%, and the p-value of the normal NHST is 0.384 (the title provides the exact numbers for the 95% CI around the effect size). But is this effect statistically smaller than my smallest effect size of interest?

Rejecting the presence of a meaningful effect

There are two ways to test the lack of a meaningful effect that yield the same result. The first is to perform two one sided t-tests, testing the observed effect size against the ‘null’ of your SESOI (0.5 and -0.5). These t-tests show the d = 0.13 is significantly larger than d = -0.5, and significantly smaller than d = 0.5. The highest of these two p-values is p = 0.007. We can conclude that there is support for the lack of a meaningful effect (where meaningful is defined by your choice of the SESOI). The second approach (which is easier to visualize) is to calculate a 90% CI around the effect (indicated by the vertical line in the figure), and check whether this 90% CI falls completely within the equivalence range. You can see a line from the upper and lower limit of the 90% CI around d = 0.13 on the left to the equivalence range on the right, and the 90% CI is completely contained within the equivalence range. This means we can reject the null of an effect that is larger than d = 0.5 or smaller than d = -0.5 and conclude this effect is smaller than what we find meaningful (and you’ll be right 95% of the time, in the long run).

[Technical note: The reason we are using a 90% confidence interval, and not a 95% confidence interval, is because the two one-sided tests are completely dependent. You could actually just perform one test, if the effect size is positive against the upper bound of the equivalence range, and if the effect size is negative against the lower bound of the equivalence range. If this one test is significant, so is the other. Therefore, we can use a 90% confidence interval, even though we perform two one-sided tests. This is also why the crossed circles are used to mark the 95% CI.].

So why were we not using these tests in the psychological literature? It’s the same old, same old. Your statistics teacher didn’t tell you about them. SPSS doesn’t allow you to do an equivalence test. Your editors and reviewers were always OK with your statements such as “as predicted, the interaction revealed there was an effect for participants in the experimental condition (p < 0.05) but there was no effect in the control condition (F< 1).” Well, I just ruined that for you. Absence of evidence is not evidence of absence!

We can’t use p > 0.05 as evidence of a lack of an effect. You can switch to Bayesian statistics if you want to support the null, but the default priors are only useful of in research areas where very large effects are examined (e.g., some areas of cognitive psychology), and are not appropriate for most other areas in psychology, so you will have to be able to quantify your prior belief yourself. You can teach yourself how, but there might be researchers who prefer to provide support for the lack of an effect within a Frequentist framework. Given that most people think about the effect size they expect when designing their study, defining the SESOI at this moment is straightforward. After choosing the SESOI, you can even design your study to have sufficient power to reject the presence of a meaningful effect. Controlling your error rates is thus straightforward in equivalence hypothesis tests, while it is not that easy in Bayesian statistics (although it can be done through simulations).

One thing I noticed while reading this literature is that TOST procedures, and power analyses for TOST, are not created to match the way psychologists design studies and think about meaningful effects. In medicine, equivalence is based on the raw data (a decrease of 10% compared to the default medicine), while we are more used to think in terms of standardized effect sizes (correlations or Cohen’s d). Biostatisticians are fine with estimating the pooled standard deviation for a future study when performing power analysis for TOST, but psychologists use standardized effect sizes to perform power analyses. Finally, the packages that exist in R (e.g., equivalence) or the software that does equivalence hypothesis tests (e.g., Minitab, which has TOST for t-tests, but not correlations) requires that you use the raw data. In my experience (Lakens, 2013)researchers find it easier to use their own preferred software to handle their data, and then calculate additional statistics not provided by the software they use by typing in summary statistics in a spreadsheet (means, standard deviations, and sample sizes per condition). So my functions don’t require access to the raw data (which is good for reviewers as well). Finally, the functions make a nice picture such as the one above so you can see what you are doing.

R Functions

I created R functions for TOST for independent t-tests, paired samples t-tests, and correlations, where you can set the equivalence thresholds using Cohen’s d, Cohen’s dz, and r. I adapted the equation for power analysis to be based on d, and I created the equation for power analyses for a paired-sample t-test from scratch because I couldn’t find it in the literature. If it is not obvious: None of this is peer-reviewed (yet), and you should use it at your own risk. I checked the independent andpaired t-test formulas against theresults from Minitab software and reproduced examples in the literature, and checked the power analyses against simulations, and all yielded the expected results, so that’s comforting. On the other hand, I had never heard of equivalence testing until 9 days ago (thanks ‘Bum Deggy’), so that’s less comforting I guess. Send me an email if you want to use these formulas for anything serious like a publication. If you find a mistake or misbehaving functions, let me know.

If you load (select and run) the functions (see GitHub gist below), you can perform a TOST by entering the correct numbers and running the single line of code:

TOSTd(d=0.13,n1=90,n2=90,eqbound_d=0.5)

You don’t know how to calculate Cohen’s d in an independent t-test? No problem. Use the means and standard deviations in each group instead, and type:

TOST(m1=0.26,m2=0.0,sd1=2,sd2=2,n1=90,n2=90,eqbound_d=0.5)

You’ll get the figure above, and it calculates Cohen’s d and the 95% CI around the effect size for free. You are welcome. Note that TOST and TOSTd differ slightly (TOST relies on the t-distribution, TOSTd on the z-distribution). If possible, use TOST – but TOSTd (and especially TOSTdpaired) will be very useful for readers of the scientific literature who want quickly check the claim that there is a lack of effect when means or standard deviations are not available. If you prefer to set the equivalence in raw difference scores (e.g., 10% of the mean in the control condition, as is common in medicine) you can use the TOSTraw function.

Are you wondering if your design was well powered? Or do you want to design a study well-powered to reject a meaningful effect? No problem. For an alpha (Type 1 error rate) of 0.05, 80% power (or a beta or Type 2 error rate of 0.2), and a SESOI of 0.4, just type:

powerTOST(alpha=0.05, beta=0.2, eqbound_d=0.4) #Returns n (for each condition)

You will see you need 107 participants in each condition to have 80% power to reject an effect larger than d = 0.4, and accept the null (or an effect smaller than your smallest effect size of interest). Note that this function is based on the z-distribution, it does not use the iterative approach based on the t-distribution that would make it exact – so it is an approximation but should work well enough in practice.

TOSTr will perform these calculations for correlations, and TOSTdpaired will allow you to use Cohen’s dz to perform these calculations for within designs. powerTOSTpaired can be used when designing within subject design studies well-powered to test if data is in line with the lack of a meaningful effect.

Choosing your SESOI

How should you choose your SESOI? Let me quote myself (Lakens, 2014, p. 707):

In applied research, practical limitations of the SESOI can often be determined on the basis of a cost–benefit analysis. For example, if an intervention costs more money than it saves, the effect size is too small to be of practical significance. In theoretical research, the SESOI might be determined by a theoretical model that is detailed enough to make falsifiable predictions about the hypothesized size of effects. Such theoretical models are rare, and therefore, researchers often state that they are interested in any effect size that is reliably different from zero. Even so, because you can only reliably examine an effect that your study is adequately powered to observe, researchers are always limited by the practical limitation of the number of participants that are willing to participate in their experiment or the number of observations they have the resources to collect.

Let’s say you collect 50 participants in two independent conditions, and plan to do a t-test with an alpha of 0.05. You have 80% power to detect an effect with a Cohen’s d of 0.57. To have 80% power to reject an effect of d = 0.57 or larger in TOST you would need 66 participants in each condition.

Let’s say your SESOI is actually d = 0.35. To have 80% power in TOST you would need 169 participants in each condition (you’d need 130 participants in each condition to have 80% power to reject the null of d = 0 in NHST).

Conclusion

We see you always need a bit more people to reject a meaningful effect, than to reject the null for the same meaningful effect. Remember that since TOST can be performed based on Cohen’s d, you can use it in meta-analyses as well (Rogers, Howard, & Vessey, 1993). This is a great place to use EHT and reject a small effect (e.g., d = 0.2, or even d = 0.1), for which you need quite a lot of observations (i.e., 517, or even 2069).

Equivalence testing has many benefits. It fixes the dichotomous nature of NHST. You can now 1) reject the null, and fail to reject the null of equivalence (there is probably something, of the size you find meaningful), 2) reject the null, and reject the null of equivalence (there is something, but it is not large enough to be meaningful, 3) fail to reject the null, and reject the null of equivalence (the effect is smaller than anything you find meaningful), and 4) fail to reject the null, and fail to reject the null of equivalence (undetermined: you don’t have enough data to say there is an effect, and you don’t have enough data to say there is a lack of a meaningful effect). These four situations are visualized below.

There are several papers throughout the scientific disciplines telling us to use equivalence testing. I’m definitely not the first. But in my experience, the trick to get people to use better statistical approaches is to make it easy to do so. I’ll work on a manuscript that tries to make these tests easy to use (if you read this post this far, and work for a journal that might be interested in this, drop me a line – I’ll throw in an easy to use spreadsheet just for you). Thinking about meaningful effects in terms of standardized effect sizes and being able to perform these test based on summary statistics might just do the trick. Try it.

Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4. http://doi.org/10.3389/fpsyg.2013.00863

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023

Rogers, J. L., Howard, K. I., & Vessey, J. T. (1993). Using significance tests to evaluate equivalence between two experimental groups. Psychological Bulletin, 113(3), 553.

Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657–680.

To leave a comment for the author, please follow the link and comment on their blog: The 20% Statistician.

↧

R profiling

June 5, 2016, 9:00 pm

≫ Next: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression Model in R + Categorical Predictors, Partial and Part Correlation]

≪ Previous: Absence of evidence is not evidence of absence: Testing for equivalence

(This article was first published on R – ipub, and kindly contributed to R-bloggers)

Profiling in R

R has a built in performance and memory profiling facility: Rprof. Type

?Rprof

into your console to learn more.

The way the profiler works is as follows:

you start the profiler by calling Rprof, providing a filename where the profiling data should be stored
you call the R functions that you want to analyse
you call Rprof(NULL) to stop the profiler
you analyse the file created by Rprof, typically using
```
summaryRprof
```

For example:

Rprof(tmp <- tempfile())
example(glm)
Rprof()
summaryRprof(tmp)
unlink(tmp)

The output looks like this:

$by.self
               self.time self.pct total.time total.pct
"str.default"       0.02    14.29       0.10     71.43
"deparse"           0.02    14.29       0.04     28.57
"as.name"           0.02    14.29       0.02     14.29
"formals"           0.02    14.29       0.02     14.29
"make.names"        0.02    14.29       0.02     14.29
"parent.frame"      0.02    14.29       0.02     14.29
"pmatch"            0.02    14.29       0.02     14.29

$by.total
                    total.time total.pct self.time self.pct
"eval"                    0.14    100.00      0.00     0.00
"withVisible"             0.14    100.00      0.00     0.00
"str.default"             0.10     71.43      0.02    14.29
"<Anonymous>"             0.10     71.43      0.00     0.00
"capture.output"          0.10     71.43      0.00     0.00
"doTryCatch"              0.10     71.43      0.00     0.00
"evalVis"                 0.10     71.43      0.00     0.00
".rs.valueContents"       0.10     71.43      0.00     0.00
".rs.valueFromStr"        0.10     71.43      0.00     0.00
"str"                     0.10     71.43      0.00     0.00
"try"                     0.10     71.43      0.00     0.00
"tryCatch"                0.10     71.43      0.00     0.00
"tryCatchList"            0.10     71.43      0.00     0.00
"tryCatchOne"             0.10     71.43      0.00     0.00
"do.call"                 0.08     57.14      0.00     0.00
"strSub"                  0.08     57.14      0.00     0.00
"deparse"                 0.04     28.57      0.02    14.29
"example"                 0.04     28.57      0.00     0.00
"FUN"                     0.04     28.57      0.00     0.00
"lapply"                  0.04     28.57      0.00     0.00
"match"                   0.04     28.57      0.00     0.00
"source"                  0.04     28.57      0.00     0.00
"as.name"                 0.02     14.29      0.02    14.29
"formals"                 0.02     14.29      0.02    14.29
"make.names"              0.02     14.29      0.02    14.29
"parent.frame"            0.02     14.29      0.02    14.29
"pmatch"                  0.02     14.29      0.02    14.29
"anova"                   0.02     14.29      0.00     0.00
"anova.glm"               0.02     14.29      0.00     0.00
"data.frame"              0.02     14.29      0.00     0.00
"deParse"                 0.02     14.29      0.00     0.00
".deparseOpts"            0.02     14.29      0.00     0.00
".getXlevels"             0.02     14.29      0.00     0.00
"glm"                     0.02     14.29      0.00     0.00
"%in%"                    0.02     14.29      0.00     0.00
"match.call"              0.02     14.29      0.00     0.00
"mode"                    0.02     14.29      0.00     0.00
"NextMethod"              0.02     14.29      0.00     0.00
"paste"                   0.02     14.29      0.00     0.00
"sapply"                  0.02     14.29      0.00     0.00
"str.data.frame"          0.02     14.29      0.00     0.00
"summary"                 0.02     14.29      0.00     0.00
"%w/o%"                   0.02     14.29      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 0.14

A lot of information!

As a side note, the sample.interval of 0.02 indicates the frequency with which Rprofile analysis the call stack and takes measures. So, Rprofile works entirely through polling. As a result, summaryRprof may look different each time you profile code. Not only slight differences in the numbers, but also e.g. elements missing, for instance because in one run the measurement was done by chance while, say, mode was executing, whereas in another run mode slipped between two measurements.

Stylized profiling example

Let’s look at a stylized example. Assume you have the following functions:

Do_1 <- function() {
  combn(1:20, 5)
  for (i in 1:15) Do_2()
  for (i in 1:25) Do_4()
}

Do_2 <- function() {
  combn(1:15, 5)
  for (i in 1:5) Do_3()
}

Do_3 <- function() {
  combn(1:14, 5)
  for (i in 1:20) Do_4()
}

Do_4 <- function() {
  paste(1:1000)
  combn(1:11, 5)
}

Ugly and pointless, true, but for the sake of this example they serve their purpose: it will take time to execute them.

So, again, let’s profile

Do_1

Rprof(tmp <- tempfile())
Do_1()
Rprof(NULL)
summaryRprof(tmp)

Which looks like this:

$by.self
         self.time self.pct total.time total.pct
"combn"       1.24    71.26       1.28     73.56
"paste"       0.46    26.44       0.46     26.44
"matrix"      0.04     2.30       0.04      2.30

$by.total
         total.time total.pct self.time self.pct
"Do_1"         1.74    100.00      0.00     0.00
"Do_2"         1.72     98.85      0.00     0.00
"Do_3"         1.68     96.55      0.00     0.00
"Do_4"         1.48     85.06      0.00     0.00
"combn"        1.28     73.56      1.24    71.26
"paste"        0.46     26.44      0.46    26.44
"matrix"       0.04      2.30      0.04     2.30

Nice. We see that combn uses about three quarter of computing time, while paste uses only one quarter.

But hang on: matrix? Where does that come from? Must be either combn or paste calling that internally. No big deal here, as matrix only uses 2.3% of total time. But still, would be interesting to understand this, right?

Analysis of profiling data with prof.tree

Luckily, the prof.tree package by Artem Kelvtsov, which is available from CRAN or from github, provides an alternative to analyzing that data. It displays profiling information as a tree:

library(prof.tree)
pr <- prof.tree(tmp)
print(pr, limit = NULL)

This will print like so:

levelName real percent         env
1  calls                          1.74 100.0 %            
2   °--Do_1                       1.74 100.0 % R_GlobalEnv
3       ¦--Do_2                   1.72  98.9 % R_GlobalEnv
4       ¦   ¦--Do_3               1.68  96.6 % R_GlobalEnv
5       ¦   ¦   ¦--combn          0.22  12.6 %       utils
6       ¦   ¦   ¦   °--matrix     0.02   1.1 %        base
7       ¦   ¦   °--Do_4           1.46  83.9 % R_GlobalEnv
8       ¦   ¦       ¦--combn      1.02  58.6 %       utils
9       ¦   ¦       ¦   °--matrix 0.02   1.1 %        base
10      ¦   ¦       °--paste      0.44  25.3 %        base
11      ¦   °--combn              0.04   2.3 %       utils
12      °--Do_4                   0.02   1.1 % R_GlobalEnv
13          °--paste              0.02   1.1 %        base

Surprise! matrix was called from combn, and not from paste!

Note that pr is a data.tree structure, so all data.tree operations are available. For example, we can sum up specific functions by name:

library(data.tree)
SumByFunction <- function(name) {
  sum(pr$Get("real", filterFun = function(node) node$name == name))/pr$real
}

SumByFunction("combn")

And, just as above, this gives us 73.56%.

Also, we can limit the nodes that are printed, by pruning out all subtrees whose percent is larger than, say, 5%:

print(pr, limit = NULL, pruneFun = function(x) x$percent > 0.05)

Et voilà:

levelName real percent         env
1 calls                     1.74 100.0 %            
2  °--Do_1                  1.74 100.0 % R_GlobalEnv
3      °--Do_2              1.72  98.9 % R_GlobalEnv
4          °--Do_3          1.68  96.6 % R_GlobalEnv
5              ¦--combn     0.22  12.6 %       utils
6              °--Do_4      1.46  83.9 % R_GlobalEnv
7                  ¦--combn 1.02  58.6 %       utils
8                  °--paste 0.44  25.3 %        base

Or we can use the data.tree plot facility to visualize this:

cols <- colorRampPalette(c("green", "red"))(101)
SetNodeStyle(pr, 
             style = "filled,rounded", 
             shape = "box",
             fontname = "helvetica",
             fillcolor = function(node) cols[round(node$percent * 100) + 1],
             tooltip = function(node) node$real)

plot(pr)

This will plot like so:

If you like what you see, make sure you show some love by starring Artem’s package on github.

The post R profiling appeared first on ipub.

To leave a comment for the author, please follow the link and comment on their blog: R – ipub.

↧

Introduction to R for Data Science :: Session 7 [Multiple Linear Regression Model in R + Categorical Predictors, Partial and Part Correlation]

June 9, 2016, 5:08 pm

≫ Next: R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

≪ Previous: R profiling

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

Welcome to Introduction to R for Data Science Session 7: Multiple Regression + Dummy Coding, Partial and Part Correlations [Multiple Linear Regression in R. Dummy coding: various ways to do it in R. Factors. Inspecting the multiple regression model: regression coefficients and their interpretation, confidence intervals, predictions. Introducing {lattice} plots + ggplot2. Assumptions: multicolinearity and testing it from the {car} package. Predictive models with categorical and continuous predictors. Influence plot. Partial and part (semi-partial) correlation in R.]

The course is co-organized by Data Science Serbia and Startit. You will find all course material (R scripts, data sets, SlideShare presentations, readings) on these pages.

Check out the Course Overview to acess the learning material presented thus far.

Data Science Serbia Course Pages [in Serbian]

Startit Course Pages [in Serbian]

Lecturers

dipl. ing Branko Kovač, Data Analyst at CUBE, Data Science Mentor at Springboard, Data Science Serbia
Goran S. Milovanović, Phd, DataScientist@DiploFoundation, Data Science Serbia

Summary of Session 7, 09. June 2016 :: Multiple Regression + Dummy Coding, Partial and Part Correlations.

Multiple Regression + Dummy Coding, Partial and Part Correlations. Multiple Linear Regression in R. Dummy coding: various ways to do it in R. Factors. Inspecting the multiple regression model: regression coefficients and their interpretation, confidence intervals, predictions. Introducing {lattice} plots + ggplot2. Assumptions: multicolinearity and testing it from the {car} package. Predictive models with categorical and continuous predictors. Influence plot. Partial and part (semi-partial) correlation in R.

Intro to R for Data Science SlideShare :: Session 7

Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R] from Goran S. Milovanovic

R script :: Session 7

########################################################
# Introduction to R for Data Science
# SESSION 7 :: 9 June, 2016
# Multiple Linear Regression in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
 
# clear
rm(list=ls())
 
#### read data
library(datasets)
library(broom)
library(ggplot2)
library(lattice)
library(QuantPsyc)
 
#### load
data(iris)
str(iris)
 
#### simple linear regression: Sepal Length vs Petal Lenth
# Predictor vs Criterion {ggplot2}
ggplot(data = iris,
       aes(x = Sepal.Length, y = Petal.Length)) +
  geom_point(size = 2, colour = "black") +
  geom_point(size = 1, colour = "white") +
  geom_smooth(aes(colour = "black"),
              method='lm') +
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length") +
  theme(legend.position = "none")

Created by Pretty R at inside-R.org

# What is wrong here?
# let's see...
reg <- lm(Petal.Length ~ Sepal.Length, data=iris) 
summary(reg)
# Hm, everything seems fine to me...
 
# And now for something completelly different (but in R)...
 
#### Problems with linear regression in iris
# Predictor vs Criterion {ggplot2} - group separation
ggplot(data = iris, 
       aes(x = Sepal.Length,
           y = Petal.Length,
           color = Species)) + 
  geom_point(size = 2) +
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length")

Created by Pretty R at inside-R.org

# Predictor vs Criterion {ggplot2} - separate regression lines
ggplot(data = iris, 
       aes(x = Sepal.Length,
           y = Petal.Length,
           colour=Species)) + 
  geom_smooth(method=lm) + 
  geom_point(size = 2) +
  ggtitle("Sepal Length vs Petal Length") +
  xlab("Sepal Length") + ylab("Petal Length")

Created by Pretty R at inside-R.org

### Ooops...
### overview and considerations
plot(iris[,c(1,3,5)],
     main = "Inspect: Sepal vs. Petal Length \nfollowing the discovery of the Species...",
     cex.main = .75,
     cex = .6)

Created by Pretty R at inside-R.org

### better... {lattice}
xyplot(Petal.Length ~ Sepal.Length | Species, # {latice} xyplot
       data = iris,
       xlab = "Sepal Length", ylab = "Petal Length"
       )

Created by Pretty R at inside-R.org

### Petal.Length and Sepal.Lengt: EDA and distributions
par(mfcol = c(2,2))
# boxplot Petal.Length
boxplot(iris$Petal.Length,
        horizontal = TRUE, 
        xlab="Petal Length")
# histogram: Petal.Length
hist(iris$Petal.Length, 
     main="", 
     xlab="Petal Length", 
     prob=T)
lines(density(iris$Petal.Length),
      lty="dashed", 
      lwd=2.5, 
      col="red")
# boxplot Sepal.Length
boxplot(iris$Sepal.Length,
        horizontal = TRUE, 
        xlab="Sepal Length")
# histogram: Sepal.Length
hist(iris$Sepal.Length, 
     main="", 
     xlab="Sepal Length", 
     prob=T)
lines(density(iris$Sepal.Length),
      lty="dashed", 
      lwd=2.5, 
      col="blue")

Created by Pretty R at inside-R.org

# Petal Length and Sepal Length: Conditional Densities
densityplot(~ Petal.Length | Species, # {latice} xyplot
       data = iris,
       plot.points=FALSE,
       xlab = "Petal Length", ylab = "Density",
       main = "P(Petal Length|Species)",
       col.line = 'red'
)

Created by Pretty R at inside-R.org

densityplot(~ Sepal.Length | Species, # {latice} xyplot
            data = iris,
            plot.points=FALSE,
            xlab = "Sepal Length", ylab = "Density",
            main = "P(Sepal Length|Species)",
            col.line = 'blue'
)

Created by Pretty R at inside-R.org

# Linear regression in subgroups
species <- unique(iris$Species)
w1 <- which(iris$Species == species[1]) # setosa
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,]) 
tidy(reg)
w2 <- which(iris$Species == species[2]) # versicolor
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,]) 
tidy(reg)
w3 <- which(iris$Species == species[3]) # virginica
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,]) 
tidy(reg)
 
#### Dummy Coding: Species in the iris dataset
is.factor(iris$Species)
levels(iris$Species)
reg <- lm(Petal.Length ~ Species, data=iris) 
tidy(reg)
glance(reg)
# Never forget what the regression coefficient for a dummy variable means:
# It tells us about the effect of moving from the baseline towards the respective reference level!
# Here: baseline = setosa (cmp. levels(iris$Species) vs. the output of tidy(reg))
# NOTE: watch for the order of levels!
levels(iris$Species) # Levels: setosa versicolor virginica
iris$Species <- factor(iris$Species, 
                       levels = c("versicolor", 
                                  "virginica",
                                  "setosa"))
levels(iris$Species)
# baseline is now: versicolor
reg <- lm(Petal.Length ~ Species, data=iris) 
tidy(reg) # The regression coefficents (!): figure out what has happened!
 
### another way to do dummy coding
rm(iris); data(iris) # ...just to fix the order of Species back to default
levels(iris$Species)
contrasts(iris$Species) = contr.treatment(3, base = 1)
contrasts(iris$Species) # this probably what you remember from your stats class...
iris$Species <- factor(iris$Species, 
                       levels = c ("virginica","versicolor","setosa"))
levels(iris$Species)
contrasts(iris$Species) = contr.treatment(3, base = 1)
# baseline is now: virginica
contrasts(iris$Species) # consider carefully what you need to do
 
### Petal.Length ~ Species (Dummy Coding) + Sepal.Length 
rm(iris); data(iris) # ...just to fix the order of Species back to default
reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris)
# BTW: since is.factor(iris$Species)==T, R does the dummy coding in lm() for you
regSum <- summary(reg)
regSum$r.squared
regSum$coefficients
# compare w. Simple Linear Regression
reg <- lm(Petal.Length ~ Sepal.Length, data=iris) 
regSum <- summary(reg)
regSum$r.squared
regSum$coefficients
 
### Comparing nested models
reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris)
reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2
# terminology: reg2 is a "full model"
# this terminology will be used quite often in Logistic Regression
 
# NOTE: Nested models
# There is a set of coefficients for the nested model (reg1) such that it
# can be expressed in terms of the full model (reg2); in our case it is simple 
# HOME: - figure it out.
 
anova(reg1, reg2) # partial F-test; Species certainly has an effect beyond Sepal.Length
# NOTE: for partial F-test, see:
# http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf
 
# Influence Plot
regFrame <- augment(reg2)
## Influence plot
# influnce data
infReg <- as.data.frame(influence.measures(reg)$infmat)
# data.frame for ggplot2
plotFrame <- data.frame(residual = regFrame$.std.resid,
                        leverage = regFrame$.hat,
                        cookD = regFrame$.cooksd)
 
ggplot(plotFrame,
       aes(y = residual,
           x = leverage)) +
  geom_point(size = plotFrame$cookD*100, shape = 1) +
  ggtitle("Influence Plot\nSize of the circle corresponds to Cook's distance") +
  theme(plot.title = element_text(size=8, face="bold")) +
  ylab("Standardized Residual") + xlab("Leverage")

Created by Pretty R at inside-R.org

#### Multiple Regression - by the book
# Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression
# (that's from your reading list, to remind you...)
data(stackloss)
str(stackloss)
# Data set description
# URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html
# Air Flow represents the rate of operation of the plant. 
# Water Temp is the temperature of cooling water circulated through coils in the absorption tower. 
# Acid Conc. is the concentration of the acid circulating.
# stack.loss (the dependent variable) is 10 times the percentage of the ingoing ammonia to 
# the plant that escapes from the absorption column unabsorbed;
# that is, an (inverse) measure of the over-all efficiency of the plant.
stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
                    data=stackloss)
 
# let's see:
summary(stacklossModel)
glance(stacklossModel) # {broom}
tidy(stacklossModel) # {broom}
 
# predict new data
obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
predict(stacklossModel, obs)
 
# confidence intervals
confint(stacklossModel, level=.95) # 95% CI
confint(stacklossModel, level=.99) # 99% CI
# 95% CI for Acid.Conc. only
confint(stacklossModel, "Acid.Conc.", level=.95)
 
# default regression plots in R
plot(stacklossModel)

Created by Pretty R at inside-R.org

# multicolinearity
library(car) # John Fox's car package
VIF <- vif(stacklossModel)
VIF
sqrt(VIF)
# Variance Inflation Factor (VIF)
# The increase in the ***variance*** of an regression ceoff. due to colinearity
# NOTE: sqrt(VIF) = how much larger the ***SE*** of a reg.coeff. vs. what it would be
# if there were no correlations with the other predictors in the model
# NOTE: lower_bound(VIF) = 1; no upper bound; VIF > 2 --> (Concerned == TRUE)
Tolerance <- 1/VIF # obviously, tolerance and VIF are redundant
Tolerance
# NOTE: you can inspect multicolinearity in the multiple regression mode
# by conducting a Principal Component Analysis over the predictors;
# when the time is right.
 
#### R for partial and part (semi-partial) correlations
library(ppcor) # a good one; there are many ways to do this in R
 
#### partial correlation in R
dataSet <- iris
str(dataSet)
dataSet$Species <- NULL
irisPCor <- pcor(dataSet, method="pearson")
irisPCor$estimate # partial correlations
irisPCor$p.value # results of significance tests
irisPCor$statistic # t-test on n-2-k degrees of freedom ; k = num. of variables conditioned
# partial correlation between x and y while controlling for z
partialCor <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
                     dataSet$Sepal.Width,
                     method = "pearson")
partialCor$estimate
partialCor$p.value
partialCor$statistic
 
# NOTE:
# Formally, the partial correlation between X and Y given a set of n 
# controlling variables Z = {Z1, Z2, ..., Zn}, written ρXY·Z, is the 
# correlation between the residuals RX and RY resulting from the 
# linear regression of X with Z and of Y with Z, respectively. 
# The first-order partial correlation (i.e. when n=1) is the difference 
# between a correlation and the product of the removable correlations 
# divided by the product of the coefficients of alienation of the 
# removable correlations. 
# NOTE: coefficient of alienation = 1 - R2 (R2 = "r-squared")
# coefficient of alienation = the proportion of variance "unaccounted for"
 
#### semi-partial correlation in R
# NOTE: ... Semi-partial correlation is the correlation of two variables 
# with variation from a third or more other variables removed only 
# from the ***second variable***
# NOTE: The first variable <- rows, the second variable <-columns
# cf. ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients (2015)
# Seongho Kim, Biostatistics Core, Karmanos Cancer Institute, Wayne State University
# URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/
irisSPCor <- spcor(dataSet, method = "pearson")
irisSPCor$estimate
irisSPCor$p.value
irisSPCor$statistic
 
# NOTE: ... Semi-partial correlation is the correlation of two variables 
# with variation from a third or more other variables removed only 
# from the ***second variable***
partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
                       dataSet$Sepal.Width,
                       method = "pearson")
# NOTE: this is a correlation of dataSet$Sepal.Length w. dataSet$Petal.Length
# when the variance of dataSet$Petal.Length (2nd variable) due to dataSet$Sepal.Width
# is removed!
partCor$estimate
partCor$p.value
partCor$statistic
 
# NOTE: In multiple regression, this is the semi-partial (or part) correlation
# that you need to inspect:
# assume a model with X1, X2, X3 as predictors, and Y as a criterion
# You need a semi-partial of X1 and Y following the removal of X2 and X3 from Y
# It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3;
# In Step 2, you take the residuals of Y, call them RY; in Step 3, you regress (correlate)
# RY ~ X1: the correlation coefficient that you get from Step 3 is the part correlation
# that you're looking for.
 
# Give a thought to the following discussion on categorical predictors:
# http://stats.stackexchange.com/questions/133203/partial-correlation-and-multiple-regression-controlling-for-categorical-variable
# What's your take on this?

Created by Pretty R at inside-R.org

Readings :: Session 8: Logistic Regression [16. June, 2016, @Startit.rs, 19h CET]

To be announced. In the meantime: a beatiful, concise theoretical background is available from Wikipedia.

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

↧

R for Publication by Page Piccinini: Lesson 4 – Multiple Regression

June 13, 2016, 2:15 am

≪ Previous: Introduction to R for Data Science :: Session 7 [Multiple Linear Regression Model in R + Categorical Predictors, Partial and Part Correlation]

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Introduction

Today we’ll see what happens when you have not one, but two variables in your model. We will also continue to use some old and new dplyr calls, as well as another parameter for our ggplot2 figure. I’ll be taking for granted some of the set-up steps from Lesson 1, so if you haven’t done that yet be sure to go back and do it.

By the end of this lesson you will:

Have learned the math of multiple regression.

Be able to make a figure to present data for a multiple regression.

Be able to run a multiple regression and interpret the results.

Have an R Markdown document to summarise the lesson.

There is a video in end of this post which provides the background on the math of multiple regression and introduces the data set we’ll be using today. There is also some extra explanation of some of the new code we’ll be writing. For all of the coding please see the text below. A PDF of the slides can be downloaded here. Before beginning please download these text files, it is the data we will use for the lesson. We’ll be using data from the “Star Trek” universe (both “Star Trek: The Original Series” and “Star Trek: The Next Generation” collected from The Star Trek Project. All of the data and completed code for the lesson can be found here.

Lab Problem

As mentioned, the lab portion of the lesson uses data from the television franchise “Star Trek”. Specifically, we’ll be looking at data about the alien species on the show, and whether they are expected to become extinct or not. We’ll be testing three questions using logistic regression, looking at both the main effects of these variables and seeing if there is an interaction between the variables.

Series: Is a given species more or less likely to become extinct in “Star Trek: The Original Series” or “Star Trek: The Next Generation?

Alignment: Is a given species more or less likely to become extinct if it is a friend or foe of the Enterprise (the main starship on “Star Trek”)?

Series x Alignment: Is there an interaction between these variables?

Setting up Your Work Space

As we did for Lesson 1 complete the following steps to create your work space. If you want more details on how to do this refer back to Lesson 1:

Make your directory (e.g. “rcourse_lesson4”) with folders inside (e.g. “data”, “figures”, “scripts”, “write_up”).

Put the data files for this lesson in your “data” folder.

Make an R Project based in your main directory folder (e.g. “rcourse_lesson4”).

Commit to Git.

Create the repository on Bitbucket and push your initial commit to Bitbucket.

Okay you’re all ready to get started!

Cleaning Script

Make a new script from the menu. We start the same way we usually do, by having a header line talking about loading any necessary packages and then listing the packages we’ll be using. Today in addition to loading dplyr we’ll also be using the package purrr. If you haven’t used the package purrr before be sure to install it first using the code below. Note, this is a one time call, so you can type the code directly into the console instead of saving it in the script.

install.packages("purrr")

Once you have the package installed, copy the code below to your script and and run it.

## LOAD PACKAGES ####
library(dplyr)
library(purrr)

Reading in our data is little more complicated than it has been in the past. Here we have three data sets one for each series (“The Original Series”, “The Animated Series”, and “The Next Generation”)*. We want to read in each file at the same time and then combine them into a single data frame. This is where purrr comes in. purrr allows us to read in multiple files with the “list.files()” call, then perform the same action on each file with the map() call, in this case reading in the file, and then finally we use the reduce() call to combine them all into a single data frame. Our read.table() call is also a little different than usual. I’ve added na.strings = c("", NA) to make sure that any empty cells are coded as “NA”, this will come in handy later. For a more detailed explanation of what the code is doing watch the video. Note, this call is assuming that all files have the same number of columns and same names of columns. Copy and run the code below to read in the three files.

## READ IN DATA ####
data = list.files(path = "data", full.names = T) %>%
       map(read.table, header = T, sep = "\t", na.strings = c("", NA)) %>%
       reduce(rbind)

As always we now need to clean our data. We’ll start with a couple filter() calls to get rid of unwanted data based on our variables of interest. First, we only want to look at data from “The Original Series” and “The Next Generation”, so we’re going to drop any data from “The Animated Series”, coded as “tas”. Next the “alignment” column has several values, but we only want to include species that are labeled as a “friend” or a “foe”. We’ll also include a couple mutate() and factor() calls so that R drops the now filtered out levels for each of our independent variables.

## CLEAN DATA ####
data_clean = data %>%
             filter(series != "tas") %>%
             mutate(series = factor(series)) %>%
             filter(alignment == "foe" | alignment == "friend") %>%
             mutate(alignment = factor(alignment))

Our column for our dependent variable is a little more complicated. Currently there is a column called “conservation”, which is coded for the likelihood of a species becoming extinct. The codings are: 1) LC – least concern, 2) NT – near threatened, 3) VU – vulnerable, 4) EN – endangered, 5) CR – critically endangered, 6) EW – extinct in the wild, and 7) EX – extinct. If you look at the data you’ll see that most species have the classification of “LC”, so for our analysis we’re going to look at “LC” species versus all other species as our dependent variable. First we’re going to filter out any data where “conservation” is an “NA”, as we can’t know if it should be labeled as “LC” or something else. We can do this with the handy !is.na() call. Recall that an ! means “is not” so what we’re saying is “if it’s not an “NA” keep it”, this was why we wanted to make sure empty cells were read in as “NA”s earlier. Next we’ll make a new column called “extinct” for our logistic regression using the mutate() call, where an “LC” species gets a “0”, not likely to become extinct, and all other species a “1”, for possible to become extinct. Copy and run the updated code below.

data_clean = data %>%
             filter(series != "tas") %>%
             mutate(series = factor(series)) %>%
             filter(alignment == "foe" | alignment == "friend") %>%
             mutate(alignment = factor(alignment)) %>%
             filter(!is.na(conservation)) %>%
             mutate(extinct = ifelse(conservation == "LC", 0, 1))

There’s still one more thing we need to do in our cleaning script. The data reports all species that appear or are discussed in a given episode. As a result, some species occur more than others if they are in several episodes. We don’t want to bias our data towards species that appear on the show a lot, so we’re only going to include each species once per series. To do this we’ll do a group_by() call including “series”, “alignment”, and “alien”, we then do an arrange() call to order the data by episode number, and finally we use a filter() call with row_number() to pull out only the first row, or the first occurrence of a given species within our other variables. For a more detailed explanation of the code watch the video. The last line ungroups our data. Copy and run the updated code below.

data_clean = data %>%
             filter(series != "tas") %>%
             mutate(series = factor(series)) %>%
             filter(alignment == "foe" | alignment == "friend") %>%
             mutate(alignment = factor(alignment)) %>%
             filter(!is.na(conservation)) %>%
             mutate(extinct = ifelse(conservation == "LC", 0, 1)) %>%
             group_by(series, alignment, alien) %>%
             arrange(episode) %>%
             filter(row_number() == 1) %>%
             ungroup()

The data is clean and ready to go to make a figure! Before we move to our figures script be sure to save your script in the “scripts” folder and use a name ending in “_cleaning”, for example mine is called “rcourse_lesson4_cleaning”. Once the file is saved commit the change to Git. My commit message will be “Made cleaning script.”. Finally, push the commit up to Bitbucket.

Figures Script

Open a new script in RStudio. You can close the cleaning script or leave it open, we’re done with it for this lesson. This new script is going to be our script for making all of our figures. We’ll start with using our source() call to read in our cleaning script, and then we’ll load our packages, in this case ggplot2. For a reminder of what source() does go back to Lesson 2. Assuming you ran all of the code in the cleaning script there’s no need to run the source() line of code, but do load ggplot2. Copy the code below and run as necessary.

## READ IN DATA ####
source("scripts/rcourse_lesson4_cleaning.R")

## LOAD PACKAGES ####
library(ggplot2)

Now we’ll clean our data specifically for our figures. There’s only one change I’m going to make for “data_figs” from “data_clean”. Since R codes variables alphabetically, currently “tng”, for “The Next Generation”, will be plotted before “tos”, for “The Original Series”, which is not desirable since chronologically it is the reverse. So, using the mutate() and factor() calls I’m going to change the order of the levels so that it’s “tos” and then “tng”. I’m also going to update the actual text with the “labels” setting so that the labels are more informative and complete. Copy and run the code below.

## ORGANIZE DATA ####
data_figs = data_clean %>%
            mutate(series = factor(series, levels=c("tos", "tng"),
                            labels = c("The Original Series", "The Next Generation")))

Just as in Lesson 3 when we summarised our “0”s and “1”s for our logistic regression into a percentage, we’ll do the same thing here. In this example we group by our two independent variables, “series” and “alignment”, and then get the mean of our dependent variable, “extinct”. Finally, we end our call with ungroup(). Copy and run the code below.

# Summarise data by series and alignment
data_figs_sum = data_figs %>%
                group_by(series, alignment) %>%
                summarise(perc_extinct = mean(extinct) * 100) %>%
                ungroup()

Now that our data frame for the figure is ready we can make our barplot. Remember, because we only have four values in “data_figs_sum”, 1) “tos” and “foe”, 2) “tos” and “friend”, 3) “tng” and “foe”, and 4) “tng” and “friend”, we can’t make a boxplot of the data because there is no spread. The first few and last few lines of the code below should be familiar to you. We have our header comment and then we write the code for “extinct.plot” with the attributes for the x- and y-axes. Something new is the fill attribute. This is how we get grouped barplots. So, first there will be separate bars for each series, and then two bars within “series”, one for each “alignment” level. The fill attribute says to use the fill color of the bars to show which is which level. The geom_bar() call we’ve used before, but the addition of the position = "dodge" tells R to put the bars side by side instead of on top of each other in the grouped portion of the plot. The next line we used last time to set the range of the y-axis, but the final two lines of the plot are new. The call geom_hline() is used to draw a vertical line on the plot. I’ve chosen to draw a line where y is 50 to show chance, thus yintercept = 50. The final line of code manually sets the colors. I’ve decided to go with “red” and “yellow” as they are the most common “Star Trek” uniform colors. The end of the code block prints the figure, and, if you uncomment the pdf() and dev.off() lines, will save it to a PDF. To learn more about the new lines of code watch the video. Copy and run the code to make the figure.

## MAKE FIGURES ####
extinct.plot = ggplot(data_figs_sum, aes(x = series, y = perc_extinct, fill = alignment)) +
               geom_bar(stat = "identity", position = "dodge") +
               ylim(0, 100) +
               geom_hline(yintercept = 50) +
               scale_fill_manual(values = c("red", "yellow"))

# pdf("figures/extinct.pdf")
extinct.plot
# dev.off()

As you can see in the figure below, it looks like there is an interaction between “series” and “alignment”. In “The Original Series” a “foe” was more likely to go extinct than a “friend”, whereas in “The Next Generation” the effect is the reverse and also much larger of a difference.

In the script on Github you’ll see I’ve added several other parameters to my figures, such as adding a title, customizing how my axes are labeled, and changing where the legend is placed. Play around with those to get a better idea of how to use them in your own figures.

Save your script in the “scripts” folder and use a name ending in “_figures”, for example mine is called “rcourse_lesson4_figures”. Once the file is saved commit the change to Git. My commit message will be “Made figures script.”. Finally, push the commit up to Bitbucket.

Statistics Script

Open a new script and on the first few lines write the following, same as for our figures script. Note, just as in previous lessons we’ll add a header for packages, but we won’t be loading any for this script.

## READ IN DATA ####
source("scripts/rcourse_lesson4_cleaning.R")

## LOAD PACKAGES ####
# [none currently needed]

We’ll also make a header for organizing our data. Just as I changed the order of “series” for the figure, I’m going to do the same thing in my data frame for the statistics so the model coefficients are easier to interpret. There’s no need for me to change the names of the levels though since they are clear enough as is for the analysis. Copy and run the code below.

## ORGANIZE DATA ####
data_stats = data_clean %>%
             mutate(series = factor(series, levels = c("tos", "tng")))

We’re going to build several logistic regressions, working up to our full model with the interaction. We’ll add a header for our code and then a comment describing our first model. The first model will use just our one variable “series”. This code should be familiar from Lesson 3. Copy and run the code below.

## BUILD MODELS ####
# One variable (series)
extinct_series.glm = glm(extinct ~ series, family = "binomial", data = data_stats)

extinct_series.glm_sum = summary(extinct_series.glm)
extinct_series.glm_sum

The summary of the model is provided below. Looking first at the estimate for the intercept we see that it is positive (0.48551). This means that in “The Original Series” a given species was likely to be headed towards extinction (since 0 is chance, positive number above chance, negative numbers below chance). Looking at the p-value for the intercept (0.0613) we can see there was trending effect of the intercept. So we can’t say that in “The Original Series” species were significantly likely to become extinct. More importantly though let’s look at the estimate for our variable of interest, “series”. Our estimate is negative (-0.05264), which suggests species were less likely to become extinct in “The Next Generation” than “The Original Series”. But is this difference significant? Our p-value (0.8689) would suggest no.

Next let’s look at our other single variable, “alignment”. The code is provided below. It is the same as the code above only using a different variable. Copy and run the code below.

# One variable (alignment)
extinct_alignment.glm = glm(extinct ~ alignment, family = "binomial", data = data_stats)

extinct_alignment.glm_sum = summary(extinct_alignment.glm)
extinct_alignment.glm_sum

The summary of the model is provided below. Our baseline here is “foe” and the intercept is negative (-0.1112) suggesting that foes are likely to not become extinct, but the intercept is not significant (p = 0.63753). However, we do get a significant effect of “alignment” (p = 0.00228). Our estimate is positive (0.9543), which means friends are more likely to become extinct that foes.

Now we can put all of this together in a single model, but first without an interaction. To do this we build our same model but using the + symbol to string together our variables. Copy and run the code below

# Two variables additive
extinct_seriesalignment.glm = glm(extinct ~ series + alignment, family = "binomial", data = data_stats)

extinct_seriesalignment.glm_sum = summary(extinct_seriesalignment.glm)
extinct_seriesalignment.glm_sum

The summary of the model is provided below. We’re not going to try and interpret the intercept because it’s not totally transparent what it means, but the estimates and significance tests for our variables match our single variable models: there is no effect of “series” but there is an effect of “alignment”. Note, the estimates aren’t exactly the same as in our single variable models. This is because our data set is unbalanced, and our additive model takes this into account when computing the estimates for both variables at the same time. If our data set were fully balanced we would have the same estimates across the single variable models and the additive model.

Our final model takes our additive model but adds an interaction. To do this we just change the + symbol connecting our two variables to a * symbol. When saving the model I added a x between the variables in the name. Copy and run the code below.

# Two variables interaction (pre-determined baselines)
extinct_seriesxalignment.glm = glm(extinct ~ series * alignment, family = "binomial", data = data_stats)

extinct_seriesxalignment.glm_sum = summary(extinct_seriesxalignment.glm)
extinct_seriesxalignment.glm_sum

The summary of the model is provided below. Now our intercept is meaningful, it is the mean for our two baselines, foes in “The Original Series”. We see that it has a positive estimate (0.7985) and is significant (p = 0.04666), suggesting that foes in “The Original Series” are likely headed towards extinction. Now, for the first time we also have a significant effect of “series” (p = 0.00313). Remember though, this is specifically for the data on foes, the baseline of “alignment”. So, foes in “The Next Generation” were significantly less likely to become extinct (estimate = -1.5267) than in “The Original Series”. We still have no effect of alignment, but again this is only in reference to the data from “The Original Series”, our baseline for “series”. Finally, we have a significant interaction of “series” and “alignment” (p = 0.00030) as expected based on our figure. The estimate is a little hard to interpret on its own, an easier way to understand would be to look at other baseline comparisons in the data and see where results differ. For example, there is no effect of “alignment” for “The Original Series”, but we don’t know if this holds for “The Next Generation”.

In order to look at other baseline comparisons we’re going to change the baseline of our model within the code for the model itself. We change the baseline for “series” earlier when we made “data_figs”, but changing it within the model gives us a little more flexibility to not have to make an entirely new data frame. In the code below I’ve changed the baseline of “series” to “tng”. Copy and run the code below.

# Two variables interaction (change baseline for series)
extinct_seriesxalignment_tng.glm = glm(extinct ~ relevel(series, "tng") * alignment, family = "binomial", data = data_stats)

extinct_seriesxalignment_tng.glm_sum = summary(extinct_seriesxalignment_tng.glm)
extinct_seriesxalignment_tng.glm_sum

The summary of the model is provided below. Now the intercept is in reference to data for foes from “The Next Generation”. The intercept is still significant (p = 0.02524) but now the estimate is negative (-0.7282) suggesting that unlike in “The Original Series”, in “The Next Generation” foes are likely to not be headed towards extinction. Also interesting to note, the effect of “alignment” is now significant (p = 7.23e-06) with a positive estimate (1.8781) suggesting that in “The Next Generation” friends are significantly more likely to be headed towards extinction than foes. Looking at our other two effects, “series” and the interaction of “series” and “alignment”, they have exactly the same coefficients and significance values as our previous model. The only difference is the sign of the coefficient, positive or negative, is switched, since we switched the baseline value for “series”.

We could also relevel the variable “alignment” but keep “series” set to the original level, “tos”. Copy and run the code below.

# Two variables interaction (change baseline for alignment)
extinct_seriesxalignment_friend.glm = glm(extinct ~ series * relevel(alignment, "friend"), family = "binomial", data = data_stats)

extinct_seriesxalignment_friend.glm_sum = summary(extinct_seriesxalignment_friend.glm)
extinct_seriesxalignment_friend.glm_sum

The summary of the model is provided below. Now our intercept isn’t significant (p = 0.4937), so friends in “The Original Series” are not significantly more or less likely to become extinct (don’t forget, the baseline for “series” is back to “tos”!). Our effect for “series” continues to be significant (p = 0.0354), but now in the reverse direction as before (0.9135). Friends are significantly more likely to become extinct in “The Next Generation” than in “The Original Series”. As before, our values for our remaining variables, “alignment” and the interaction of “series” and “alignment”, are the same as the original model with the interaction, just with reversed signs.

In the end our expectations based on the figure are confirmed statistically, there was an interaction of “series” and “alignment”. Breaking it down a bit more, we found that foes were significantly less likely to become extinct in “The Next Generation” than in “The Original Series”, but friends were significantly more likely to become extinct in “The Next Generation” than in “The Original Series”. Within “series”, there was no difference between foes and friends in “The Original Series”, but there was in “The Next Generation”, with friends being more likely to become extinct.

You’ve now run a logistic regression with two variables and an interaction in R! Save your script in the “scripts” folder and use a name ending in “_statistics”, for example mine is called “rcourse_lesson4_statistics”. Once the file is saved commit the change to Git. My commit message will be “Made statistics script.”. Finally, push the commit up to Bitbucket.

Write-up

Let’s make our write-up to summarise what we did today. First save your current working environment to a file such as “rcourse_lesson4_environment” to your “write_up” folder. If you forgot how to do this go back to Lesson 1. Open a new R Markdown document and follow the steps to get a new script. As before delete everything below the chuck of script enclosed in the two sets of ---. Then on the first line use the following code to load our environment.

```{r, echo=FALSE}
load("rcourse_lesson4_environment.RData")
```

Let’s make our sections for the write-up. I’m going to have three: 1) Introduction, 2) Results, and 3) Conclusion. See below for structure.

# Introduction


# Results


# Conclusion

In each of my sections I can write a little bit for any future readers. For example below is my Introduction.

# Introduction

I analyzed alien species data from two "Star Trek" series, "Star Trek: The Original Series" and "Star Trek: The Next Generation". Specifically, I looked at whether series ("The Original Series", "The Next Generation") and species alignment to the Enterprise (foe, friend) could predict whether the species was classified as likely to become extinct in the near future or not. Note, in the classification for this analysis only species with a classification of "least concerned" in a more nuanced classification system, were labeled as "not likely", the rest were labeled as "likely".

Turning to the Results section, I can include both my figure and my model results. For example, below is the code to include my figure and my full model with the interaction.

# Results

I tested for if an alien species' likelihood of becoming extinct could be predicted by the series in which the species appeared and whether the species was a friend or a foe. Initial visual examination of the data suggests that there is an interaction, where likelihood of becoming extinct for friends or foes is flipped for each series.

```{r, echo=FALSE, fig.align='center'}
extinct.plot
```

To test this effect I ran a logistic regression with "not likely to become extinct" (0) or "likely to become extinct" (1) as the dependent variable and series and alignment as independent variables. There was a significant effect of series and a significant interaction of series and alignment.

```{r}
extinct_seriesxalignment.glm_sum
```

Go ahead and fill out the rest of the document to include the releveled models to fully explain the interaction and write a short conclusion, you can also look at the full version of my write-up with the link provided at the top of the lesson. When you are ready, save the script to your “write_up” folder (for example, my file is called “rcourse_lesson4_writeup”) and compile the HTML or PDF file. Once your write-up is made, commit the changes to Git. My commit message will be “Made write-up.”. Finally, push the commit up to Bitbucket. If you are done with the lesson you can go to your Projects menu and click “Close Project”.

Congrats! You can now do multiple regression in R!

Conclusion and Next Steps

Today you learned how to take an old statistics test (logistic regression) but expand it to when you have two variables (multiple regression). You were also introduced to the package purrr to read in multiple files at once, and expanded your knowledge of dplyr and ggplot2 calls. One issue you may have is that with baselines we lose our ability to see general main effects across the data. For example, in our model with the interaction we didn’t get to know if there was an effect of “series” regardless of “alignment”, only within one “alignment” level or the other. Next time we’ll be able to get around this issue with an analysis of variance (ANOVA).

* Data for the rest of the series is not currently available in full on The Star Trek Project.

Lesson 4: Multiple Regression

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧