Connecting R to Everything with IFTTT

June 18, 2015, 7:45 pm

≫ Next: Count data: To Log or Not To Log

≪ Previous: advanced.procD.lm for pairwise tests and model comparisons

(This article was first published on Brian Connelly » R, and kindly contributed to R-bloggers)

IFTTT (“if this then that”) is one of my favorite tools. I use it to keep and share articles, turn on my home’s lights at sundown, alert me when certain keywords are mentioned on Twitter/Reddit/etc., and many other things. Recently, the great people at IFTTT announced the Maker Channel, which allows recipes to make and receive web requests. This second option caught my interest as a nice way to do all kinds of things from R. For example, you could set the temperature with a NEST Thermostat, blink your HUE lightbulbs, write some data to a Google Drive document, or do a [whole lot(https://ifttt.com/channels) of other things.

For the sake of demonstration, I’m going to use the Maker Channel to send notifications to my phone (even though I’m partial to the pushoverr package for that).

If you don’t already have one, you’ll first need to create an IFTTT account. To receive notifications on your phone, you’ll also need to install the IF app for iOS or Android. The instructions that follow can either be done from the IF app or from the IFTTT website. I’m going to be using the website.

Setting Up the Maker Channel

First, we’ll need to visit the Channels page to enable the Maker Channel. In the search bar, type “maker“. Select it by clicking on the fancy M in the results.

Finding the Maker channel

Now activate the Maker Channel by clicking on the Connect button.

Connecting the Maker Channel

This will get the channel ready to go and generate a secret key. You can see in the picture below that my secret key is ci740p2XeuKq35nfHohG9Z. It’s called a secret key for a reason, so don’t share this with anyone, or they can use it to trigger whatever actions you define. If you’ve managed to leak yours like I have here, you can generate a new secret key by pressing the Reconnect Channel button (which I have done following this post, so don’t try to send me your ANOVA results).

Channel is now activated

Now we’re ready to go. If you’d like to see exactly how to trigger events with the Maker Channel, follow the How to Trigger Events link (if you’ve clicked the link here, you’ll need to replace REPLACE_ME with your actual secret key).

Creating a Recipe for Notifications

Now we’re going to create a recipe to send you a notification on your phone whenever you (or R!) connects to the Maker Channel. Go to My Recipes and click on the Create a Recipe button.

If you’re new to IFTTT, the goal is to connect some output from one channel (i.e., the this) with some other channel (i.e., the that).

Select THIS

Click on the this link, and then select the fancy M for the Maker Channel.

Choosing the Maker channel

For the this part of a recipe, we can only choose to receive web requests, so follow that option. We’re now going to pick a name for the event that is triggered whenever a web request is received. For now, let’s call it r_status. When you’re a Maker Channel pro, you can create multiple different events and have them do different things.

Entering the Event name

Now it’s time to choose the that part of the recipe, which is what IFTTT will do whenever it receives r_status events. Click on the that link, and select either Android Notifications or iOS Notifications, depending on which type of device you have. I’ll be going with the iOS option, so the remaining screenshots may look slightly different for you.

Select THAT

No matter which route you go, the next option is to select Send a Notification.

Here, we can get creative with what the notification says. You can include the name of the event ({{EventName}}), when the web request was received ({{OccurredAt}}), and up to three text values that are given in the web request ({{Value1}}–{{Value3}}). So if you’re fitting a linear model with lm, you could notify yourself of the slope, the intercept, and the amount of time it took to run, which we’ll do later.

Creating the notification message

Once you’ve crafted your message, hit the Create Action button, edit your recipe title (optional), and hit the Create Recipe button.

Giving R a Voice

Now fire up R or RStudio. We’re going to need the httr package to send web requests. If you don’t already have it (or think you might be out of date), run install.packages("httr").

Before we send our first notification, I’m going to save my event name and secret key in variables. If you do the same (but with your secret values), you’ll be able to easily copy and run all of the other code that follows.

my_event <- 'r_status'
my_key <- 'ci740p2XeuKq35nfHohG9Z'

Now let’s send a first message! First, we’ll build the URL where we’ll issue the request, and then we’ll issue the request with httr’s POST:

maker_url <- paste('https://maker.ifttt.com/trigger', my_event, 'with/key',
                   my_key, sep='/')
httr::POST(maker_url)

## Response [https://maker.ifttt.com/trigger/r_status/with/key/ci740p2XeuKq35nfHohG9Z]
##   Date: 2015-06-18 19:01
##   Status: 200
##   Content-Type: text/html; charset=utf-8
##   Size: 48 B

If all went well, your device should be beeping or buzzing. We see in the results that the status was 200, which indicates a success.

Our first notification

Sending Values

You may have noticed that your notification didn’t contain any data. We can add data by specifying values for value1, value2, and value3 (note the lowercase) in the body of our message.

httr::POST(maker_url, body=list(value1='hola', value2='mundo', value3=7))

## Response [https://maker.ifttt.com/trigger/r_status/with/key/ci740p2XeuKq35nfHohG9Z]
##   Date: 2015-06-18 19:01
##   Status: 200
##   Content-Type: text/html; charset=utf-8
##   Size: 48 B

Notification from our hola mundo example

Now that ¡Hola, mundo! is out of the way, we can send it some data. I mentioned fitting a linear model earlier, so let’s do that. First, we’ll borrow some example code from lm‘s help page and fit a linear model:

# Create some data
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)

# Fit the model
lm.D9 <- lm(weight ~ group)

Now, we’ll send a notification containing the model’s slope and intercept:

httr::POST(maker_url, body=list(value1='lm complete',
                                value2=coefficients(lm.D9)[[2]],
                                value3=coefficients(lm.D9)[[1]]))

## Response [https://maker.ifttt.com/trigger/r_status/with/key/ci740p2XeuKq35nfHohG9Z]
##   Date: 2015-06-18 19:01
##   Status: 200
##   Content-Type: text/html; charset=utf-8
##   Size: 48 B

Here, we’ve sent a little message as value1, the slope as value2, and the intercept as value3.

Notification from our lm example

Hopefully, this little example has demonstrated how IFTTT’s Maker Channel can be used to connect R to a whole lot of online services. Have at it!

To leave a comment for the author, please follow the link and comment on his blog: Brian Connelly » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Count data: To Log or Not To Log

July 22, 2015, 7:45 am

≫ Next: Connecting R to Everything with IFTTT

≪ Previous: Connecting R to Everything with IFTTT

(This article was first published on biologyforfun » R, and kindly contributed to R-bloggers)

Count data are widely collected in ecology, for example when one count the number of birds or the number of flowers. These data follow naturally a Poisson or negative binomial distribution and are therefore sometime tricky to fit with standard LMs. A traditional approach has been to log-transform such data and then fit LMs to the transformed data. Recently a paper advocated against the use of such transformation since it led to high bias in the estimated coefficients. More recently another paper argued that log-transformation of count data followed by LM led to lower type I error rate (ie saying that an effect is significant when it is not) than GLMs. What should we do then?

Using a slightly changed version of the code published in the Ives 2015 paper let’s explore the impact of using these different modelling strategies on the rejection of the null hypothesis and the bias of the estimated coefficients.


#load the libraries
library(MASS)
library(lme4)
library(ggplot2)
library(RCurl)
library(plyr)

#download and load the functions that will be used
URL<-"https://raw.githubusercontent.com/Lionel68/Jena_Exp/master/stats/ToLog_function.R"
download.file(URL,destfile=paste0(getwd(),"/ToLog_function.R"),method="curl")
source("ToLog_function.R")

Following the paper from Ives and code therein I simulated some predictor (x) and a response (y) that follow a negative binomial distribution and is linearly related to x.
In the first case I look at the impact of varying the sample size on the rejection of the null hypothesis and the bias in the estimated coefficient between y and x.


######univariate NB case############
base_theme<-theme(title=element_text(size=20),text=element_text(size=18))
#range over n
output<-compute.stats(NRep=500,n.range = c(10,20,40,80,160,320,640,1280))
ggplot(output,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Proportion of Rejected H0")+base_theme
ggplot(output,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Bias in the slope coefficient")+base_theme

For this simulation round the coefficient of the slope (b1) was set to 0 (no effect of x on y), and only the sample size varied. The top panel show the average proportion of time that the p-value of the slope coefficient was lower than 0.05 (H0:b1=0 rejected). We see that for low sample size (<40) the Negative Binomial model has higher proportion of rejected H0 (type I error rate) but this difference between the model disappear as we reached sample size bigger than 100. The bottom panel show the bias (estimated value – true value) in the estimated coefficient. For very low sample size (n=10), Log001, Negative Binomial and Quasipoisson have higher bias than Log1 and LogHalf. For larger sample size the difference between the GLM team (NB and QuasiP) and the LM one (Log1 and LogHalf) gradually decrease and both teams converged to a bias around 0 for larger sample size. Only Log0001 is behaved very badly. From what we saw here it seems that Log1 and LogHalf are good choices for count data, they have low Type I error and Bias along the whole sample size gradient.

The issue is that an effect of exactly 0 never exist in real life where most of the effect are small (but non-zero) thus the Null Hypothesis will never be true. Let’s look know how the different models behaved when we vary b1 alone in a first time and crossed with sample size variation.


#range over b1
outpu<-compute.stats(NRep=500,b1.range = seq(-2,2,length=17))
ggplot(outpu,aes(x=b1,y=Reject,color=Model))+geom_path(size=2)+base_theme+labs(y="Proportion of Rejected H0")
ggplot(outpu,aes(x=b1,y=Bias,color=Model))+geom_path(size=2)+base_theme+labs(y="Bias in the slope coefficient")

Here the sample size was set to 100, what we see in the top graph is that for a slope of exactly 0 all model have a similar average proportion of rejection of the null hypothesis. As b1 become smaller or bigger the average proportion of rejection show very similar increase for all model expect for Log0001 which has a slower increase. This curves basically represent the power of the model to detect an effect and is very similar to the Fig2 in the Ives 2015 paper. Now the bottom panel show that all the LM models have bad behaviour concerning their bias, they have only small bias for very small (close to 0) coefficient, has the coefficient gets bigger the absolute bias increase. This means that even if the LM models are able to detect an effect with similar power the estimated coefficient is wrong. This is due to the value added to the untransformed count data in order to avoid -Inf for 0s. I have no idea on how one may take into account arithmetically these added values and then remove its effects …

Next let’s cross variation in the coefficient with sample size variation:

#range over n and b1
output3<-compute.stats(NRep=500,b1.range=seq(-1.5,1.5,length=9),n.range=c(10,20,40,80,160,320,640,1280))
ggplot(output3,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Proportion of rejected H0")
ggplot(output3,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Bias in the slope coefficient")

The tope panel show one big issue of focussing only on the significance level: rejection of H0 depend not only on the size of the effect but also on the sample size. For example for b1=0.75 (a rather large value since we work on the exponential scale) less than 50% of all models rejected the null hypothesis for a sample size of 10. Of course as the effect sizes gets larger the impact of the sample size on the rejection of the null hypothesis is reduced. However most effect around the world are small so that we need big sample size to be able to “detect” them using null hypothesis testing. The top graph also shows that NB is slightly better than the other models and that Log0001 is again having the worst performance. The bottom graphs show something interesting, the bias is quasi-constant over the sample size gradient (maybe if we would look closer we would see some variation). Irrespective of how many data point you will collect the LMs will always have bigger bias than the GLMs (expect for the artificial case of b1=0)

To finish with in Ives 2015 the big surprise was the explosion of type I error with the increase in the variation in individual-level random error (adding a random normally distributed value to the linear predictor of each data point and varying the standard deviation of these random values) as can be seen in the Fig3 of the paper.


#range over b1 and sd.eps
output4<-compute.statsGLMM(NRep=500,b1.range=seq(-1.5,1.5,length=9),sd.eps.range=seq(0.01,2,length=10))
ggplot(output4,aes(x=sd.eps,y=Reject,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="",y="Proportion of rejected H0")
ggplot(output4,aes(x=sd.eps,y=Bias,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="Standard Deviation of the random error",y="Bias of the slope coefficient")

Before looking at the figure in detail please note that a standard deviation of 2 in this context is very high, remember that these values will be added to the linear predictor which will be exponentiated so that we will end up with very large deviation. In the top panel there are two surprising results, the sign of the coefficient affect the pattern of null hypothesis rejection and I do not see the explosion of rejection rates for NB or QuasiP that are presented in Ives 2015. In his paper Ives reported the LRT test for the NB models when I am reporting the p-values from the model summary directly (Wald test). If some people around have computing power it would be interesting to see if changing the seed and/or increasing the number of replication lead to different patterns … The bottom panel show again that the LMs bias are big, the NB and QuasiP models have an increase in the bias with the standard deviation but only if the coefficient is negative (I suspect some issue with the exponentiating of large random positive error), as expected the GLMM perform the best in this context.

Pre-conclusion, in real life of course we would rarely have a model with only one predictor, most of the time we will build larger models with complex interaction structure between the predictors. This will of course affect both H0 rejection and Bias, but this is material for a next post

Let’s wrap it up; we’ve seen that even if LM transformation seem to be a good choice for having a lower type I error rate than GLMs this advantage will be rather minimal when using empirical data (no effect are equal to 0) and potentially dangerous (large bias). Ecologists have sometime the bad habits to turn their analysis into a star hunt (R standard model output gives stars to significant effects) and focusing only on using models that have the best behavior towards significance (but large bias) does not seem to be a good strategy to me. More and more people call for increasing the predictive power of ecological model, we need then modelling techniques that are able to precisely (low bias) estimate the effects. In this context transforming the data to make them somehow fit normal assumption is sub-optimal, it is much more natural to think about what type of processes generated the data (normal, poisson, negative binomial, with or without hierarchical structure) and then model it accordingly. There are extensive discussion nowadays about the use and abuse of p-values in science and I think that in our analysis we should slowly but surely shifts our focus from “significance/p-values<0.05/star hunting” only to a more balanced mix of effect-sizes (or standardized slopes), p-values and R-square.

Filed under: Biological Stuff, R and Stat Tagged: ecology, GLM, LM, R, Statistics

To leave a comment for the author, please follow the link and comment on his blog: biologyforfun » R.

↧

Connecting R to Everything with IFTTT

June 18, 2015, 1:21 pm

≫ Next: Evaluating Logistic Regression Models

≪ Previous: Count data: To Log or Not To Log

(This article was first published on Brian Connelly: r, and kindly contributed to R-bloggers)

For the sake of demonstration, I’m going to use the Maker Channel to send notifications to my phone (even though I’m partial to the pushoverr package for that).

Setting Up the Maker Channel

First, we’ll need to visit the Channels page to enable the Maker Channel. In the search bar, type “maker”. Select it by clicking on the fancy M in the results.

Finding the Maker channel

Now activate the Maker Channel by clicking on the Connect button.

Connecting the Maker Channel

Channel is now activated

Creating a Recipe for Notifications

Now we’re going to create a recipe to send you a notification on your phone whenever you (or R!) connects to the Maker Channel. Go to My Recipes and click on the Create a Recipe button.

If you’re new to IFTTT, the goal is to connect some output from one channel (i.e., the this) with some other channel (i.e., the that).

Select THIS

Click on the this link, and then select the fancy M for the Maker Channel.

Choosing the Maker channel

Entering the Event name

Select THAT

No matter which route you go, the next option is to select Send a Notification.

Creating the notification message

Once you’ve crafted your message, hit the Create Action button, edit your recipe title (optional), and hit the Create Recipe button.

Giving R a Voice

Now fire up R or RStudio. We’re going to need the httr package to send web requests. If you don’t already have it (or think you might be out of date), run install.packages("httr").

my_event <- 'r_status'
my_key <- 'ci740p2XeuKq35nfHohG9Z'

Now let’s send a first message! First, we’ll build the URL where we’ll issue the request, and then we’ll issue the request with httr’s POST:

maker_url <- paste('https://maker.ifttt.com/trigger', my_event, 'with/key',
                   my_key, sep='/')
                   httr::POST(maker_url)

## Response [https://maker.ifttt.com/trigger/r_status/with/key/ci740p2XeuKq35nfHohG9Z]
##   Date: 2015-06-18 19:01
##   Status: 200
##   Content-Type: text/html; charset=utf-8
##   Size: 48 B

If all went well, your device should be beeping or buzzing. We see in the results that the status was 200, which indicates a success.

Our first notification

Sending Values

You may have noticed that your notification didn’t contain any data. We can add data by specifying values for value1, value2, and value3 (note the lowercase) in the body of our message.

httr::POST(maker_url, body=list(value1='hola', value2='mundo', value3=7))

## Response [https://maker.ifttt.com/trigger/r_status/with/key/ci740p2XeuKq35nfHohG9Z]
##   Date: 2015-06-18 19:01
##   Status: 200
##   Content-Type: text/html; charset=utf-8
##   Size: 48 B

Notification from our hola mundo example

Now that ¡Hola, mundo! is out of the way, we can send it some data. I mentioned fitting a linear model earlier, so let’s do that. First, we’ll borrow some example code from lm’s help page and fit a linear model:

# Create some data
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)

# Fit the model
lm.D9 <- lm(weight ~ group)

Now, we’ll send a notification containing the model’s slope and intercept:

httr::POST(maker_url, body=list(value1='lm complete',
           value2=coefficients(lm.D9)[[2]],
           value3=coefficients(lm.D9)[[1]]))

 
## Response [https://maker.ifttt.com/trigger/r_status/with/key/ci740p2XeuKq35nfHohG9Z]
##   Date: 2015-06-18 19:01
##   Status: 200
##   Content-Type: text/html; charset=utf-8
##   Size: 48 B

Here, we’ve sent a little message as value1, the slope as value2, and the intercept as value3.

Notification from our lm example

Hopefully, this little example has demonstrated how IFTTT’s Maker Channel can be used to connect R to a whole lot of online services. Have at it!

Update: Bob Rudis has made a great package for using the Maker Channel called nifffty.

To leave a comment for the author, please follow the link and comment on his blog: Brian Connelly: r.

↧

Evaluating Logistic Regression Models

August 17, 2015, 6:14 pm

≫ Next: Two-Way ANOVA with Repeated Measures

≪ Previous: Connecting R to Everything with IFTTT

(This article was first published on Mathew Analytics » R, and kindly contributed to R-bloggers)

Logistic regression is a technique that is well suited for examining the relationship between a categorical response variable and one or more categorical or continuous predictor variables. The model is generally presented in the following format, where β refers to the parameters and x represents the independent variables.

log(odds)=β0+β1∗x1+...+βn∗xn

The log(odds), or log-odds ratio, is defined by ln[p/(1−p)] and expresses the natural logarithm of the ratio between the probability that an event will occur, p(Y=1), to the probability that it will not occur. We are usually concerned with the predicted probability of an event occuring and that is defined by p=1/1+exp^−z, where z=β0+β1∗x1+...+βn∗xn

Logistic Regression Example

We will use the GermanCredit dataset in the caret package for this example. It contains 62 characteristics and 1000observations, with a target variable (Class) that is allready defined. The response variable is coded 0 for bad consumer and 1 for good. It’s always recommended that one looks at the coding of the response variable to ensure that it’s a factor variable that’s coded accurately with a 0/1 scheme or two factor levels in the right order. The first step is to partition the data into training and testing sets.

library(caret)
data(GermanCredit)

Train <- createDataPartition(GermanCredit$Class, p=0.6, list=FALSE)
training <- GermanCredit[ Train, ]
testing <- GermanCredit[ -Train, ]

Using the training dataset, which contains 600 observations, we will use logistic regression to model Class as a function of five predictors.

mod_fit <- train(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own + 
                   CreditHistory.Critical,  data=training, method="glm", family="binomial")

Bear in mind that the estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale. For example, this model suggests that for every one unit increase in Age, the log-odds of the consumer having good credit increases by 0.018. Because this isn’t of much practical value, we’ll ussually want to use the exponential function to calculate the odds ratios for each preditor.

exp(coef(mod_fit$finalModel))

##            (Intercept)                    Age          ForeignWorker 
##              1.1606762              1.0140593              0.5714748 
##    Property.RealEstate            Housing.Own CreditHistory.Critical 
##              1.8214566              1.6586940              2.5943711

This informs us that for every one unit increase in Age, the odds of having good credit increases by a factor of 1.01. In many cases, we often want to use the model parameters to predict the value of the target variable in a completely new set of observations. That can be done with the predict function. Keep in mind that if the model was created using the glm function, you’ll need to add type="response" to the predict command.

predict(mod_fit, newdata=testing)
predict(mod_fit, newdata=testing, type="prob")

Model Evaluation and Diagnostics

A logistic regression model has been built and the coefficients have been examined. However, some critical questions remain. Is the model any good? How well does the model fit the data? Which predictors are most important? Are the predictions accurate? The rest of this document will cover techniques for answering these questions and provide R code to conduct that analysis.

For the following sections, we will primarily work with the logistic regression that I created with the glm() function. While I prefer utilizing the Caret package, many functions in R will work better with a glm object.

mod_fit_one <- glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own + 
                     CreditHistory.Critical, data=training, family="binomial")

mod_fit_two <- glm(Class ~ Age + ForeignWorker, data=training, family="binomial")

Goodness of Fit

Likelihood Ratio Test

A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. Removing predictor variables from a model will almost always make the model fit less well (i.e. a model will have a lower log likelihood), but it is necessary to test whether the observed difference in model fit is statistically significant. Given that H0 holds that the reduced model is true, a p-value for the overall model fit statistic that is less than 0.05 would compel us to reject the null hypothesis. It would provide evidence against the reduced model in favor of the current model. The likelihood ratio test can be performed in R using the lrtest() function from the lmtest package or using the anova() function in base.

anova(mod_fit_one, mod_fit_two, test ="Chisq")

library(lmtest)
lrtest(mod_fit_one, mod_fit_two)

Pseudo R^2

Unlike linear regression with ordinary least squares estimation, there is no R2 statistic which explains the proportion of variance in the dependent variable that is explained by the predictors. However, there are a number of pseudo R2 metrics that could be of value. Most notable is McFadden’s R2, which is defined as 1−[ln(LM)/ln(L0)] where ln(LM) is the log likelihood value for the fitted model and ln(L0) is the log likelihood for the null model with only an intercept as a predictor. The measure ranges from 0 to just under 1, with values closer to zero indicating that the model has no predictive power.

library(pscl)

pR2(mod_fit_one)  # look for 'McFadden'

##           llh       llhNull            G2      McFadden          r2ML 
## -344.42107079 -366.51858123   44.19502089    0.06029029    0.07101099 
##          r2CU 
##    0.10068486

Hosmer-Lemeshow Test

Another approch to determining the goodness of fit is through the Homer-Lemeshow statistics, which is computed on data after the observations have been segmented into groups based on having similar predicted probabilities. It examines whether the observed proportions of events are similar to the predicted probabilities of occurence in subgroups of the data set using a pearson chi square test. Small values with large p-values indicate a good fit to the data while large values with p-values below 0.05 indicate a poor fit. The null hypothesis holds that the model fits the data and in the below example we would reject H0.

library(MKmisc)
HLgof.test(fit = fitted(mod_fit_one), obs = training$Class)

library(ResourceSelection)
hoslem.test(training$Class, fitted(mod_fit_one), g=10)

Statistical Tests for Individual Predictors

Wald Test

A wald test is used to evaluate the statistical significance of each coefficient in the model and is calculated by taking the ratio of the square of the regression coefficient to the square of the standard error of the coefficient. The idea is to test the hypothesis that the coefficient of an independent variable in the model is significantly different from zero. If the test fails to reject the null hypothesis, this suggests that removing the variable from the model will not substantially harm the fit of that model.

library(survey)

regTermTest(mod_fit_one, "ForeignWorker")

## Wald test for ForeignWorker
##  in glm(formula = Class ~ Age + ForeignWorker + Property.RealEstate + 
##     Housing.Own + CreditHistory.Critical, family = "binomial", 
##     data = training)
## F =  0.949388  on  1  and  594  df: p= 0.33027

regTermTest(mod_fit_one, "CreditHistory.Critical")

## Wald test for CreditHistory.Critical
##  in glm(formula = Class ~ Age + ForeignWorker + Property.RealEstate + 
##     Housing.Own + CreditHistory.Critical, family = "binomial", 
##     data = training)
## F =  16.67828  on  1  and  594  df: p= 5.0357e-05

Variable Importance

To assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter. This technique is utilized by the varImp function in the caret package for general and generalized linear models.

varImp(mod_fit)

## glm variable importance
## 
##                        Overall
## CreditHistory.Critical  100.00
## Property.RealEstate      57.53
## Housing.Own              50.73
## Age                      22.04
## ForeignWorker             0.00

Validation of Predicted Values

Classification Rate

When developing models for prediction, the most critical metric regards how well the model does in predicting the target variable on out of sample observations. The process involves using the model estimates to predict values on the training set. Afterwards, we will compared the predicted target variable versus the observed values for each observation. In the example below, you’ll notice that our model accurately predicted 67 of the observations in the testing set.

pred = predict(mod_fit, newdata=testing)
accuracy <- table(pred, testing[,"Class"])
sum(diag(accuracy))/sum(accuracy)

## [1] 0.705

pred = predict(mod_fit, newdata=testing)
confusionMatrix(data=pred, testing$Class)

ROC Curve

The receiving operating characteristic is a measure of classifier performance. Using the proportion of positive data points that are correctly considered as positive and the proportion of negative data points that are mistakenly considered as positive, we generate a graphic that shows the trade off between the rate at which you can correctly predict something with the rate of incorrectly predicting something. Ultimately, we’re concerned about the area under the ROC curve, or AUROC. That metric ranges from 0.50 to 1.00, and values above 0.80 indicate that the model does a good job in discriminating between the two categories which comprise our target variable. Bear in mind that ROC curves can examine both target-x-predictor pairings and target-x-model performance. An example of both are presented below.

library(pROC)
# Compute AUC for predicting Class with the variable CreditHistory.Critical
f1 = roc(Class ~ CreditHistory.Critical, data=training) 
plot(f1, col="red")

## 
## Call:
## roc.formula(formula = Class ~ CreditHistory.Critical, data = training)
## 
## Data: CreditHistory.Critical in 180 controls (Class Bad) < 420 cases (Class Good).
## Area under the curve: 0.5944

library(ROCR)
# Compute AUC for predicting Class with the model
prob <- predict(mod_fit_one, newdata=testing, type="response")
pred <- prediction(prob, testing$Class)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.6540625

K-Fold Cross Validation

When evaluating models, we often want to assess how well it performs in predicting the target variable on different subsets of the data. One such technique for doing this is k-fold cross-validation, which partitions the data into k equally sized segments (called ‘folds’). One fold is held out for validation while the other k-1 folds are used to train the model and then used to predict the target variable in our testing data. This process is repeated k times, with the performance of each model in predicting the hold-out set being tracked using a performance metric such as accuracy. The most common variation of cross validation is 10-fold cross-validation.

ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)

mod_fit <- train(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own + 
                   CreditHistory.Critical,  data=GermanCredit, method="glm", family="binomial",
                 trControl = ctrl, tuneLength = 5)

pred = predict(mod_fit, newdata=testing)
confusionMatrix(data=pred, testing$Class)

There you have it. A high level review of evaluating logistic regression models in R. If you have any feedback or suggestions, please comment in the section below.

To leave a comment for the author, please follow the link and comment on his blog: Mathew Analytics » R.

↧

Two-Way ANOVA with Repeated Measures

August 18, 2015, 10:00 am

≫ Next: Ensuring R Generates the Same ANOVA F-values as SPSS

≪ Previous: Evaluating Logistic Regression Models

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

NOTE: This post only contains information on repeated measures ANOVAs, and not how to conduct a comparable analysis using a linear mixed model. For that, be on the lookout for an upcoming post!

When I was studying psychology as an undergraduate, one of my biggest frustrations with R was the lack of quality support for repeated measures ANOVAs.They’re a pretty common thing to run into in much psychological research, and having to wade through incomplete and often contradictory advice for conducting them was (and still is) a pain, to put it mildly.

Thankfully, though, they’re not too tricky to set up once you figure out what you’re doing.

To get started, let’s construct a phony data set where we’re measuring participant stress on a 100-point scale. Higher numbers mean the participant is more stressed out. For our experimental manipulation, let’s say that participants are exposed to a series of several images presented with various background music playing. The images can depict scenes that are happy or angry. The background music can be a Disney soundtrack or music from a horror movie. Each participant sees multiple images and listens to multiple music samples. (Your variables can have more than 2 factors, and you can include more than 2 IVs. We’re just keeping it simple for the purposes of explanation!)

First, here’s the code we’ll use to generate our phony data:

set.seed(5250)

myData <- data.frame(PID = rep(seq(from = 1,
                               to = 50, by = 1), 20),
                     stress = sample(x = 1:100,
                                     size = 1000,
                                     replace = TRUE),
                     image = sample(c("Happy", "Angry"),
                                    size = 1000,
                                    replace = TRUE),
                     music = sample(c("Disney", "Horror"),
                                    size = 1000,
                                    replace = TRUE)
)

myData <- within(myData, {
  PID   <- factor(PID)
  image <- factor(image)
  music <- factor(music)
})

myData <- myData[order(myData$PID), ]
head(myData)

PID stress image  music
  1     90 Happy Disney
  1     70 Angry Horror
  1     61 Angry Horror
  1     87 Happy Horror
  1     79 Happy Disney
  1     95 Happy Horror

So we see that we have one row per observation per participant. If your dataset is in wide form rather than long, I’d suggest checking out our article on converting between wide and long since everything from this point out assumes that your data look like what’s shown above!

Extracting Condition Means

Before we can run our ANOVA, we need to find the mean stress value for each participant for each combination of conditions. We’ll do that with:

myData.mean <- aggregate(myData$stress,
                      by = list(myData$PID, myData$music,
                              myData$image),
                      FUN = 'mean')

colnames(myData.mean) <- c("PID","music","image","stress")

myData.mean <- myData.mean[order(myData.mean$PID), ]
head(myData.mean)

PID  music   image   stress
  1 Disney   Angry 39.33333
  1 Horror   Angry 65.50000
  1 Disney   Happy 68.00000
  1 Horror   Happy 69.57143
  1 Disney Neutral 40.00000
  1 Horror Neutral 52.66667

So now we’ve gone from one row per participant per observation to one row per participant per condition. At this point we’re ready to actually construct our ANOVA!

Building the ANOVA

Now, our actual ANOVA is going to look something like this:

stress.aov <- with(myData.mean,
                   aov(stress ~ music * image +
                       Error(PID / (music * image)))
)

But what’s all that mean? What’s with that funky Error() term we threw in there? Pretty simple: what we’re saying is that we want to look at how stress changes as a function of the music and image that participants were shown. (Thus the stress ~ music * image) The asterisk specifies that we want to look at the interaction between the two IVs as well. But since this was a repeated measures design, we need to specify an error term that accounts for natural variation from participant to participant. (E.g., I might react a little differently to scary music than you do because I love zombie movies and you hate them!) We do this with the Error() function: specifically, we are saying that we want to control for that between-participant variation over all of our within-subjects variables.

Now that we’ve specified our model, we can go ahead and look at the results:

summary(stress.aov)

Error: PID
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 49   8344   170.3               

Error: PID:music
          Df Sum Sq Mean Sq F value Pr(>F)
music      1      1    0.78   0.003  0.954
Residuals 49  11524  235.19               

Error: PID:image
          Df Sum Sq Mean Sq F value Pr(>F)
image      1     61   61.11   0.296  0.589
Residuals 49  10127  206.66               

Error: PID:music:image
            Df Sum Sq Mean Sq F value Pr(>F)
music:image  1    564   563.8   2.626  0.112
Residuals   49  10520   214.7

We see that there is no main effect of either music:

F(1, 49) = 0.003; p-value = 0.954

or image:

F(1, 49) = 0.296; p-value = 0.589

on participant stress. Likewise, we see that there is not a significant interaction effect between the two independent variables:

F(1, 49) = 2.626; p-value = 0.112

What do I do with my Between-Subjects Effects?

This has all been fine and good, but what if you have an independent variable that’s between-subjects? To continue our previous example, let’s say that some participants could only come in during the day and some could only come in at night. Our data might instead look like this:

set.seed(5250)

myData <- data.frame(PID = rep(seq(from = 1,
                               to = 50, by = 1), 20),
                     stress = sample(x = 1:100,
                                     size = 1000,
                                     replace = TRUE),
                     image = sample(c("Happy", "Angry"),
                                    size = 1000,
                                    replace = TRUE),
                     music = sample(c("Disney", "Horror"),
                                    size = 1000,
                                    replace = TRUE),
                     time = rep(sample(c("Day", "Night"),
                                       size = 50,
                                       replace = TRUE), 2))

head(myData)


PID stress image  music  time
  1     66 Happy Disney   Day
  2     21 Happy Disney Night
  3     25 Angry Horror   Day
  4     61 Happy Disney   Day
  5     11 Angry Disney Night
  6     85 Angry Horror   Day

From there, the steps we take look pretty similar to before:

myData <- within(myData, {
  PID   <- factor(PID)
  image <- factor(image)
  music <- factor(music)
  time  <- factor(time)
})

myData <- myData[order(myData$PID), ]
head(myData)

myData.mean <- aggregate(myData$stress,
                         by = list(myData$PID, myData$music,
                                 myData$image, myData$time),
                         FUN = 'mean')

colnames(myData.mean) <- c("PID", "music", "image",
                           "time", "stress")
myData.mean <- myData.mean[order(myData.mean$PID), ]

stress.aov <- with(myData.mean, aov(stress ~ time * music *
                                    image + Error(PID /
                                    (music * image))))
summary(stress.aov)

Error: PID
          Df Sum Sq Mean Sq F value Pr(>F)
time       1      6     5.7   0.033  0.857
Residuals 48   8338   173.7               

Error: PID:music
           Df Sum Sq Mean Sq F value Pr(>F)
music       1      1    0.78   0.003  0.955
time:music  1     22   21.96   0.092  0.763
Residuals  48  11502  239.63               

Error: PID:image
           Df Sum Sq Mean Sq F value Pr(>F)
image       1     61   61.11   0.292  0.591
time:image  1     81   80.77   0.386  0.537
Residuals  48  10046  209.29               

Error: PID:music:image
                 Df Sum Sq Mean Sq F value Pr(>F)
music:image       1    564   563.8   2.578  0.115
time:music:image  1     24    23.7   0.109  0.743
Residuals        48  10496   218.7

The only big difference is that we don’t include out between-subjects factor (time) in the Error() term. In any case, we see that there are no significant main effects (of time, music, or image) nor any significant interactions (between time and music, time and image, music and image, or music and time and image).

Dealing with “Error() model is singular”

Sometimes you might be unlucky enough to get this error when you try to specify your aov() object. It’s not the end of the world, it just means that you don’t have an observation for every between-subjects condition for every participant. This can happen due to a bug in your programming, a participant being noncompliant, data trimming after the fact, or a whole host of other reasons. The moral of the story, though, is that you need to find the participant that is missing data and drop him or her from this analysis for the error to go away. Or, if the idea of dropping a participant entirely rubs you the wrong way, you could look into conducting the analysis as a linear mixed model. We don’t have a tutorial for that (yet!), but keep your eyes peeled: as soon as it’s written, we’ll update this post and link you to it!

To leave a comment for the author, please follow the link and comment on his blog: DataScience+.

↧

Ensuring R Generates the Same ANOVA F-values as SPSS

August 27, 2015, 12:15 pm

≫ Next: Getting started in applied statistics / datascience

≪ Previous: Two-Way ANOVA with Repeated Measures

(This article was first published on Stats Can Be Fun, and kindly contributed to R-bloggers)

When switching to R from SPSS a common concern among psychology researchers is that R gives the “correct” ANOVA F-values. By “correct” they simply mean F-values that match those generated by SPSS. Because ANOVA F-values in R do not match those in SPSS by default it often appears that R is “doing something wrong”. This is not the case. R simply has a different default configuration than SPSS.

The nature of the differences between SPSS and R comes is evident when there are an unequal number of participants across factorial ANOVA cells. There are a few simple steps that can be followed to ensure that R ANOVA values do indeed match those generated by SPSS. These steps involves using Type-III sums of squares for the ANOVA but there is more to it than that. I will detail the complete process in R here but a deeper discussion of the related statistical issues is provided in the excellent free e-book, Learning Statistics Using R by Dan Navarro

Initial R Data

> my.data <- read.csv("goggles.csv")
> my.data
   gender alcohol attractiveness
1       1       1             65
2       1       1             70
3       1       1             60
4       1       1             60
5       1       1             60
6       1       1             55
7       1       1             60
8       1       1             55
9       1       2             70
10      1       2             65
11      1       2             60
12      1       2             70
13      1       2             65
14      1       2             60
15      1       2             60
16      1       2             50
17      1       3             55
18      1       3             65
19      1       3             70
20      1       3             55
21      1       3             55
22      1       3             60
23      1       3             50
24      1       3             50
25      2       1             50
26      2       1             55
27      2       1             80
28      2       1             65
29      2       1             70
30      2       1             75
31      2       1             75
32      2       1             65
33      2       2             45
34      2       2             60
35      2       2             85
36      2       2             65
37      2       2             70
38      2       2             70
39      2       2             80
40      2       2             60
41      2       3             30
42      2       3             30
43      2       3             30
44      2       3             55
45      2       3             35
46      2       3             20
47      2       3             45
48      2       3             40

SPSS Analysis: The numbers below are the one’s we desire:

You can see the F-values for gender, alcohol, and the interaction are 2.0232, 20.065, and 11.911, respectively.

Outline of R Steps

There are three things you need to do to ensure ANOVA F-values in R match those in SPSS. I will briefly list these three steps and then provide a more details description of each.

1. Set each independent variable as a factor
2. Set the default contrast to helmert
3. Conduct analysis using Type III Sums of Squares

Step 1. Set each independent variable as a factor

By default R assumes variables are not categorical. If you have a categorical variable (as you do with ANOVA independent variables) you need to indicate to R the nature of the variables; you do this with the as.factor function. In the example below I work with a goggles data set (from Discovering Statistics Using SPSS) that investigates the effect of alcohol consumption (None,2-pints, 4-pints) and gender (male/female) or attractiveness ratings. The categorial variables have been entered into the data file numerically such that for gender 1 is Female and 2 is Male. Likewise, for alcohol 1 is None, 2 is two pints, 3 is four pints. Before running the ANOVA I need to let R know that gender and alcohol are factors and what the levels of those factors are labeled.

# Set the variables to factors
> my.data$gender <- as.factor(my.data$gender)
> my.data$alcohol <- as.factor(my.data$alcohol)

# Label the levels of each factor
> levels(my.data$gender) <- list("Female"=1,"Male"=2)
> levels(my.data$alcohol) <- list("None"=1,"2-pints"=2,"4-pints"=3)

Step 2. Set the default contrast to helmert

When an ANOVA is conducted in R it’s done using the general linear model. Consequently, the contrasts need to specified in the same way as SPSS if the values are to match.

You can see the default contrasts in R with the command belowL

> options("contrasts")
$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly"

We need to change the default contrast for unordered factors from “cont.treatment” to “contr.helmert”. We do this with the command below:

> options(contrasts = c("contr.helmert", "contr.poly"))

You can verify that the contrast has changed by using the options command again:

> options("contrasts")
$contrasts
[1] "contr.helmert" "contr.poly"

Step 3. Conduct Analysis Using Type III Sums of Squares

Conduct your analysis:

> crf.lm <- lm(attractiveness~gender*alcohol,data=my.data)

Now you want traditional ANOVA statistics using using Type III Sums of Squares. These can be provided by the car package (car: Companion to Applied Regression). The first time (and only the first time) you use the car package you need to install it. The package give you the “Anova” function; note the capitalization in this function name is critical.

> install.packages("car",dependencies = TRUE)

Once the package is installed you only need the code below:

> crf.lm <- lm(attractiveness~gender*alcohol,data=my.data)
> library(car)
> Anova(crf.lm,type=3)
Anova Table (Type III tests)

Response: attractiveness
               Sum Sq Df   F value    Pr(>F)    
(Intercept)    163333  1 1967.0251 < 2.2e-16 ***
gender            169  1    2.0323    0.1614    
alcohol          3332  2   20.0654 7.649e-07 ***
gender:alcohol   1978  2   11.9113 7.987e-05 ***
Residuals        3488 42                        
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You can see the F-values for gender, alcohol, and the interaction are 2.0232, 20.065, and 11.911, respectively. These match the SPSS values presented above.

Quick Summary

> my.data <- read.csv("goggles.csv")

> my.data$gender <- as.factor(my.data$gender)
> my.data$alcohol <- as.factor(my.data$alcohol)
> levels(my.data$gender) <- list("Female"=1,"Male"=2)
> levels(my.data$alcohol) <- list("None"=1,"2-pints"=2,"4-pints"=3)

> options(contrasts = c("contr.helmert", "contr.poly"))

> crf.lm <- lm(attractiveness~gender*alcohol,data=my.data)
> library(car)
> Anova(crf.lm,type=3)

Anova Table (Type III tests)

Response: attractiveness
               Sum Sq Df   F value    Pr(>F)    
(Intercept)    163333  1 1967.0251 < 2.2e-16 ***
gender            169  1    2.0323    0.1614    
alcohol          3332  2   20.0654 7.649e-07 ***
gender:alcohol   1978  2   11.9113 7.987e-05 ***
Residuals        3488 42                        
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

To leave a comment for the author, please follow the link and comment on his blog: Stats Can Be Fun.

↧

Getting started in applied statistics / datascience

August 29, 2015, 5:00 am

≫ Next: Analysing longitudinal data: Multilevel growth models (II)

≪ Previous: Ensuring R Generates the Same ANOVA F-values as SPSS

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Where to start to start?

I was recently asked by a colleague manager from another organisation what direction they could give to a staff member interested in building skills in the whole “big data” thing. A search of the web shows hundreds if not thousands of sites and blog posts aimed at budding data scientists, but most of them seem (to my admittedly very non-rigorous glance) to be collections of resources and techniques; too detailed and specific for my purposes, and aimed at people already a bit into the journey. So here’s something oriented a bit more to someone who’s still wondering what this thing is you might be going to get into.

Why is data suddenly so sexy?

First, while a lot of the publicity you hear is about “big data”, the real revolution in recent decades is bigger than big data. It’s about data creation, storage, access, analytical techniques and tools:

In the last third of the twentieth century there were big advances in applied statistics, with new methods like robust statistics, bootstrapping, a bigger range of graphics, and mixed effects and additive models to deal with a bunch of situations beyond the crude assumptions needed for the previous generations of techniques (the world of ANOVA, linear regression and t statistics which unfortunately is still the impression many people have from those one or two mandatory stats papers);
Roughly overlapping with that and still ongoing, the rise in computing power has made those methods practicable and cheap;
Then, the last 10 years has seen an explosion of data capture and storage, as our digital traces are increasingly being logged somewhere and more and more of our lives create those traces.

A lot of the new data is web-related, but the overall cumulative impact of those three things above is not necessarily just enormous piles of Twitter and Facebook data. So I’d be careful not to focus on “big data” from the start but instead build a core of statistics and computing skills which can then be applied to larger data. The techniques actually specific to big data are relatively small compared to the core skillset, and should be fairly easy to learn if they get a solid grounding.

What’s it take to learn?

Second, I would emphasise that this stuff is hard and there’s lots of it to learn. It can’t be learnt by a few two day courses and a brief apprenticeship, although both those things can help. To be successful you need specialist tertiary education or its equivalent, plus a commitment to continuous creative destruction of your knowledge and skills and to life long learning. My team for example comprises mostly people with quantitative PhDs or Masters degrees, and we have two training sessions per week at which everyone is continually learning new stuff (and teaching it to the others). We start each weekly team meeting reporting back one thing each of us learnt in the last week; often it’s some tool or technique that didn’t even exist six months ago.

My thinking is heavily influenced by Drew Conway’s data science Venn diagram, a modified version of which is below (at the bottom of this post is the R code that drew this):
Datascience Venn diagram

Basically, a good applied statistician or datascientist (I’m not going to argue about language here) needs to combine computing, statistical and content knowledge skills. It’s the growth in computing power that’s changing capabilities in the field, but knowing stuff and techniques is important too.

However, as we’ve only got one lifetime each, developing specialist knowledge in particular domain areas is expensive. My advice on the content knowledge circle of the Venn diagram is to get good at quickly understanding issues and questions that others can bring to you, rather than try to be a domain specialist. This could be controversial; for example, I was once criticised by statisticians and others for recruiting team members on the basis of statistical and data management skills rather than domain knowledge in XXX. The reality is, we work with others who are the specialists in XXX and its policy problems, but need help in the data area. I look for people with data skills (or potential skills) who can quickly build up familiarity with the domain, rather than limit the range an already difficult job search.

Getting started on statistical computing

That leaves hacking and statistics. In my thinking I break this down into four pragmatic areas where skills need to be developed. I say pragmatic because I don’t have some theory dictating why these four areas, it’s more that when we’re planning training or other skills development, it seems to fall into these categories:

statistics
computer languages and generally getting the computer to do new stuff
reproducible research
databases and data management

This series of John Hopkins Coursera online courses online courses has had good recommendations: and covers the full range of things, using up to date tools. It’s a commitment, but the fact is there’s a lot to learn. I’d suggest at some point early in the journy getting enrolled in that or a similar course to see if you’ve got the stomach for it. If you haven’t written computer code before, for example, there’s probably a particular psychological hurdle to overcome before you decide this is for you (and you can’t handle data properly without doing it in code).

The “range of things” as I would see them (which is pretty much similar to the curriculum of that course linked above) would be:

Statistics

Learning statistics properly takes effort, and mathematics, and lots of time in front of a computer practicing. One problem is a lot of the statistics learnt at university in non-statistics degrees teaches techniques rather than principles, and often dated at that. To get an idea of what you’re getting into:

As a starter (reflecting my own learning preferences of course ) I advise reading some books. Like Wilcox’s Modern Statistics for the Social and Behavioral Sciences and Andy Field’s Discovering Statistics Using R, both of which I’ve used successfully with people and learnt stuff myself on the way. Those are really good, modern introductory texts, particularly Wilcox’s which played a big part in nudging my own statistical approaches into the modern world.
Use the amazing Cross-Validated Q&A site to search for answers to statistical questions
Wikipedia for excellent technical definitions and descriptions

One day I’ll do a more extended post on other books-I-love dealing with topics like modelling strategies, time series, surveys, etc.

A computing tool for statistics

A choice needs to be made for a computer language in which to start learning. If you spend more than an hour thinking about R v SAS, R v Python, or R v Julia you’re wasting your time because the reality is if you’re going to get any good at this, you need to be multilingual. However, you have to start somewhere and my recommendation is for R. It’s free, easy to get, forms the lingua franca in the academy, and its open source approach means new techniques get operationalised in it quicker than in SAS, as do bindings to other languages like JavaScript (pretty much essential for fancy modern data presentation). Ideally you learn R and statistics together – R is a computer language written by statisticians for statisticians, so it helps you fall into a statistical way of thinking.

Download R from CRAN and RStudio from RStudio.
Follow this blog aggregation site: http://www.r-bloggers.com/
Use this Stack Overflow Q&A
Join your local R users group and other analytics groups. Where I live, that means the Wellington R users group and the (language-agnostic) New Zealand analytics forum, both of which have 3 or 4 events a year to hear what others around are up to and make contacts

Down the track you need to get familiar with other more general languages – like HTML and JavaScript for web dissemination, LaTeX for static reports and presentations, and probably a general purpose language like Python for generally doing Stuff to data. You also need to get familiar with the basics of the computer’s operating system and using a shell session to get it to do stuff. But other than the minimum that can wait until you’ve broken the ice (I’m assuming people are starting from non-familiarity with coding) with a statistically-oriented language.

Reproducible research eg version control, making things reproducible end to end, etc.

Download Git and once you’ve started to get comfortable with writing code, learn to use Git to do version control. It integrates well with RStudio.
Read about reproducible research and find ways to set up your code so others can repeat what you’ve done – for peer review, quality control, scalability, and updating stuff.
Install LaTeX and learn how to use it.
As you get further into larger projects you need to start borrowing techniques from software developers, and read up on software development methods like Extreme Programming. When analysts realise they are writing computer programs, not just interacting with a statistical tool, they move up a level in the power of what they can do.

Databases, data management, SQL, tidying and cleaning data

Down the track, you need database skills. This can wait until familiar with stats and R but at some point you need to be able to set up a database. MySQL is the easiest one to do this on a home system.
Within R, {dplyr} and {tidyr} are relatively new but are king for data tidying, reshaping and accessing. Most recent online courses and blogs will use these.
Only after everything else is sorted and are familiar with medium size data (eg traditional relational databases like MySQL), can think about big-data-specific things like Hadoop clusters.

So that’s only a beginning. It’s an exciting area. Hopefully I’ve given some indicators that might be useful for someone out there, wondering if they (or someone else) should get into this stuff, and what it will take.

Drawing that diagram

Finally, here’s the code that drew my own version of the Drew Conway data science diagram. I wanted to tweak his original

to avoid the argument “that’s just what applied statisticians do”.
to make clearer than Conway’s original that computer skills in combination with statistics is still a danger zone

library(showtext)
library(grid)
library(RColorBrewer)

font.add.google("Poppins", "myfont")
showtext.auto()
palette <- brewer.pal(3, "Set1")

radius <- 0.3
strokecol <- "grey50"
linewidth <- 4
fs <- 11

draw_diagram <- function(){
grid.newpage()
grid.circle(0.33, 0.67, radius, gp = 
               gpar(col = strokecol,
                    fill = palette[1],
                    alpha = 0.2,
                    lwd = linewidth))

grid.circle(0.67, 0.67, radius, gp =
            gpar(col = strokecol,
                 fill = palette[2],
                 alpha = 0.2,
                 lwd = linewidth))

grid.circle(0.5, 0.33, radius, gp =
               gpar(col = strokecol,
                    fill = palette[3],
                    alpha = 0.2,
                    lwd = linewidth))

grid.text("Hacking", 0.25, 0.75, rot = 45, gp =
             gpar(fontfamily = "myfont",
                  fontsize = fs * 2.3,
                  col = palette[1],
                  fontface = "bold"))

grid.text("Statistics", 0.75, 0.75, rot = -45, gp =
             gpar(fontfamily = "myfont",
                  fontsize = fs * 2.3,
                  col = palette[2],
                  fontface = "bold"))

grid.text("Contentnknowledge", 0.5, 0.25, rot = 0, gp =
             gpar(fontfamily = "myfont",
                  fontsize = fs * 2.3,
                  col = palette[3],
                  fontface = "bold"))

grid.text("Danger:nno context", 0.5, 0.75,
          gp = gpar(fontfamily = "myfont",
                    fontsize = fs))

grid.text("Danger: nonunderstandingnof probability", 0.32, 0.48, rot = 45,
          gp = gpar(fontfamily = "myfont",
                    fontsize = fs))

grid.text("Traditionalnresearch", 0.66, 0.46, rot = -45,
          gp = gpar(fontfamily = "myfont",
                    fontsize = fs))


grid.text("Data science /nappliednstatistics", 0.5, 0.55, 
          gp = gpar(fontfamily = "myfont",
                    fontsize = fs * 1.2,
                    fontface = "bold"))
}

draw_diagram()

To leave a comment for the author, please follow the link and comment on his blog: Peter's stats stuff - R.

↧

Analysing longitudinal data: Multilevel growth models (II)

September 4, 2015, 11:58 pm

≫ Next: How to perform a Logistic Regression in R

≪ Previous: Getting started in applied statistics / datascience

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

This is the third post in the longitudinal data series. Previously, we introduced what longitudinal data is, how we can convert between long and wide format data-sets, and a basic multilevel model for analysis. Apparently, the basic multilevel model is not quite enough to analyse our imaginary randomised controlled trial (RCT) data-set. This post is going to continue our analysis and introduce a proper way to handle treatment effects in multilevel models.

Generate a longitudinal dataset and convert it into long format

As usual, we start by generating our longitudinal data-set and convert it into long format.

library(MASS)

dat.tx.a <- mvrnorm(n=250, mu=c(30, 20, 28), 
                    Sigma=matrix(c(25.0, 17.5, 12.3, 
                                   17.5, 25.0, 17.5, 
                                   12.3, 17.5, 25.0), nrow=3, byrow=TRUE))

dat.tx.b <- mvrnorm(n=250, mu=c(30, 20, 22), 
                    Sigma=matrix(c(25.0, 17.5, 12.3, 
                                   17.5, 25.0, 17.5, 
                                   12.3, 17.5, 25.0), nrow=3, byrow=TRUE))

dat <- data.frame(rbind(dat.tx.a, dat.tx.b))
names(dat) <- c(‘measure.1’, ‘measure.2’, ‘measure.3’)

dat <- data.frame(subject.id = factor(1:500), tx = rep(c(‘A’, ‘B’), each = 250), dat)

rm(dat.tx.a, dat.tx.b)

dat <- reshape(dat, varying = c(‘measure.1’, ‘measure.2’, ‘measure.3’), 
               idvar = ‘subject.id’, direction = ‘long’)

A multilevel growth model considering treatment effect

This is a RCT data-set, implying that there should be some potential differences between the two treatment groups. Last time we ignored this heterogeneity and specified only a common time effect across the two groups. Intuitively, we could add the treatment variable as an fixed effect in the model to capture the between group difference.

library(lmerTest)
m1 <- lmer(measure ~ time + tx + (1 | subject.id), data=dat)

We still used the package lmerTest because it allows the test of fixed effect using approximate degrees of freedom. The model formula is the same as the one we used last time except we added the treatment variable tx as an independent variable.

summary(m1)

Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of
  freedom [lmerMod]
Formula: measure ~ time + tx + (1 | subject.id)
   Data: dat

REML criterion at convergence: 9788.5

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.68035 -0.66131  0.07007  0.66151  2.87378 

Random effects:
 Groups     Name        Variance Std.Dev.
 subject.id (Intercept)  9.918   3.149   
 Residual               32.114   5.667   
Number of obs: 1500, groups:  subject.id, 500

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)   30.5932     0.4593 1474.4000  66.610  < 2e-16 ***
time          -2.2201     0.1792  999.0000 -12.389  < 2e-16 ***
txB           -2.3377     0.4062  498.0000  -5.755 1.51e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
     (Intr) time  
time -0.780       
txB  -0.442  0.000

It is clear from the model summary that both time and tx effects are statistically significant and participants in treatment B had a depression score 2.34 points lower than those in treatment A. Should we conclude that treatment B is more effective? Definitely no. The model m1 is not a proper way to compare group difference in the context of longitudinal data. The ‘treatment effect’ in m1 is in fact the the average treatment effect along time. In other words, the coefficient -2.34 is the difference in depression score between treatments A and B averaged in time 1, 2, and 3. This is an RCT, we therefore expect to see no (or very little) difference at time 1 (pre-intervention) between the participants in the two treatments. We are actually not interested in knowing the average difference between the two treatments but how the trajectories differ at time 2 and 3.

The concept may be a bit tricky for first-timers but the execution is fairly simple: we just add the interaction term between time and tx.

m2 <- lmer(measure ~ time * tx + (1 | subject.id), data=dat)

In R, pure interaction term is indicated by the operator : so we could specify the model by time + tx + time:tx. But we have a short hand instead: time * tx. The * operator asks R to include both main and treatment effects so we could just use the * operator and skip to indicate all effects separately. Please note that the * operator also works for higher dimension interactions. For instance, A * B * C asks R to include all the three main effects, the three two-way interaction effects, and the one three-way interaction effects.

summary(m2)

Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of
  freedom [lmerMod]
Formula: measure ~ time * tx + (1 | subject.id)
   Data: dat

REML criterion at convergence: 9721.9

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.71431 -0.65906  0.08873  0.65358  2.63778 

Random effects:
 Groups     Name        Variance Std.Dev.
 subject.id (Intercept) 10.60    3.256   
 Residual               30.06    5.483   
Number of obs: 1500, groups:  subject.id, 500

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)   27.7106     0.5683 1456.6000  48.757  < 2e-16 ***
time          -0.7788     0.2452  998.0000  -3.176  0.00154 ** 
txB            3.4275     0.8037 1456.6000   4.264 2.13e-05 ***
time:txB      -2.8826     0.3468  998.0000  -8.312 4.44e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
         (Intr) time   txB   
time     -0.863              
txB      -0.707  0.610       
time:txB  0.610 -0.707 -0.863

So we now have three fixed effects in the model m2. As mentioned, the time effect is averaged across group and the txB effect is averaged along time, which provide little information about the trajectory difference between the two treatments. The effect of interest is the time:txB interaction. This term is a little bit tricky to interpret: the coefficient value indicates the difference between the two treatments per unit time increment, conditional on average time and treatment effect. In words that human can understand, this indicates how the two treatment groups diverge with the progression of time. For each unit time increase (e.g. from time 1 to time 2), participants in treatment B could expect a 2.88 lower depression score than those in treatment A after accounting for the average treatment effect difference. So at time 2, participants in treatment B could expect a -2.88 x 2 + 3.43 = -2.33 score difference and at time 3, participants in treatment B could expect a -2.88 x 3 + 3.43 = -5.21 difference. We should always consider both treatment main effect and treatment-time interaction effect when comparing treatment differences at a given time. But why? Try to see what would be the treatment difference at time 1 if you omit the treatment main effect.

To choose the best model

I can tell you model m2 is better than m1 and m0 (the model in the last post) because the data-set is generated by me. But in real life, we could have a hard time deciding which model is better. One way to choose is to see the statistical significance of the additional variables using the summary() function as we did above. Other common approaches are Analysis of Variance (ANOVA) table (only for nested models) and model fit indices.

m0 <- lmer(measure ~ time + (1 | subject.id), data=dat)
anova(m0, m1, m2)

Data: dat
Models:
object: measure ~ time + (1 | subject.id)
..1: measure ~ time + tx + (1 | subject.id)
..2: measure ~ time * tx + (1 | subject.id)
       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
object  4 9825.8 9847.0 -4908.9   9817.8                             
..1     5 9795.6 9822.2 -4892.8   9785.6 32.197      1  1.393e-08 ***
..2     6 9730.6 9762.5 -4859.3   9718.6 66.944      1  2.794e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here we include both m0, m1, and m2 in the anova() function because m0 is nested within m1, which in turn nested within m2. A model is nested within another if it is a subset of another. In other words, a model with independent variables A and B is nested within another model with independent variables A, B, and C.

This ANOVA table should be quite familiar to you if you have experience with ANOVA analysis. The Df column indicates the degrees of freedom associated with the model, which simply means the number of parameters estimated in this case. For example, the 4 degrees of freedom in m0 indicates there were 4 parameters estimated in the model (1. Coefficient of the fixed intercept effect, 2. Coefficient of time effect, 3. Variance of the random intercept, 4. Variance of the residual).

The AIC and BIC are two commonly used fit indices. Both AIC and BIC consider two factors: how well the model fit the data and how simple the model is. The difference between AIC and BIC is just the penalisation on model complexity (BIC penalise complex model more). I personally find AIC easier to interpret because it asymptotically (when n approaches infinity) converge to the leave-one-out cross-validation (LOOCV) prediction performance. We would cover the concept of cross-validation and prediction performance later so just leave it if you do not understand what they are.

The logLik column is the logarithm of likelihood in estimating the models. Basically it is an indicator of how well the model fit the data. Deviance is simply (-2 times logLik). Why do we need that? Because the difference in deviance between two nested model follows a Chi-squared distribution under null hypothesis (no difference in deviance between two nested model) which we could use as a test for model fit difference. The difference in deviance is listed under the Chisq column. The associated degrees of freedom of the Chi-squared statistics are listed under Chi Df (which is simply the number of additional parameter). The last column should be quite clear now: it indicates the p-values for testing the difference in model fit between two nested models. Nonetheless, please be cautious about the ANOVA tests results because type I error would inflate furiously if there are many candidate models.

As m2 is strictly better than the other two candidate models in AIC, BIC, and the ANOVA tests, we could now conclude it to be our best model so far. Of course there are still some ways to improve further. We will cover the possible improvements and the plotting of the predicted outcomes later. Stay tuned.

To leave a comment for the author, please follow the link and comment on his blog: DataScience+.

↧

How to perform a Logistic Regression in R

September 13, 2015, 1:39 pm

≫ Next: BayesFactor version 0.9.12-2 released to CRAN

≪ Previous: Analysing longitudinal data: Multilevel growth models (II)

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both.

The categorical variable y, in general, can assume different values. In the simplest case scenario y is binary meaning that it can assume either the value 1 or 0. A classical example used in machine learning is email classification: given a set of attributes for each email such as number of words, links and pictures, the algorithm should decide whether the email is spam (1) or not (0). In this post we call the model “binomial logistic regression”, since the variable to predict is binary, however, logistic regression can also be used to predict a dependent variable which can assume more than 2 values. In this second case we call the model “multinomial logistic regression”. A typical example for instance, would be classifying films between “Entertaining”, “borderline” or “boring”.

Logistic regression implementation in R

R makes it very easy to fit a logistic regression model. The function to be called is glm() and the fitting process is not so different from the one used in linear regression. In this post I am going to fit a binary logistic regression model and explain each step.

The dataset

We’ll be working on the Titanic dataset. There are different versions of this datasets freely available online, however I suggest to use the one available at Kaggle, since it is almost ready to be used (in order to download it you need to sign up to Kaggle).
The dataset (training) is a collection of data about some of the passengers (889 to be precise), and the goal of the competition is to predict the survival (either 1 if the passenger survived or 0 if they did not) based on some features such as the class of service, the sex, the age etc. As you can see, we are going to use both categorical and continuous variables.

The data cleaning process

When working with a real dataset we need to take into account the fact that some data might be missing or corrupted, therefore we need to prepare the dataset for our analysis. As a first step we load the csv data using the read.csv() function.
Make sure that the parameter na.strings is equal to c("") so that each missing value is coded as a NA. This will help us in the next steps.

training.data.raw <- read.csv('train.csv',header=T,na.strings=c(""))

Now we need to check for missing values and look how many unique values there are for each variable using the sapply() function which applies the function passed as argument to each column of the dataframe.

sapply(training.data.raw,function(x) sum(is.na(x)))

PassengerId    Survived      Pclass        Name         Sex 
          0           0           0           0           0 
        Age       SibSp       Parch      Ticket        Fare 
        177           0           0           0           0 
      Cabin    Embarked 
        687           2 

sapply(training.data.raw, function(x) length(unique(x)))

PassengerId    Survived      Pclass        Name         Sex 
        891           2           3         891           2 
        Age       SibSp       Parch      Ticket        Fare 
         89           7           7         681         248 
      Cabin    Embarked 
        148           4

A visual take on the missing values might be helpful: the Amelia package has a special plotting function missmap() that will plot your dataset and highlight missing values:

library(Amelia)
missmap(training.data.raw, main = "Missing values vs observed")

The variable cabin has too many missing values, we will not use it. We will also drop PassengerId since it is only an index and Ticket.
Using the subset() function we subset the original dataset selecting the relevant columns only.

data <- subset(training.data.raw,select=c(2,3,5,6,7,8,10,12))

Taking care of the missing values

Now we need to account for the other missing values. R can easily deal with them when fitting a generalized linear model by setting a parameter inside the fitting function. However, personally I prefer to replace the NAs “by hand”, when is possible. There are different ways to do this, a typical approach is to replace the missing values with the average, the median or the mode of the existing one. I’ll be using the average.

data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)

As far as categorical variables are concerned, using the read.table() or read.csv() by default will encode the categorical variables as factors. A factor is how R deals categorical variables.
We can check the encoding using the following lines of code

is.factor(data$Sex)
TRUE

is.factor(data$Embarked)
TRUE

For a better understanding of how R is going to deal with the categorical variables, we can use the contrasts() function. This function will show us how the variables have been dummyfied by R and how to interpret them in a model.

contrasts(data$Sex)
       male
female    0
male      1

contrasts(data$Embarked)
  Q S
C 0 0
Q 1 0
S 0 1

For instance, you can see that in the variable sex, female will be used as the reference. As for the missing values in Embarked, since there are only two, we will discard those two rows (we could also have replaced the missing values with the mode and keep the datapoints).

data <- data[!is.na(data$Embarked),]
rownames(data) <- NULL

Before proceeding to the fitting process, let me remind you how important is cleaning and formatting of the data. This preprocessing step often is crucial for obtaining a good fit of the model and better predictive ability.

Model fitting

We split the data into two chunks: training and testing set. The training set will be used to fit our model which we will be testing over the testing set.

train <- data[1:800,]
test <- data[801:889,]

Now, let’s fit the model. Be sure to specify the parameter family=binomial in the glm() function.

model <- glm(Survived ~.,family=binomial(link='logit'),data=train)

By using function summary() we obtain the results of our model:

summary(model)

Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6064  -0.5954  -0.4254   0.6220   2.4165  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.137627   0.594998   8.635  < 2e-16 ***
Pclass      -1.087156   0.151168  -7.192 6.40e-13 ***
Sexmale     -2.756819   0.212026 -13.002  < 2e-16 ***
Age         -0.037267   0.008195  -4.547 5.43e-06 ***
SibSp       -0.292920   0.114642  -2.555   0.0106 *  
Parch       -0.116576   0.128127  -0.910   0.3629    
Fare         0.001528   0.002353   0.649   0.5160    
EmbarkedQ   -0.002656   0.400882  -0.007   0.9947    
EmbarkedS   -0.318786   0.252960  -1.260   0.2076    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1065.39  on 799  degrees of freedom
Residual deviance:  709.39  on 791  degrees of freedom
AIC: 727.39

Number of Fisher Scoring iterations: 5

Interpreting the results of our logistic regression model

Now we can analyze the fitting and interpret what the model is telling us.
First of all, we can see that SibSp, Fare and Embarked are not statistically significant. As for the statistically significant variables, sex has the lowest p-value suggesting a strong association of the sex of the passenger with the probability of having survived. The negative coefficient for this predictor suggests that all other variables being equal, the male passenger is less likely to have survived. Remember that in the logit model the response variable is log odds: ln(odds) = ln(p/(1-p)) = a*x1 + b*x2 + … + z*xn. Since male is a dummy variable, being male reduces the log odds by 2.75 while a unit increase in age reduces the log odds by 0.037.

Now we can run the anova() function on the model to analyze the table of deviance

anova(model, test="Chisq")

Analysis of Deviance Table
Model: binomial, link: logit
Response: Survived
Terms added sequentially (first to last)

         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                       799    1065.39              
Pclass    1   83.607       798     981.79 < 2.2e-16 ***
Sex       1  240.014       797     741.77 < 2.2e-16 ***
Age       1   17.495       796     724.28 2.881e-05 ***
SibSp     1   10.842       795     713.43  0.000992 ***
Parch     1    0.863       794     712.57  0.352873    
Fare      1    0.994       793     711.58  0.318717    
Embarked  2    2.187       791     709.39  0.334990    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding Pclass, Sex and Age significantly reduces the residual deviance. The other variables seem to improve the model less even though SibSp has a low p-value. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.

While no exact equivalent to the R² of linear regression exists, the McFadden R² index can be used to assess the model fit.

library(pscl)
pR2(model)

         llh      llhNull           G2     McFadden         r2ML         r2CU 
-354.6950111 -532.6961008  356.0021794    0.3341513    0.3591775    0.4880244

Assessing the predictive ability of the model

In the steps above, we briefly evaluated the fitting of the model, now we would like to see how the model is doing when predicting y on a new set of data. By setting the parameter type='response', R will output probabilities in the form of P(y=1|X). Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0. Note that for some applications different thresholds could be a better option.

fitted.results <- predict(model,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)

misClasificError <- mean(fitted.results != test$Survived)
print(paste('Accuracy',1-misClasificError))

"Accuracy 0.842696629213483"

The 0.84 accuracy on the test set is quite a good result. However, keep in mind that this result is somewhat dependent on the manual split of the data that I made earlier, therefore if you wish for a more precise score, you would be better off running some kind of cross validation such as k-fold cross validation.

As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier.
The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.

library(ROCR)
p <- predict(model, newdata=subset(test,select=c(2,3,4,5,6,7,8)), type="response")
pr <- prediction(p, test$Survived)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

0.8647186

And here is the ROC plot:

I hope this post will be useful. A gist with the full code for this example can be found here.

Thank you for reading this post, leave a comment below if you have any question.

To leave a comment for the author, please follow the link and comment on his blog: DataScience+.

↧

BayesFactor version 0.9.12-2 released to CRAN

September 24, 2015, 1:52 pm

≫ Next: Fast food, fast publication

≪ Previous: How to perform a Logistic Regression in R

(This article was first published on BayesFactor: Software for Bayesian inference, and kindly contributed to R-bloggers)

I’ve released BayesFactor 0.9.12-2 to CRAN; it should be available on all platforms now. The changes include:

Added feature allowing fine-tuning of priors on a per-effect basis: see new argument rscaleEffects of lmBF, anovaBF, and generalTestBF

Fixed bug that disallowed logical indexing of probability objects

Fixed minor typos in documentation

Fixed bug causing regression Bayes factors to fail for very small R^2

Fixed bug disallowing expansion of dot (.) in generalTestBF model specifications

Fixed bug preventing cancelling of all analyses with interrupt

Restricted contingency prior to values >=1

All BFmodel objects have additional “analysis” slot giving details of analysis

To leave a comment for the author, please follow the link and comment on their blog: BayesFactor: Software for Bayesian inference.

↧

Fast food, fast publication

November 8, 2015, 2:19 pm

≫ Next: Correlation and Linear Regression

≪ Previous: BayesFactor version 0.9.12-2 released to CRAN

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

The following article is getting quite a lot of press right now: David Just and Brian Wansink (2015). Fast Food, Soft Drink, and Candy Intake is Unrelated to Body Mass Index for 95% of American Adults. Obesity Science & Practice, forthcoming (upcoming in a new pay for placement journal). Obviously it is a popular contrary position (some coverage: here, here, and here).

I thought I would take a peek to learn about the statistical methodology (see here for some commentary). I would say the kindest thing you can say about the paper is: its problems are not statistical.

At this time the authors don’t seem to have supplied their data preparation or analysis scripts and the paper “isn’t published yet” (though they have had time for a press release), so we have to rely on their pre-print. Read on for excerpts from the work itself (with commentary).

After excluding the clinically underweight and morbidly obese, consumption of fast food, soft drinks or candy was not positively correlated with measures of BMI.

(Eliminate enough outcome variation and there is no variation to measure/explain.)

We restrict our sample to adults, defined as age 18 or older, who completed two 24-hour dietary recall surveys.

(It plausibly takes more than two days of measurements to get a good image of long term eating habits. Also most “food regulation”, a topic these authors have written on, is targeted at children. So for a useful public policy analysis it would have been nice to leave them in.)

We focus on eating episode rather than amount eaten because it is less subject to recall bias.

(Breaking the actual relation between eating and health, by leaving out amount. Also some effective diets advise more sittings of much smaller portions.)

We compare average eating episodes within food and across BMI categories.

(I am guessing this means they are modeling BMI category code instead of the BMI number. There are only about 3 BMI category codes left after “excluding the clinically underweight and morbidly obese.” Again eliminate variation in the measured outcome, and nothing will correlate to it.)

Missing data were omitted from the analysis …

(Just dropping missing data is not likely to work with interview data, unless you truly believe censoring is completely independent of health, diet, and health/diet interactions.)

Likewise, those with normal BMIs consume an average of 1.1 salty snacks over two days, while overweight, obese, and morbidly obese consume an average 0.9, 1.0, and 0.9 salty snacks, respectively.

(Uh, I thought we were “excluding the clinically underweight and morbidly obese.” I guess this is a different analysis. But here is a statistical issue: it really doesn’t look like the independent variable (“salty snacks”) is varying. So you are not going to be able to see if it drives an outcome. And since there isn’t a complete methods section I really wonder if the analysis is really looking at the claimed underlying data, or just looking at aggregate values.)

From: Table 1. Average Instances of Consumption in 48 Hours of Various Food Items, Sorted by BMI

(I’m not a statistician, but a negative p-value? Maybe that is some variation of z? But the weird values are not just in one column. Is all this just off one ANOVA table? Also, Why not try a linear regression on BMI score using non-grouped data, or a logistic regression on BMI category?)

Also when the input (or “independent”) variables are not known to be independent of each other ANOVA is variable order dependent! Usually this is handled by experiment design- but in this case we are observing eating patterns, not assigning them.

Some R code showing the effect is given below. Notice all of the x’s have the same relation to y, but the ANOVA analysis assigns effect in variable order. It does not make any sense to say “x1 is significant, but x10 is not” as the F-scores are not about each variable in isolation.

set.seed(6326)
d <- data.frame(y=rnorm(100))
for(i in 1:10) {
  d[[paste('x',i,sep='')]] <- d$y + rnorm(nrow(d))
}
anova(lm(y~x1+x2+x3+x4+x5+x6+x7+x9+x10,data=d))

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## x1         1 70.643  70.643 640.0173 < 2.2e-16 ***
## x2         1 22.647  22.647 205.1824 < 2.2e-16 ***
## x3         1  5.285   5.285  47.8821 6.425e-10 ***
## x4         1  6.588   6.588  59.6906 1.491e-11 ***
## x5         1  2.382   2.382  21.5771 1.155e-05 ***
## x6         1  3.027   3.027  27.4269 1.063e-06 ***
## x7         1  0.494   0.494   4.4757   0.03714 *  
## x9         1  1.914   1.914  17.3441 7.137e-05 ***
## x10        1  0.376   0.376   3.4048   0.06830 .  
## Residuals 90  9.934   0.110                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I am sure I got a few points wrong, but I just don’t see strong result here.

I’ll just end with: it is of course difficult to prove a non-effect, but a single analysis failing to find an effect is not strong evidence against an effect. A single study not finding a relation, doesn’t make two things unrelated. This analysis (seemingly entirely driven off one or two aggregated ANOVA tables, evidently without also trying the simple standard techniques of regression or logistic regression) does not in fact seem sensitive enough to see effects even if there are any.

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

↧

Correlation and Linear Regression

November 14, 2015, 5:44 am

≫ Next: Climate change and spline interactions

≪ Previous: Fast food, fast publication

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Before going into complex model building, looking at data relation is a sensible step to understand how your different variable interact together. Correlation look at trends shared between two variables, and regression look at causal relation between a predictor (independent variable) and a response (dependent) variable.

Correlation

As mentioned above correlation look at global movement shared between two variables, for example when one variable increases and the other increases as well, then these two variables are said to be positively correlated. The other way round when a variable increase and the other decrease then these two variables are negatively correlated. In the case of no correlation no pattern will be seen between the two variable.

Let’s look at some code before introducing correlation measure:

x<-sample(1:20,20)+rnorm(10,sd=2)
y<-x+rnorm(10,sd=3)
z<-(sample(1:20,20)/2)+rnorm(20,sd=5)
df<-data.frame(x,y,z)
plot(df[,1:3])

Here is the plot:

From the plot we get we see that when we plot the variable y with x, the points form some kind of line, when the value of x get bigger the value of y get somehow proportionally bigger too, we can suspect a positive correlation between x and y.

The measure of this correlation is called the coefficient of correlation and can calculated in different ways, the most usual measure is the Pearson coefficient, it is the covariance of the two variable divided by the product of their variance, it is scaled between 1 (for a perfect positive correlation) to -1 (for a perfect negative correlation), 0 would be complete randomness. We can get the Pearson coefficient of correlation using the function cor():

cor(df,method="pearson")

        x          y          z
x  1.0000000  0.8736874 -0.2485967
y  0.8736874  1.0000000 -0.2376243
z -0.2485967 -0.2376243  1.0000000

cor(df[,1:3],method="spearman")

        x          y          z
x  1.0000000  0.8887218 -0.3323308
y  0.8887218  1.0000000 -0.2992481
z -0.3323308 -0.2992481  1.0000000

From these outputs our suspicion is confirmed x and y have a high positive correlation, but as always in statistics we can test if this coefficient is significant. Using parametric assumptions (Pearson, dividing the coefficient by its standard error, giving a value that follow a t-distribution) or when data violate parametric assumptions using Spearman rank coefficient.

cor.test(df$x,df$y,method="pearson")

        Pearson's product-moment correlation

data:  df$x and df$y 
t = 7.6194, df = 18, p-value = 4.872e-07
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.7029411 0.9492172 
sample estimates:
      cor 
0.8736874 

cor.test(df$x,df$y,method="spearman")

        Spearman's rank correlation rho

data:  df$x and df$y 
S = 148, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0 
sample estimates:
      rho 
0.8887218

An extension of the Pearson coefficient of correlation is when we square it we obtain the amount of variation in y explained by x (this is not true for the spearman rank based coefficient where squaring it has no statistical meanings). In our case we have around 75% of the variance in y that is explained by x.

However such results do not allow any causal explanation of the effect of x on y, indeed x could act on y in various way that are not always direct, all we can say from the correlation is that these two variables are linked somehow, to really explain and measure causal effect of x on y we need to use regression method, which will come next.

Linear Regression

Regression is different from correlation because it try to put variables into equation and thus explain causal relationship between them, for example the most simple linear equation is written : Y=aX+b, so for every variation of unit in X, Y value change by aX. Because we are trying to explain natural processes by equations that represent only part of the whole picture we are actually building a model that’s why linear regression are also called linear modelling.

In R we can build and test the significance of linear models.

m1<-lm(mpg~cyl,data=mtcars)
summary(m1)

Call:
lm(formula = mpg ~ cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9814 -2.1185  0.2217  1.0717  7.5186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
cyl          -2.8758     0.3224   -8.92 6.11e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262,     Adjusted R-squared: 0.7171 
F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10

The basic function to build linear model (linear regression) in R is to use the lm() function, you provide to it a formula in the form of y~x and optionally a data argument.

Using the summary() function we get all information about our model: the formula called, the distribution of the residuals (the error of our models), the value of the coefficient and their significance plus an information on the overall model performance with the adjusted R-squared (0,71 in our case) that represent the amount of variation in y explained by x, so 71% of the variation in ‘mpg’ can be explain by the variable ‘cyl’.

Before shouting ‘Eureka’ we should first check that the models assumptions are met, indeed linear models make a few assumptions on your data, the first one is that your data are normally distributed, the second one is that the variance in y is homogeneous over all x values (sometimes called homoscedasticity) and independence which means that a y value at a certain x value should not influence other y values.

There is a marvelous built-in methods to check all this with linear models:

par(mfrow=c(2,2))
plot(m1)

The par() argument is just to put all graphs in one window, the plot function is the real one.

Here is the plot:

The graphs on the first columns look at variance homogeneity among other things, normally you should see no pattern in the dots but just a random clouds of points. In this example this is clearly not the case since we see that the spreads of dots increase with higher values of cyl, our homogeneity assumptions is violated we can go back at the beginning and build new models this one cannot be interpreted… Sorry m1 you looked so great…

For the record the graph on the top right check the normality assumptions, if your data are normally distributed the point should fall (more or less) in a straight line, in this case the data are normal.

The final graph show how each y influence the model, each points is removed at a time and the new model is compared to the one with the point, if the point is very influential then it will have a high leverage value. Points with too high leverage value should be removed from the dataset to remove their outlying effect on the model.

Transforming the data

There are a few basics mathematical transformations that can be applied to non normal or heterogeneous data, usually it is a trial and error process;

mtcars$Mmpg<-log(mtcars$mpg)
plot(Mmpg~cyl,mtcars)

Here is the plot we get:

In our case this looks ok, but we can still remove the two outliers in ‘cyl’ categorie 8;

n<-rownames(mtcars)[mtcars$Mmpg!=min(mtcars$Mmpg[mtcars$cyl==8])]
mtcars2<-subset(mtcars,rownames(mtcars)%in%n)

The first line ask for row names in ‘mtcars’ (rownames(mtcars)), but only return the one where the value of the variable ‘Mmpg’ is not equal (!=) to the minimum value of the variable ‘Mmpg’ which fall in the category of 8 cylinders. Then the list ‘n’ contain all these rownames and the next step is to make a new data frame that only contain rows with rownames present in the list ‘n’.

In this stage of transforming and removing outliers from the data you should use and abuse of plots to help you through the process.

Now let’s look back at our bivariate linear regression model from this new dataset;

model<-lm(Mmpg~cyl,mtcars2)
summary(model)

Call:
lm(formula = Mmpg ~ cyl, data = mtcars2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.19859 -0.08576 -0.01887  0.05354  0.26143 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.77183    0.08328  45.292  < 2e-16 ***
cyl         -0.12746    0.01319  -9.664 2.04e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.1264 on 28 degrees of freedom
Multiple R-squared: 0.7693,     Adjusted R-squared: 0.7611 
F-statistic: 93.39 on 1 and 28 DF,  p-value: 2.036e-10 

plot(model)

Here is the plot for the model:

Again we have highly significant intercept and slope, the model explain 76% of the variance in log(mpg) and is overall significant. Now we biologist are trained to love and worship ANOVA table, in R there are several way to do it (as always an easy and straightforward way and another with more possibilities for tuning);

anova(model)

Analysis of Variance Table

Response: Mmpg
          Df  Sum Sq Mean Sq F value    Pr(>F)    
cyl        1 1.49252 1.49252  93.393 2.036e-10 ***
Residuals 28 0.44747 0.01598   

library(car)
Le chargement a nécessité le package : MASS
Le chargement a nécessité le package : nnet

Anova(model)

Anova Table (Type II tests)

Response: Mmpg
           Sum Sq Df F value    Pr(>F)    
cyl       1.49252  1  93.393 2.036e-10 ***
Residuals 0.44747 28

The second function Anova() allow you to define which type of sum-of-square you want to calculate (here is a nice explanation of their different assumptions) and also to correct for variance heterogeneity;

Anova(model,white.adjust=TRUE)

Analysis of Deviance Table (Type II tests)

Response: Mmpg
          Df      F    Pr(>F)    
cyl        1 69.328 4.649e-09 ***
Residuals 28

You would have noticed that the p-value is a bit higher. This function is very useful for unbalanced dataset (which is our case) but need to take care when formulating the model especially when there is more than one predictor variables since the type II sum of square assume that there is no interaction between the predictors.

Concluding comments

To sum up, correlation is a nice first step to data exploration before going into more serious analysis and to select variable that might be of interest (anyway it always produce sexy and easy to interpret graphs which will make your supervisor happy), then the next step is to model the variable relationship and the most basic models are bivariate linear regression that put the relation between the response variable and the predictor variable into equation and testing this using the summary and anova() function. Since linear regression make several assumptions on the data before interpreting the results of the model you should use the function plot and look if the data are normally distributed, that the variance is homogeneous (no pattern in the residuals~fitted values plot) and when necessary remove outliers.

Next step will be using multiple predictors and looking at generalized linear models.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Climate change and spline interactions

November 20, 2015, 10:00 pm

≫ Next: Published — Introductory Fisheries Analyses with R

≪ Previous: Correlation and Linear Regression

(This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers)

In a series of irregular posts¹ I’ve looked at how additive models can be used to fit non-linear models to time series. Up to now I’ve looked at models that included a single non-linear trend, as well as a model that included a within-year (or seasonal) part and a trend part. In this trend plus season model it is important to note that the two terms are purely additive; no matter which January you are predicting for in a long timeseries, the seasonal effect for that month will always be the same. The trend part might shift this seasonal contribution up or down a bit, but all January’s are the same. In this post I want to introduce a different type of spline interaction model that will allow us to relax this additivity assumption and fit a model that allows the seasonal part of the model to change in time along with the trend.

As with previous posts, I’ll be using the Central England Temperature time series as an example. The data require a bit of processing to get them into a format useful for modelling, so I’ve written a little function — loadCET() — that downloads the data and processes it for you. To load the function into R, run the following

source(con <- url("http://bit.ly/loadCET", method = "libcurl"))
close(con)
cet <- loadCET()

We also need a couple of packages for model fitting and plotting

library("mgcv")

Loading required package: nlme
This is mgcv 1.8-9. For overview type 'help("mgcv-package")'.

library("ggplot2")

Loading required package: methods

OK, let’s begin…

As previously, if we think about a time series where observations were made on a number of occasions within any given year over a number of years, we may want to model the following features of the data

any trend or long term change in the level of the time series, and
any seasonal or within-year variation, and
any variation in, or interaction between, the trend and seasonal features of the data.

In a previous post I tackled features 1 and 2, but it is feature 3 that is of interest now. Our model for features 1 and 2 was

[ y = 0 + f{}(x_1) + f_{}(x_2) + , N(0, ^2) ]

where (0) is the intercept, (f{}) and (f_{}) are smooth functions for the seasonal and trend features we’re interested in, and (x_1) and (x_2) are to covariate data providing some form of time indicators for the within-year and between year times.

To allow for an interaction between (f_{}) and (f_{}) we will need to fit the following modle instead

[ y = _0 + f(x_1, x_2) + , N(0, ^2) ]

Notice now that (f()) is a smooth function of our two time variables, and for simplicity’s sake let’s say that the within-year variable will just be the numeric month indicator (1, 2, …, 12) and the between year variable will be the calendar year of the observation. In previous posts I’ve used a derived time variable instead of calendar year for the trend, but doing that here is largely redundant; the data seem well modelled even if we don’t allow for a trend within-year, and doing some useful or interesting things with the model once fitted is much simplified if we just use observation year for the trend.

In pseudo mgcv code we are going to fit the following model

mod <- gam(y = te(x1, x2), data = foo)

The te() represents a tensor product smooth of the indicated variables. We won’t be using s() because our two time variables are unrelated, and we want to allow for more variation in one of the variables than the other; multivariate s() smooths are isotropic, so they’re good for things like spatial coordinates but not things measured in different units or having more variation in one variable than the other. I’m not going to go into the detail of tensor product smooths; that’s covered in Simon Wood’s rather excellent book.

Another detail that we need to consider is knot placement. Previously I used a cyclic spline for the within-year term and allowed gam() to select the knots for the spline from the data. This meant that boundary knots were at months 1 and 12. This worked ok where I’ve been modelling daily data so the within-year term is in Julian day say, as the knots would be at 1 and 366 and it didn’t matter much if December 31^st was exactly the same as January 1^st. But with monthly data like this it is a bit of a problem; we don’t expect December and January to be exactly the same. This problem was anticipated in the comments of the previous post by a reader and I sort of dismissed it. Well, I was wrong and it took me until I set about interrogating the model that I’ll fitcshortly to realise it.

What we need to do is place boundary knots just beyond the data, such that the distance between December and January is the same as the distance between any other month. Placing boundary knots at (0.5, 12.5) achieves this. We then have 10 more interior knots to play with (assuming 12 knots overall, which is what I specify for k below), so I just place those, spread evenly between 1 and 12 (the inner seq() call).

knots <- list(nMonth = c(0.5, seq(1, 12, length = 10), 12.5))

Having dealt with those details, we can fit some models; here I fit models with the same fixed effects parts (the spline interaction) but with differing stochastic trend models in the residuals.

To assist our selection of the stochastic model in the residuals, we fit a naive model that assumes independence of observations

m0 <- gamm(Temperature ~ te(Year, nMonth, bs = c("cr","cc"), k = c(10,12)),
           data = cet, method = "REML", knots = knots)

Plotting the autocorrelation function (ACF) of the normalized residuals from the (lme</code> part of this model fit we can start to think about plausible models for the residuals. Remember though that we are going to nest this within-year, so we’re only going to be able to do anything about the first 12 lags even though I’ll still show the default number <div class="highlight"> <pre><code class="language-r" data-lang="r">plot(acf(resid(m0)lme, type = “normalized”)))

ACF for model m0 a naive additive model assuming conditional independence of observations fitted to the CET time series — ACF for model `m0` a naive additive model assuming conditional independence of observations fitted to the CET time series

In the ACF we see lingering correlations out to lag 7 or 8 and then longer-range lags out beyond a year. These latter lags are the between-year temporal signal that we aren’t capturing perfectly with the temporal trend component of the model fit. We’re going to ignore these, for now at least — I may return to look at these in a future post.

From the ACF (and a bit of fiddling, err… EDA) it looks like AR terms are needed to model this residual autocorrelation. Hence the stochatsic trend models are AR(p), for p in {1, 2, …, 8}. The ARMA is nested within year, as previously; with the switch to modelling using calendar year for the trend term, I would anticipate stronger within year autocorrelation in residuals, or possible a more complex structure, than observed in earlier fits².

If you want to fit all the models great, I’ll get to you in a moment — just don’t look at the value of p in the chunk below! If you just want to skip ahead, fit the following model and then move right along to the next section, thus saving yourself in the region of 10 minutes (on a fast as hell Xeon workstation) of thumb twiddling

ctrl <- list(niterEM = 0, optimMethod="L-BFGS-B", maxIter = 100, msMaxIter = 100)
m <- gamm(Temperature ~ te(Year, nMonth, bs = c("cr","cc"), k = c(10,12)),
          data = cet, method = "REML", control = ctrl, knots = knots,
          correlation = corARMA(form = ~ 1 | Year, p = 7))

For those of you in for the long haul, here’s a loop³ that will fit the models with varying AR terms for us

ctrl <- list(niterEM = 0, optimMethod="L-BFGS-B", maxIter = 100, msMaxIter = 100)
for (i in 1:8) {
    m <- gamm(Temperature ~ te(Year, nMonth, bs = c("cr","cc"), k = c(10,12)),
              data = cet, method = "REML", control = ctrl, knots = knots,
              correlation = corARMA(form = ~ 1 | Year, p = i))
    assign(paste0("m", i), m) 
}

A generalised likelihood ratio test can be used to test for which correlation structure fits best

anova(m1$lme, m2$lme, m3$lme, m4$lme, m5$lme, m6$lme, m7$lme, m8$lme)

Model df      AIC      BIC    logLik   Test   L.Ratio p-value
m1$lme     1  6 14849.98 14888.13 -7418.988                         
m2$lme     2  7 14836.78 14881.29 -7411.389 1 vs 2 15.197206  0.0001
m3$lme     3  8 14810.73 14861.60 -7397.365 2 vs 3 28.047345  <.0001
m4$lme     4  9 14784.63 14841.86 -7383.314 3 vs 4 28.101617  <.0001
m5$lme     5 10 14778.35 14841.95 -7379.177 4 vs 5  8.275739  0.0040
m6$lme     6 11 14776.49 14846.44 -7377.244 5 vs 6  3.865917  0.0493
m7$lme     7 12 14762.45 14838.77 -7369.227 6 vs 7 16.032363  0.0001
m8$lme     8 13 14764.33 14847.01 -7369.167 7 vs 8  0.119909  0.7291

Lo and behold, the AR(7) turns out to have the best fit as assessed by a range of metrics. If we now look at the ACF of the normalized residuals for this model we see that all the within-year autocorrelation has been accounted for, leaving a little bit of correlation at lags just longer than a year.

plot(acf(resid(m7$lme, type = "normalized")))

ACF for model m7 an additive model with an AR(7) process in the residuals fitted to the CET time series — ACF for model `m7` an additive model with an AR(7) process in the residuals fitted to the CET time series

At this stage we can probably proceed without too much worry — although an AR(7) is quite a complex model to fit, so we should remain a little cautious.

Before we move on, to bring us up to speed with the people that jumped ahead, copy m7 into object m so the code in the next section works for you too.

m <- m7

Interrogating the fitted model {#nextsection

I’m going to cut to the chase and look at the fitted model and use it to ask some questions about how temperature has changed both within and between years over the last 100 years. In part 2 of this post I’ll look at doing inference on the fitted model, but for now I’ll skip that.

First, let’s visualise the fitted spline; this requires a 3D plot so it gets somewhat tricky to really see what’s going on, but here goes

plot(m$gam, pers = TRUE)

This is quite a useful visualisation as it illustrates how the model represents longer term trends, seasonal cycles, and how these vary in relation to one another. Viewed one way, we have estimates of trends over years for each month. Alternatively, we could see the model as giving an estimate of the seasonal cycle for each year. Each year can have a different seasonal cycle and each month a different trend. If there was no interaction, there would be no change in the seasonal pattern other time — or all months would have the same trend over years. This figure also sucks; it’s 3D but static and the scale of the trend and any change in seasonal cycle over time is swamped by the magnitude of the seasonal cycle itself.

Predict monthly temperature for the years 1914 and 2014

In the first illustrative use of the fitted model, I’ll predict within-year temperatures for two years — 1914 and 2014 — to look at how different the seasonal cycle is after a 100 years⁴ of climate change (time). The first step is to produce the values of the covariates that we want to predict at. In the snippet below I generate 100 1914s followed by 100 2014s for Year, and within these years we have 100 evenly-spaced values on the interval (1,12) for nMonth.

pdat <- with(cet,
             data.frame(Year = rep(c(1914, 2014), each = 100),
                        nMonth = rep(seq(1, 12, length = 100), times = 2)))

Next, the predict() method generates predicted values for the new data pairs, with standard errors for each predicted value

pred <- predict(m$gam, newdata = pdat, se.fit = TRUE)
crit <- qt(0.975, df = df.residual(m$gam)) # ~95% interval critical t
pdat <- transform(pdat, fitted = pred$fit, se = pred$se.fit, fYear = as.factor(Year))
pdat <- transform(pdat,
                  upper = fitted + (crit * se),
                  lower = fitted - (crit * se))

The first transform() adds fitted, se, and fYear variables to pdat for the predictions, their standard errors, and a factor for Year that I’ll use in plotting shortly. The second transform() call adds upper and lower variables containing the upper and lower pointwise confidence bounds, here for an approximate 95% interval.

A plot, using the ggplot2 package, of the predicted monthly temperatures for 1914 and 2014 is created in the next chunk. It’s a little involved as I wanted to modify a few things and change the name of the legend to make it look nice — I’ve commented the lines to indicate what they do

p1 <- ggplot(pdat, aes(x = nMonth, y = fitted, group = fYear)) +
    geom_ribbon(mapping = aes(ymin = lower, ymax = upper,
                              fill = fYear), alpha = 0.2) + # confidence band
    geom_line(aes(colour = fYear)) +    # predicted temperatures
    theme_bw() +                        # minimal theme
    theme(legend.position = "top") +    # push legend to the top
    labs(y = expression(Temperature ~ (degree*C)), x = NULL) +
    scale_fill_discrete(name = "Year") + # correct legend name
    scale_colour_discrete(name = "Year") +
    scale_x_continuous(breaks = 1:12,   # tweak where the x-axis ticks are
                       labels = month.abb, # & with what labels
                       minor_breaks = NULL)
p1

Predicted monthly temperature for 1914 and 2014

Looking at the plot, most of the action appears in the autumn and winter months.

Predict trends for each month, 1914–2014

The second use of the fitted model will be to predict trends in temperature for each month over the period 1914–2014. For this we need a different set of new values to predict at than before; here I repeat the values 1914–2012 twelve times each and the sequence 1, 2, …, 12 101 times, once per year of the period of interest.

pdat2 <- with(cet,
              data.frame(Year = rep(1914:2014, each = 12),
                         nMonth = rep(1:12, times = 101)))

Next we repeat the earlier steps to predict from the model and set up an object for plotting with ggplot()

pred2 <- predict(m$gam, newdata = pdat2, se.fit = TRUE)
## add predictions & SEs to the new data ready for plotting
pdat2 <- transform(pdat2,
                   fitted = pred2$fit,  # predicted values
                   se = pred2$se.fit,   # standard errors
                   fMonth = factor(month.abb[nMonth], # month as a factor
                                   levels = month.abb))
pdat2 <- transform(pdat2,
                   upper = fitted + (crit * se), # upper and...
                   lower = fitted - (crit * se)) # lower confidence bounds

The first plot we’ll produce using these data is a plot of the trends faceted by fMonth

p2 <- ggplot(pdat2, aes(x = Year, y = fitted, group = fMonth)) +
    geom_line(aes(colour = fMonth)) +   # draw trend lines
    theme_bw() +                        # minimal theme
    theme(legend.position = "none") +   # no legend
    labs(y = expression(Temperature ~ (degree*C)), x = NULL) +
    facet_wrap(~ fMonth, ncol = 6) +    # facet on month
    scale_y_continuous(breaks = seq(4, 17, by = 1),
                       minor_breaks = NULL) # nicer ticks
p2

Predicted trends in monthly temperature, 1914–2014.

The impression that most of the action is in the autumn and winter is again very apparent.

Predict trends for each month, 1914–2014, by quarter

Another visualisation of the same predictions is to group the data by quarter/season. For that we set up a variable Quarter in the pred2 data frame and assign particular months to the seasons.

pdat2$Quarter <- NA
pdat2$Quarter[pdat2$nMonth %in% c(12,1,2)] <- "Winter"
pdat2$Quarter[pdat2$nMonth %in% 3:5] <- "Spring"
pdat2$Quarter[pdat2$nMonth %in% 6:8] <- "Summer"
pdat2$Quarter[pdat2$nMonth %in% 9:11] <- "Autumn"
pdat2 <- transform(pdat2,
                   Quarter = factor(Quarter,
                                    levels = c("Spring","Summer","Autumn","Winter")))

Then we facet on Quarter, and we need a legend to help identify the months, we do a little fiddling to get a nice name

p3 <- ggplot(pdat2, aes(x = Year, y = fitted, group = fMonth)) +
    geom_line(aes(colour = fMonth)) +   # draw trend lines
    theme_bw() +                        # minimal theme
    theme(legend.position = "top") +    # legend on top
    scale_fill_discrete(name = "Month") + # nicer legend title
    scale_colour_discrete(name = "Month") +
    labs(y = expression(Temperature ~ (degree*C)), x = NULL) +
    facet_grid(Quarter ~ ., scales = "free_y") # facet by Quarter
p3

Predicted trends in monthly temperature, 1914–2014, by quarter.

Summary

In this post I’ve looked at how we can fit smooth models with smooth interactions between two variables. This allows the smooth effect one variable to vary as a smooth function of the second variable. This approach can be extended to additional variables as needed.

One of the things I’m not very happy with is the rather complex AR process in the model residuals. The AR(7) mopped up all the within-year residual autocorrelation but it appears that there is a trade-off here between fitting a more complex seasonal smooth or a more complex within-year AR process.

An important aspect that I haven’t covered in this post is whether the interaction model is an improvement in fit over a purely additive model of a trend in temperature with the same seasonal cycle superimposed. I’ll look at how we can do that in part 2.

here, here, and here ↩
Note that this code assumes that samples are provided in the data in their time order within year. This is the case here, but if it isn’t, you could do form = ~ nMonth | Year to tell gamm() about the correct ordering.↩
I’m just being lazy; I could fit these models in parallel with the parallel package, but I’m caching this code chunk so, meh…↩
Yes, yes, yes, I know it’s 101 years…↩

To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap - R.

↧

Published — Introductory Fisheries Analyses with R

December 5, 2015, 10:00 pm

≫ Next: Baby Boomers

≪ Previous: Climate change and spline interactions

(This article was first published on fishR Blog, and kindly contributed to R-bloggers)

I am pleased to announce that my Introductory Fisheries Analyses with R (IFAR) book has been published, almost two weeks ahead of schedule. Details about the book (and companion website) are here and it can be purchased from CRC Press (at a 20% discount through the end of the year). A brief description and table of contents for the book are below.

Brief Description

Introductory Fisheries Analyses with R provides detailed instructions on how to perform basic fisheries stock assessment analyses in the R environment. The analyses covered are typical analyses for many working fisheries scientists and, thus, also occur in most upper-level undergraduate and graduate level fisheries science, analysis, or management courses. The book begins with three foundational chapters (R basics, data manipulation, and plotting) that help the reader become familiar with R within the context of basic fisheries analyses and examples. The remaining chapters build upon these foundational skills with analytical techniques specific to fisheries stock assessments.

(Very Brief) Introduction to R Basics
- The bare fundamentals of R that are required for the remainder of the book.
Loading Data and Basic Manipulations
- Load data into R from external files and perform typical manipulations including filtering, sorting, aggregating, joining, and converting between wide- and long-formats.
Plotting Fundamentals
- The bare fundamentals for constructing basic plots using base R.
Age Comparisons
- Compare two or more estimates of age for the same fish with precision and bias metrics and plots.
Age-Length Keys
- Assign ages to unaged fish from their length and an age-length-key.
Size Structure
- Assess size structure through length frequencies and the proportional size distribution (PSD) metric.
Weight-Length Relationships
- Introduction to simple linear regression through examination of weight-length relationships.
Condition
- Compute condition metrics from observed length and weights. Introduction to one-way ANOVA.
Abundance from Capture-Recapture Data
- Estimate abundance from capture-recapture data for closed (single and multiple recapture events) and open populations.
Abundance from Depletion Data
- Estimate abundance from removal or depletion samplings (Leslie, DeLury, k-pass).
Mortality Rates
- Estimate total mortality rates from catch curve and capture-recapture data. Estimate fishing and natural mortality with empirical models.
Individual Growth
- Estimate parameters for the von Bertalanffy growth function and compare growth parameters among populations.
Recruitment
- Estimate parameters for the Beverton-Holt and Ricker stock-recruitment models, compute spawning potential ratios, and estimate year-class strengths from catch data.

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog.

↧

Baby Boomers

December 22, 2015, 6:24 am

≫ Next: Error Control in Exploratory ANOVA’s: The How and the Why

≪ Previous: Published — Introductory Fisheries Analyses with R

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Chris Campbell

As another new baby card is passed around the office, and the latest cute baby pictures are emailed out, a discussion is underway. Could it be true? Something in the water? An elixir of fertility that should be bottled and sold to desperate couples for enormous profit? Is Mango having some crazy baby boom?!

Mango has been having something of a population explosion. In the first seven years of Mango there were only a couple of babies. Since moving to Methuen Park it seems like there’s been one every few months. So is there really something magical happening here?

Of course not. Now demonstrating this kind of data is an absolute HR nightmare, so please forgive me if I’m somewhat coy with revealing the raw datasets. However, I’d like to share a small analysis I did as an example of how to set up a dataset for survival analysis with multiple events. To investigate the hypothesis that there’d been a change in the rate of baby production, I had two datasets. One, babies, has the Date of birth of the child, as well as some covariates. Splitting births into events in the first 3432 days (9.4 years) and most recent 1268 days (3.5 years) shows that 40% the number of babies were born before moving to our current home, Methuen Park, compared to when at the new offices. In fact, the rate jumps from 0.4 babies/year to 2.9 babies/year!

names(babies)
# [1] "Name"   "Date"   "Sex"    "Parent"
library(dplyr)
day1 <- as.Date("2002-10-01")
summarize(
      .data = group_by(x = babies, Methuen = Date - day1 > 3432),
      Number = length(Date))
# Source: local data frame [2 x 2]
#
#   Methuen Number
#     (lgl)  (int)
# 1   FALSE      4
# 2    TRUE     10

I helped Francis Smart develop his stick man script into a small package, stick, that you can install from GitHub. This allows me to demonstrate graphically how absolutely remarkable this step change in birth rate truly is.

There is a plotStick function in the package, but since I want to demonstrate which time period is at Methuen without obscuring the sticks, I added them to a blank plot with pointsStick. Each stick represents one additional birth. The sticks wear dresses if they are female, and are coloured to show unique parents. The mood of the face is random, to represent the changeable mood of a newborn. Sleeping stick is not yet implemented.

R Code for Stick Plot

library(devtools)
# install stick package
install_github("EconometricsBySimulation/R-Graphics/Stick-Figures/stick")
library(stick)

# blank plot
plot(x = babies$Date, y = seq_len(nrow(babies)), 
    main = "Cumulative Baby Numbers for Mango Solutions Employees",
    xlab = "Date", ylab = "Total Babies", type = "n", 
    ylim = c(0, 15))

# highlight time at Methuen Park
methuen <- rainbow(1, s = 0.5, start = 0.2)
polygon(x = day1 + c(3432, 4700, 4700, 3432), 
    y = c(-2, -2, 17, 17), 
    border = methuen, col = methuen)
box()

# add sticks
pointsStick(x = babies$Date, y = seq_len(nrow(babies)), 
    # female babies wear a dress I guess
    gender = babies$Sex, 
    # colour by parent
    col = rainbow(length(unique(babies$Parent)), 
        s = 0.9)[factor(babies$Parent)], 
    # baby mood is unpredictable at best
    face = sample(x = c("happy", "neutral", "surprised", 
            "annoyed", "sad"), 
        size = nrow(babies), 
        replace = TRUE, prob = (5:1) / 15), 
    # babies rarely, if ever keep hats on
    hat = FALSE)

# add labels
text(x = as.Date(c("2006-03-30", "2012-05-29")), 
    y = c(14, 14), 
    labels = c("Before Methuen", "At Methuen"), 
    pos = 4)

mango babies

Plotting the number of babies born, it certainly looks like there’s a correlation between Methuen Park and babies. However, examining employee head count during the two periods trivially shows that baby birth number correlates with number of employees. Baby births and employee numbers are 40% the level at Methuen Park.

names(employeeno)
# [1] "Date"  "Employees"

summarize(
      .data = group_by(x = employeeno, Methuen = Date - day1 > 3432),
      Number = median(Employees))
# Source: local data frame [2 x 2]
#
#   Methuen Number
#     (lgl)  (int)
# 1   FALSE     22
# 2    TRUE     52

However, for the sake of fun, we can see whether there is a difference in birth rates between these two periods. One way of measuring the occurrence of events is survival analysis. A lot of tools for survival analysis are available in the survival package. Following the common modelling idiom in R, the model is defined as a formula. Events go on the left hand side of the formula, and are coded as a Surv object (a matrix of time and event status). For data where only one event can be observed for a subject, only one event time is observed. But people who have had a baby are still at risk of having another baby!

For multiple events in the Surv object, the data need to be shaped so that the time at risk of each individual is marked as start time, end time and status: event observed or subject censored (no event observed). Subjects can occur at multiple records.

library(data.table)
tab2 <- data.table(tab)
setkey(tab2, Employee)
tab2
#         Employee Start Stop Status
#1:  Andy Nicholls  3031 4408      1
#2:  Andy Nicholls  4408 4595      0
#3: Chris Campbell  3361 3893      1
#4: Chris Campbell  3893 4479      1
#5: Chris Campbell  4479 4595      0
#6: Chris Musselle  4124 4595      0

In this experiment, subjects are at risk during two time periods. To divide this dataset into two groups based on site change at 3432 days since the start of Mango, we can use the foverlaps function from package data.table.

R Code for Splitting into Time Intervals

# split date
dmethuen <- data.table(Start = 3432, Stop = 3432,
     Status = 0, key = c("Start", "Stop"))
setkey(tab2, Start, Stop)

# records where split occurs flagged in Start/Stop columns
tab2 <- foverlaps(x = tab2, y = dmethuen, type = "any")
setkey(tab2, Employee)
tab2
#    Start Stop Status       Employee i.Start i.Stop i.Status
# 1:  3432 3432      0  Andy Nicholls    3031   4408        1
# 2:    NA   NA     NA  Andy Nicholls    4408   4595        0
# 3:  3432 3432      0 Chris Campbell    3361   3893        1
# 4:    NA   NA     NA Chris Campbell    3893   4479        1
# 5:    NA   NA     NA Chris Campbell    4479   4595        0
# 6:    NA   NA     NA Chris Musselle    4124   4595        0

# bind new columns as new rows, using new Status
tab2 <- rbindlist(
      list(
            tab2[!is.na(Start), 
                list(Employee, Start = i.Start, 
                    Stop, Status = Status)], 
            tab2[!is.na(Start), 
                list(Employee, Start = Start, 
                    Stop = i.Stop, Status = i.Status)], 
            tab2[is.na(Start), 
                list(Employee, Start = i.Start, 
                    Stop = i.Stop, Status = i.Status)]))

# add flag column for modelling
tab2[, Location := factor(x = Start >= 3432, 
      levels = c(FALSE, TRUE), labels = c("Greenways", "Methuen"))]
setkey(tab2, Employee)
tab2
#         Employee Start Stop Status  Location
#1:  Andy Nicholls  3031 3432      0 Greenways
#2:  Andy Nicholls  3432 4408      1   Methuen
#3:  Andy Nicholls  4408 4595      0   Methuen
#4: Chris Campbell  3361 3432      0 Greenways
#5: Chris Campbell  3432 3893      1   Methuen
#6: Chris Campbell  3893 4479      1   Methuen
#7: Chris Campbell  4479 4595      0   Methuen
#8: Chris Musselle  4124 4595      0   Methuen

This allowed me to use the Methuen or not-Methuen as a possible covariate. I created a Cox Proportional Hazards fit of event status with Location as an explanatory variable. There was no improvement in the likelihood of the model by adding a Location effect. The rates of baby creation for a person at risk is constant, whether at Methuen Park or the old offices. As we suspected, the only difference, which we can reasonably infer to be causative, is the number of person-days at risk in the two groups.

library(survival)
newBabyLocs <- coxph(
    formula = Surv(Start, Stop, Status) ~ Location, 
    data = tab)
anova(newBabyLocs)
# Analysis of Deviance Table
#  Cox model: response is Surv(Start, Stop, Status)
# Terms added sequentially (first to last)
#
#           loglik Chisq Df Pr(>|Chi|)
# NULL     -48.341                    
# Location -48.341     0  1          1

This approach suggests that Location doesn’t influence risk of babies. Absence of evidence does not constitute evidence of absence, so we can’t completely rule out an effect. However, supporting evidence strongly suggests that there is a causal relationship between moving to larger offices at Methuen Park and Mango population size. The increased number of babies born simply reflects the increased number of subjects at risk. So if there is something in the water, the effect size must be very small.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

↧

Error Control in Exploratory ANOVA’s: The How and the Why

January 1, 2016, 2:24 am

≫ Next: A simple ANOVA

≪ Previous: Baby Boomers

(This article was first published on The 20% Statistician, and kindly contributed to R-bloggers)

In a 2X2X2 design, there are three main effects, three two-way interactions, and one three-way interaction to test. That’s 7 statistical tests.The probability of making at least one Type 1 error in a single ANOVA is 1-(0.95)^⁷=30%.

There are earlier blog posts on this, but my eyes were not opened until I read this paper by Angelique Cramer and colleagues (put it on your reading list, if you haven’t read it yet). Because I prefer to provide solutions to problems, I want to show how to control Type 1 error rates in ANOVA’s in R, and repeat why it’s necessary if you don’t want to fool yourself. Please be aware that if you continue reading, you will lose the bliss of ignorance if you hadn’t thought about this issue before now, and it will reduce the amount of p <0.05 you’ll find in exploratory ANOVA’s.

Simulating Type 1 errors in 3-way ANOVA’s

Let’s simulate 250000 2x2x2 ANOVAs where all factors are manipulated between individuals, with 50 participants in each condition, and without any true effect (all group means are equal).The R code is at the bottom of this page. We store the p-values of the 7 tests. The total p-value distribution has the by now familiar uniform shape we see if the null hypothesis is true.

If we count the number of significant findings (even though there is no real effect), we see that from 250000 2x2x2 ANOVA’s, approximately 87.500 p-values were smaller than 0.05 (the left most bar in the Figure). This equals 250.000 ANOVA’s x 0.05 Type 1 errors x 7 tests. If we split up the p-values for each of the 7 tests, we see in the table below that as expected, each test has it’s own 5% error rate, which together add up to a 30% error rate due to multiple testing. With a 2x2x2x2 ANOVA, the Type 1 errors you’ll a massive 54%, making you about as accurate as a scientist as a coin-flipping toddler.

Let’s fix this. We need to adjust the error rate. The Bonferroni correction (divide your alpha level by the number of tests, so for 7 tests and alpha = 0.05 use 0.05/7-= 0.007 for each test) communicates the basic idea very well, but the Holm-Bonferroni correction is slightly better. In fields outside of psychology (e.g., economics, gene discovery) work on optimal Type 1 error control procedures continues. I’ve used the mutoss package in R in my simulations to check a wide range of corrections, and came to the conclusion that unless the number of tests is huge, we don’t need anything more fancy than the Holm-Bonferroni (or sequential Bonferroni) correction (please correct me if I’m wrong in the comments!). It orders p-values from lowest to highest, and tests them sequentially against an increasingly more lenient alpha level. If you prefer a spreadsheet, go here.

In a 2x2x2 ANOVA, we can test three main effects, three 2-way interactions, and one 3-way interaction. The table below shows the error rate for each of these 7 tests is 5% (for a total of 1-0.95^7=30%) but after the Holm-Bonferroni correction, the Type 1 error rate nicely controlled.

However, another challenge is to not let Type 1 error control increase the Type 2 errors too much. To examine this, I’ve simulated 2x2x2 ANOVA’s where there is a true effect. One of the eight cells has a small positive difference, and one has a small negative difference. As a consequence, with sufficient power, we should find 4 significant effects (a main effect, two 2-way interactions, and the 3-way interaction).

Let’s first look at the p-value distribution. I’ve added a horizontal and vertical line. The horizontal line indicates the null-distribution caused by the four null-effects. The vertical line indicates the significance level of 0.05. The two lines create four quarters. Top left are the true positives, bottom left are the false positives, top right are the false negatives (not significant due to a lack of power) and the bottom right are the true negatives.

Now let’s plot the adjusted p-values using Holm’s correction (instead of changing the alpha level for each test, we can also keep the alpha fixed, but adjust the p-value).

We see a substantial drop in the left-most column, and this drop is larger than the false height due to false positives. We also see a peculiarly high bar on the right, caused by the Holm correction adjusting a large number of p-values to 1. We can see this drop in power in the Table below as well. It’s substantial: From 87% power to 68% power.

If you perform a 2x2x2 ANOVA, we might expect you are not really interested in the main effects (if you were, a simply t-test would have sufficed). The power cost is already much lower if the exploratory analysis focusses on only four tests, the three 2-way interactions, and the 3-way interaction (see the third row in the Table below). Even exploratory 2x2x2 ANOVA’s are typically not 100% exploratory. If so, preregistering the subset of all tests you are interesting in, and controlling the error rate for this subset of tests, provides an important boost in power.

Oh come on you silly methodological fetishist!

If you think Type 1 error control should not endanger the discovery of true effects, here’s what you should not do. You should not wave your hands at controlling Type 1 error rates, saying it is ‘methodological fetishism’ (Ellemers, 2013). It ain’t gonna work. If you choose to report p-values (by all means, don’t), and want to do quantitative science (by all means, don’t) than the formal logic you are following (even if you don’t realize this) is the Neyman-Pearson approach. It allows you to say: ‘In the long run, I’m not saying there’s something, when there is nothing, more than X% of the time’. If you don’t control error rates, your epistemic foundation of making statements reduces to ‘In the long run, I’m not saying there’s something, when there is nothing, more than … uhm … nice weather for the time of the year, isn’t it?’.

Now just because you need to control error rates, doesn’t mean you need to use a Type 1 error rate of 5%. If you plan to replicate any effect you find in an exploratory study, and you set the alpha to 0.2, the probability of making a Type 1 error twice in a row is 0.2*0.2 = 0.04. If you want to explore four different interactions in a 2x2x2 ANOVA you intend to replicate in any case, setting you overall Type 1 error across two studies to 0.2, and then using an alpha of 0.05 for each of the 4 tests might be a good idea. If some effects would be costlier to miss, but others less costly, you can use an alpha of 0.8 for two effects, and an alpha of 0.02 for the other two. This is just one example. It’s your party. You can easily pre-register the choices you make to the OSFor AsPredicted to transparently communicate them.

You can also throw error control out of the window. There are approximately 1.950.000 hits in Google Scholar when I search for ‘An Exploratory Analysis Of’. Put these words in the title, list all your DV’s in the main test (e.g., in a table), add Bayesian statistics and effect sizes with their confidence intervals, and don’t draw strong conclusions (Bender & Lange, 2001).

Obviously, the tricky thing is always what to do if your prediction was not confirmed. I think you enter a Lakatosian degenerative research line (as opposed to the progressive research line you’d be in if your predictions were confirmed). With some luck, there’s an easy fix. The same study, but using a larger sample, (or, if you designed a study using sequential analyses, simply continue the data collection after the first look at the data, Lakens, 2014) might get you back in a progressive research line after an update in the predicted effect size. Try again, with a better manipulation of dependent variable. Giving up on a research idea after a single failed confirmation is not how science works, in general. Statistical inferences tell you how to interpret the data without fooling yourself. Type 1 error control matters, and in most psychology experiments, is relatively easy to do. But it’s only one aspect of the things you take into account when you decide which research you want to do.

My main point here is that there are many possible solutions, and all you have to do is choose one that best fits your goals. Since your goal is very unlikely to be a 30% Type 1 error rate in a single study which you interpret as a 5% Type 1 error rate, you have to do something. There’s a lot of room between 100% exploratory and 100% confirmatory research, and there are many reasonable ideas about what the ‘family’ of errors is you want to control (for a good discussion on this, see Bender & Lange, 2001). I fully support their conclusion (p. 344): “Whatever the decision is, it should clearly be stated why and how the chosen analyses are performed, and which error rate is controlled for”. Clear words, no hand waving.

Thanks to @RogierK for correcting an error in an earlier version of this blog post.

Bender, R., & Lange, S. (2001). Adjusting for multiple testing—when and how? Journal of Clinical Epidemiology, 54(4), 343–349.

Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … Wagenmakers, E.-J. (2014). Hidden multiplicity in multiway ANOVA: Prevalence, consequences, and remedies. arXiv Preprint arXiv:1412.3416. Retrieved from http://arxiv.org/abs/1412.3416

Ellemers, N. (2013). Connecting the dots: Mobilizing theory to reveal the big picture in social psychology (and why we should do this): The big picture in social psychology. European Journal of Social Psychology, 43(1), 1–8. http://doi.org/10.1002/ejsp.1932

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023

To leave a comment for the author, please follow the link and comment on their blog: The 20% Statistician.

↧

A simple ANOVA

January 17, 2016, 8:38 am

≫ Next: R trends in 2015 (based on cranlogs)

≪ Previous: Error Control in Exploratory ANOVA’s: The How and the Why

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I was browsing Davies Design and Analysis of Industrial Experiments (second edition, 1967). Published by for ICI in times when industry did that kind of thing. It is quite an applied book. On page 107 there is an example where the variance of a process is estimated.

Data

Data is from nine batches from which three samples were selected (A, B and C) and each a duplicate measurement. I am not sure about copyright of these data, so I will not reprint the data here. The problem is to determine the measurement ans sampling error in a chemical process.
ggplot(r4,aes(x=Sample,y=x))+
geom_point()+
facet_wrap(~ batch )

Analysis

At the time of writing the book, the only approach was to do a classical ANOVA and calculate the estimates from there.
aov(x~ batch + batch:Sample,data=r4) %>%
anova
Analysis of Variance Table

Response: x
Df Sum Sq Mean Sq F value Pr(>F)
batch 8 792.88 99.110 132.6710 < 2e-16 ***
batch:Sample 18 25.30 1.406 1.8818 0.06675 .
Residuals 27 20.17 0.747
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In this case the residual variation is 0.75. The batch:Sample variation estimates is, due to the design, twice the sapling variation plus residual variation. Hence it is estimated as 0.33. How lucky we are to have tools (lme4) which can do this estimate directly. In this case, as it was a well designed experiment, these estimates are the same as from the ANOVA.
l1 <- lmer(x ~1+ (1 | batch) + (1|batch:Sample) ,data=r4 )

summary(l1)
Linear mixed model fit by REML [‘lmerMod’]
Formula: x ~ 1 + (1 | batch) + (1 | batch:Sample)
Data: r4

REML criterion at convergence: 189.4

Scaled residuals:
Min 1Q Median 3Q Max
-1.64833 -0.50283 -0.06649 0.55039 1.57801

Random effects:
Groups Name Variance Std.Dev.
batch:Sample (Intercept) 0.3294 0.5739
batch (Intercept) 16.2841 4.0354
Residual 0.7470 0.8643
Number of obs: 54, groups: batch:Sample, 27; batch, 9

Fixed effects:
Estimate Std. Error t value

(Intercept) 47.148 1.355 34.8
A next step is confidence intervals around the estimates. Davies uses limits from a Chi-squared distribution for the residual variation, leading to a 90% interval 0.505 to 1.25. In contrast lme4 has two estimators, profile (computing a likelihood profile and finding the appropriate cutoffs based on the likelihood ratio test;) and bootstrap (perform parametric bootstrapping with confidence intervals computed from the bootstrap distribution according to boot.type). Each of these takes one or few second on my laptop, not feasible for the pre computer age. The estimates are different, to my surprise more narrow:
Computing profile confidence intervals …
5 % 95 %
.sig01 0.0000000 0.9623748
.sig02 2.6742109 5.9597328
.sigma 0.7017849 1.1007261
(Intercept) 44.8789739 49.4173227

Computing bootstrap confidence intervals …
5 % 95 %
sd_(Intercept)|batch:Sample 0.000000 0.8880414
sd_(Intercept)|batch 2.203608 5.7998348
sigma 0.664149 1.0430984

(Intercept) 45.140652 49.4931109
Davies continues to estimate the ratio to residual for sampling variation, which was the best available for that time. This I won’t repeat.

To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.

↧

R trends in 2015 (based on cranlogs)

January 20, 2016, 12:11 am

≫ Next: Filling in the gaps – highly granular estimates of income and population for New Zealand from survey data

≪ Previous: A simple ANOVA

(This article was first published on R – G-Forge, and kindly contributed to R-bloggers)

What are the current tRends? The image is CC from coco + kelly.

It is always fun to look back and reflect on the past year. Inspired by Christoph Safferling’s post on top packages from published in 2015, I decided to have my own go at the top R trends of 2015. Contrary to Safferling’s post I’ll try to also (1) look at packages from previous years that hit the big league, (2) what top R coders we have in the community, and then (2) round-up with my own 2015-R-experience.

Everything in this post is based on the CRANberries reports. To harvest the information I’ve borrowed shamelessly from Safferling’s post with some modifications. He used the number of downloads as proxy for package release date, while I decided to use the release date, if that wasn’t available I scraped it off the CRAN servers. The script now also retrieves package author(s) and description (see code below for details).

^?View Code RSPLUS

library(rvest)
library(dplyr)
# devtools::install_github("hadley/multidplyr")
library(multidplyr)
library(magrittr)
library(lubridate)
 
getCranberriesElmnt <- function(txt, elmnt_name){
  desc <- grep(sprintf("^%s:", elmnt_name), txt)
  if (length(desc) == 1){
    txt <- txt[desc:length(txt)]
    end <- grep("^[A-Za-z/@]{2,}:", txt[-1])
    if (length(end) == 0)
      end <- length(txt)
    else
      end <- end[1]
 
    desc <-
      txt[1:end] %>% 
      gsub(sprintf("^%s: (.+)", elmnt_name),
           "\1", .) %>% 
      paste(collapse = " ") %>% 
      gsub("[ ]{2,}", " ", .) %>% 
      gsub(" , ", ", ", .)
  }else if (length(desc) == 0){
    desc <- paste("No", tolower(elmnt_name))
  }else{
    stop("Could not find ", elmnt_name, " in text: n",
         paste(txt, collapse = "n"))
  }
  return(desc)
}
 
convertCharset <- function(txt){
  if (grepl("Windows", Sys.info()["sysname"]))
    txt <- iconv(txt, from = "UTF-8", to = "cp1252")
  return(txt)
}
 
getAuthor <- function(txt, package){
  author <- getCranberriesElmnt(txt, "Author")
  if (grepl("No author|See AUTHORS file", author)){
    author <- getCranberriesElmnt(txt, "Maintainer")
  }
 
  if (grepl("(No m|M)aintainer|(No a|A)uthor|^See AUTHORS file", author) || 
      is.null(author) ||
      nchar(author)  <= 2){
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    author <- cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
      .[grep("^Author", .)] %>% 
      gsub(".*n", "", .)
 
    # If not found then the package has probably been
    # removed from the repository
    if (length(author) == 1)
      author <- author
    else
      author <- "No author"
  }
 
  # Remove stuff such as:
  # [cre, auth]
  # (worked on the...)
  # <my@email.com>
  # "John Doe"
  author %<>% 
    gsub("^Author: (.+)", 
         "\1", .) %>% 
    gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% 
    gsub("\([^)]+\)", " ", .) %>% 
    gsub("([ ]*<[^>]+>)", " ", .) %>% 
    gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% 
    gsub("[ ]{2,}", " ", .) %>% 
    gsub("(^[ '"]+|[ '"]+$)", "", .) %>% 
    gsub(" , ", ", ", .)
  return(author)
}
 
getDate <- function(txt, package){
  date <- 
    grep("^Date/Publication", txt)
  if (length(date) == 1){
    date <- txt[date] %>% 
      gsub("Date/Publication: ([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2}).*",
           "\1", .)
  }else{
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    date <- 
      cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
      .[grep("^Published", .)] %>% 
      gsub(".*n", "", .)
 
 
    # The main page doesn't contain the original date if 
    # new packages have been submitted, we therefore need
    # to check first entry in the archives
    if(cran_txt %>% 
       html_nodes("tr") %>% 
       html_text %>% 
       gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
       grepl("^Old.{1,4}sources", .) %>% 
       any){
      archive_txt <- read_html(sprintf("http://cran.r-project.org/src/contrib/Archive/%s/",
                                       package))
      pkg_date <- 
        archive_txt %>% 
        html_nodes("tr") %>% 
        lapply(function(x) {
          nodes <- html_nodes(x, "td")
          if (length(nodes) == 5){
            return(nodes[3] %>% 
                     html_text %>% 
                     as.Date(format = "%d-%b-%Y"))
          }
        }) %>% 
        .[sapply(., length) > 0] %>% 
        .[!sapply(., is.na)] %>% 
        head(1)
 
      if (length(pkg_date) == 1)
        date <- pkg_date[[1]]
    }
  }
  date <- tryCatch({
    as.Date(date)
  }, error = function(e){
    "Date missing"
  })
  return(date)
}
 
getNewPkgStats <- function(published_in){
  # The parallel is only for making cranlogs requests
  # we can therefore have more cores than actual cores
  # as this isn't processor intensive while there is
  # considerable wait for each http-request
  cl <- create_cluster(parallel::detectCores() * 4)
  parallel::clusterEvalQ(cl, {
    library(cranlogs)
  })
  set_default_cluster(cl)
  on.exit(stop_cluster())
 
  berries <- read_html(paste0("http://dirk.eddelbuettel.com/cranberries/", published_in, "/"))
  pkgs <- 
    # Select the divs of the package class
    html_nodes(berries, ".package") %>% 
    # Extract the text
    html_text %>% 
    # Split the lines
    strsplit("[n]+") %>% 
    # Now clean the lines
    lapply(.,
           function(pkg_txt) {
             pkg_txt[sapply(pkg_txt, function(x) { nchar(gsub("^[ t]+", "", x)) > 0}, 
                            USE.NAMES = FALSE)] %>% 
               gsub("^[ t]+", "", .) 
           })
 
  # Now we select the new packages
  new_packages <- 
    pkgs %>% 
    # The first line is key as it contains the text "New package"
    sapply(., function(x) x[1], USE.NAMES = FALSE) %>% 
    grep("^New package", .) %>% 
    pkgs[.] %>% 
    # Now we extract the package name and the date that it was published
    # and merge everything into one table
    lapply(function(txt){
      txt <- convertCharset(txt)
      ret <- data.frame(
        name = gsub("^New package ([^ ]+) with initial .*", 
                     "\1", txt[1]),
        stringsAsFactors = FALSE
      )
 
      ret$desc <- getCranberriesElmnt(txt, "Description")
      ret$author <- getAuthor(txt, ret$name)
      ret$date <- getDate(txt, ret$name)
 
      return(ret)
    }) %>% 
    rbind_all %>% 
    # Get the download data in parallel
    partition(name) %>% 
    do({
      down <- cran_downloads(.$name[1], 
                             from = max(as.Date("2015-01-01"), .$date[1]), 
                             to = "2015-12-31")$count 
      cbind(.[1,],
            data.frame(sum = sum(down), 
                       avg = mean(down))
      )
    }) %>% 
    collect %>% 
    ungroup %>% 
    arrange(desc(avg))
 
  return(new_packages)
}
 
pkg_list <- 
  lapply(2010:2015,
         getNewPkgStats)
 
pkgs <- 
  rbind_all(pkg_list) %>% 
  mutate(time = as.numeric(as.Date("2016-01-01") - date),
         year = format(date, "%Y"))

Downloads and time on CRAN

The longer a package has been on CRAN the more downloaded it gets. We can illustrate this using simple linear regression, slightly surprising is that this behaves mostly linear:

^?View Code RSPLUS

pkgs %<>% 
  mutate(time_yrs = time/365.25)
fit <- lm(avg ~ time_yrs, data = pkgs)
 
# Test for non-linearity
library(splines)
anova(fit,
      update(fit, .~.-time_yrs+ns(time_yrs, 2)))

Analysis of Variance Table

Model 1: avg ~ time
Model 2: avg ~ ns(time, 2)
  Res.Df       RSS Df Sum of Sq      F Pr(>F)
1   7348 189661922                           
2   7347 189656567  1    5355.1 0.2075 0.6488

Where the number of average downloads increases with about 5 downloads per year. It can easily be argued that the average number of downloads isn’t that interesting since the data is skewed, we can therefore also look at the upper quantiles using quantile regression:

^?View Code RSPLUS

library(quantreg)
library(htmlTable)
lapply(c(.5, .75, .95, .99),
       function(tau){
         rq_fit <- rq(avg ~ time_yrs, data = pkgs, tau = tau)
         rq_sum <- summary(rq_fit)
         c(Estimate = txtRound(rq_sum$coefficients[2, 1], 1), 
           `95 % CI` = txtRound(rq_sum$coefficients[2, 1] + 
                                        c(1,-1) * rq_sum$coefficients[2, 2], 1) %>% 
             paste(collapse = " to "))
       }) %>% 
  do.call(rbind, .) %>% 
  htmlTable(rnames = c("Median",
                       "Upper quartile",
                       "Top 5%",
                       "Top 1%"))

	Estimate	95 % CI
Median	0.6	0.6 to 0.6
Upper quartile	1.2	1.2 to 1.1
Top 5%	9.7	11.9 to 7.6
Top 1%	182.5	228.2 to 136.9

The above table conveys a slightly more interesting picture. Most packages don’t get that much attention while the top 1% truly reach the masses.

Top downloaded packages

In order to investigate what packages R users have been using during 2015 I’ve looked at all new packages since the turn of the decade. Since each year of CRAN-presence increases the download rates, I’ve split the table by the package release dates. The results are available for browsing below (yes – it is the new brand interactive htmlTable that allows you to collapse cells – note it may not work if you are reading this on R-bloggers and the link is lost under certain circumstances).

		Downloads
Name	Author	Total	Average/day	Description
Top 10 packages published in 2015
xml2	Hadley Wickham, Jeroen Ooms, RStudio, R Foundation	348,222	1635	Work with XML files …
rversions	Gabor Csardi	386,996	1524	Query the main R SVN…
git2r	Stefan Widgren	411,709	1303	Interface to the lib…
praise	Gabor Csardi, Sindre Sorhus	96,187	673	Build friendly R pac…
readxl	David Hoerl	99,386	379	Import excel files i…
readr	Hadley Wickham, Romain Francois, R Core Team, RStudio	90,022	337	Read flat/tabular te… Read flat/tabular text files from disk.
DiagrammeR	Richard Iannone	84,259	236	Create diagrams and … Create diagrams and flowcharts using R.
visNetwork	Almende B.V. (vis.js library in htmlwidgets/lib,	41,185	233	Provides an R interf…
plotly	Carson Sievert, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy	9,745	217	Easily translate ggp…
DT	Yihui Xie, Joe Cheng, jQuery contributors, SpryMedia Limited, Brian Reavis, Leon Gersen, Bartek Szopka, RStudio Inc	24,806	120	Data objects in R ca…
Top 10 packages published in 2014
stringi	Marek Gagolewski and Bartek Tartanus ; IBM and other contributors ; Unicode, Inc.	1,316,900	3608	stringi allows for v…
magrittr	Stefan Milton Bache and Hadley Wickham	1,245,662	3413	Provides a mechanism…
mime	Yihui Xie	1,038,591	2845	This package guesses…
R6	Winston Chang	920,147	2521	The R6 package allow…
dplyr	Hadley Wickham, Romain Francois	778,311	2132	A fast, consistent t…
manipulate	JJ Allaire, RStudio	626,191	1716	Interactive plotting…
htmltools	RStudio, Inc.	619,171	1696	Tools for HTML gener… Tools for HTML generation and output
curl	Jeroen Ooms	599,704	1643	The curl() function …
lazyeval	Hadley Wickham, RStudio	572,546	1569	A disciplined approa…
rstudioapi	RStudio	515,665	1413	This package provide…
Top 10 packages published in 2013
jsonlite	Jeroen Ooms, Duncan Temple Lang	906,421	2483	This package is a fo…
BH	John W. Emerson, Michael J. Kane, Dirk Eddelbuettel, JJ Allaire, and Romain Francois	691,280	1894	Boost provides free …
highr	Yihui Xie and Yixuan Qiu	641,052	1756	This package provide…
assertthat	Hadley Wickham	527,961	1446	assertthat is an ext…
httpuv	RStudio, Inc.	310,699	851	httpuv provides low-…
NLP	Kurt Hornik	270,682	742	Basic classes and me…
TH.data	Torsten Hothorn	242,060	663	Contains data sets u…
NMF	Renaud Gaujoux, Cathal Seoighe	228,807	627	This package provide…
stringdist	Mark van der Loo	123,138	337	Implements the Hammi…
SnowballC	Milan Bouchet-Valat	104,411	286	An R interface to th…
Top 10 packages published in 2012
gtable	Hadley Wickham	1,091,440	2990	Tools to make it eas…
knitr	Yihui Xie	792,876	2172	This package provide…
httr	Hadley Wickham	785,568	2152	Provides useful tool…
markdown	JJ Allaire, Jeffrey Horner, Vicent Marti, and Natacha Porte	636,888	1745	Markdown is a plain-…
Matrix	Douglas Bates and Martin Maechler	470,468	1289	Classes and methods …
shiny	RStudio, Inc.	427,995	1173	Shiny makes it incre…
lattice	Deepayan Sarkar	414,716	1136	Lattice is a powerfu…
pkgmaker	Renaud Gaujoux	225,796	619	This package provide…
rngtools	Renaud Gaujoux	225,125	617	This package contain…
base64enc	Simon Urbanek	223,120	611	This package provide…
Top 10 packages published in 2011
scales	Hadley Wickham	1,305,000	3575	Scales map data to a…
devtools	Hadley Wickham	738,724	2024	Collection of packag… Collection of package development tools
RcppEigen	Douglas Bates, Romain Francois and Dirk Eddelbuettel	634,224	1738	R and Eigen integrat…
fpp	Rob J Hyndman	583,505	1599	All data sets requir…
nloptr	Jelmer Ypma	583,230	1598	nloptr is an R inter…
pbkrtest	Ulrich Halekoh Søren Højsgaard	536,409	1470	Test in linear mixed…
roxygen2	Hadley Wickham, Peter Danenberg, Manuel Eugster	478,765	1312	A Doxygen-like in-so…
whisker	Edwin de Jonge	413,068	1132	logicless templating…
doParallel	Revolution Analytics	299,717	821	Provides a parallel …
abind	Tony Plate and Richard Heiberger	255,151	699	Combine multi-dimens…
Top 10 packages published in 2010
reshape2	Hadley Wickham	1,395,099	3822	Reshape lets you fle…
labeling	Justin Talbot	1,104,986	3027	Provides a range of …
evaluate	Hadley Wickham	862,082	2362	Parsing and evaluati…
formatR	Yihui Xie	640,386	1754	This package provide…
minqa	Katharine M. Mullen, John C. Nash, Ravi Varadhan	600,527	1645	Derivative-free opti…
gridExtra	Baptiste Auguie	581,140	1592	misc. functions
memoise	Hadley Wickham	552,383	1513	Cache the results of…
RJSONIO	Duncan Temple Lang	414,373	1135	This is a package th…
RcppArmadillo	Romain Francois and Dirk Eddelbuettel	410,368	1124	R and Armadillo inte…
xlsx	Adrian A. Dragulescu	401,991	1101	Provide R functions …

Just as Safferling et. al. noted there is a dominance of technical packages. This is little surprising since the majority of work is with data munging. Among these technical packages there are quite a few that are used for developing other packages, e.g. roxygen2, pkgmaker, devtools, and more.

R-star authors

Just for fun I decided to look at who has the most downloads. By splitting multi-authors into several and also splitting their downloads we can find that in 2015 the top R-coders where:

^?View Code RSPLUS

top_coders <- list(
  "2015" = 
    pkgs %>% 
    filter(format(date, "%Y") == 2015) %>% 
    partition(author) %>% 
    do({
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(10),
  "all" =
    pkgs %>% 
    partition(author) %>% 
    do({
      if (grepl("Jeroen Ooms", .$author))
        browser()
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(30))
 
interactiveTable(
  do.call(rbind, top_coders) %>% 
    mutate(download_ave = txtInt(download_ave)),
  align = "lrr",
  header = c("Coder", "Total ave. downloads per day", "No. of packages", "Packages"),
  tspanner = c("Top coders 2015",
               "Top coders 2010-2015"),
  n.tspanner = sapply(top_coders, nrow),
  minimized.columns = 4, 
  rnames = FALSE, 
  col.rgroup = c("white", "#F0F0FF"))

Coder	Total ave. downloads	No. of packages	Packages
Top coders 2015
Gabor Csardi	2,312	11	sankey, franc, rvers…
Stefan Widgren	1,563	1	git2r
RStudio	781	16	shinydashboard, with…
Hadley Wickham	695	12	withr, cellranger, c…
Jeroen Ooms	541	10	rjade, js, sodium, w…
Richard Cotton	501	22	assertive.base, asse…
R Foundation	490	1	xml2
David Hoerl	455	1	readxl
Sindre Sorhus	409	2	praise, clisymbols
Richard Iannone	294	2	DiagrammeR, stationa… DiagrammeR, stationaRy
Top coders 2010-2015
Hadley Wickham	32,115	55	swirl, lazyeval, ggp…
Yihui Xie	9,739	18	DT, Rd2roxygen, high…
RStudio	9,123	25	shinydashboard, lazy…
Jeroen Ooms	4,221	25	JJcorr, gdtools, bro…
Justin Talbot	3,633	1	labeling
Winston Chang	3,531	17	shinydashboard, font…
Gabor Csardi	3,437	26	praise, clisymbols, …
Romain Francois	2,934	20	int64, LSD, RcppExam…
Duncan Temple Lang	2,854	6	RMendeley, jsonlite,…
Adrian A. Dragulescu	2,456	2	xlsx, xlsxjars
JJ Allaire	2,453	7	manipulate, htmlwidg…
Simon Urbanek	2,369	15	png, fastmatch, jpeg…
Dirk Eddelbuettel	2,094	33	Rblpapi, RcppSMC, RA…
Stefan Milton Bache	2,069	3	import, blatr, magri… import, blatr, magrittr
Douglas Bates	1,966	5	PKPDmodels, RcppEige…
Renaud Gaujoux	1,962	6	NMF, doRNG, pkgmaker…
Jelmer Ypma	1,933	2	nloptr, SparseGrid
Rob J Hyndman	1,933	3	hts, fpp, demography
Baptiste Auguie	1,924	2	gridExtra, dielectri… gridExtra, dielectric
Ulrich Halekoh Søren Højsgaard	1,764	1	pbkrtest
Martin Maechler	1,682	11	DescTools, stabledis…
Mirai Solutions GmbH	1,603	3	XLConnect, XLConnect… XLConnect, XLConnectJars, XLConnectJars
Stefan Widgren	1,563	1	git2r
Edwin de Jonge	1,513	10	tabplot, tabplotGTK,…
Kurt Hornik	1,476	12	movMF, ROI, qrmtools…
Deepayan Sarkar	1,369	4	qtbase, qtpaint, lat… qtbase, qtpaint, lattice, qtutils
Tyler Rinker	1,203	9	cowsay, wakefield, q…
Yixuan Qiu	1,131	12	gdtools, svglite, hi…
Revolution Analytics	1,011	4	doParallel, doSMP, r… doParallel, doSMP, revoIPC, checkpoint
Torsten Hothorn	948	7	MVA, HSAUR3, TH.data…

It is worth mentioning that two of the top coders are companies, RStudio and Revolution Analytics. While I like the fact that R is free and open-source, I doubt that the community would have grown as quickly as it has without these companies. It is also symptomatic of 2015 that companies are taking R into account, it will be interesting what the R Consortium will bring to the community. I think the r-hub is increadibly interesting and will hopefully make my life as an R-package developer easier.

My own 2015-R-experience

My own personal R experience has been dominated by magrittr and dplyr, as seen in above code. As most I find that magrittr makes things a little easier to read and unless I have som really large dataset the overhead is small. It does have some downsides related to debugging but these are negligeable.

When I originally tried dplyr out I came from the plyr environment and was disappointed by the lack of parallelization, I found the concepts a little odd when thinking the plyr way. I had been using sqldf a lot in my data munging and merging, when I found the left_join, inner_joint, and the brilliant anti_join I was completely sold. Combined with RStudio I find the dplyr-workflow both intuitive and more productive than my previous.

When looking at those packages (including more than just the top 10 here) I did find some additional gems that I intend to look into when I have the time:

DiagrammeR An interesting new way of producing diagrams. I’ve used it for gantt charts but it allows for much more.
checkmate A neat package for checking function arguments.
covr An excellent package for testing how much of a package’s code is tested.
rex A package for making regular easier.
openxlsx I wish I didn’t have to but I still get a lot of things in Excel-format – perhaps this package solves the Excel-import inferno…
R6 The successor to reference classes – after working with the Gmisc::Transition-class I appreciate the need for a better system.

To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge.

↧

Filling in the gaps – highly granular estimates of income and population for New Zealand from survey data

January 22, 2016, 3:00 am

≫ Next: New Course! A hands-on introduction to statistics with R by A. Conway (Princeton University)

≪ Previous: R trends in 2015 (based on cranlogs)

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Individual-level estimates from survey data

I was motivated by web apps like the British Office of National Statistics’ How well do you know your area? and How well does your job pay? to see if I could turn the New Zealand Income Survey into an individual-oriented estimate of income given age group, qualification, occupation, ethnicity, region and hours worked. My tentative go at this is embedded below, and there’s also a full screen version available.

The job’s a tricky one because the survey data available doesn’t go to anywhere near that level of granularity. It could be done with census data of course, but any such effort to publish would come up against confidentiality problems – there are just too few people in any particular combination of category to release real data there. So some kind of modelling is required that can smooth over the actual data but still give a plausible and realistic estimate.

I also wanted to emphasise the distribution of income, not just a single measure like mean or median – something I think that we statisticians should do much more than we do, with all sorts of variables. And in particular I wanted to find a good way of dealing with the significant number of people in many categories (particularly but not only “no occupation”) who have zero income; and also the people who have negative income in any given week.

My data source is the New Zealand Income Survey 2011 simulated record file published by Statistics New Zealand. An earlier post by me describes how I accessed this, normalised it and put it into a database. I’ve also written several posts about dealing with the tricky distribution of individual incomes, listed here under the “NZIS2011” heading.

This is a longer post than usual, with a digression into the use of Random Forests ™ to predict continuous variables, an attempt at producing a more polished plot of a regression tree than usually available, and some reflections on strengths and weakness of several different approaches to estimating distributions.

Data import and shape

I begin by setting up the environment and importing the data I’d placed in the data base in that earlier post. There’s a big chunk of R packages needed for all the things I’m doing here. I also re-create some helper functions for transforming skewed continuous variables that include zero and negative values, which I first created in another post back in September 2015.

#------------------setup------------------------
library(showtext)
library(RMySQL)
library(ggplot2)
library(scales)
library(MASS) # for stepAIC.  Needs to be before dplyr to avoid "select" namespace clash
library(dplyr)
library(tidyr)
library(stringr)
library(gridExtra)
library(GGally)

library(rpart)
library(rpart.plot)   # for prp()
library(caret)        # for train()
library(partykit)     # for plot(as.party())
library(randomForest)

# library(doMC)         # for multicore processing with caret, on Linux only

library(h2o)


library(xgboost)
library(Matrix)
library(data.table)


library(survey) # for rake()


font.add.google("Poppins", "myfont")
showtext.auto()
theme_set(theme_light(base_family = "myfont"))

PlayPen <- dbConnect(RMySQL::MySQL(), username = "analyst", dbname = "nzis11")


#------------------transformation functions------------
# helper functions for transformations of skewed data that crosses zero.  See 
# http://ellisp.github.io/blog/2015/09/07/transforming-breaks-in-a-scale/
.mod_transform <- function(y, lambda){
   if(lambda != 0){
      yt <- sign(y) * (((abs(y) + 1) ^ lambda - 1) / lambda)
   } else {
      yt = sign(y) * (log(abs(y) + 1))
   }
   return(yt)
}


.mod_inverse <- function(yt, lambda){
   if(lambda != 0){
      y <- ((abs(yt) * lambda + 1)  ^ (1 / lambda) - 1) * sign(yt)
   } else {
      y <- (exp(abs(yt)) - 1) * sign(yt)
      
   }
   return(y)
}

# parameter for reshaping - equivalent to sqrt:
lambda <- 0.5

Importing the data is a straightforward SQL query, with some reshaping required because survey respondents were allowed to specify either one or two ethnicities. This means I need an indicator column for each individual ethnicity if I’m going to include ethnicity in any meaningful way (for example, an “Asian” column with “Yes” or “No” for each survey respondent). Wickham’s {dplyr} and {tidyr} packages handle this sort of thing easily.

#---------------------------download and transform data--------------------------
# This query will include double counting of people with multiple ethnicities
sql <-
"SELECT sex, agegrp, occupation, qualification, region, hours, income, 
         a.survey_id, ethnicity FROM
   f_mainheader a                                               JOIN
   d_sex b           on a.sex_id = b.sex_id                     JOIN
   d_agegrp c        on a.agegrp_id = c.agegrp_id               JOIN
   d_occupation e    on a.occupation_id = e.occupation_id       JOIN
   d_qualification f on a.qualification_id = f.qualification_id JOIN
   d_region g        on a.region_id = g.region_id               JOIN
   f_ethnicity h     on h.survey_id = a.survey_id               JOIN
   d_ethnicity i     on h.ethnicity_id = i.ethnicity_id
   ORDER BY a.survey_id, ethnicity"

orig <- dbGetQuery(PlayPen, sql) 
dbDisconnect(PlayPen)

# ...so we spread into wider format with one column per ethnicity
nzis <- orig %>%
   mutate(ind = TRUE) %>%
   spread(ethnicity, ind, fill = FALSE) %>%
   select(-survey_id) %>%
   mutate(income = .mod_transform(income, lambda = lambda))

for(col in unique(orig$ethnicity)){
   nzis[ , col] <- factor(ifelse(nzis[ , col], "Yes", "No"))
}

# in fact, we want all characters to be factors
for(i in 1:ncol(nzis)){
   if(class(nzis[ , i]) == "character"){
      nzis[ , i] <- factor(nzis[ , i])
   }
}

names(nzis)[11:14] <- c("MELAA", "Other", "Pacific", "Residual")

After reshaping ethnicity and transforming the income data into something a little less skewed (so measures of prediction accuracy like root mean square error are not going to be dominated by the high values), I split my data into training and test sets, with 80 percent of the sample in the training set.

set.seed(234)
nzis$use <- ifelse(runif(nrow(nzis)) > 0.8, "Test", "Train")
trainData <- nzis %>% filter(use == "Train") %>% select(-use)
trainY <- trainData$income
testData <- nzis %>% filter(use == "Test") %>% select(-use)
testY <- testData$income

Modelling income

The first job is to get a model that can estimate income for any arbitrary combination of the explanatory variables hourse worked, occupation, qualification, age group, ethnicity x 7 and region. I worked through five or six different ways of doing this before eventually settling on Random Forests which had the right combination of convenience and accuracy.

Regression tree

My first crude baseline is a single regression tree. I didn’t seriously expect this to work particularly well, but treated it as an interim measure before moving to a random forest. I use the train() function from the {caret} package to determine the best value for the complexity parameter (cp) – the minimum improvement in overall R-squared needed before a split is made. The best single tree is shown below.

One nice feature of regression trees – so long as they aren’t too large to see all at once – is usually their easy interpretability. Unfortunately this goes a bit by the wayside because I’m using a transformed version of income, and the tree is returning the mean of that transformed version. When I reverse the transform back into dollars I get a dollar number that is in effect the squared mean of the square root of the original income in a particular category; which happens to generally be close to the median, hence the somewhat obscure footnote in the bottom right corner of the plot above. It’s a reasonable measure of the centre in any particular group, but not one I’d relish explaining to a client.

Following the tree through, we see that

the overall centre of the data is $507 income per week
for people who work less than 23 hours, it goes down to $241; and those who work 23 or more hours receive $994.
of those who work few hours, if they are a community and personal service worker, labourer, no occupation, or residual category occupation their average income is $169 and all other incomes it is $477.
of those people who work few hours and are in the low paying occupations (including no occupation), those aged 15 – 19 receive $28 per week and those in other categories $214 per week.
and so on.

It takes a bit of effort to look at this plot and work out what is going on (and the abbreviated occupation labels don’t help sorry), but it’s possible once you’ve got the hang of it. Leftwards branches always receive less income than rightwards branches; the split is always done on only one variable at a time, and the leftwards split label is slightly higher on the page than the rightwards split label.

Trees are a nice tool for this sort of data because they can capture fairly complex interactions in a very flexible way. Where they’re weaker is in dealing with relationships between continuous variables that can be smoothly modelled by simple arithmetic – that’s when more traditional regression methods, or model-tree combinations, prove useful.

The code that fitted and plotted this tree (using the wonderful and not-used-enough prp() function that allows considerable control and polish of rpart trees) is below.

#---------------------modelling with a single tree---------------
# single tree, with factors all grouped together
set.seed(234)

# Determine the best value of cp via cross-validation
# set up parallel processing to make this faster, for this and future use of train()
# registerDoMC(cores = 3) # linux only
rpartTune <- train(income ~., data = trainData,
                     method = "rpart",
                     tuneLength = 10,
                     trControl = trainControl(method = "cv"))

rpartTree <- rpart(income ~ ., data = trainData, 
                   control = rpart.control(cp = rpartTune$bestTune),
                   method = "anova")


node.fun1 <- function(x, labs, digits, varlen){
   paste0("$", round(.mod_inverse(x$frame$yval, lambda = lambda), 0))
}

# exploratory plot only - not for dissemination:
# plot(as.party(rpartTree))

svg("..http://ellisp.github.io/img/0026-polished-tree.svg", 12, 10)
par(fg = "blue", family = "myfont")

prp(rpartTree, varlen = 5, faclen = 7, type = 4, extra = 1, 
    under = TRUE, tweak = 0.9, box.col = "grey95", border.col = "grey92",
    split.font = 1, split.cex = 0.8, eq = ": ", facsep = " ",
    branch.col = "grey85", under.col = "lightblue",
    node.fun = node.fun1)

grid.text("New Zealanders' income in one week in 2011", 0.5, 0.89,
          gp = gpar(fontfamily = "myfont", fontface = "bold"))  

grid.text("Other factors considered: qualification, region, ethnicity.",
          0.8, 0.2, 
          gp = gpar(fontfamily = "myfont", cex = 0.8))

grid.text("$ numbers in blue are 'average' weekly income:nsquared(mean(sign(sqrt(abs(x)))))nwhich is a little less than the median.",
          0.8, 0.1, 
          gp = gpar(fontfamily = "myfont", cex = 0.8, col = "blue"))

dev.off()

(Note – in working on this post I was using at different times several different machines, including some of it on a Linux server which is much easier than Windows for parallel processing. I’ve commented out the Linux-only bits of code so it should all be fully portable.)

The success rates of the various modelling methods in predicting income in the test data I put aside will be shown all in one part of this post, later.

A home-made random spinney (not forest…)

Regression trees have high variance. Basically, they are unstable, and vulnerable to influential small pockets of data changing them quite radically. The solution to this problem is to generate an ensemble of different trees and take the average prediction. Two most commonly used methods are:

“bagging” or bootstrap aggregation, which involves resampling from the data and fitting trees to the resamples
Random Forests (trademark of Breiman and Cutler), which resamples rows from the data and also restricts the number of variables to a different subset of variables for each split.

Gradient boosting can also be seen as a variant in this class of solutions but I think takes a sufficiently different approach for me to leave it to further down the post.

Bagging is probably an appropriate method here given the relatively small number of explanatory variables, but to save space in an already grossly over-long post I’ve left it out.

Random Forests ™ are a subset of the broader group of ensemble tree techniques known as “random decision forests”, and I set out to explore one variant of random decision forests visually (I’m a very visual person – if I can’t make a picture or movie of something happening I can’t understand it). The animation below shows an ensemble of 50 differing trees, where each tree was fitted to a set of data sample with replacement from the original data, and each tree was also restricted to just three randomly chosen variables. Note that this differs from a Random Forest, where the restriction differs for each split within a tree, rather than being a restriction for the tree as a whole.

Here’s how I generated my spinney of regression trees. Some of this code depends on a particular folder structure. The basic strategy is to

work out which variables have the most crude explanatory power
subset the data
subset the variables, choosing those with good explanatory power more often than the weaker ones
use cross-validation to work out the best tuning for the complexity parameter
fit the best tree possible with our subset of data nad variables
draw an image, with appropriate bits of commentary and labelling added to it, and save it for later
repeat the above 50 times, and then knit all the images into an animated GIF using ImageMagick.

#----------home made random decision forest--------------
# resample both rows and columns, as in a random decision forest,
# and draw a picture for each fitted tree.  Knit these
# into an animation.  Note this isn't quite the same as a random forest (tm).

# define the candidate variables
variables <- c("sex", "agegrp", "occupation", "qualification",
               "region", "hours", "Maori")

# estimate the value of the individual variables, one at a time

var_weights <- data_frame(var = variables, r2 = 0)
for(i in 1:length(variables)){
   tmp <- trainData[ , c("income", variables[i])]
   if(variables[i] == "hours"){
      tmp$hours <- sqrt(tmp$hours)
   }
   tmpmod <- lm(income ~ ., data = tmp)
   var_weights[i, "r2"] <- summary(tmpmod)$adj.r.squared
}

svg("..http://ellisp.github.io/img/0026-variables.svg", 8, 6)
print(
   var_weights %>%
   arrange(r2) %>%
   mutate(var = factor(var, levels = var)) %>%
   ggplot(aes(y = var, x = r2)) +
   geom_point() +
   labs(x = "Adjusted R-squared from one-variable regression",
        y = "",
        title = "Effectiveness of one variable at a time in predicting income")
)
dev.off()


n <- nrow(trainData)

home_made_rf <- list()
reps <- 50

commentary <- str_wrap(c(
   "This animation illustrates the use of an ensemble of regression trees to improve estimates of income based on a range of predictor variables.",
   "Each tree is fitted on a resample with replacement from the original data; and only three variables are available to the tree.",
   "The result is that each tree will have a different but still unbiased forecast for a new data point when a prediction is made.  Taken together, the average prediction is still unbiased and has less variance than the prediction of any single tree.",
   "This method is similar but not identical to a Random Forest (tm).  In a Random Forest, the choice of variables is made at each split in a tree rather than for the tree as a whole."
   ), 50)


set.seed(123)
for(i in 1:reps){
   
   these_variables <- sample(var_weights$var, 3, replace = FALSE, prob = var_weights$r2)
   
   this_data <- trainData[
      sample(1:n, n, replace = TRUE),
      c(these_variables, "income")
   ]
   
   
   
   this_rpartTune <- train(this_data[,1:3], this_data[,4],
                      method = "rpart",
                      tuneLength = 10,
                      trControl = trainControl(method = "cv"))
   
   
   
   home_made_rf[[i]] <- rpart(income ~ ., data = this_data, 
                      control = rpart.control(cp = this_rpartTune$bestTune),
                      method = "anova")
 
   png(paste0("_output/0026_random_forest/", 1000 + i, ".png"), 1200, 1000, res = 100)  
      par(fg = "blue", family = "myfont")
      prp(home_made_rf[[i]], varlen = 5, faclen = 7, type = 4, extra = 1, 
          under = TRUE, tweak = 0.9, box.col = "grey95", border.col = "grey92",
          split.font = 1, split.cex = 0.8, eq = ": ", facsep = " ",
          branch.col = "grey85", under.col = "lightblue",
          node.fun = node.fun1, mar = c(3, 1, 5, 1))
      
      grid.text(paste0("Variables available to this tree: ", 
                      paste(these_variables, collapse = ", "))
                , 0.5, 0.90,
                gp = gpar(fontfamily = "myfont", cex = 0.8, col = "darkblue"))
      
      grid.text("One tree in a random spinney - three randomly chosen predictor variables for weekly income,
resampled observations from New Zealand Income Survey 2011", 0.5, 0.95,
                gp = gpar(fontfamily = "myfont", cex = 1))
      
      grid.text(i, 0.05, 0.05, gp = gpar(fontfamily = "myfont", cex = 1))
      
      grid.text("$ numbers in blue are 'average' weekly income:nsquared(mean(sign(sqrt(abs(x)))))nwhich is a little less than the median.",
                0.8, 0.1, 
                gp = gpar(fontfamily = "myfont", cex = 0.8, col = "blue"))
      
      comment_i <- floor(i / 12.5) + 1
      
      grid.text(commentary[comment_i], 
                0.3, 0.1,
                gp = gpar(fontfamily = "myfont", cex = 1.2, col = "orange"))
      
      dev.off()

}   

# knit into an actual animation
old_dir <- setwd("_output/0026_random_forest")
# combine images into an animated GIF
system('"C:\Program Files\ImageMagick-6.9.1-Q16\convert" -loop 0 -delay 400 *.png "rf.gif"') # Windows
# system('convert -loop 0 -delay 400 *.png "rf.gif"') # linux
# move the asset over to where needed for the blog
file.copy("rf.gif", "../../..http://ellisp.github.io/img/0026-rf.gif", overwrite = TRUE)
setwd(old_dir)

Random Forest

Next model to try is a genuine Random Forest ™. As mentioned above, a Random Forest is an ensemble of regression trees, where each tree is a resample with replacement (variations are possible) of the original data, and each split in the tree is only allowed to choose from a subset of the variables available. To do this I used the {randomForests} R package, but it’s not efficiently written and is really pushing its limits with data of this size on modest hardware like mine. For classification problems the amazing open source H2O (written in Java but binding nicely with R) gives super-efficient and scalable implementations of Random Forests and of deep learning neural networks, but it doesn’t work with a continuous response variable.

Training a Random Forest requires you to specify how many explanatory variables to make available for each individual tree, and the best way to decide this is vai cross validation.

Cross-validation is all about splitting the data into a number of different training and testing sets, to get around the problem of using a single hold-out test set for multiple purposes. It’s better to give each bit of the data a turn as the hold-out test set. In the tuning exercise below, I divide the data into ten so I can try different values of the “mtry” parameter in my randomForest fitting and see the average Root Mean Square Error for the ten fits for each value of mtry. “mtry” defines the number of variables the tree building algorithm has available to it at each split of the tree. For forests with a continuous response variable like mine, the default value is the number of variables divided by three and I have 10 variables, so I try a range of options from 2 to 6 as the subset of variables for the tree to choose from at each split. It turns out the conventional default value of mtry = 3 is in fact the best:

rf-tuning

Here’s the code for this home-made cross-validation of randomForest:

#-----------------random forest----------
# Hold ntree constant and try different values of mtry
# values of m to try for mtry for cross-validation tuning
m <- c(1, 2, 3, 4, 5, 6)

folds <- 10

cvData <- trainData %>%
   mutate(group = sample(1:folds, nrow(trainData), replace = TRUE))

results <- matrix(numeric(length(m) * folds), ncol = folds)



# Cross validation, done by hand with single processing - not very efficient or fast:
for(i in 1:length(m)){
   message(i)
   for(j in 1:folds){
      
      cv_train <- cvData %>% filter(group != j) %>% select(-group)
      cv_test <- cvData %>% filter(group == j) %>% select(-group)

      tmp <- randomForest(income ~ ., data = cv_train, ntree = 100, mtry = m[i], 
                          nodesize = 10, importance = FALSE, replace = FALSE)
      tmp_p <- predict(tmp, newdata = cv_test)
      
      results[i, j] <- RMSE(tmp_p, cv_test$income)
      print(paste("mtry", m[i], j, round(results[i, j], 2), sep = " : "))
   }
}

results_df <- as.data.frame(results)
results_df$mtry <- m

svg("..http://ellisp.github.io/img/0026-rf-cv.svg", 6, 4)
print(
   results_df %>% 
   gather(trial, RMSE, -mtry) %>% 
   ggplot() +
   aes(x = mtry, y = RMSE) +
   geom_point() +
   geom_smooth(se = FALSE) +
   ggtitle(paste0(folds, "-fold cross-validation for random forest;ndiffering values of mtry"))
)
dev.off()

Having determined a value for mtry of three variables to use for each tree in the forest, we re-fit the Random Forest with the full training dataset. It’s interesting to see the “importance” of the different variables – which ones make the most contribution to the most trees in the forest. This is the best way of relating as Random Forest to a theoretical question; otherwise their black box nature makes them harder to interpret than a more traditional regression with its t tests and confidence intervals for each explanatory variable’s explanation.

It’s also good to note that after the first 300 or so trees, increasing the size of the forest seems to have little impact.

final-forest

Here’s the code that fits this forest to the training data and draws those plots:

# refit model with full training data set
rf <- randomForest(income ~ ., 
                    data = trainData, 
                    ntrees = 500, 
                    mtries = 3,
                    importance = TRUE,
                    replace = FALSE)


# importances
ir <- as.data.frame(importance(rf))
ir$variable  <- row.names(ir)

p1 <- ir %>%
   arrange(IncNodePurity) %>%
   mutate(variable = factor(variable, levels = variable)) %>%
   ggplot(aes(x = IncNodePurity, y = variable)) + 
   geom_point() +
   labs(x = "Importance of contribution tonestimating income", 
        title = "Variables in the random forest")

# changing RMSE as more trees added
tmp <- data_frame(ntrees = 1:500, RMSE = sqrt(rf$mse))
p2 <- ggplot(tmp, aes(x = ntrees, y = RMSE)) +
   geom_line() +
   labs(x = "Number of trees", y = "Root mean square error",
        title = "Improvement in predictionnwith increasing number of trees")

grid.arrange(p1, p2, ncol = 2)

Extreme gradient boosting

I wanted to check out extreme gradient boosting as an alternative prediction method. Like Random Forests, this method is based on a forest of many regression trees, but in the case of boosting each tree is relatively shallow (not many layers of branch divisions), and the trees are not independent of eachother. Instead, successive trees are built specifically to explain the observations poorly explained by previous trees – this is done by giving extra weight to outliers from the prediction to date.

Boosting is prone to over-fitting and if you let it run long enough it will memorize the entire training set (and be useless for new data), so it’s important to use cross-validation to work out how many iterations are worth using and at what point is not picking up general patterns but just the idiosyncracies of the training sample data. The excellent {xgboost} R package by Tianqui Chen, Tong He and Michael Benesty applies gradient boosting algorithms super-efficiently and comes with built in cross-validation functionality. In this case it becomes clear that 15 or 16 rounds is the maximum boosting before overfitting takes place, so my final boosting model is fit to the full training data set with that number of rounds.

#-------xgboost------------
sparse_matrix <- sparse.model.matrix(income ~ . -1, data = trainData)

# boosting with different levels of rounds.  After 16 rounds it starts to overfit:
xgb.cv(data = sparse_matrix, label = trainY, nrounds = 25, objective = "reg:linear", nfold = 5)

mod_xg <- xgboost(sparse_matrix, label = trainY, nrounds = 16, objective = "reg:linear")

Two stage Random Forests

My final serious candidate for a predictive model is a two stage Random Forest. One of my problems with this data is the big spike at $0 income per week, and this suggests a possible way of modelling it does so in two steps:

first, fit a classification model to predict the probability of an individual, based on their characteristics, having any income at all
fit a regression model, conditional on them getting any income and trained only on those observations with non-zero income, to predict the size of their income (which may be positive or negative).

The individual models could be chosen from many options but I’ve opted for Random Forests in both cases. Because the first stage is a classification problem, I can use the more efficient H2O platform to fit it – much faster.

#---------------------two stage approach-----------
# this is the only method that preserves the bimodal structure of the response
# Initiate an H2O instance that uses 4 processors and up to 2GB of RAM
h2o.init(nthreads = 4, max_mem_size = "2G")

var_names <- names(trainData)[!names(trainData) == "income"]

trainData2 <- trainData %>%
   mutate(income = factor(income != 0)) %>%
   as.h2o()

mod1 <- h2o.randomForest(x = var_names, y = "income",
                         training_frame = trainData2,
                         ntrees = 1000)

trainData3 <- trainData %>% filter(income != 0) 
mod2 <- randomForest(income ~ ., 
                     data = trainData3, 
                     ntree = 250, 
                     mtry = 3, 
                     nodesize = 10, 
                     importance = FALSE, 
                     replace = FALSE)

Traditional regression methods

As a baseline, I also fit three more traditional linear regression models:

one with all variables
one with all variables and many of the obvious two way interactions
a stepwise selection model.

I’m not a big fan of stepwise selection for all sorts of reasons but if done carefully, and you refrain from interpreting the final model as though it was specified in advance (which virtually everyone gets wrong) they have their place. It’s certainly a worthwhile comparison point as stepwise selection still prevails in many fields despite development in recent decades of much better methods of model building.

Here’s the code that fit those ones:

#------------baseline linear models for reference-----------
lin_basic <- lm(income ~ sex + agegrp + occupation + qualification + region +
                   sqrt(hours) + Asian + European + Maori + MELAA + Other + Pacific + Residual, 
                data = trainData)          # first order only
lin_full  <- lm(income ~ (sex + agegrp + occupation + qualification + region +
                   sqrt(hours) + Asian + European + Maori + MELAA + Other + Pacific + Residual) ^ 2, 
                data = trainData)  # second order interactions and polynomials
lin_fullish <- lm(income ~ (sex + Maori) * (agegrp + occupation + qualification + region +
                     sqrt(hours)) + Asian + European + MELAA + 
                     Other + Pacific + Residual,
                  data = trainData) # selected interactions only

lin_step <- stepAIC(lin_fullish, k = log(nrow(trainData))) # bigger penalisation for parameters given large dataset

Results – predictive power

I used root mean square error of the predictions of (transformed) income in the hold-out test set – which had not been touched so far in the model-fitting – to get an assessment of how well the various methods perform. The results are shown in the plot below. Extreme gradient boosting and my two stage Random Forest approaches are neck and neck, followed by the single tree and the random decision forest, with the traditional linear regressions making up the “also rans”.

rmses

I was surprised to see that a humble single regression tree out-performed my home made random decision forest, but concluded that this is probably something to do with the relatively small number of explanatory variables to choose from, and the high performance of “hours worked” and “occupation” in predicting income. A forest (or spinney…) that excludes those variables from whole trees at a time will be dragged down by trees with very little predictive power. In contrast, Random Forests choose from a random subset of variables at each split, so excluding hours from the choice in one split doesn’t deny it to future splits in the tree, and the tree as a whole still makes a good contribution.

It’s useful to compare at a glance the individual-level predictions of all these different models on some of the hold-out set, and I do this in the scatterplot matrix below. The predictions from different models are highly correlated with eachother (correlation of well over 0.9 in all cases), and less strongly correlated with the actual income. This difference is caused by the fact that the observed income includes individual level random variance, whereas all the models are predicting some kind of centre value for income given the various demographic values. This is something I come back to in the next stage, when I want to predict a full distribution.

pairs

Here’s the code that produces the predicted values of all the models on the test set and produces those summary plots:

#---------------compare predictions on test set--------------------
# prediction from tree
tree_preds <- predict(rpartTree, newdata = testData)

# prediction from the random decision forest
rdf_preds <- rep(NA, nrow(testData))
for(i in 1:reps){
   tmp <- predict(home_made_rf[[i]], newdata = testData)
   rdf_preds <- cbind(rdf_preds, tmp)
}
rdf_preds <- apply(rdf_preds, 1, mean, na.rm= TRUE)

# prediction from random forest
rf_preds <- as.vector(predict(rf, newdata = testData))

# prediction from linear models
lin_basic_preds <- predict(lin_basic, newdata = testData)
lin_full_preds <- predict(lin_full, newdata = testData)
lin_step_preds <-  predict(lin_step, newdata = testData)

# prediction from extreme gradient boosting
xgboost_pred <- predict(mod_xg, newdata = sparse.model.matrix(income ~ . -1, data = testData))

# prediction from two stage approach
prob_inc <- predict(mod1, newdata = as.h2o(select(testData, -income)), type = "response")[ , "TRUE"]
pred_inc <- predict(mod2, newdata = testData)
pred_comb <- as.vector(prob_inc > 0.5)  * pred_inc
h2o.shutdown(prompt = F) 

rmse <- rbind(
   c("BasicLinear", RMSE(lin_basic_preds, obs = testY)), # 21.31
   c("FullLinear", RMSE(lin_full_preds, obs = testY)),  # 21.30
   c("StepLinear", RMSE(lin_step_preds, obs = testY)),  # 21.21
   c("Tree", RMSE(tree_preds, obs = testY)),         # 20.96
   c("RandDecForest", RMSE(rdf_preds, obs = testY)),       # 21.02 - NB *worse* than the single tree!
   c("randomForest", RMSE(rf_preds, obs = testY)),        # 20.85
   c("XGBoost", RMSE(xgboost_pred, obs = testY)),    # 20.78
   c("TwoStageRF", RMSE(pred_comb, obs = testY))       # 21.11
   )

rmse %>%
   as.data.frame(stringsAsFactors = FALSE) %>%
   mutate(V2 = as.numeric(V2)) %>%
   arrange(V2) %>%
   mutate(V1 = factor(V1, levels = V1)) %>%
   ggplot(aes(x = V2, y = V1)) +
   geom_point() +
   labs(x = "Root Mean Square Error (smaller is better)",
        y = "Model type",
        title = "Predictive performance on hold-out test set of different models of individual income")

#------------comparing results at individual level------------
pred_results <- data.frame(
   BasicLinear = lin_basic_preds,
   FullLinear = lin_full_preds,
   StepLinear = lin_step_preds,
   Tree = tree_preds,
   RandDecForest = rdf_preds,
   randomForest = rf_preds,
   XGBoost = xgboost_pred,
   TwoStageRF = pred_comb,
   Actual = testY
)

pred_res_small <- pred_results[sample(1:nrow(pred_results), 1000),]

ggpairs(pred_res_small)

Building the Shiny app

There’s a few small preparatory steps now before I can put the results of my model into an interactive web app, which will be built with Shiny.

I opt for the two stage Random Forest model as the best way of re-creating the income distribution. It will let me create simulated data with a spike at zero dollars of income in a way none of the other models (which focus just on averages) will do; plus it has equal best (with extreme gradient boosting) in overall predictive power.

Adding back in individual level variation

After refitting my final model to the full dataset, my first substantive problem is to recreate the full distribution, with individual level randomness, not just a predicted value at each point. On my transformed scale for income, the residuals from the models are fairly homoskedastic, so decide that the Shiny app will simulate a population at any point by sampling with replacement from the residuals of the second stage model.

I save the models, the residauals, and the various dimension variables for my Shiny app.

#----------------shiny app-------------
# dimension variables for the user interface:
d_sex <- sort(as.character(unique(nzis$sex)))
d_agegrp <- sort(as.character(unique(nzis$agegrp)))
d_occupation <- sort(as.character(unique(nzis$occupation)))
d_qualification <- sort(as.character(unique(nzis$qualification)))
d_region <- sort(as.character(unique(nzis$region)))

save(d_sex, d_agegrp, d_occupation, d_qualification, d_region,
     file = "_output/0026-shiny/dimensions.rda")

# tidy up data of full dataset, combining various ethnicities into an 'other' category:     
nzis_shiny <- nzis %>% 
   select(-use) %>%
   mutate(Other = factor(ifelse(Other == "Yes" | Residual == "Yes" | MELAA == "Yes",
                         "Yes", "No"))) %>%
   select(-MELAA, -Residual)
   
for(col in c("European", "Asian", "Maori", "Other", "Pacific")){
   nzis_shiny[ , col]   <- ifelse(nzis_shiny[ , col] == "Yes", 1, 0)
   }

# Refit the models to the full dataset
# income a binomial response for first model
nzis_rf <- nzis_shiny %>%  mutate(income = factor(income !=0))
mod1_shiny <- randomForest(income ~ ., data = nzis_rf,
                           ntree = 500, importance = FALSE, mtry = 3, nodesize = 5)
save(mod1_shiny, file = "_output/0026-shiny/mod1.rda")

nzis_nonzero <- subset(nzis_shiny, income != 0) 

mod2_shiny <- randomForest(income ~ ., data = nzis_nonzero, ntree = 500, mtry = 3, 
                           nodesize = 10, importance = FALSE, replace = FALSE)

res <- predict(mod2_shiny) - nzis_pos$income
nzis_skeleton <- nzis_shiny[0, ]
all_income <- nzis$income

save(mod2_shiny, res, nzis_skeleton, all_income, nzis_shiny,
   file = "_output/0026-shiny/models.rda")

Contextual information – how many people are like “that” anyway?

After my first iteration of the web app, I realised that it could be badly misleading by giving a full distribution for a non-existent combination of demographic variables. For example, Maori female managers aged 15-19 with Bachelor or Higher qualification and living in Southland (predicted to have median weekly income of $932 for what it’s worth).

I realised that for meaningful context I needed a model that estimated the number of people in New Zealand with the particular combination of demographics selected. This is something that traditional survey estimation methods don’t provide, because individuals in the sample are weighted to represent a discrete number of exactly similar people in the population; there’s no “smoothing” impact allowing you to widen inferences to similar but not-identical people.

Fortunately this problem is simpler than the income modelling problem above and I use a straightforward generalized linear model with a Poisson response to create the seeds of such a model, with smoothed estimates of the number of people for each combination of demographics. I then can use iterative proportional fitting to force the marginal totals for each explanatory variable to match the population totals that were used to weight the original New Zealand Income Survey. Explaining this probably deserves a post of its own, but no time for that now.

#---------------population--------

nzis_pop <- expand.grid(d_sex, d_agegrp, d_occupation, d_qualification, d_region,
                        c(1, 0), c(1, 0), c(1, 0), c(1, 0), c(1, 0))
names(nzis_pop) <-  c("sex", "agegrp", "occupation", "qualification", "region",
                      "European", "Maori", "Asian", "Pacific", "Other")
nzis_pop$count <- 0
for(col in c("European", "Asian", "Maori", "Other", "Pacific")){
 nzis_pop[ , col]   <- as.numeric(nzis_pop[ , col])
}

nzis_pop <- nzis_shiny %>%
   select(-hours, -income) %>%
   mutate(count = 1) %>%
   rbind(nzis_pop) %>%
   group_by(sex, agegrp, occupation, qualification, region, 
             European, Maori, Asian, Pacific, Other) %>%
   summarise(count = sum(count)) %>%
   ungroup() %>%
   mutate(Ethnicities = European + Maori + Asian + Pacific + Other) %>%
   filter(Ethnicities %in% 1:2) %>%
   select(-Ethnicities)

# this pushes my little 4GB of memory to its limits:
 mod3 <- glm(count ~ (sex + Maori) * (agegrp + occupation + qualification) + region + 
                Maori:region + occupation:qualification + agegrp:occupation +
                agegrp:qualification, 
             data = nzis_pop, family = poisson)
 
 nzis_pop$pop <- predict(mod3, type = "response")

# total population should be (1787 + 1410) * 1000 = 319700.  But we also want
# the marginal totals (eg all men, or all women) to match the sum of weights
# in the NZIS (where wts = 319700 / 28900 = 1174).  So we use the raking method
# for iterative proportional fitting of survey weights

wt <- 1174

sex_pop <- nzis_shiny %>%
   group_by(sex) %>%
   summarise(freq = length(sex) * wt)

agegrp_pop <- nzis_shiny %>%
   group_by(agegrp) %>%
   summarise(freq = length(agegrp) * wt)

occupation_pop <- nzis_shiny %>%
   group_by(occupation) %>%
   summarise(freq = length(occupation) * wt)

qualification_pop <- nzis_shiny %>%
   group_by(qualification) %>%
   summarise(freq = length(qualification) * wt)

region_pop <- nzis_shiny %>%
   group_by(region) %>%
   summarise(freq = length(region) * wt)

European_pop <- nzis_shiny %>%
   group_by(European) %>%
   summarise(freq = length(European) * wt)

Asian_pop <- nzis_shiny %>%
   group_by(Asian) %>%
   summarise(freq = length(Asian) * wt)

Maori_pop <- nzis_shiny %>%
   group_by(Maori) %>%
   summarise(freq = length(Maori) * wt)

Pacific_pop <- nzis_shiny %>%
   group_by(Pacific) %>%
   summarise(freq = length(Pacific) * wt)

Other_pop <- nzis_shiny %>%
   group_by(Other) %>%
   summarise(freq = length(Other) * wt)

nzis_svy <- svydesign(~1, data = nzis_pop, weights = ~pop)

nzis_raked <- rake(nzis_svy,
                   sample = list(~sex, ~agegrp, ~occupation, 
                                 ~qualification, ~region, ~European,
                                 ~Maori, ~Pacific, ~Asian, ~Other),
                   population = list(sex_pop, agegrp_pop, occupation_pop,
                                     qualification_pop, region_pop, European_pop,
                                     Maori_pop, Pacific_pop, Asian_pop, Other_pop),
                   control = list(maxit = 20, verbose = FALSE))

nzis_pop$pop <- weights(nzis_raked)

save(nzis_pop, file = "_output/0026-shiny/nzis_pop.rda")

The final shiny app

The full screen version of the web app
The source code

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

↧

New Course! A hands-on introduction to statistics with R by A. Conway (Princeton University)

January 14, 2016, 8:07 pm

≫ Next: Titanic – Machine Learning from Disaster (Part 1)

≪ Previous: Filling in the gaps – highly granular estimates of income and population for New Zealand from survey data

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

The best way to learn is at your own pace. Combining the interactive R learning environment of DataCamp and the expertise of Prof. Conway of Princeton, we offer you an extensive online course on introductory statistics with R. Start learning now…Whether you are a professional using statistics in your job, an academic wanting a refresher on specific statistical topics, or a student taking statistics classes, this new DataCamp course will match your needs. It is a comprehensive and friendly course, that requires no background knowledge in statistics or R. The aim is to provide you with a solid foundation for future learning, as well as being able to put one’s work into context. All this takes place in your browser thanks to the DataCamp online learning environment. Try it for free!

Statistics with R

So, how does it all work? You can choose to subscribe to the course as a whole, or to take individual modules according to your own specific needs. The course consists of 7 modules, ranging from the Student’s T-test over ANOVA to simple and multiple linear regression, finally ending with a last module on Moderation and Mediation. In total there are more than 250 interactive R exercises, which are accompanied by videos and slides. This adds up to 24 hours of material on statistics with R .Interested? To give you the opportunity to get a taste of the course content and to try out the DataCamp learning experience, we present you the first module for free. Furthermore, if you are a student, we want you to know that you get a 75% discount on the whole course.So what are you waiting for? Grab this learning opportunity and check out the course! Remember that the first module is free, that you can buy separate modules according to your needs, and if you buy all 7 modules at once, you get a significant discount. On top of that, students can get a 75% reduction on the whole statistics with R course.

On Professor Andrew Conway

Prof. Conway is a Senior Lecturer at Princeton and has been teaching to undergrads and graduate students for 20 years. His experience is reflected in the quality of this course. The content of this course has been on Coursera, and back then more than 200,000 individuals followed it, making it the second most popular Coursera course using R. Psychology students at Princeton are already following the DataCamp course this semester.

On DataCamp

The course is set up in DataCamp’s interactive platform that aims to enhance the learning experience by offering a learning-by-doing approach. The material is presented by short videos and slides to explain major elements. In order to consolidate your learning, every section ends with interactive exercises that let you practice the covered concepts while giving you tailored feedback.You will discover R’s capabilities and how they interplay with each other step by step. You can learn at your own pace, stopping to take a break or replay a segment at any time. The system tracks your progress so you can stop at any time; it will start up where you left off. This way, you will learn effectively instead of losing time with one-speed-fits-all solutions like a four-hour screencast or webinar. What’s more, in order to consolidate your learning, every section ends with interactive exercises that let you practice the covered concepts while giving you tailored feedback.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog.

↧

Setting Up the Maker Channel

Creating a Recipe for Notifications

Giving R a Voice

Sending Values

Setting Up the Maker Channel

Creating a Recipe for Notifications

Giving R a Voice

Sending Values

Logistic Regression Example

Model Evaluation and Diagnostics

Goodness of Fit

Likelihood Ratio Test

Pseudo R^2

Hosmer-Lemeshow Test

Statistical Tests for Individual Predictors

Wald Test

Variable Importance

Validation of Predicted Values

Classification Rate

ROC Curve

K-Fold Cross Validation

Extracting Condition Means

Building the ANOVA

What do I do with my Between-Subjects Effects?

Dealing with “Error() model is singular”

Initial R Data

Outline of R Steps

Step 1. Set each independent variable as a factor

Step 2. Set the default contrast to helmert

Step 3. Conduct Analysis Using Type III Sums of Squares

Quick Summary

Where to start to start?

Why is data suddenly so sexy?

What’s it take to learn?

Getting started on statistical computing

Statistics

A computing tool for statistics

Reproducible research eg version control, making things reproducible end to end, etc.

Databases, data management, SQL, tidying and cleaning data

More

Drawing that diagram

Generate a longitudinal dataset and convert it into long format

A multilevel growth model considering treatment effect

To choose the best model

Logistic regression implementation in R

The dataset

The data cleaning process

Taking care of the missing values

Model fitting

Interpreting the results of our logistic regression model

Assessing the predictive ability of the model

Correlation

Linear Regression

Transforming the data

Concluding comments

Interrogating the fitted model {#nextsection

Predict monthly temperature for the years 1914 and 2014

Predict trends for each month, 1914–2014

Predict trends for each month, 1914–2014, by quarter

Summary

Brief Description

Table of Contents

Data

Analysis

Downloads and time on CRAN

Top downloaded packages

R-star authors

My own 2015-R-experience

Individual-level estimates from survey data

Data import and shape

Modelling income

Regression tree

A home-made random spinney (not forest…)

Random Forest

Extreme gradient boosting

Two stage Random Forests

Traditional regression methods

Results – predictive power

Building the Shiny app

Adding back in individual level variation