Data analysis with R

Although it is possible to perform an analysis by typing all commands on the R prompt, it is advisable to collect all commands in a separate text file, created with your favorite text editor. When all steps of a data analysis are saved in such an R transcript file, for example called ‘analysis.R’, the analysis can be reproduced at any time, maybe with updated data. The R transcript file can be sourced into R using the source() function.

 

Return to the wiki
home page
> source('analysis.R',echo=TRUE)
↑ Go up

RStudio offers an easy way to create and source an R transcript file. As pointed out earlier, it is recommended that you set different working directories for different analysis projects. To create a new project in RStudio, click on ‘File > New Project…’. Next, click on ‘New Directory’ and then click on ‘New Project’. After choosing the directory name (e.g., Wiki) and the directory path (e.g., your desktop), click on ‘Create Project’.

Now we can create a new R script file within the project. Go to ‘File > New File’ and select ‘R Script’.

Let us begin with a very simple R script in which we define two numeric vectors and sum them up. Instead of typing our commands in the RStudio console, we will type them in the RStudio script section that is located just above the console.

 
# Defining two vectors
x <- 1:5
y <- 11:15
# Creating the sum
z <- sum(x,y)
 

Note that the character ‘#’ allows adding comments to the script.

The ‘Run’ button on the top right corner of the script window simply executes the selected lines of code. The difference between ‘Run’ and ‘Source’ is that, when running a selection, all lines are copied directly to the console, whereas for Source the file is saved to a temporary location and then sourced into the console, thus keeping the console less cluttered.

To save the script, click on ‘File > Save’ and enter ‘easyExample’ as the ‘File name’. 

If you quit RStudio ( ‘File > Quit Session…’) and reopen it, you can run the script as it is, or you can modify it, for example, by adding new commands. 

In what follows, we will illustrate how to conduct the most common statistical tests in R.

 

Z test

The two-proportion z test allows comparing differences between two population proportions (percentages). For instance, you may want to test two different drugs. Drug A works on 12 people in a sample of 27 (~44% = 12/27). Drug B works on 4 people out of a sample of 25 (~16% = 4/25). Is there a significant difference in the proportions of cured people?

In R, the function prop.test() can be used to compare two or more proportions. The arguments of the function should be given as two vectors: the first vector contains the number of positive outcomes; the second one contains the total number for each group. However, before writing the code, it is convenient to create a new R script file named ‘zTestExample’.

 
# z test for two independent proportions ----
# Vector containing the number of positive outcomes
cured <- c(12,4)
# Vector containing the total number in each sample
total <- c(27,25)

# z test (two-tailed test)
prop.test(cured,total,alternative='two.sided')
↑ Go up

If you want to test whether the proportion observed with drug A is greater than the proportion observed with drug B, type

 
# z test (one-tailed tests)
prop.test(cured,total,alternative='greater')
 

Otherwise, if you want to test whether the proportion observed with drug A is less than the proportion observed with drug B, type

 
prop.test(cured,total,alternative='less')
↑ Go up

Notice that, for a 2 x 2 table, the two-sided z test is exactly equivalent to the standard chi-square test. To perform a chi-square test in R, type the following commands

 
# chi-square test
uncured <- total-cured
chisq.test(matrix(c(cured,uncured),2))
 

The chisq.test() function requires data in matrix form. The second column of the matrix contains the number of negative outcomes and not the total number of observations. More details on the chi-square test are provided in a dedicated section.

Now, let us go back to dealing with the z test. Note that the z test is performed with the Yates continuity correction by default. If you want to perform the test without the Yates correction, add the argument ‘correct=FALSE’. The Yates correction makes the confidence interval for the difference in proportions somewhat wider. However, it is worth emphasizing that the confidence interval 0.0097 0.5591 does not contain zero, thus contradicting the test that says there is no significant difference between the two groups with a two-tailed test (p = 0.055). This is due to different approximation algorithms that can become relevant for tables as sparse as the present one. If you want to compute at least the correct p-value, you can use Fisher’s exact test. 

The fisher.test() function requires data in matrix form, like chisq.test() does.

 
# Fisher's exact test
fisher.test(matrix(c(cured,uncured),2))
 

Sometimes it is useful to compare more than two proportions. For example, you might want to extend the previous analysis including drug C, which works on 19 people out of a sample of 31 (~61% = 19/31).

 
# z test for more than two independent proportions ----
# Vector containing the number of positive outcomes
cured <- c(12,4,19)
# Vector containing the total number in each sample
total <- c(27,25,31)

# z test (two-tailed test)
prop.test(cured,total,alternative='two.sided')
↑ Go up

Chi-square test

A contingency table is a very useful tool to represent the joint distributions of two or more categorical variables. The chi-square test of independence evaluates whether rows and columns of a contingency table are independent of each other. As an example, let us consider the contingency table reporting caffeine consumption as a function of marital status. Before starting to write the code, let us create a new R script file and save it as ‘chiSquareTestExample’.

 
# Chi-square test ----
# Defining the contingency table
caffeineMarital <- matrix(c(648,1543,592,250,35,49,36,24,212,329,107,64),
                          nrow=3,byrow=T)
colnames(caffeineMarital) <- c("0","1-150","151-300",">300")
rownames(caffeineMarital) <- c("Married","PreviousMarried","Single")
caffeineMarital

# Chi-square test
chisq.test(caffeineMarital)
 

In this example, the test is highly significant; therefore, we can reject the hypothesis of independence. 
You would generally also like to inspect the deviations between expected and observed values in order to highlight the most contributing cells to the total chi-square score.

 
# Expected values
E <- chisq.test(caffeineMarital)$expected
# Observed values
O <- chisq.test(caffeineMarital)$observed
E
(O-E)^2/E
↑ Go up

Note that the formula (O-E)/sqrt(E) returns the so-called Pearson (or standardized) residuals r.

 
# Pearson residuals
r <- chisq.test(caffeineMarital)$residuals
r^2
 

To visualize Pearson residuals, you can use the package ‘corrplot’.

 
# Plotting ----
requiredPackages <- c("corrplot")
# Installing packages that are not installed
install.packages(setdiff(requiredPackages,rownames(installed.packages())),
                 dependencies=TRUE)

library(corrplot)
corrplot(r,is.cor=FALSE)
↑ Go up
CorrPlot1  

For each cell, the size of the circle is proportional to the amount of the cell contribution. Positive residuals are in blue and specify a positive association (attraction) between the corresponding row and column variables. Conversely, negative residuals are in red and specify a negate association (repulsion) between the corresponding variables. You might also want to compute and plot the contribution in percentage of a given cell to the total chi-square score.

 
# Computing contibution in percentage (%)
contribPc <- 100*r^2/chisq.test(caffeineMarital)$statistic
round(contribPc,3)

# Visualizing the percentage contribution
corrplot(contribPc,is.cor=FALSE)
↑ Go up
CorrPlot2  

Student's t-test

To begin with, let us create a new R script file and save it as ‘tTestExample’. Now, we can start coding.

 
# Loading the data ----
requiredPackages <- c("HSAUR2")
# Installing packages that are not installed
install.packages(setdiff(requiredPackages,rownames(installed.packages())))

# Loading the data
library(HSAUR2)
data("USmelanoma",package="HSAUR2")
str(USmelanoma)
↑ Go up

We would like to test the hypothesis that the mean mortality rate of the population living in ocean States is equal to the mean mortality rate of the population living in non-ocean States. We will do this first by applying an independent sample t-test. In general, it is a good practice to check – at least informally – the normality and equal variance assumption.

 
# Checking (informally) the normality and equal variance assumptions ----
# Getting the summary statistics separately for each group given by the factor 'ocean'
tapply(USmelanoma$mortality,USmelanoma$ocean,summary)
# Getting the standard deviations separately for each group given by the factor 'ocean'
tapply(USmelanoma$mortality,USmelanoma$ocean,sd)

# Displaying boxplots and normal Q-Q plots for each group
layout(matrix(c(1,2,1,3),nrow=2,ncol=2,byrow=FALSE))
boxplot(mortality ~ ocean,data=USmelanoma,ylab="Mortality rate",varwidth=TRUE)
qqnorm(USmelanoma$mortality[USmelanoma$ocean=='no'],ylab="Mortality rate (land)")
qqline(USmelanoma$mortality[USmelanoma$ocean=='no'])
qqnorm(USmelanoma$mortality[USmelanoma$ocean=='yes'],ylab="Mortality rate (ocean)")
qqline(USmelanoma$mortality[USmelanoma$ocean=='yes'])
 

Note that we set the argument ‘varwidth’=TRUE in the boxplot. This means that the boxes are displayed with widths proportional to the square-roots of the number of observations in each group.

The boxplots seem to indicate that the mortality rates in the ocean States are a bit more variable and slightly skewed than those in the land States, a point suggested by the numerical summary statistics too. To confirm the intuition suggested by the graphics, one should conduct equality of variance and normality tests.

 
# Checking (formally) the normality and equal variance assumptions ----
# Shapiro-Wilk's test
shapiro.test(USmelanoma$mortality[USmelanoma$ocean=='no'])
shapiro.test(USmelanoma$mortality[USmelanoma$ocean=='yes'])

# F-test is very sensitive to departure from the normal assumption 
var.test(mortality ~ ocean,data=USmelanoma,alternative='two.sided')
# Levene's test is not sensitive to departure from the normal assumption
library(car)
leveneTest(mortality ~ ocean,data=USmelanoma)
↑ Go up

The results of all statistical tests indicate that both normality assumption and equal variance assumptions are met. Now, we can conduct the independent t-test. Just for didactic purposes, we will also show how to perform the Welch test, which relaxes the equal variance assumption, and the Wilcoxon Mann-Whitney test, which relaxes the normality assumption.

 
# Performing the independent t-test ----
# Two sample t-test
t.test(mortality ~ ocean,data=USmelanoma,var.equal=TRUE)

# Welch two sample t-test
t.test(mortality ~ ocean,data=USmelanoma,var.equal=FALSE)

# Wilcoxon Mann-Whitney test
wilcox.test(mortality ~ ocean,data=USmelanoma,
            alternative='two.sided',exact=FALSE)
 

To be thorough, we will illustrate how to conduct a paired t-test in R. Let ‘pre’ and ‘post’ be two measurements on the same experimental unit.

 
# Performing the paired t-test ----
# Weight before diet
pre <- c(100.1, 95.9, 97.7, 107, 120.4, 98.9, 86.2, 93.5, 102.2, 94.7)
# Weight after diet
post <- c(89.7, 98, 88.7, 102.7, 104.8, 89.9, 81.8, 81.7, 99.5, 87.1)
# Creating a data frame
weightPrePost <- data.frame(group = rep(c("pre", "post"), each = 10),
                            weight = c(pre, post))

# Checking the normality assumption
# Computing the difference
difference <- pre - post
# Shapiro-Wilk normality test for the differences
shapiro.test(difference) # => p-value = 0.9738 => we can assume the normality

# Paired t-test
t.test(pre, post, paired = TRUE)
# or, equivalently
t.test(weight ~ group, data = weightPrePost, paired = TRUE)

# Matched-pairs Wilcoxon test
wilcox.test(weight ~ group, data = weightPrePost,
            alternative='two.sided',exact=FALSE)
↑ Go up