Data analysis with RAlthough it is possible to perform an analysis by typing all commands on the R prompt, it is advisable to collect all commands in a separate text file, created with your favorite text editor. When all steps of a data analysis are saved in such an R transcript file, for example called ‘analysis.R’, the analysis can be reproduced at any time, maybe with updated data. The R transcript file can be sourced into R using the source() function. 
Return to the wiki

> source('analysis.R',echo=TRUE) 
↑ Go up 
RStudio offers an easy way to create and source an R transcript file. As pointed out earlier, it is recommended that you set different working directories for different analysis projects. To create a new project in RStudio, click on ‘File > New Project…’. Next, click on ‘New Directory’ and then click on ‘New Project’. After choosing the directory name (e.g., Wiki) and the directory path (e.g., your desktop), click on ‘Create Project’. Now we can create a new R script file within the project. Go to ‘File > New File’ and select ‘R Script’. Let us begin with a very simple R script in which we define two numeric vectors and sum them up. Instead of typing our commands in the RStudio console, we will type them in the RStudio script section that is located just above the console. 

# Defining two vectors x < 1:5 y < 11:15 # Creating the sum z < sum(x,y) 

Note that the character ‘#’ allows adding comments to the script. The ‘Run’ button on the top right corner of the script window simply executes the selected lines of code. The difference between ‘Run’ and ‘Source’ is that, when running a selection, all lines are copied directly to the console, whereas for Source the file is saved to a temporary location and then sourced into the console, thus keeping the console less cluttered. To save the script, click on ‘File > Save’ and enter ‘easyExample’ as the ‘File name’. If you quit RStudio ( ‘File > Quit Session…’) and reopen it, you can run the script as it is, or you can modify it, for example, by adding new commands. In what follows, we will illustrate how to conduct the most common statistical tests in R. 

Z testThe twoproportion z test allows comparing differences between two population proportions (percentages). For instance, you may want to test two different drugs. Drug A works on 12 people in a sample of 27 (~44% = 12/27). Drug B works on 4 people out of a sample of 25 (~16% = 4/25). Is there a significant difference in the proportions of cured people? In R, the function prop.test() can be used to compare two or more proportions. The arguments of the function should be given as two vectors: the first vector contains the number of positive outcomes; the second one contains the total number for each group. However, before writing the code, it is convenient to create a new R script file named ‘zTestExample’. 

# z test for two independent proportions  # Vector containing the number of positive outcomes cured < c(12,4) # Vector containing the total number in each sample total < c(27,25) # z test (twotailed test) prop.test(cured,total,alternative='two.sided') 
↑ Go up 
If you want to test whether the proportion observed with drug A is greater than the proportion observed with drug B, type 

# z test (onetailed tests) prop.test(cured,total,alternative='greater') 

Otherwise, if you want to test whether the proportion observed with drug A is less than the proportion observed with drug B, type 

prop.test(cured,total,alternative='less') 
↑ Go up 
Notice that, for a 2 x 2 table, the twosided z test is exactly equivalent to the standard chisquare test. To perform a chisquare test in R, type the following commands 

# chisquare test uncured < totalcured chisq.test(matrix(c(cured,uncured),2)) 

The chisq.test() function requires data in matrix form. The second column of the matrix contains the number of negative outcomes and not the total number of observations. More details on the chisquare test are provided in a dedicated section. Now, let us go back to dealing with the z test. Note that the z test is performed with the Yates continuity correction by default. If you want to perform the test without the Yates correction, add the argument ‘correct=FALSE’. The Yates correction makes the confidence interval for the difference in proportions somewhat wider. However, it is worth emphasizing that the confidence interval 0.0097 0.5591 does not contain zero, thus contradicting the test that says there is no significant difference between the two groups with a twotailed test (p = 0.055). This is due to different approximation algorithms that can become relevant for tables as sparse as the present one. If you want to compute at least the correct pvalue, you can use Fisher’s exact test. The fisher.test() function requires data in matrix form, like chisq.test() does. 

# Fisher's exact test fisher.test(matrix(c(cured,uncured),2)) 

Sometimes it is useful to compare more than two proportions. For example, you might want to extend the previous analysis including drug C, which works on 19 people out of a sample of 31 (~61% = 19/31). 

# z test for more than two independent proportions  # Vector containing the number of positive outcomes cured < c(12,4,19) # Vector containing the total number in each sample total < c(27,25,31) # z test (twotailed test) prop.test(cured,total,alternative='two.sided') 
↑ Go up 
Chisquare testA contingency table is a very useful tool to represent the joint distributions of two or more categorical variables. The chisquare test of independence evaluates whether rows and columns of a contingency table are independent of each other. As an example, let us consider the contingency table reporting caffeine consumption as a function of marital status. Before starting to write the code, let us create a new R script file and save it as ‘chiSquareTestExample’. 

# Chisquare test  # Defining the contingency table caffeineMarital < matrix(c(648,1543,592,250,35,49,36,24,212,329,107,64), nrow=3,byrow=T) colnames(caffeineMarital) < c("0","1150","151300",">300") rownames(caffeineMarital) < c("Married","PreviousMarried","Single") caffeineMarital # Chisquare test chisq.test(caffeineMarital) 

In this example, the test is highly significant; therefore, we can reject the hypothesis of independence. 

# Expected values E < chisq.test(caffeineMarital)$expected # Observed values O < chisq.test(caffeineMarital)$observed E (OE)^2/E 
↑ Go up 
Note that the formula (OE)/sqrt(E) returns the socalled Pearson (or standardized) residuals r. 

# Pearson residuals r < chisq.test(caffeineMarital)$residuals r^2 

To visualize Pearson residuals, you can use the package ‘corrplot’. 

# Plotting  requiredPackages < c("corrplot") # Installing packages that are not installed install.packages(setdiff(requiredPackages,rownames(installed.packages())), dependencies=TRUE) library(corrplot) corrplot(r,is.cor=FALSE) 
↑ Go up 
For each cell, the size of the circle is proportional to the amount of the cell contribution. Positive residuals are in blue and specify a positive association (attraction) between the corresponding row and column variables. Conversely, negative residuals are in red and specify a negate association (repulsion) between the corresponding variables. You might also want to compute and plot the contribution in percentage of a given cell to the total chisquare score. 

# Computing contibution in percentage (%) contribPc < 100*r^2/chisq.test(caffeineMarital)$statistic round(contribPc,3) # Visualizing the percentage contribution corrplot(contribPc,is.cor=FALSE) 
↑ Go up 
Student's ttestTo begin with, let us create a new R script file and save it as ‘tTestExample’. Now, we can start coding. 

# Loading the data  requiredPackages < c("HSAUR2") # Installing packages that are not installed install.packages(setdiff(requiredPackages,rownames(installed.packages()))) # Loading the data library(HSAUR2) data("USmelanoma",package="HSAUR2") str(USmelanoma) 
↑ Go up 
We would like to test the hypothesis that the mean mortality rate of the population living in ocean States is equal to the mean mortality rate of the population living in nonocean States. We will do this first by applying an independent sample ttest. In general, it is a good practice to check – at least informally – the normality and equal variance assumption. 

# Checking (informally) the normality and equal variance assumptions  # Getting the summary statistics separately for each group given by the factor 'ocean' tapply(USmelanoma$mortality,USmelanoma$ocean,summary) # Getting the standard deviations separately for each group given by the factor 'ocean' tapply(USmelanoma$mortality,USmelanoma$ocean,sd) # Displaying boxplots and normal QQ plots for each group layout(matrix(c(1,2,1,3),nrow=2,ncol=2,byrow=FALSE)) boxplot(mortality ~ ocean,data=USmelanoma,ylab="Mortality rate",varwidth=TRUE) qqnorm(USmelanoma$mortality[USmelanoma$ocean=='no'],ylab="Mortality rate (land)") qqline(USmelanoma$mortality[USmelanoma$ocean=='no']) qqnorm(USmelanoma$mortality[USmelanoma$ocean=='yes'],ylab="Mortality rate (ocean)") qqline(USmelanoma$mortality[USmelanoma$ocean=='yes']) 

Note that we set the argument ‘varwidth’=TRUE in the boxplot. This means that the boxes are displayed with widths proportional to the squareroots of the number of observations in each group. The boxplots seem to indicate that the mortality rates in the ocean States are a bit more variable and slightly skewed than those in the land States, a point suggested by the numerical summary statistics too. To confirm the intuition suggested by the graphics, one should conduct equality of variance and normality tests. 

# Checking (formally) the normality and equal variance assumptions  # ShapiroWilk's test shapiro.test(USmelanoma$mortality[USmelanoma$ocean=='no']) shapiro.test(USmelanoma$mortality[USmelanoma$ocean=='yes']) # Ftest is very sensitive to departure from the normal assumption var.test(mortality ~ ocean,data=USmelanoma,alternative='two.sided') # Levene's test is not sensitive to departure from the normal assumption library(car) leveneTest(mortality ~ ocean,data=USmelanoma) 
↑ Go up 
The results of all statistical tests indicate that both normality assumption and equal variance assumptions are met. Now, we can conduct the independent ttest. Just for didactic purposes, we will also show how to perform the Welch test, which relaxes the equal variance assumption, and the Wilcoxon MannWhitney test, which relaxes the normality assumption. 

# Performing the independent ttest  # Two sample ttest t.test(mortality ~ ocean,data=USmelanoma,var.equal=TRUE) # Welch two sample ttest t.test(mortality ~ ocean,data=USmelanoma,var.equal=FALSE) # Wilcoxon MannWhitney test wilcox.test(mortality ~ ocean,data=USmelanoma, alternative='two.sided',exact=FALSE) 

To be thorough, we will illustrate how to conduct a paired ttest in R. Let ‘pre’ and ‘post’ be two measurements on the same experimental unit. 

# Performing the paired ttest  # Weight before diet pre < c(100.1, 95.9, 97.7, 107, 120.4, 98.9, 86.2, 93.5, 102.2, 94.7) # Weight after diet post < c(89.7, 98, 88.7, 102.7, 104.8, 89.9, 81.8, 81.7, 99.5, 87.1) # Creating a data frame weightPrePost < data.frame(group = rep(c("pre", "post"), each = 10), weight = c(pre, post)) # Checking the normality assumption # Computing the difference difference < pre  post # ShapiroWilk normality test for the differences shapiro.test(difference) # => pvalue = 0.9738 => we can assume the normality # Paired ttest t.test(pre, post, paired = TRUE) # or, equivalently t.test(weight ~ group, data = weightPrePost, paired = TRUE) # Matchedpairs Wilcoxon test wilcox.test(weight ~ group, data = weightPrePost, alternative='two.sided',exact=FALSE) 
↑ Go up 