Graphics in RThe first step in the analysis of most datasets is to understand the general characteristics of the data, by graphing them in some informative manner. Graphical exploration of data usually begins with well-known methods, such as scatterplots, histograms, barcharts and boxplots. Let us start examining the ‘USmelanoma’ dataset contained in the ‘HSAUR2’ package. |
Return to the wiki
|
> install.packages("HSAUR2") > data("USmelanoma", package = "HSAUR2") > USmelanoma > str(USmelanoma) |
↑ Go up |
The data frame ‘USmelanoma’ contains the number of deaths due to melanoma in each U.S. state, the longitude and latitude of the geographic centre of each state, and a categorical variable ‘ocean’ indicating contiguity to an ocean. To begin with, we might construct a histogram for all the mortality rates. |
|
> hist(USmelanoma$mortality) |
|
In general, the high-level graphical function ‘hist(x)’ creates a histogram of the frequencies of x. Other examples of high-level graphical functions in R are ‘plot(x)’, ‘plot(x,y)’, ‘pie(x)’, ‘boxplot(x)’, ‘barplot(x)’, ‘qqplot(x,y)’. For each high-level function, you might want to set specific options. For instance, you might want to edit the main title, to annotate the axes, or to specify the lower and upper limits of the axes, and so on. |
|
> hist(USmelanoma$mortality, prob = TRUE, xlim = range(USmelanoma$mortality)*c(0.9,1.1), xlab = 'Mortality', main = '', axes = FALSE, col = 'skyblue') > axis(2) > axis(1,at=seq(80,240,20)) |
↑ Go up |
Note that the axis() function allows to change the labels on the tick marks of the axes, whereas the ‘col’ option allows to specify the color to be used to fill the bars. To get more information about the complete option list of a graphical function, refer to the help facilities in R. The axis() function is one the graphical functions that affect an already existing graph. Such functions are called low-level plotting functions. Other examples of low-level plotting functions are ‘points(x,y)’, ‘lines(x,y)’, ‘title()’, ‘legend()’, ‘rug(x)’. In particular, the rug() function draws the data x on the x-axis as small vertical lines. |
|
> rug(USmelanoma$mortality) |
|
We can also add a density plot made by the density() function. |
|
> lines(density(USmelanoma$mortality)) |
|
In addition to low-level plotting commands, we can use the par() function to improve the presentation of graphics. Specifically, the par() function allows to change some graphical parameters permanently, namely a change in the parameters will affect subsequent plots too. For instance, to change the background color and the line type use the following commands. |
|
> par(bg='salmon') > par(lty=2) > hist(USmelanoma$mortality) |
↑ Go up |
To reset par() to the default values, simply issue the command |
|
> dev.off() |
|
that clears all plots. We will now illustrate how to create a boxplot. |
|
> boxplot(USmelanoma$mortality) |
↑ Go up |
As before, we will change some options. |
|
> boxplot(USmelanoma$mortality, ylim = range(USmelanoma$mortality)*c(0.9,1.1), horizontal = TRUE, xlab = 'Mortality', axes = FALSE) > axis(1,at=seq(80,240,20)) |
↑ Go up |
Sometimes, we may need to produce a plot where different high-level graphical functions are applied at once. For instance, we may need to plot a boxplot and a histogram on top of each other. To this purpose, we will use the layout() function that organizes independent plots on the same plotting devise. |
|
> layout(matrix(1:2, nrow = 2)) > par(mar = par("mar") * c(0.8, 1, 1, 1)) > boxplot(USmelanoma$mortality, ylim = range(USmelanoma$mortality)*c(0.9,1.1), horizontal = TRUE, xlab = 'Mortality', axes = FALSE) > axis(1,at=seq(80,240,20)) > hist(USmelanoma$mortality, prob = TRUE, xlim = range(USmelanoma$mortality)*c(0.9,1.1), xlab = '', main = '', ylab = '', axes = FALSE) > axis(1,at=seq(80,240,20)) |
↑ Go up |
Because of their dependence on the number of classes chosen, histograms can be misleading for displaying distributions. A valid alternative is to estimate and display the density function of a variable. For the melanoma data, we might be interested in comparing the mortality rates for ocean and non-ocean States. |
|
> oceanYesDensity <- density(USmelanoma$mortality[USmelanoma$ocean=='yes']) > oceanNoDensity <- density(USmelanoma$mortality[USmelanoma$ocean=='no']) > plot(oceanYesDensity,lty=1,ylim=c(0,0.018),main='') > lines(oceanNoDensity,lty=2) > legend("topright",lty = 1:2,legend=c("Coastal State","Land State"),bty="n") |
↑ Go up |
We can also construct two parallel boxplots displaying the conditional distributions of the mortality rates in the two groups given by the categorical variable ‘ocean’. A simple way to do this is to use a so-called ‘model formula’, specifying the dependent variable on the left-hand side of the tilde and the independent variable on the right-hand side of the tilde. |
|
> plot(mortality ~ ocean, data = USmelanoma, xlab = "Contiguity to an ocean", ylab = "Mortality") |
↑ Go up |
Now we might be interested in looking at how mortality rates are related to the latitude and longitude of the centre of each State. The main graphic will be the scatterplot. |
|
> layout(matrix(1:2, ncol = 2)) > plot(mortality ~ longitude, data = USmelanoma) > plot(mortality ~ latitude, data = USmelanoma) |
↑ Go up |
Because latitude, but not longitude, seems to be related to mortality, we can inspect the influence of latitude separately for ocean and non-ocean States. Instead of displaying two parallel scatterplots, we will use different plotting symbols for ocean and non-ocean States. To achieve this, we need to specify a vector of integers or characters to the parameter ‘pch’. Therefore, we will convert the factor ‘ocean’ to an integer vector so that the number 1 will identify land States, whereas the number 2 will identify ocean States. As a consequence, land States will be represented by the dot symbol, whereas ocean States will be represented by triangles. |
|
> plot(mortality ~ latitude, data = USmelanoma, pch = as.integer(USmelanoma$ocean)) > legend("topright", legend = c("Land state", "Coast state"), pch = 1:2, bty = "n") |
↑ Go up |
It is worth pointing out that the plot() function is a ‘generic’ function, namely the action performed on its arguments depends on the class (i.e., type of object) of the arguments. For instance, if x and y are numeric vectors, plot(x,y) produces a scatterplot of y against x. If f is a factor and y is a numeric vector, plot(f,y) produces boxplots of y for each level of f. So far, we have illustrated some useful commands for creating commonly used graphics. To get an idea of the remarkable variety of graphics that one can construct in R, type |
|
> demo(graphics) |
|
or |
|
> demo(persp) |
↑ Go up |
A very powerful and flexible R package for producing elegant graphics is ‘ggplot2’. The interesting reader is referred to specialized books, courses and online materials. |