R Logo

A First Practical Session with R



Begin by starting R by going

Start ® Programs ® Teaching ® Maths

Click on Maths. You will then find various icons including one for R. Double click on this icon. You should then find that the screen displays a window headed

R Console

After the red > prompt, type

data(faithful)

to load the faithful data frame which has 272 rows and 2 columns; the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. Typing

names(faithful)

will give you the names of the variables contained in this data frame, while typing

?faithful

will give a more detailed description of the data frame, while typing

faithful

will display all the data (but you will need to scroll back with the right hand scroll bar to see it all). You can refer to the first column alone as faithful[,1] or as faithful[,"eruptions"] and similarly for the second column, but it makes life a bit easier to type

attach(faithful)

after which you can refer to the first column simply as eruptions and to the second simply as waiting. You can refer to the third element of waiting simply as waiting[3], and this is in fact the same as faithful[3,2] since waiting constitutes the second column of faithful.

Try finding some simple descriptive statistics. For example, mean(waiting) will give the mean waiting time. If you simply type

mean(waiting)

the mean waiting time will be displayed, whereas if you type

m <- mean(waiting)

then the variable m will be given a value equal to this mean. You can then type m to see its value, but you can also use it in further calculations, which may make formulae look less cumbersome than if you use mean(waiting) every time. (You may care to note that while you can use _ instead of <- for assignemnt, in the words of Venables and Ripley, "We regard the use of _ for assigment as unreadable". They also point out that, "Assignments using the right-hand pointing combination -> are also allowed to make assigments in the opposite direction, but these are never needed and are little used in practice.")

Some other descriptive statistics, the meanings of which are pretty obvious, are given by median, var and sd. You may also care to see what results from quantile. Naturally, you can refer to the second of the numbers which are printed out when you type quantile(waiting) as quantile(waiting)[2]. You may also care to try summary or fivenum. The number of observations can be found from length.

Sometimes you want not just the mean of one variable, but the means of several. In this connection it is worth looking at the result of

apply(faithful,2,mean)

(or, e.g., apply(faithful,2,sum) or apply(faithful,1,mean).

You should then try some simple graphical techniques. Investigate the results of boxplot(waiting) or stem(waitng) (you can vary the display of the latter by, e.g., going stem(waiting,scale=2).

Sometimes we want to investigate a subset of the data. We can display the values of waiting for which eruptions is less than 3 by waiting[eruptions < 3], and we can similarly refer to eruptions[eruptions < 3]. If you need two conditions they can be joined by | for 'or' as in

waiting[(eruptions<3)|(eruptions>5)]

or by & for 'and' (and for that matter you can use ! for 'not').

An Exploration of the Data Set

A very similar data set is examined by S Chatterjee, M S Handcock and J S Simonoff, A Casebook for a First Course in Statistics and Data Analysis, New York, etc: Wiley 1995 (SF 2 CHA).

Try to examine the data in the way that they do. Begin by looking at histograms and box-and-whisker plots of intereruption times and scatter plots of intereruption times (waiting times) against eruption duration times. Look for a suitable definition of a 'short' duration, in the sense that it appears that the scatterplot falls into two distinct parts depending on whether the eruption concerned is short or long. Then try a parallel box-and-whisker plot showing such plots of intereruption times for short and long eruption times side by side.

Try and guess a simple prediction rule of the form that short eruptions will be followed by intereruption intervals of length x while long ione will be followed by intervals of length y for suitable values of x and y. You can then define a variable representing your prediction by using a construction of the form

predwaiting <- ifelse(eruptions < d, t <- x, t <- y)

If d is the value you choose to distinguish short from long eruption times, then such a variable takes the value x if the condition eruptions < d is true and otherwise ('else') takes the value y.

You can then find the error with your prediction rule by

error <- predwaiting - waiting

You could then try a histogram or a boxplot of the errors or a Q-Q plot to see how closely they appear to be normally distributed.

Try and see how well your rule is obeyed by the two data sets provided by Chatterjee et al. which can be found at

geyser1.dat

and

geyser2.dat

Try plotting the errors using your rule with the original data frame faithful and with the two data sets provided by Chatterjee et al.

This is just a suggestion. Other ways of exploring the data may occur to you - at this stage the most important thing is to get used to using R to carry out simple exploration of data. You may like to know that there are quite a lot of data sets supplied with R; for a list go

data()