R Logo

Guide to R



Installation

R is installed on the network at the University of York. Alternatively, you can install it on your own computer; for details, click here.

Files and Directories

All data files relevant to the course on Statistics I can be found on the web starting from

../st1/welcome.htm [Broken link SPE 2017/06/19]

You can then copy files of mine. For example I have Anscombe's file of four different data sets leading to exactly the same regression on a file of mine in the sub-directory of st1 called progs (for programs relating to Statistics I) under the name anscombe.dat. You can find a list of programs in this area by looking at the web area

../st1/progs/ [Broken link SPE 2017/06/19]

and that particular file can be found by looking at the web area

../st1/progs/anscombe.dat [Broken link SPE 2017/06/19]

You can copy the data over to your own area and name it by clicking File then Save as... (or equivalently by going ALT and F either in succession or together followed by S). It will then show up anscombe_dat as File name. Alter this to just anscombe and alter Save as type to Text file (*.txt). When you click Save the file should be saved as anscombe.txt in the directory or subdirectory you have chosen to save it in.

All of the data files used in the courses on Applied Statistics and Multivariate Analysis can be found on the web. Data files usually have the extension .dat, while R programs have the extension .r and Genstat programs have the extension .gst.

A first session with R

If you know nothing about subdirectories (folders), it will do no harm if you leave all your files in the top directory of your M: drive. If you do want to use a subdirectory, decide on a suitable subdirectory in which you are going to keep programs in R. In what follows, we shall suppose that you are going to keep your programs in a subdirectory of the top level directory on your M drive called rprogs (and that such a subdirectory already exists).

Get to the teaching programs installed by the Mathematics Department by

Start ® Programs ® Teaching ® Maths

Click on Maths. You will then find various icons including one for R. Double click on this icon. You should then find that the screen displays a window headed

R Console

with text in blue beginning something like

R version 2.9.1 (2009-06-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

and ending with a > sign in red. At the top of the window are the words

File Edit Misc Packages Windows Help

Click on File (or alternatively press ALT and F either in succession or together) and a menu appears which includes Change dir.... Click on this (or alternatively press C) and a window appears headed

Change working directory to:

Change the working directory to read

M:/rprogs

(or if you are leaving files in your top directory to M:). This step can alternatively be achieved by going

setwd("M:/rprogs")

or by going

setwd("M:\\rprogs")

next to the red > sign which the machine has typed in the R Console window (yes, it does have to be a forward slash in setwd, although actually you can use two backslashes instead). You can check that you have got the right directory by typing

getwd()

after the red > sign which the machine has typed, and you can find what files are in the directory by typing the command

dir()

(list.files() can be used instead of dir()).

You can see the content of any file in your working directory in a new window by the command

file.show("fred.txt")

(substituting the right name for fred.txt, of course).

Next to the red > sign which the machine has typed, type x <- 6 (followed by carriage return) to assign the value 6 to the variable x (the two symbols <- are meant to indicate an arrow showing that the value 6 is put into the variable x and in fact you could have written 6 -> x. To print out the value of a variable once it has been assigned you can go print(x).

but if you are working interactively it suffices simply to type the name of the variable (followed, as always, by carriage return). If, on the other hand, you are running a program from a file using source as described below, then it is necessary to use print(x) rather than just x.

Usually we want variables that take a large number of values, and it is not much harder to give a string of values to a variable. So we can let x be a variable taking the 5 values 9, 11, 1.1, 2 and 3.3 by x <- c(9,11,1.1,2,3.3). If you want to put in a large number of values, it is useful to keep everything visible at once, and for this purpose it is worth noting that it you end a line in such a way that it is "obvious" that something more is to come, so in this case with a comma, then the program knows to expect more data on the next line and indicates this by beginning that line with a + sign instead of a > sign, as for example

> u <- c(14.40,15.20,11.30, 2.50,22.70,14.90,
+ 1.41,15.81, 4.19,15.39,17.25, 9.52

(where > and + are, of course, typed by the machine).

You can then produce simple statistics such as the mean and variance by mean(x) (followed, as always, by carriage return) and var(x). The standard deviation results from sd(x) or from sqrt(var(x)). If you want to store these values you might care to go mu <- mean(x) and s <- sd(x). After that, mu/s would give the value of the coefficient of variation (i.e. the mean divided by the standard deviation.

Simple graphics are easily obtained. The command boxplot(x) prints out a standard box and whisker plot in a new window headed

R Graphics: Device 2 (ACTIVE)

The command stem(x) prints out a stem-and-leaf plot (in this case in the R Console window). If you want to be able to go back to previous plots, it is helpful to ensure that R keeps a record of them. For this purpose, you should note that the R Graphics window has at the top

File History Resize Windows

If you click on History (or alternatively press ALT and H either in succession or together) you see a menu including Recording. Click on this (or alternatively press R) and all plots from then on are stored so that they can be retrieved. Once recording is started you can go to earlier and later plots by using the PgUp and PgDown keys. Recoding can also be started by use of the command windows(record=TRUE).

If you have two variables, the boxplot(x,y) gives boxplots for the two variables side by side. If they have the same length (i.e. the value of length(x) equals that of length(y), then you can plot one against the other by plot(x,y). Actually plot(x) is meaningful - it plots x against the numbers 1, 2, 3, ....

There are some useful abbreviations. Thus lots <- rep(2.2,99) produces a variable lots with 99 values, all equal to 2.2. Also if x is c(5,6,7) and y is c(3,4,5) then xandy <- c(x,y) results in a variable xandy consisting of 5, 6, 7, 3, 4, 5. Further sxandy <- sort(xandy) results in sorting the values into increasing order.

While most books tend to talk about rep, I find gl (short for 'general linear') easier to use for many purposes. The command x <- gl(l,m,n) sets x to equal a sequence of total length n consisting of the first l integers each repeated m times. Thus x <- gl(3,4,14) is equivalent to

x <- c(1,1,1,1,2,2,2,2,3,3,3,3,1,1)

while x <- gl(4,2,9) is equivalent to

x <- c(1,1,2,2,3,3,4,4,1)

For future reference, note that any result of gl() is a factor and if a variable is required you would need

x <- as.numeric(gl(l,m,n))

Reading data from files

If you copied Anscombe's data to a file in the working directory rprogs (or whatever you have called it) you are using and called this file fred.txt then

d <- read.table("fred.txt")

results in a value for d which if you print it out turns out to look like

  V1 V2V3 V4 V5 V6V7 V8
1 108.04109.14 107.4686.58
2 86.9588.14 86.7785.76
3 137.58138.74 1312.7487.71
4 98.8198.77 97.1188.84
5 118.33119.26 117.8188.47
6 149.96148.10 148.8487.04
7 67.2466.13 66.0885.25
8 44.2643.10 45.391912.50
9 1210.84129.13 128.1585.56
10 74.8277.26 76.4287.91
11 55.6854.74 55.7386.89
Strictly speaking, d is not a matrix but a data frame, but for many purposes it behaves rather like a matrix. The assigment e <- d[3,4] sets e equal to the element in row 3, column 4, namely 8.74, while r <- d[3,] sets r equal to the third row, namely
V1 V2V3 V4 V5 V6V7 V8
3 137.58138.74 1312.7487.71
Similarly cc <- d[,4] sets cc (note it is best to avoid calling a variable c so that we do not confuse it with the operator that constructs vectors) equal to the fourth column, namely
9.148.148.748.77 9.268.106.133.10 9.137.264.74
Actually you can refer to the fourth column as d["V4"] if you so wish.

If we take x1 as the first column and y1 as the second, we can get the plot of one against the other by plot(x1,y1) as described above. You will notice that the default is to place a o at each point; if, e.g., you prefer stars, you can get this by plot(x1,y1,pch="*").

Sometimes you want to have a line plot rather than a dot plot. Thus the population of England and Wales in census years can be entered by

pop <- c(8.89, 10.16, 12.00, 13.90, 15.91, 17.93, 20.07, 22.71, 25.97, 29.00, 32.53, 36.07, 37.89, 39.95)

You can plot this by plot(year,pop), but you may prefer a line plot which you can obtain it by plot(year,pop,type="l"), the "l" being for line.

Some data sets come automatically with R. If you include the line

data("morley")

then morley or print(morley) will result in the printing of all Michaelson and Morley's speed of light data, names(morley) will result in the printing of the names of the variables involved (in this case "Expt", "Run" and "Speed"), while morley$Expt or print(morley$Expt) will result in the printing of the variable Expt alone. (Actually, if you include the line attach(morley), then it will suffice to refer to Expt alone rather than to morley$Expt.)

As a minor point you may find it useful that if you highlight a command or commands and then go CTRL-C this command is stored and can be repeated by CTRL-V. You can also repeat commands by using the up arrow key.

You can select the entire contents of the R console window for pasting into a file for saving or printing by clicking Edit and then Select all (or equivalently by pressing ALT and E either successively or together and then S). This selection can then be copied by clicking Edit and then Copy (or equivalently by pressing ALT and F either in succession or together and then C). You can then use the appropriate command to paste the selection into your chosen file.

One way of dealing with the R Graphics window when active is to take the File and Copy to clipboard as a Bitmap (or equivalently by pressing ALTand F either in succession or together, then C, then B). These contents can then be placed in a wordprocessor. (The contents can also be written to the clipboard as a Metafile, but this is less likely to be useful).

The contents of the R Graphics window if that is active can be saved as a Metafile, Postscript, PDF, Png, Bmp, TIFF or Jpeg file by clicking File, then Save as and then choosing as required. In the case of a Postscript file called, say, plot.ps, the result can be incorporated in a LaTeX file which contains the line

\usepackage{epsfig}

before \begin{document} by inserting

\epsfig{file=plot.ps,width=8cm,height=8cm}

at the place where you want the figure to appear (the width and height should, of course, be altered as desired, but if they are not specified the resulting size will probably not be what you want).

Running Programs from Files

If you copy a file of mine, e.g.

../st1/progs/scarletfever.txt [Broken link SPE 2017/06/19]

on to your rprogs subdirectory, you can run it simply by typing

source("scarletfever.txt")

(or whatever name you have give it - it may turn out to be simplest to copy it as a text file with extension .txt). You may need to copy a corresponding dta file (in this case scarletfever.dat. You can arrange to store the resulting output in a file called scarletfever.out by

sink("scarletfever.out")
source("scarletfever.txt")
sink()

Note the file scarletfever.out will not be available for inspection until it is closed and output returned to the R console window by the command sink(), For some reason I am unable to explain the version available on the network causes problems if you attempt to reuse same file (and it is impossible to delete a file used as a sink by R by employing the functions file.remove or unlink), but this is not likely to cause a problem in practice.

Of course, you can make files of your own using an editor and run them in a similar manner.

Emergency Interruption

If the program seems to be nothing for a long time, it can be interrupted by pressing ESC or CTRL-[ (i.e. CONTROL and left square bracket simulateously).

Help

If you want to know about any particular R function, you can get a description simply by typing its name preced by a question mark to the R Console window. Thus

?lm

will result in a new window with a description of the lm command (used for fitting linear models).

There is a fuller online help system which is obtainable when the R Console window is active by clicking HTML help (or equivalently by pressing ALT and H either in succession or together).

It is worth knowing about the function demo which is a user-friendly interface to running some demonstration R scripts. demo() gives the list of available topics. It may also be worth knowing that example runs all the R code from the Examples part of R's online help topic topic; for example example(mean) runs the R code occurring under Help for 'Mean'.

You can also refer to books on S and S-plus which can be found in the University Library at SK 59 S. Probably the best one to start with is P Dalgaard, Introductory Statistics with R, New York, etc.: Springer-Verlag 2002 (SK 59 R/D); others worth knowing about include A Krause and M Olson, The Basics of S and S-PLUS (2nd edn), New York, etc: Springer-Verlag 2000 (SK 59 S/K), B S Everitt, Statistical Analyses using S-plus, Boca Raton, FL, etc: Chapman and Hall/ CRC 1994 (SK 59 S/E), J Fox An R and S-Plus Companion to Applied Regression, Thousand Oaks, CA: Sage (SF 2.5 FOX), and W N Venables and B D Ripley, Modern Applied Statistics with S (4th edn), New York, etc: Springer 2002 (SK 59 S/V). There is also a supplement called 'R' Complements to Modern Aplied Statistics with S-Plus which is on the web at

http://www.stats.ox.ac.uk/pub/MASS4/

Those of you taking the course on Bayesian statistics may find J Albert, Bayesian Computation with R, New York, etc.: Springer-Verlag 2007 (SF 4 ALB) of use. The definitive reference guide is M J Crawley, The R Book, Chichester: Wiley 2007 (SK 59 R/C).

Basically, R does very nearly all of the things which can be done by S-plus (but not all) and has the immense advantage that it is completely free. Almost always the examples in books about S and S-plus work without any alteration in R. Information about S-plus can be found on the web at

http://www.insightful.com/products/splus/ [Broken link SPE 2017/06/19]

The current version number of R is printed out when the session starts; otherwise it can be obtained by clicking Help and then About (or equivalently by pressing ALT and H either in succession or together and then b).

A first practical session using R can be found on the web at

firstpractical.htm

It may also be useful to know that a list of data sets available in R can be found by going

data()

Some of the basic commands are described in the file basic_r.htm while a sample session can be found as a pdf file on the web at samplesession.pdf. The file samplesession.htm contains the LaTeX source and samplesession.txt contains the unadorned R code; however, it is anticiapted that the firstpract files will be more useful.

Some useful links

Peter M Lee

Revised 21 August 2009