R Logo

Basic Concepts of R



Files and Directories

All data files relevant to the course on Statistical Theory I can be on the web starting from

../st1/progs/ [Broken link SPE 2017/06/19]

You can then copy files of mine. For example I have Anscombe's file of four different data sets leading to exactly the same regression on a file of mine in the sub-directory of st1 called progs (for programs relating to Applied Statistics) under the name anscombe.dat. You can find a list of programs in this area by looking at the web area mentioned above and that particular file can be found by looking at the web area

../st1/progs/anscombe.dat [Broken link SPE 2017/06/19]

You can copy the data over to your own area and name it by clicking File then Save as... (or equivalently by going ALT and F either in succession or together followed by S). It will then show up anscombe_dat as File name. Alter this to just anscombe and alter Save as type to Text file (*.txt). When you click Save the file should be saved as anscombe.txt in the directory or subdirectory you have chosen to save it in.

All of the data files used in the courses on Applied Statistics and Multivariate Analysis can be found on the web. Data files usually have the extension .dat, while R programs have the extension .r and Genstat programs have the extension .gst.

A first session with R

If you know nothing about subdirectories (folders), it will do no harm if you leave all your files in the top directory of your M: drive. If you do want to use a subdirectory, decide on a suitable subdirectory in which you are going to keep programs in R. In what follows, we shall suppose that you are going to keep your programs in a subdirectory of the top level directory on your M drive called rprogs (and that such a subdirectory already exists).

Get to the teaching programs installed by the Mathematics Department by

Start ® Programs ® Teaching ® Maths

Click on Maths. You will then find various icons including one for R. Double click on this icon. You should then find that the screen displays a window headed

R Console

with text in blue beginning

R version 2.9.1 (2009-06-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

and ending with a > sign in red. At the top of the window are the words

File Edit Misc Windows Help

Click on File (or alternatively press ALT and F either in succession or together) and a menu appears which includes Change dir. Click on this (or alternatively press C) and a window appears headed

Question

Change the working directory to read

M:/rprogs

(or in you are leaving files in your top directory to M:) and click on OK. This step can alternatively be achieved by going

setwd("M:/rprogs")

next to the red > sign which the machine has typed in the R Console window (yes, it does have to be a forward slash in setwd, although actually you can use two backslashes instead). You can check that you have got the right directory by typing

getwd()

after the red > sign which the machine has typed, and you can find what files are in the directory by typing the command

dir()

(list.files can be used instead of dir).

You can see the content of any file in your working directory in a new window by the command

file.show("fred.txt")

(substituting the right name for fred.txt, of course).

Next to the red > sign which the machine has typed, type x <- 6 (followed by carriage return) to assign the value 6 to the variable x (the two symbols <- are meant to indicate an arrow showing that the value 6 is put into the variable x and in fact you could have written 6 -> x. It is no longer allowable to use an underscore, as in x _ 6. To print out the value of a variable once it has been assigned you can go print(x), but if you are working interactively it suffices simply to type the name of the variable (followed, as always, by carriage return). If, on the other hand, you are running a program from a file using source as described below, then it is necessary to use print(x) rather than just x.

Usually we want variables that take a large number of values, and it is not much harder to give a string of values to a variable. So we can let x be a variable taking the 5 values 9, 11, 1.1, 2 and 3.3 by x <- c(9,11,1.1,2,3.3). If you want to put in a large number of values, it is useful to keep everything visible at once, and for this purpose it is worth noting that it you end a line in such a way that it is "obvious" that something more is to come, so in this case with a comma, then the program knows to expect more data on the next line and indicates this by beginning that line with a + sign instead of a > sign, as for example

> u <- c(14.40,15.20,11.30, 2.50,22.70,14.90,
+ 1.41,15.81, 4.19,15.39,17.25, 9.52

(where > and + are, of course, typed by the machine).

You can then produce simple statistics such as the mean and variance by mean(x) (followed, as always, by carriage return) and var(x). The standard deviation results from sd(x) or from sqrt(var(x)). If you want to store these values you might care to go mu <- mean(x) and s <- sd(x). After that, mu/s would give the value of the coefficient of variation (i.e. the mean divided by the standard deviation.

Simple graphics are easily obtained. The command boxplot(x) prints out a standard box and whisker plot in a new window headed

R Graphics: Device 2 (ACTIVE)

The command stem(x) prints out a stem-and-leaf plot (in this case in the R Console window). If you want to be able to go back to previous plots, it is helpful to ensure that R keeps a record of them. For this purpose, you should note that the R Graphics window has at the top

File History Resize Windows

If you click on History (or alternatively press ALT and H either in succession or together) you see a menu including Recording. Click on this (or alternatively press R) and all plots from then on are stored so that they can be retrieved. Once recording is started you can go to earlier and later plots by using the PgUp and PgDown keys.

If you have two variables, the boxplot(x,y) gives boxplots for the two variables side by side. If they have the same length (i.e. the value of length(x) equals that of length(y), then you can plot one against the other by plot(x,y). Actually plot(x) is meaningful - it plots x against the numbers 1, 2, 3, ....

There are some useful abbreviations. Thus lots <- rep(2.2,99) produces a variable lots with 99 values, all equal to 2.2. Also if x is c(5,6,7) and y is c(3,4,5) then xandy <- c(x,y) results in a variable xandy consisting of 5, 6, 7, 3, 4, 5. Further sxandy <- sort(xandy) results in sorting the values into increasing order.

While most books tend to talk about rep, I find gl (short for 'general linear') easier to use for many purposes. The command x <- gl(l,m,n) sets x to equal a sequence of total length n consisting of the first l integers each repeated m times. Thus x <- gl(3,4,14) is equivalent to

x <- c(1,1,1,1,2,2,2,2,3,3,3,3,1,1)

while x <- gl(4,2,9) is equivalent to

x <- c(1,1,2,2,3,3,4,4,1)

For future reference, note that any result of gl() is a factor and if a variable is required you would need

x <- as.numeric(gl(l,m,n))

Reading data from files

If you copied Anscombe's data to a file in the working directory rprogs (or whatever you have called it) you are using and called this file fred.txt then

d <- read.table("fred.txt")

results in a value for d which if you print it out turns out to look like

  V1 V2V3 V4 V5 V6V7 V8
1 108.04109.14 107.4686.58
2 86.9588.14 86.7785.76
3 137.58138.74 1312.7487.71
4 98.8198.77 97.1188.84
5 118.33119.26 117.8188.47
6 149.96148.10 148.8487.04
7 67.2466.13 66.0885.25
8 44.2643.10 45.391912.50
9 1210.84129.13 128.1585.56
10 74.8277.26 76.4287.91
11 55.6854.74 55.7386.89
Strictly speaking, d is not a matrix but a data frame, but for many purposes it behaves rather like a matrix. The assigment e <- d[3,4] sets e equal to the element in row 3, column 4, namely 8.74, while r <- d[3,] sets r equal to the third row, namely
V1 V2V3 V4 V5 V6V7 V8
3 137.58138.74 1312.7487.71
Similarly cc <- d[,4] sets cc (note it is best to avoid calling a variable c so that we do not confuse it with the operator that constructs vectors) equal to the fourth column, namely
9.148.148.748.77 9.268.106.133.10 9.137.264.74
Actually you can refer to the fourth column as d["V4"] if you so wish.

If we take x1 as the first column and y1 as the second, we can get the plot of one against the other by plot(x1,y1) as described above. You will notice that the default is to place a o at each point; if, e.g., you prefer stars, you can get this by plot(x1,y1,pch="*").

Sometimes you want to have a line plot rather than a dot plot. Thus the population of England and Wales in census years can be entered by

pop <- c(8.89, 10.16, 12.00, 13.90, 15.91, 17.93, 20.07, 22.71, 25.97, 29.00, 32.53, 36.07, 37.89, 39.95)

You can plot this by plot(year,pop), but you may prefer a line plot which you can obtain it by plot(year,pop,type="l"), the "l" being for line.

Some data sets come automatically with R. If you include the line

data("morley")

then morley or print(morley) will result in the printing of all Michaelson and Morley's speed of light data, names(morley) will result in the printing of the names of the variables involved (in this case "Expt", "Run" and "Speed"), while morley$Expt or print(morley$Expt) will result in the printing of the variable Expt alone. (Actually, if you include the line attach(morley), then it will suffice to refer to Expt alone rather than to morley$Expt.)

As a minor point you may find it useful that if you highlight a command or commands and then go CTRL-C this command is stored and can be repeated by CTRL-V. You can also repeat commands by using the up arrow key.

You can select the entire contents of the R console window for pasting into a file for saving or printing by clicking Edit and then Select all (or equivalently by pressing ALT and E either successively or together and then S). This selection can then be copied by clicking Edit and then Copy (or equivalently by pressing ALT and F either in succession or together and then C). You can then use the appropriate command to paste the selection into your chosen file.

One way of dealing with the R Graphics window when active is to take the File and Copy to clipboard as a Bitmap (or equivalently by pressing ALT and F either in succession or together, then C, then B). These contents can then be placed in a WordPerfect file when the appropriate window is active by CTRL-V. (The contents can also be written to the clipboard as a Metafile, but this is less likely to be useful).

The contents of the R Graphics window if that is active can be saved as a Gif, Metafile or Postscript file by clicking File then Save as and then as appropriate (or equivalently by pressing ALT and F either in succession or together, then S then P or as desired). In the case of a Postscript file called, say, plot.ps, the result can be incorporated in a LaTeX file which contains the line

\usepackage{epsfig}

before \begin{document} by inserting

\epsfig{file=plot.ps,width=8cm,height=8cm}

at the place where you want the figure to appear (the width and height should, of course, be altered as desired, but if they are not specified the resulting size will probably not be what you want).

Running Programs from Files

If you copy a file of mine, e.g.

../st1/progs/marraigeage.txt [Broken link SPE 2017/06/19]

on to your rprogs subdirectory, together with the corresponding data file marraigeage.dat you can run it simply by typing

source("marraigeage.r")

You can arrange to store the resulting output in a file called ex1.out by

sink("marriageage.out")
source("marriageage.r")
sink()

Note the file ex1.out will not be available for inspection until it is closed and output returned to the R console window by the command sink(), For some reason I am unable to explain the version available on the network causes problems if you attempt to reuse same file (and it is impossible to delete a file used as a sink by R by employing the functions file.remove or unlink), but this is not likely to cause a problem in practice.

Of course, you can make files of your own using an editor and run them in a similar manner.

Emergency Interruption

If the program seems to be nothing for a long time, it can be interrupted by pressing ESC or CTRL-[ (i.e. CONTROL and left square bracket simulateously).

Help

If you want to know about any particular R function, you can get a description simply by typing its name preced by a question mark to the R Console window. Thus

?lm

will result in a new window with a description of the lm command (used for fitting linear models).

There is a fuller online help system which is obtainable when the R Console window is active by clicking Help (or equivalently by pressing ALT and H either in succession or together). The most useful parts are a description of all the Functions and Packages together with Search Engines and Keywords which are obtained by clicking R language html (or equivalently by pressing H or h after ALT-H).

It is worth knowing about the function demo which is a user-friendly interface to running some demonstration R scripts. demo() gives the list of available topics. It may also be worth knowing that example runs all the R code from the Examples part of R's online help topic topic; for example example(mean) runs the R code occurring under Help for 'Mean'.

Some useful data sets can be accessed from

datasets.htm

You can also refer to books on S and S-plus which can be found in the University Library at SK 59 S. Probably the best one to start with is P Dalgaard, Introductory Statistics with R, New York, etc.: Springer-Verlag 2002 (SK 59 R/D); others worth knowing about include A Krause and M Olson, The Basics of S and S-PLUS (2nd edn), New York, etc: Springer-Verlag 2000 (SK 59 S/K), B S Everitt, Statistical Analyses using S-plus, Boca Raton, FL, etc: Chapman and Hall/ CRC 1994 (SK 59 S/E), J Fox An R and S-Plus Companion to Applied Regression, Thousand Oaks, CA: Sage (SF 2.5 FOX), and W N Venables and B D Ripley, Modern Applied Statistics with S (4th edn), New York, etc: Springer 2002 (SK 59 S/V). There is also a supplement called 'R' Complements to Modern Aplied Statistics with S-Plus which is on the web at

http://www.stats.ox.ac.uk/pub/MASS4/

Basically, R does a large proportion of the things which can be done by S (but not all) and has the immense advantage that it is completely free. You can install it on your own PC if you have one; if you are interested you should refer to the web page

install.htm

Individual packages not automatically installed can be installed easily from within R by clicking Packages then Load package....

Almost always the examples in books about S and S-plus work without any alteration in R. Information about S-plus can be found on the web at

http://www.insightful.com/products/splus/ [Broken link SPE 2017/06/19]

The current version number of R is printed out when the session starts; otherwise it can be obtained by clicking Help and then About (or equivalently by pressing ALT and H either in succession or together and then b).

A first practical session using R can be found on the web at

firstpractical.htm

It may also be useful to know that a list of data sets available in R can be found by going

data()

There is a 105-page Introduction to R which is obtainable by clicking Help, then Manuals, then An Introduction to R (or equivalently by pressing ALT and H either successively or together, then M, then A and not I as you might expect). Hard copies of this can be found in the University Library at QUARTO SK 59 R/R. The sample session at the end of this document (slightly adapted) can be found as a pdf file on the web at

samplesession.pdf

while samplesession.htm contains the LaTeX source and samplesession.txt contains the unadorned R code; however, it is anticiapted that the firstpract files will be more useful.