R Tutorial

Introduction

read.table {utils}

Description

Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.

Usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)

read.delim(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)

read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)

Arguments

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
file	
the name of the file which the data are to be read from. Each row of the table appears as one line of the file. If it does not contain an absolute path, the file name is relative to the current working directory, getwd(). Tilde-expansion is performed where supported. This can be a compressed file (see file).

Alternatively, file can be a readable text-mode connection (which will be opened for reading if necessary, and if so closed (and hence destroyed) at the end of the function call). (If stdin() is used, the prompts for lines may be somewhat confusing. Terminate input with a blank line or an EOF signal, Ctrl-D on Unix and Ctrl-Z on Windows. Any pushback on stdin() will be cleared before return.)

file can also be a complete URL. (For the supported URL schemes, see the ‘URLs’ section of the help for url.)

header
a logical value indicating whether the file contains the names of the variables as its first line. If missing, the value is determined from the file format: header is set to TRUE if and only if the first row contains one fewer field than the number of columns.

sep
the field separator character. Values on each line of the file are separated by this character. If sep = "" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.

quote
the set of quoting characters. To disable quoting altogether, use quote = "". See scan for the behaviour on quotes embedded in quotes. Quoting is only considered for columns read as character, which is all of them unless colClasses is specified.

dec
the character used in the file for decimal points.

numerals
string indicating how to convert numbers whose conversion to double precision would lose accuracy, see type.convert. Can be abbreviated. (Applies also to complex-number inputs.)

row.names
a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names.

If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names. Otherwise if row.names is missing, the rows are numbered.

Using row.names = NULL forces row numbering. Missing or NULL row.names generate row names that are considered to be ‘automatic’ (and not preserved by as.matrix).

col.names
a vector of optional names for the variables. The default is to use "V" followed by the column number.

as.is
the default behavior of read.table is to convert character variables (which are not converted to logical, numeric or complex) to factors. The variable as.is controls the conversion of columns not otherwise specified by colClasses. Its value is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.

Note: to suppress all conversions including those of numeric columns, set colClasses = "character".

Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped.

na.strings
a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance.

colClasses
character. A vector of classes to be assumed for the columns. If unnamed, recycled as necessary. If named, names are matched with unspecified values being taken to be NA.

Possible values are NA (the default, when type.convert is used), "NULL" (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or "factor", "Date" or "POSIXct". Otherwise there needs to be an as method (from package methods) for conversion from "character" to the specified formal class.

Note that colClasses is specified per column (not per variable) and so includes the column of row names (if any).

nrows
integer: the maximum number of rows to read in. Negative and other invalid values are ignored.

skip
integer: the number of lines of the data file to skip before beginning to read data.

check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names) so that they are, and also to ensure that there are no duplicates.

fill
logical. If TRUE then in case the rows have unequal length, blank fields are implicitly added. See ‘Details’.

strip.white
logical. Used only when sep has been specified, and allows the stripping of leading and trailing white space from unquoted character fields (numeric fields are always stripped). See scan for further details (including the exact meaning of ‘white space’), remembering that the columns may include the row names.

blank.lines.skip
logical: if TRUE blank lines in the input are ignored.

comment.char
character: a character vector of length one containing a single character or an empty string. Use "" to turn off the interpretation of comments altogether.

allowEscapes
logical. Should C-style escapes such as \n be processed or read verbatim (the default)? Note that if not within quotes these could be interpreted as a delimiter (but not as a comment character). For more details see scan.

flush
logical: if TRUE, scan will flush to the end of the line after reading the last of the fields requested. This allows putting comments after the last field.

stringsAsFactors
logical: should character vectors be converted to factors? Note that this is overridden by as.is and colClasses, both of which allow finer control.

fileEncoding
character string: if non-empty declares the encoding used on a file (not a connection) so the character data can be re-encoded. See the ‘Encoding’ section of the help for file, the ‘R Data Import/Export Manual’ and ‘Note’.

encoding
encoding to be assumed for input strings. It is used to mark character strings as known to be in Latin-1 or UTF-8 (see Encoding): it is not used to re-encode the input, but allows R to handle encoded strings in their native encoding (if one of those two). See ‘Value’ and ‘Note’.

text
character string: if file is not supplied and this is, then data are read from the value of text via a text connection. Notice that a literal string can be used to include (small) data sets within R code.

skipNul
logical: should nuls be skipped?

...
Further arguments to be passed to read.table.

Example

aggregate {stats}

Description

Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
将数据拆分为子集, 计算每个子集的摘要统计信息, 并以方便的形式返回结果.

Usage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
aggregate(x, ...)

## Default S3 method:

aggregate(x, ...)

## S3 method for class 'data.frame'

aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)

## S3 method for class 'formula'

aggregate(formula, data, FUN, ...,
subset, na.action = na.omit)

## S3 method for class 'ts'

aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1,
ts.eps = getOption("ts.eps"), ...)

Arguments
x
an R object.

by
a list of grouping elements, each as long as the variables in the data frame x. The elements are coerced to factors before use.

FUN
a function to compute the summary statistics which can be applied to all data subsets.

simplify
a logical indicating whether results should be simplified to a vector or matrix if possible.

drop
a logical indicating whether to drop unused combinations of grouping values. The non-default case drop=FALSE has been amended for R 3.5.0 to drop unused combinations.

formula
a formula, such as y ~ x or cbind(y1, y2) ~ x1 + x2, where the y variables are numeric data to be split into groups according to the grouping x variables (usually factors).

data
a data frame (or list) from which the variables in formula should be taken.

subset
an optional vector specifying a subset of observations to be used.

na.action
a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.

nfrequency
new number of observations per unit of time; must be a divisor of the frequency of x.

ndeltat
new fraction of the sampling period between successive observations; must be a divisor of the sampling interval of x.

ts.eps
tolerance used to decide if nfrequency is a sub-multiple of the original frequency.

...
further arguments passed to or used by methods.

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
> d <- read.csv("Global Superstore Orders 2016(1).csv")
> head(d$Sales)
[1] 221.98 3709.40 5175.17 2892.51 2832.96 2862.68
> head(d$Market)
[1] USCA Asia Pacific Asia Pacific Europe Africa Asia Pacific
Levels: Africa Asia Pacific Europe LATAM USCA

# aggregate() 将 d$Sales 以 d$Market 分组, 并将各组数据求和. 得到了原来没有的新数据组合.
> dsum <- aggregate.data.frame(d$Sales,by=list(d$Market),FUN = "sum")
> dsum
Group.1 x
1 Africa 783773.4
2 Asia Pacific 4042660.7
3 Europe 3287338.6
4 LATAM 2164605.3
5 USCA 2364129.2

# names() 给与 dsum 的新数据组合表头.
> names(dsum) = c("Market", "Total Sales")
> dsum
Market Total Sales
1 Africa 783773.4
2 Asia Pacific 4042660.7
3 Europe 3287338.6
4 LATAM 2164605.3
5 USCA 2364129.2

# order() 将 dsum 的 Total Sales 列进行从小到大的重新排列.
> dsum_ordered <- dsum[order(dsum$`Total Sales`, decreasing = FALSE), ]
> dsum_ordered
Market Total Sales
1 Africa 783773.4
4 LATAM 2164605.3
5 USCA 2364129.2
3 Europe 3287338.6
2 Asia Pacific 4042660.7

Plot {graphics}

Description

1
2
3
Generic function for plotting of R objects. For more details about the graphical parameter arguments, see par.

For simple scatter plots, plot.default will be used. However, there are plot methods for many R objects, including functions, data.frames, density objects, etc. Use methods(plot) and the documentation for these.

Usage

1
plot(x, y, ...)

Arguments

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
x	
the coordinates of points in the plot. Alternatively, a single plotting structure, function or any R object with a plot method can be provided.

y
the y coordinates of points in the plot, optional if x is an appropriate structure.

...
Arguments to be passed to methods, such as graphical parameters (see par). Many methods will accept the following arguments:

type
what type of plot should be drawn. Possible types are

"p" for points,

"l" for lines,

"b" for both,

"c" for the lines part alone of "b",

"o" for both ‘overplotted’,

"h" for ‘histogram’ like (or ‘high-density’) vertical lines,

"s" for stair steps,

"S" for other steps, see ‘Details’ below,

"n" for no plotting.

All other types give a warning or an error; using, e.g., type = "punkte" being equivalent to type = "p" for S compatibility. Note that some methods, e.g. plot.factor, do not accept this.

main
an overall title for the plot: see title.

sub
a sub title for the plot: see title.

xlab
a title for the x axis: see title.

ylab
a title for the y axis: see title.

asp
the y/x aspect ratio, see plot.window.

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# aggregate() 将 d$Sales 以 d$Category, d$Order.Year 分组, 并将各组数据求和. 得到了原来没有的新数据组合.
> dsum3 <- aggregate.data.frame(d$Sales,by=list(d$Category,d$Order.Year),FUN = "sum")
> names(dsum3) = c("Category","Order.Year","Total.Sales")
> dsum3
Category Order.Year Total.Sales
1 Furniture 2012 756084.3
2 Office Supplies 2012 675714.9
3 Technology 2012 827652.3
4 Furniture 2013 858822.1
5 Office Supplies 2013 795176.1
6 Technology 2013 1023441.7
7 Furniture 2014 1117629.5
8 Office Supplies 2014 1010812.9
9 Technology 2014 1277305.6
10 Furniture 2015 1377917.1
11 Office Supplies 2015 1305791.5
12 Technology 2015 1616159.1

# 通过匹配关键字的方式, 获得 d$category 对应年份的总和数据.
> dsum3.Furniture <- dsum3[dsum3$Category == "Furniture",]
> dsum3.OfficeSupplies <- dsum3[dsum3$Category == "Office Supplies",]
> dsum3.Technology <- dsum3[dsum3$Category == "Technology",]
> dsum3.Furniture
Category Order.Year Total.Sales
1 Furniture 2012 756084.3
4 Furniture 2013 858822.1
7 Furniture 2014 1117629.5
10 Furniture 2015 1377917.1
> dsum3.OfficeSupplies
Category Order.Year Total.Sales
2 Office Supplies 2012 675714.9
5 Office Supplies 2013 795176.1
8 Office Supplies 2014 1010812.9
11 Office Supplies 2015 1305791.5
> dsum3.Technology
Category Order.Year Total.Sales
3 Technology 2012 827652.3
6 Technology 2013 1023441.7
9 Technology 2014 1277305.6
12 Technology 2015 1616159.1

# 通过 range() 获得数据的范围
> xrange <- range(dsum3$Order.Year)
> yrange <- range(dsum3$Total.Sales)
> xrange
[1] 2012 2015
> yrange
[1] 675714.9 1616159.1

# plot() 作图, 坐标轴的范围和名称均已准备好. xaxt = "n" 表示不画 x 轴.
> plot(xrange,
+ yrange,
+ xlab="Year",
+ ylab="Sum of Quantity",
+ xaxt="n",
+ main = "Sum of Quantity Ordered by Category by Year")

1
2
# axis() 修改坐标轴, 1 的意思是修改下面(x axis). at 以哪个点修改.
axis(1, labels = as.character(dsum3$Order.Year), at = as.numeric(dsum3$Order.Year))

1
2
3
4
5
# points() 在坐标轴上加点, 左边是 x 值, 右边是 y 值.
> points(dsum3.Furniture$Order.Year, dsum3.Furniture$Total.Sales)
> points(dsum3.OfficeSupplies$Order.Year, dsum3.OfficeSupplies$Total.Sales)
> points(dsum3.Technology$Order.Year, dsum3.Technology$Total.Sales)

1
2
3
4
# lines() 在坐标轴上画线, 并区分颜色.
> lines(dsum3.Furniture$Order.Year, dsum3.Furniture$Total.Sales,col = "red")
> lines(dsum3.OfficeSupplies$Order.Year, dsum3.OfficeSupplies$Total.Sales,col = "blue")
> lines(dsum3.Technology$Order.Year, dsum3.Technology$Total.Sales,col = "green")


Packages

G

ggplot

S

shiny