R语言学习笔记 Day1

1. R Studio界面

2. 数据处理

2.1 数据类型

1. R Studio界面

设置工作界面：

Note：R语言的排序从1开始，与python从0开始的做法不一样

右下角界面：点File然后进入到你需要到的界面-然后点带齿轮的More - Set as working directory.

如果不想要每次都打开选择这么多文件夹，可以选择使用Tools-Global setting- Deafault working directory.

2. 数据处理

2.1 数据类型

向量（vector）
列表（list）
矩阵（matrix）
数组（Array)
因子（Factor）
数据框（Data Frame)

除了因子以外，其他数据类型在python中都有，在这里不进行重复，因子是一个比较新的数据类型：详细的解释可见：R 因子 | 菜鸟教程

总的来说就是类似list的变量会带一个class，用于对数据进行归类。以下为几个例子：

gender <- factor(c("MALE", "FEMALE", "MALE")) 
gender

得到的结果如下：

[1] MALE FEMALE MALE

Levels: FEMALE MALE

也就是除了正常的数据以外，会带一个levels，包含他们的划分

blood <- factor(c("O", "AB", "A"),
                levels = c("A", "B", "AB", "O"))
blood

[1] O AB A

Levels: A B AB O

再比如说以上的在这个例子中，虽然在blood这个因子中不包含B这个类型，但是总的分类中是有的。

也可以用于排序：

# add ordered factor
symptoms <- factor(c("SEVERE", "MILD", "MODERATE"),
                   levels = c("MILD", "MODERATE", "SEVERE"),
                   ordered = TRUE)
symptoms

# check for symptoms greater than moderate
symptoms > "MODERATE"

[1] MALE FEMALE MALE Levels: FEMALE MALE

[1] TRUE FALSE FALSE

2.2 数据的基本处理

1. 选取某一列数据：

subject1$temperature $符号代表某一列的数据

2. 根据名字选取几列

# get several list items by specifying a vector of names

subject1[c("temperature", "flu_status")]

3. 根据序号选取几列

## access a list like a vector # get values 2 and 3

subject1[2:3]

4. 新建一个dataframe

# create a data frame from medical patient data
pt_data <- data.frame(subject_name, temperature, flu_status, gender, blood, symptoms)

5. 截取行或者列的数据

# column 1,
all rows pt_data[, 1]
# row 1, all columns
pt_data[1, ]
# all rows and all columns 
pt_data[ , ]

6. 建立不同维度的矩阵

# create a 2x3 matrix
m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
# create a 3x2 matrix
m <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)

2.3 管理数据

# show all data structures in memory 展示所有的数据
ls()
# remove the m and subject1 objects 清除数据
rm(m, subject1)
ls()
rm(list=ls())

3. 导入数据

3.1导入csv数据

可以使用代码进行数据的导入

# reading a CSV file
pt_data <- read.csv("pt_data.csv")
# reading a CSV file and converting all character columns to factors
pt_data <- read.csv("pt_data.csv", stringsAsFactors = TRUE)

也可以直接使用右上角的import dataset来进行数据的导入

3.2 得到数据的结构

# get structure of used car data
str(usedcars)

3.3 总结数据

# summarize numeric variables  对数据进行总结
summary(usedcars$year)
summary(usedcars[c("price", "mileage")])

Min. 1st Qu. Median Mean 3rd Qu. Max. 2000 2008 2009 2009 2010 2012

# calculate the mean income
(36000 + 44000 + 56000) / 3
mean(c(36000, 44000, 56000))

# the median income
median(c(36000, 44000, 56000))

# the min/max of used car prices
range(usedcars$price)

# the difference of the range
diff(range(usedcars$price))

# IQR for used car prices 四分位距 (Interquartile range)
IQR(usedcars$price)

# use quantile to calculate five-number summary
quantile(usedcars$price)

# the 99th percentile
quantile(usedcars$price, probs = c(0.01, 0.99))

绘图：

# quintiles
quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20))

# boxplot of used car prices and mileage
boxplot(usedcars$price, main = "Boxplot of Used Car Prices",
      ylab = "Price ($)")

boxplot(usedcars$mileage, main="Boxplot of Used Car Mileage",
      ylab = "Odometer (mi.)")

# histograms of used car prices and mileage
hist(usedcars$price, main = "Histogram of Used Car Prices",
     xlab = "Price ($)")

hist(usedcars$mileage, main = "Histogram of Used Car Mileage",
     xlab = "Odometer (mi.)")

对数据的variance和方差进行探究

# variance and standard deviation of the used car data
var(usedcars$price)
sd(usedcars$price)
var(usedcars$mileage)
sd(usedcars$mileage)

## Exploring numeric variables -----

# one-way tables for the used car data
table(usedcars$year)
table(usedcars$model)
table(usedcars$color)

table是对数据的频数进行统计：如下图所示：

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

3 1 1 1 3 2 6 11 14 42 49 16 2012 1

# compute table proportions
model_table <- table(usedcars$model)
prop.table(model_table) #对频数的百分比进行计算

# round the data
color_table <- table(usedcars$color)
color_pct <- prop.table(color_table) * 100
round(color_pct, digits = 1)