GATK入门的最佳姿势
虽然GATK的功能超级多，但是主要可以归为以下几个方面：

诊断和质量控制工具(Diagnostics and Quality Control Tools)
序列数据处理工具(Sequence Data Processing Tools)
变异位点探索工具(Variant Discovery Tools)
变异位点评估工具(Variant Evaluation Tools)
变异位点操作工具(Variant Manipulation Tools)
注释模块
读段（reads）过滤
资源文件解码工具
参考序列实用工具

Add read groups (Picard tools)
AddOrReplaceReadGroups.jar I=sorted.bam_file O=s1.rg.bam RGLB=genome RGPL=ILLUMINA
RGPU=GATKv4 RGSM=sample_name VALIDATION_STRINGENCY=LENIENT
用了picard tools的AddOrReplaceReadGroups.jar加了read group，根据以前百度GATK的经验，了解GATK要求bam文件的header必须包含@RG，所以这一步应该是前面比对时候，没有在参数中增加相应部分。所以如果我在比对的时候增加了这个参数，这一步就可以免了。
Mark duplicates (Picard tools)
MarkDuplicates.jar INPUT=s1.rg.bam OUTPUT=s2.dedup.bam ASSUME_SORTED=TRUE
VALIDATION_STRINGENCY=LENIENT METRICS_FILE=s2.dedup.metrics

用picard tools的MarkDuplicates.jar去重。这是因为二代测序有一部桥式PCR扩增底物的过程，在100x以上出现reads重复，很大可能是PCR扩增重复了，所以可以直接去掉。

Index (samtools)
samtools index s2.dedup.bam
samtools的index，建立索引，提高检索效率。
Realign reads (create intervals first, then do IndelRealigner) (GATK)
GenomeAnalysisTK.jar -I s2.dedup.bam -R ref_file -T RealignerTargetCreator -o s3.intervals
GenomeAnalysisTK.jar -T IndelRealigner -I s2.dedup.bam -R ref_file -targetIntervals s3.intervals -o
s4.realn.bam
先建立indel区间文件，然后对该区域进行reads重比对。这一步我没有经验，然后我在GATK的论坛搜索了这个工具，发现原因是indel会导致附近的错配，所以需要借由这一步降低indel附近的假阳性。但是最新的HaplotypeCaller or MuTect2已经不需要了，这些工具的haplotype组装步奏效果类似。这里的UnifiedGenotyper or the original MuTect.还是要的。
Unified genotyper (GATK)
GenomeAnalysisTK.jar -T UnifiedGenotyper -R ref_file -I s4.realn.bam -glm BOTH -o s5.UG1.vcf -mbq 30 -nt
使用UnifiedGenotyper寻找variant。论坛搜索发现，现在推荐用HaplotypeCaller，效果更好。
4
Base score recalibrator (GATK)
GenomeAnalysisTK.jar -T BaseRecalibrator -I s4.realn.bam -R ref_file -knownSites s5.UG1.vcf -o s6.recal

根据上一步得到vcf文件对第四步得到BAM文件进行碱基重校准，得到校准所需的文件。

Print Reads (GATK)
GenomeAnalysisTK.jar -T PrintReads -R ref_file -I s4.realn.bam -BQSR s6.recal -o s7.recal.bam
Unified Genotype (GATK)
GenomeAnalysisTK.jar -T UnifiedGenotyper -R ref_file -I s7.recal.bam -glm BOTH -o s8.UG2.vcf -mbq 30
Base score recalibrator (GATK)
GenomeAnalysisTK.jar -T BaseRecalibrator -I s4.realn.bam -R ref_file -knownSites s8.UG1.vcf -o s9.recal
Print Reads (GATK)
GenomeAnalysisTK.jar -T PrintReads -R ref_file -I s4.realn.bam -BQSR s9.recal -o s10.recal.bam
Unified Genotyper (GATK)
GenomeAnalysisTK.jar -T UnifiedGenotyper -R ref_file -I s10.recal.bam -glm BOTH

研究意义
1.在不同背景下比较mRNA水平

同一物种，不同组织：研究基因在不同部分的表达情况
同一物种，同一组织：研究基因在不同处理下，不同条件下的表达变化
同一组织，不同物种：研究基因的进化关系
时间序列实验：基因在不同时期的表达情况与发育的关系

2.基因分类：找到细胞特异，疾病相关，处理相关的基因表达模式，用于诊断疾病和预测等
3.基因网络和通路：基因在细胞活动中的功能，基因间的相互作用。

差异表达(differential expression,DE)基因分析
通过研究基因的差异表达，我们可以发现

细胞特异性的基因；
发育阶段特异性的基因；
疾病状态相关的基因；
环境相关的基因；
…

基本方法就是以生物学意义的方式计算基因表达量，然后通过统计学分析表达量寻找具有统计学显著性差异的基因，从而

选择合适的基因
衡量结果的可靠性
分析方法
寻找差异表达基因有三种方式：
第一种是计算Fold change（倍数变化），十分简单粗暴的方法，计算方法如下：

E = mean(group1) B=mean(group2)
FC = (E-B) / min(E,B)

说人话就是，基因A和基因B的平均值之差与两者中较小的比值。选择2-3倍的基因作为结果（为什么是2-3倍，就是大家约定俗成）。

用到的数据来自于一篇文献“A pooled shRNA screen for regulators of primary mammary stem and progenitor cells identifies roles for Asap1 and Prox1”。

根据文章的介绍，数据是100 bp的单端read，用Rsubread比对到mouse reference genome(mm10), 然后使用featureCounts统计每个基因的count数。然后用TMM进行标准化，转换成log2 counts per million.最后用limma包对每个样本每个基因的平均表达值以观察水平权重的线性模型进行拟合，并用T检验找到不同群体的差异表达基因。以FDR + log2-fold-change对基因排序。
我们的任务是下载他们featureCounts的结果，不用limma,改用DESeq2分析。

install.packages(“R.utils”)

library(“R.utils”)
url <- “https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63310&format=file”
utils::download.file(url, destfile=“GSE63310_RAW.tar”, mode=“wb”)
utils::untar(“GSE63310_RAW.tar”, exdir = “.”)
files <- c(“GSM1545535_10_6_5_11.txt”, “GSM1545536_9_6_5_11.txt”, “GSM1545538_purep53.txt”,
“GSM1545539_JMS8-2.txt”, “GSM1545540_JMS8-3.txt”, “GSM1545541_JMS8-4.txt”,
“GSM1545542_JMS8-5.txt”, “GSM1545544_JMS9-P7c.txt”, “GSM1545545_JMS9-P8c.txt”)

for(i in paste(files, “.gz”, sep=""))
R.utils::gunzip(i, overwrite=TRUE)

read.delim(files[1], nrow=5)

EntrezID GeneLength Count

#1 497097 3634 1
#2 100503874 3259 0
#3 100038431 1634 0
#4 19888 9747 0
#5 20671 3130 1

之后我们需要用DESeqDataSetFromMatrix为DESeq2提供数据,命令形式如下：
library(“DEseq2”)
DESeqDataSetFromMatrix(countData, colData, design, tidy = FALSE,ignoreRank = FALSE, …)
首先从各个count matrix文件中读取count，基因长度部分可以舍弃，因为DESeq2只需要为标准化的count数据，不需要提供基因长度信息。
逻辑就是分别读取每一个文件的count列，然后赋予文件名。

sample1 <- read.delim(files[1],header = T)[,c(1,3)]

sample1
EntrezID Count
1 497097 1
2 100503874 0
3 100038431 0
4 19888 0
5 20671 1
6 27395 431
7 18777 768

count.table <- data.frame(sample1)

length(files)
[1] 9

for ( f in files[2:length(files)]){
column <- read.delim(f,header=T,row.names = 1)[,2]
count.table <- cbind(count.table, column)
}
head(count.table)
rownames(count.table) <- count.table[,1]
count.table <- count.table[-1]
samplenames <- substring(files, 12, nchar(files)-4)
colnames(count.table) <- samplenames

然后要提供colData,其中colData存放列名要和countData的行名相同。

colData

condition <- as.factor(c(“LP”, “ML”, “Basal”, “Basal”, “ML”, “LP”,
“Basal”, “ML”, “LP”))
lane <- as.factor(rep(c(“L004”,“L006”,“L008”), c(3,4,2)))
colData <- data.frame(condition = condition, lane=lane,row.names = samplenames)

all(rownames(colData) %in% colnames(count.table))

最后导入数据:

dds <- DESeqDataSetFromMatrix(countData = count.table,
colData = colData,
design = ~ condition )

数据预处理
预过滤
虽然DESeq2会自动屏蔽那些低count的基因，但是剔除那些几乎不存在基因的部分能够提高运行速度。

dds <- dds [ rowSums(counts(dds)) > 1, ]
rlog和方差齐性转换
许多常见的多维数据探索性分析的统计分析方法，例如聚类和主成分分析要求，在那些同方差性的数据表现良好。所谓的同方差性就是虽然平均值不同，但是方差相同。
但是对于RNA-Seq count数据而言，当均值增加时，方差期望也会提高。也就说直接对count matrix或标准化count（根据测序深度调整）做PCA分析，由于高count在不同样本间的绝对差值大，也就会对结果有很大影响。简单粗暴的方法就是对count matrix取log后加1。这个1也是约定俗成，看经验了。
随便举个栗子看下效果：
lambda <- 10^seq(from = -1, to = 2, length = 1000)
cts <- matrix(rpois(1000*100, lambda), ncol = 100)
library(“vsn”)
meanSdPlot(cts, ranks = FALSE)
log.cts.one <- log2(cts + 1)
meanSdPlot(log.cts.one, ranks = FALSE)

到底用啥：数据集小于30 -> rlog，大数据集 -> VST。还有这个处理过程不是用于差异检验的，在DESeq分析中会自动选择最合适的所以你更不需要纠结了，记得用raw count。

library(“dplyr”)
library(“ggplot2”)

dds <- estimateSizeFactors(dds)

df <- bind_rows(
as_data_frame(log2(counts(dds, normalized=TRUE)[, 1:2]+1)) %>%
mutate(transformation = “log2(x + 1)”),
as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = “rlog”),
as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = “vst”))

colnames(df)[1:2] <- c(“x”, “y”)

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid( . ~ transformation)

结果就是转换后更加集中了。

样本距离
RNA-Seq分析第一步通常是评估样本间的总体相似度。

那些样本最接近
那些样本差异较大
这与实验设计预期符合么
这里使用R内置的 dist 计算 rlog变换数据的Euclidean distance，然后用pheatmap根据结果画热图

sampleDists <- dist(t(assay(rld)))
library(“pheatmap”)
library(“RColorBrewer”)
sampleDistMatrix <- as.matrix( sampleDists )

colors <- colorRampPalette( rev(brewer.pal(9, “Blues”)) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors)

同样的可以用Poisson Distance (Witten 2011)计算距离，计算方式如下：

library(“PoiClaClu”)
poisd <- PoissonDistance(t(counts(dds)))

PCA分析
还有一种可视化样本-样本距离的方法就是主成分分析。PCA分析我打算找点资料好好理解之后再写，这个说下有这个方法。
DESeq2提供了专门的方法用于作图，还很好看呢！

plotPCA(rld,intgroup=c(“condition”))

能够明显的发现不同处理的距离离得很远。

差异表达基因分析（DEA）
在我们定义好DESeqDataSe 就可以方便的调用DESeq分析.

dds <- DESeq(dds)
在几行输出后信息后，分析就完成了，更多具体参数可以用?DESeq查看手册

构建结果表格
如果直接调用results，只会提取出design matrix最后两个变量（这里是ML vs Basal）的 log2 fold changes and p values等结果。

results(dds)
log2 fold change (MAP): condition ML vs Basal
Wald test p-value: condition ML vs Basal
DataFrame with 27179 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj

497097 118.7522256 -7.0662875 0.5802473 -12.1780608 4.068401e-34 8.556128e-33
100503874 1.3308906 -2.6371037 1.3370584 -1.9723174 4.857338e-02 8.946685e-02
100038431 0.0843771 -0.1541107 1.2443948 -0.1238439 9.014388e-01 NA
19888 1.4833878 2.9503142 1.3351543 2.2097179 2.712475e-02 5.385925e-02
20671 26.4317752 -1.5256766 0.7754853 -1.9673829 4.913908e-02 9.034136e-02
… … … … … … …
100861837 368.84237 0.33334299 0.3226665 1.0330883 0.3015626 0.4139846
100861924 22.79272 -1.25169702 0.8307249 -1.5067528 0.1318740 0.2107990
170942 852.61538 -0.07612745 0.2399318 -0.3172879 0.7510252 0.8235139
100861691 0.00000 NA NA NA NA NA
100504472 0.00000 NA NA NA NA NA

实际上results有如下众多参数
results(object, contrast, name, lfcThreshold = 0,
altHypothesis = c(“greaterAbs”, “lessAbs”, “greater”, “less”),
listValues = c(1, -1), cooksCutoff, independentFiltering = TRUE,
alpha = 0.1, filter, theta, pAdjustMethod = “BH”, filterFun,
format = c(“DataFrame”, “GRanges”, “GRangesList”), test, addMLE = FALSE,
tidy = FALSE, parallel = FALSE, BPPARAM = bpparam(), …)

比如可以指定比较对象Basal和PL,可以用mcols查看结果存储的元数据，了解列名的含义。

res <- results(dds,contrast = c(“condition”,“Basal”,“LP”))
mcols(res, use.names = TRUE)

DataFrame with 6 rows and 2 columns
type description

baseMean intermediate mean of normalized counts for all samples
log2FoldChange results log2 fold change (MAP): condition Basal vs LP
lfcSE results standard error: condition Basal vs LP
stat results Wald statistic: condition Basal vs LP
pvalue results Wald test p-value: condition Basal vs LP
padj results BH adjusted p-valuess

其中lfcSE是Log2 fold change的标准误。其他项都比较好理解。
最后还可以看一些总结性的内容
summary(res)

out of 22026 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 5528, 25%
LFC < 0 (down) : 5274, 24%
outliers [1] : 43, 0.2%
low counts [2] : 2953, 13%
(mean count < 1)
[1] see ‘cooksCutoff’ argument of ?results
[2] see ‘independentFiltering’ argument of ?results

在limma分析结果分别是4127，4298，稍微多了那么2000个基因。其他结果也基本上都了2000个。看起来对于22000多个基因而言，差异不算太大。
当然我们还可以通过

降低假阳性率（FDR， false discovery rate)阈值
提高log2 fold change的阈值

res0.5 <- results(dds, contrast = c(“condition”,“Basal”,“LP”),alpha=0.05)
table(res0.5$padj < 0.05)

FALSE TRUE
8860 9750

resLFC1 <- results(dds, lcontrast = c(“condition”,“Basal”,“LP”),fcThreshold=1)
table(resLFC1$padj < 0.1)

FALSE TRUE
8916 10958

于是乎结果就和limma差不多了。

多重试验
在高通量数据分析中，我们通常不是用p值来拒绝原假设，更多是用来进行多重试验矫正。

为什么要做多重试验矫正？

如果对一个基因，我有99%的把握认为是差异表达的，也就是说1%的可能是错的。那么假设有10000个基因，按照数学期望，有100个是假的。因此为了保证多重试验结果的可靠性要对结果的p-value做矫正。
DESeq2用的Benjamini-Hochberg (BH) adjustment (Benjamini and Hochberg 1995)，计算矫正后的P-value,用于回答如下问题：

如果我们选择了所有小于或等于矫正p-value阈值的所有显著性基因，假阳性比例（ false discovery rate, FDR）是多少？

FDR在高通量试验中是比较有用的统计值，由于我们通常关注一类自己感兴趣的基因，所以我们需要设置一个假阳性上限。
我们从结果中以1%作为显著性，分别找出显著性上调和下调的基因。
regSig <- subset(res, padj < 0.01)

Up

head(resSig[ order(resSig$log2FoldChange), ])

Down

head(resSig[ order(resSig$log2FoldChange, decreasing = TRUE), ])

可以用MA-plot（也叫mean-difference plot,Bland-Altman plot）了解模型（如所有基因在不同处理比较的结果）的估计系数的分布
res <- lfcShrink(dds, contrast=c(“contion”,“ML”,“PL”), res=res)
plotMA(res, ylim = c(-5, 5))

更加喜闻乐见的是基因聚类所提供的热图展示。我们可以找前20个样本件差异比较大，然后看他们在不同样本间的表达情况。
library(“genefilter”)
topVarGenes <- head(order(rowVars(assay(rld)), decreasing = TRUE), 20)
mat <- assay(rld)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)[, c(“condition”,“lane”)])
pheatmap(mat,annotation_col = anno)

heatmap

大致可以发现同一组的基因颜色是相同的，也就是说表达量相近。

结语
找到差异表达的基因只是第一步，后续还需要对这些基因进行进一步的分析，如

基因富集分析
主成分分析
聚类分析
而这些内容就是我接下来的学习重点，也是下次更新文章的主题。

首先对基因表达分析（上）做一个简单的回顾

研究基因表达的有如下工具：RNA-Seq，microarray， qRT-PCR等（欢迎补充）
RNA-Seq，microarray一般用在探索性阶段，qRT-PCR用于验证
RNA-Seq和microarray由于他们的实验方式不同，导致寻找差异表达基因的统计学方法也不同。其中microarray使用寡核苷酸作为探针进行杂交，基因表达量与亮度正相关，而亮度是一个连续型变量，因此大多认为结果是服从正态分布。而RNA-Seq的测序结果是一条条read，是一种离散抽样过程，因此认为是服从泊松分布。
ANOVA和简单线性模型都是广义线性模型的特殊情况。ANOVA是研究名义型解释变量和连续型解释变量的关系，简单线性模式是研究连续型解释变量和连续型解释变量的关系。而广义线性模式没特殊要求。
在3,4的背景下，microarray一般用t检验（两个条件），ANOVA分析（多个条件），最常用limma（线性模型）进行检验。RNA-Seq有许多基于count的R包，如DESeq，DESeq2,（基于负二向分布广义线性模型）
以上要求你每个条件都要有3个重复（目前投稿要求），你要是老板穷，一个重复都不给，那你去Google解决方案吧。
用R作差异表达分析大致分为以下几步：1）根据软件包要求导入数据；2）数据预处理，把那些只有0或1计数结果的基因去掉，提高效率。这一步还可以进行探索性数据分析；3）跑程序，得到结果；4）对结果进行可视化，看看基因聚类等结果，这一步不是必须的，但却是展示数据最好的手段了。

如果这些内容都还有印象，那么我们继续上次没有写完的内容的其中一个部分–基因富集分析
E-MTAB-5130.sdrf.txt
/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -o athal.fa.gz

然后用Salmon建立索引：

salmon index -t athal.fa.gz -i athal_index

数据定量
由于样本一共有16个，不可能一条条输入命令，所以我们写一个脚本：
#! /bin/bash
for fn in ERR1698{194…209};
do
samp=basename ${fn}
echo “Processin sample ${sampe}”
salmon quant -i athal_index -l A
-1 ${samp}_1.fastq.gz
-2 $KaTeX parse error: Expected 'EOF', got '\ ' at position 19: …mp}_2.fastq.gz \̲ ̲ -p 8 -o…$ {samp}_quant
done

tximport可以纠正不同样本基因长度的潜在改变（比如说differential isoform usage）；能够用于导入 (Salmon, Sailfish, kallisto)程序的输出文件；能够避免丢弃那些比对到多个基因的同源序列，从而提高灵敏度

虽然tximport的参数看起来很多，但其实需要我们准备的就是两个files和tx2gene
tximport(files, type = c(“none”, “salmon”, “sailfish”, “kallisto”, “rsem”),
txIn = TRUE, txOut = FALSE, countsFromAbundance = c(“no”, “scaledTPM”,
“lengthScaledTPM”), tx2gene = NULL, varReduce = FALSE,
dropInfReps = FALSE, ignoreTxVersion = FALSE, geneIdCol, txIdCol,
abundanceCol, countsCol, lengthCol, importer = NULL)

files存放的是salmon的输出文件，所以我们需要根据文件存放位置，进行声明
dir <- “C:/Users/Xu/Desktop/”
list.files(dir)
sample <- paste0(“ERR1698”,seq(194,209),"_quant")
files <- file.path(dir,“quants”,sample,“quant.sf”)
names(files) <- paste0(“sample”,c(1:16))
all(file.exists(files))

然后我们还要准备一个基因名和转录本名字相关的数据框
library(AnnotationHub)
ah <- AnnotationHub()
ath <- query(ah,‘thaliana’)
ath_tx <- ath[[‘AH52247’]]
columns(ath_tx)
k <- keys(ath_tx,keytype = “GENEID”)
df <- select(ath_tx, keys=k, keytype = “GENEID”,columns = “TXNAME”)
tx2gene <- df[,2:1] # TXID在前， GENEID在后
如果你电脑跑的够快，基本上这个时候就可以导入数据了。

install.packages(“readr”)

install.packages(“rsjon”)

library(“tximport”)
library(“readr”)
txi <- tximport(files, type = “salmon”, tx2gene = tx2gene)

reading in files with read_tsv

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

summarizing abundance

summarizing counts

summarizing length

names(txi)
[1] “abundance” “counts” “length”
[4] "countsFromAbundance

head(txi $lenght) head(txi$ counts)

library(“DESeq2”)
sampleTable <- data.frame(condition = factor(
rep(c(“Day0”,“Day1”,“Day2”,“Day3”),each=4)))

rep(c(“Day0”,“Day1”,“Day2”,“Day3”),each=4)
[1] “Day0” “Day0” “Day0” “Day0” “Day1” “Day1” “Day1” “Day1” “Day2” “Day2” “Day2” “Day2” “Day3” “Day3” “Day3” “Day3”

rownames(sampleTable) <- colnames(txi$counts)
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

如果，我现在有两个matrix，如下
u <- matrix(seq(1,16),ncol=2);u
v <- matrix(seq(17,28),ncol=2);v

我希望通过循环的方式，以每个matrix的两列作为输入做线性拟合, 如下

lm(u[,1]~u[,2])

Call:
lm(formula = u[, 1] ~ u[, 2])

Coefficients:
(Intercept) u[, 2]
-8 1

于是我想当然：
for (m in c(u,v)){
lm(m[,1]~m[,2])
}
Error in m[, 1] : incorrect number of dimensions

我忘记了c()会把所有matrix降成一维，这就很尴尬了。这是因为R不支持对非向量集合的循环操作。如果想实现非向量集合的循环，只能曲线救国，如lapply或get，这里介绍get。

u <- matrix(seq(1,16),ncol=2);u
[,1] [,2]
[1,] 1 9
[2,] 2 10
[3,] 3 11
[4,] 4 12
[5,] 5 13
[6,] 6 14
[7,] 7 15
[8,] 8 16
v <- matrix(seq(17,28),ncol=2);v
[,1] [,2]
[1,] 17 23
[2,] 18 24
[3,] 19 25
[4,] 20 26
[5,] 21 27
[6,] 22 28

for (m in c(“u”,“v”)){

```
z <- get(m)
```
```
print(z)
```
}
[,1] [,2]
[1,] 1 9
[2,] 2 10
[3,] 3 11
[4,] 4 12
[5,] 5 13
[6,] 6 14
[7,] 7 15
[8,] 8 16
[,1] [,2]
[1,] 17 23
[2,] 18 24
[3,] 19 25
[4,] 20 26
[5,] 21 27
[6,] 22 28

w<-c(u,v)
w
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

于是，这解决了我昨天晚上遇到的问题。我打算对3个DESeq2的比较table进行过滤，

res1 <- results(dds, contrast = c(“condition”,“Day1”,“Day0”),pAdjustMethod=“fdr”,alpha=0.05)
res2 <- results(dds, contrast = c(“condition”,“Day2”,“Day0”),pAdjustMethod=“fdr”,alpha=0.05)
res3 <- results(dds, contrast = c(“condition”,“Day3”,“Day0”),pAdjustMethod=“fdr”,alpha=0.05)

本来代码长下面这个样子，各种他提示错误：

for (res in c(res1,res2,res3)){
paste0(res, “SigUp”) <- subset(res, padj < 0.01 & log2FoldChange >0 )
}

我需要解决两个问题：
一：如何在for循环中增加变量名，使用assign对非局部变量进行读写
二：还有如何对非向量集合进行循环, 使用get获取变量内容。

所以最后的代码长这个样子：

for (res in c(“res1”,“res2”,“res3”)){
m = get(res)
nam <- paste0(res,“SigUpt”)
assign(nam, subset(m, padj < 0.01 & log2FoldChange >0 ))
}

其实可以用list，然后用lapply, 然后分别提取的。

resSigUp <- lapply(list(res1,res2,res3), subset, padj < 0.01 & log2FoldChange >0 )

GATK入门