samtools用法详解

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sunchengquan/article/details/85176940

下载安装


curl -OL https://sourceforge.net/projects/samtools/files/samtools/1.6/samtools-1.6.tar.bz2

tar jxvf samtools-1.6.tar.bz2
cd samtools-1.6
./configure 
make

mv samtools-1.6 samtools
ln -s ~/local/app/samtools/samtools ~/bin/samtools
or 
echo 'export PATH=/home/chengquan/local/app/samtools:$PATH' >> ~/.bashrc
source ~/.bashrc

检查一下是否可以用

[sunchengquan 14:35:50 ~]
$ samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.6 (using htslib 1.6)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

以下的命令讲解就是按照这个顺序

测试数据

参考:生物信息数据格式:sam,bam格式

命令详解

官方说明文档:http://www.htslib.org/doc/samtools.html

dict

作用:建立参考基因组字典

[sunchengquan 15:12:43 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ samtools dict lambda_virus.fa > lambda_virus.dict

faidx

功能:对fasta格式的文件建立索引,后缀名.fai。根据索引文件和序列文件,可以快速提取任意区域的序列文件。

fasta序列格式要求:每条序列,除了最后一行外,其他行的长度必须相同!

为了方便演示,我们使用bowtie2软件中的测试数据,在example/reference

查看建立fasta索引

[sunchengquan 14:44:59 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ ll
总用量 56K
-rw-r--r-- 1 sunchengquan sunchengquan 49K 7月  27 00:43 lambda_virus.fa

[sunchengquan 14:47:21 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ samtools faidx lambda_virus.fa 
[sunchengquan 14:48:32 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ ll
总用量 60K
-rw-r--r-- 1 sunchengquan sunchengquan 49K 7月  27 00:43 lambda_virus.fa
-rw-r----- 1 sunchengquan sunchengquan  43 12月 20 14:48 lambda_virus.fa.fai

查看lambda_virus.fa.fai

[sunchengquan 14:49:27 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ less -S lambda_virus.fa.fai 

gi|9626243|ref|NC_001416.1|     48502   74      70      71

一共5列,每列之间tab分割

  • 第一列:序列的名称

  • 第二列:序列长度

  • 第三列:第一个碱基的偏移量,从0开始计数

  • 第四列:除了最后一行外,序列中每行的碱基数

  • 第五列:除了最后一行外,序列中每行的长度(包括换行符)

提取fasta序列

因为文件中只有一个序列名称,所以我们提取这个序列100-200bp的区间:

[sunchengquan 15:01:01 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ samtools faidx lambda_virus.fa 'gi|9626243|ref|NC_001416.1|':100-200 > NC_001416_100_200.fa 

index

主要功能:对bam文件建立索引,但在此之前必须进行排序(sort),生成后缀是.bai的文件。

[sunchengquan 15:06:57 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools index tmp.sorted.bam

reheader

作用:替换bam文件的信息

[sunchengquan 15:21:38 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.sorted.bam |grep -v '^@HD' > in.header.bam
[sunchengquan 15:21:56 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ less in.header.bam 
[sunchengquan 15:31:10 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools reheader -P in.header.bam tmp.sorted.bam > tmp.sorted.h.bam

查看tmp.sorted.h.bam头部信息

[sunchengquan 15:31:20 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.sorted.h.bam 
@SQ	SN:gi|9626243|ref|NC_001416.1|	LN:48502
@PG	ID:bowtie2	PN:bowtie2	VN:2.3.4.3	CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"

不加-P 参数


[sunchengquan 15:33:06 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools reheader in.header.bam tmp.sorted.bam > tmp.sorted.h.bam
[sunchengquan 15:34:36 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.sorted.h.bam 
@SQ	SN:gi|9626243|ref|NC_001416.1|	LN:48502
@PG	ID:bowtie2	PN:bowtie2	VN:2.3.4.3	CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"
@PG	ID:samtools	PN:samtools	PP:bowtie2	VN:1.6	CL:samtools reheader in.header.bam tmp.sorted.bam

rmdup

作用:将由PCR duplicates 获得的reads去掉,并保留高比对质量的reads

picard,GATK中也有相似的功能

[sunchengquan 15:36:06 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools rmdup

Usage:  samtools rmdup [-sS] <input.srt.bam> <output.bam>

Option: -s    rmdup for SE reads
        -S    treat PE reads as SE in rmdup (force -s)
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

只需要加-s参数, 就可以对单端测序进行去除PCR重复。其实对单端测序去除PCR重复很简单的,因为比对flag情况只有0,4,16,只需要它们比对到染色体的起始终止坐标一致即可,flag很容易一致。但是对于双端测序就有点复杂了

在双端测序里面一大堆的flag情况,双端测序数据,用samtools rmdup效果就很差,所以很多人建议用picard工具的MarkDuplicates 功能~~~

cat

作用:将多个bam文件合并为一个bam文件(merge命令的区别是cat不需要将bam文件提前进行排序)

[sunchengquan 15:50:28 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools cat
Usage: samtools cat [options] <in1.bam>  [... <inN.bam>]
       samtools cat [options] <in1.cram> [... <inN.cram>]

Concatenate BAM or CRAM files, first those in <bamlist.fofn>, then those
on the command line.

Options: -b FILE  list of input BAM/CRAM file names, one per line
         -h FILE  copy the header from FILE [default is 1st input file]
         -o FILE  output BAM/CRAM

merge

功能:合并多个已经sort的bam文件

当有多个样本的bam文件时,可以使用samtools的merge命令将这些bam文件合并为一个排序的且保持所有输入记录并保持现有排序顺序的bam文件

格式:

samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b <list>] <out.bam> <in1.bam> [<in2.bam><in3.bam>…<inN.bam]
  • -l 指定压缩等级;

  • -b FILE 输入文件列表,一个文件一行;

  • -f 强制覆盖同名输出文件;

  • -h FILE 指定FILE内的’@’头复制到输出bam文件中并替换输出文件的文件头。否则,输出文件的文件头将从第一个输入文件复制过来;

  • -n 设定输入比对文件是以read名进行排序的而不是以染色体坐标排序的;

  • -R STRING 合并输入文件的指定区域;

  • -r 使RG标签添加到每一个比对文件上,标签值来自文件名;

  • -u 输出的bam文件不压缩;

  • -c 当多个输入文件包含相同的@RG头ID时,只保留第一个到合并后输出的文件。当合并多个相同样本的不同文件时,非常有用。

  • -p 与-c参数类似,对于要合并的每一个文件中的@PG ID只保留第一个文件中的@PG。

#-f 参数
[sunchengquan 18:47:33 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[bam_merge] File 'tmp.merge.bam' exists. Please apply '-f' to overwrite. Abort.

[sunchengquan 18:48:53 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -f tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 18:49:44 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.merge.bam
@HD	VN:1.0	SO:coordinate
@SQ	SN:gi|9626243|ref|NC_001416.1|	LN:48502
@PG	ID:bowtie2	PN:bowtie2	VN:2.3.4.3	CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"
@PG	ID:bowtie2-1CE8AF82	PN:bowtie2	VN:2.3.4.3	CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"

#-p 参数
[sunchengquan 18:50:04 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -f -p tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 18:51:45 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.merge.bam
@HD	VN:1.0	SO:coordinate
@SQ	SN:gi|9626243|ref|NC_001416.1|	LN:48502
@PG	ID:bowtie2	PN:bowtie2	VN:2.3.4.3	CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"

#-h 参数
#header中多了一行@CO SUNCHENGQUAN 
[sunchengquan 18:51:49 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ cat inh.bam 
@HD	VN:1.0	SO:coordinate
@SQ	SN:gi|9626243|ref|NC_001416.1|	LN:48502
@PG	ID:bowtie2	PN:bowtie2	CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"	VN:2.3.4.3
@CO SUNCHENGQUAN 
[sunchengquan 18:52:54 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -h inh.bam -f -p tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 18:53:22 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.merge.bam
@HD	VN:1.0	SO:coordinate
@SQ	SN:gi|9626243|ref|NC_001416.1|	LN:48502
@PG	ID:bowtie2	PN:bowtie2	CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"	VN:2.3.4.3
@CO SUNCHENGQUAN 

#-r 参数
[sunchengquan 18:58:44 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -h inh.bam -f -p -r tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 18:58:56 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.merge.bam |head -1
r523	165	gi|9626243|ref|NC_001416.1|	1	0	*	=	1	0	TTTCGCACTCAATCCGCCGGGCGCGGGGGCGGCGACCTCGCGGGTTTTCGCTATTTAT	(?,8F5CH3+1H<H&.7;F+H?@>%$./?%>F.+/9(H3?4%.=<6A>&"5/.FG?-%	YT:Z:UP	RG:Z:tmp.sorted.bam1
[sunchengquan 18:59:03 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.merge.bam |tail -1
r9929	141	*	0	0	*	*	0	0	ACCGTAACAAGCAACAGGCAGNCGTGACAGCCAGCAAACCAAAACTCGACCTGACAAACACAGACTGGATTTACGGGGTGGATCTATGAAAATCATC	74)E*11'-G;2G'(!+A26+.E>F'5+.F=@A,/9105>;'E)#F#8C:;G6)G148+<E;)!-B8"#$:7%=?#/H'-F("G4'D/&'&(;2$$+	YT:Z:UP	RG:Z:tmp.sorted.bam2


#-R 参数
[sunchengquan 18:59:12 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -h inh.bam -f -p -r -R 'gi|9626243|ref|NC_001416.1|':1-10 tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 19:02:45 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.merge.bam |wc -l
22
[sunchengquan 19:03:00 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.merge.bam |awk '{print $4}'
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
4
4
8
8

mpileup

作用:对参考基因组每个位点做碱基堆积,用于call SNP和INDEL。主要是生成BCF、VCF文件或者pileup一个或多个bam文件。比对记录以在@RG中的样本名作为区分标识符。如果样本标识符缺失,那么每一个输入文件则视为一个样本。

mpileup命令中参数比较多,这里只介绍一些常用的参数。

参数 解释
-A 在检测变异中,不忽略异常的reads对
-C 用于调节比对质量的系数,如果reads中含有过多的错配,不能设置为零
-D 输出每个样本的reads深度
-l BED文件或者包含区域位点的位置列表文件,注意:位置文件包含两列,染色体和位置,从1开始计数。BED文件至少包含3列,染色体、起始和终止位置,开始端从0开始计数。
-r 在指定区域产生pileup,需已建立索引的bam文件,通常和-l参数一起使用
-o/v/g 输出文件类型(标准格式文件或者VCF、BCF文件)
-t 设置FORMAT和INFO的列表内容,以逗号分割
-u 生成未压缩的VCF和BCF文件
-I 跳过INDEL检测
-m 候选INDEL的最小间隔的reads
-f 输入有索引文件的fasta参考序列
-F 含有间隔reads的最小片段
-Q 最低的碱基质量,默认设置是13
-q 最低的mapping质量,默认设置是0

查看参数

samtools mpileup

mpileup生成的结果

[sunchengquan 19:04:21 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools mpileup tmp.sorted.bam |head 
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
gi|9626243|ref|NC_001416.1|	1	N	0	*	*
gi|9626243|ref|NC_001416.1|	2	N	0	*	*
gi|9626243|ref|NC_001416.1|	3	N	0	*	*
gi|9626243|ref|NC_001416.1|	4	N	0	*	*
gi|9626243|ref|NC_001416.1|	5	N	1	G	/
gi|9626243|ref|NC_001416.1|	6	N	0	*	*
gi|9626243|ref|NC_001416.1|	7	N	1	C	=
gi|9626243|ref|NC_001416.1|	8	N	0	*	*
gi|9626243|ref|NC_001416.1|	9	N	0	*	*
gi|9626243|ref|NC_001416.1|	10	N	1	C	E

在pileup格式中(没有-u或者-g参数),每一行代表

mpileup生成的结果包含6列

  • 参考序列名

  • 位置

  • 参考碱基

  • 比对上的reads数

  • 比对情况
    其中第5列比较复杂,解释如下:
    1 ‘.’代表与参考序列正链匹配。
    2 ‘,’代表与参考序列负链匹配。
    3 ‘ATCGN’代表在正链上的不匹配。
    4 ‘atcgn’代表在负链上的不匹配。
    5 ‘*’代表模糊碱基
    6 ‘’代表匹配的碱基是一个read的开始;’'后面紧跟的ascii码减去33代表比对质量;这两个符号修饰的是后面的碱基,其后紧跟的碱基(.,ATCGatcgNn)代表该read的第一个碱基。
    7 ‘$’代表一个read的结束,该符号修饰的是其前面的碱基。
    8 正则式’+[0-9]+[ACGTNacgtn]+’代表在该位点后插入的碱基;
    9 正则式’-[0-9]+[ACGTNacgtn]+’代表在该位点后缺失的碱基;

  • 比对上的碱基的质量

有参考序列的pileup使用

[sunchengquan 19:30:08 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools mpileup -f ../reference/lambda_virus.fa tmp.sorted.bam |tail
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
gi|9626243|ref|NC_001416.1|	48493	A	3	,,,$	43D
gi|9626243|ref|NC_001416.1|	48494	C	3	,,,	6A<
gi|9626243|ref|NC_001416.1|	48495	A	2	,$,	/>
gi|9626243|ref|NC_001416.1|	48496	G	2	,,	?D
gi|9626243|ref|NC_001416.1|	48497	G	1	,	<
gi|9626243|ref|NC_001416.1|	48498	T	0	*	*
gi|9626243|ref|NC_001416.1|	48499	T	1	,$	B
gi|9626243|ref|NC_001416.1|	48500	A	0	*	*
gi|9626243|ref|NC_001416.1|	48501	C	0	*	*
gi|9626243|ref|NC_001416.1|	48502	G	0	*	*

生成一个简单的vcf文件

[sunchengquan 19:38:45 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools mpileup -uvf ../reference/lambda_virus.fa tmp.sorted.bam |head -25
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.6+htslib-1.6
##samtoolsCommand=samtools mpileup -uvf ../reference/lambda_virus.fa tmp.sorted.bam
##reference=file://../reference/lambda_virus.fa
##contig=<ID=gi|9626243|ref|NC_001416.1|,length=48502>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tmp.sorted.bam
gi|9626243|ref|NC_001416.1|	1	.	G	<*>	0	.	DP=1;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0	PL	0,0,0
gi|9626243|ref|NC_001416.1|	2	.	G	<*>	0	.	DP=2;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0	PL	0,0,0
gi|9626243|ref|NC_001416.1|	2	.	GGCG	GGCGCGGGGGCG	0	.	INDEL;IDV=1;IMF=0.5;DP=2;I16=0,0,1,0,0,0,41,1681,0,0,42,1764,0,0,0,0;QS=0,1;SGB=-0.379885;MQ0F=0	PL	41,3,0

sort

主要功能:对bam文件进行排序(不能对sam文件进行排序)

查看用法

[sunchengquan 19:45:09 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools sort
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]

主要参数释义:

  • -l:设置文件压缩等级,0不压缩,9压缩最高

  • -m:每个线程运行内存大小(可使用K M G表示)

  • -n:按照read名称进行排序

  • -o:排序后的输出文件

  • -T:PREFIX临时文件前缀

  • -@:设置排序和压缩的线程数,默认单线程

[sunchengquan 22:24:41 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools sort  -l 9 -m 90M -n -o tmp.sorted.bam -T sorted -@ 2 tmp.bam
[bam_sort_core] merging from 0 files and 2 in-memory blocks...

查看tmp.sorted.bam

[sunchengquan 22:35:28 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.sorted.bam|less -S


r1      99      gi|9626243|ref|NC_001416.1|     18401   42      122M    =       18430
r1      147     gi|9626243|ref|NC_001416.1|     18430   42      198M    =       18401
r2      99      gi|9626243|ref|NC_001416.1|     8886    42      275M    =       8973
r2      147     gi|9626243|ref|NC_001416.1|     8973    42      188M    =       8886
r3      83      gi|9626243|ref|NC_001416.1|     11599   42      338M    =       11596
r3      163     gi|9626243|ref|NC_001416.1|     11596   42      67M     =       11599
r4      99      gi|9626243|ref|NC_001416.1|     40075   42      184M    =       40211
r4      147     gi|9626243|ref|NC_001416.1|     40211   42      48M     =       40075

不加-n 参数

[sunchengquan 22:36:58 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools sort  -l 9 -m 90M  -o tmp.sorted.bam -T sorted -@ 2 tmp.bam
[bam_sort_core] merging from 0 files and 2 in-memory blocks...
[sunchengquan 22:38:10 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.sorted.bam|less -S


r523    165     gi|9626243|ref|NC_001416.1|     1       0       *       =       1
r2543   165     gi|9626243|ref|NC_001416.1|     1       0       *       =       1
r4819   163     gi|9626243|ref|NC_001416.1|     1       24      4M16I183M       =
r523    89      gi|9626243|ref|NC_001416.1|     1       0       3M26I196M       =
r2543   89      gi|9626243|ref|NC_001416.1|     1       42      214M    =       1
r6080   177     gi|9626243|ref|NC_001416.1|     1       42      4M2I54M =       5850
r9207   177     gi|9626243|ref|NC_001416.1|     1       42      2M6I204M        =
r9555   113     gi|9626243|ref|NC_001416.1|     1       42      116M    =       5826
r3903   99      gi|9626243|ref|NC_001416.1|     2       42      1M8I238M        =
r874    99      gi|9626243|ref|NC_001416.1|     4       42      49M     =       232
r3796   113     gi|9626243|ref|NC_001416.1|     8       42      120M    =       5912

split

作用:根据read group 分割bam文件


[sunchengquan 17:04:02 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ find *|grep '[0-9]$' |awk '{print "@RG\tID:"$0}' >> inh.bam 
[sunchengquan 17:04:06 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools reheader -P inh.bam tmp.merge.bam > tmp.merge.h.bam
[sunchengquan 17:04:14 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.merge.h.bam 
@HD	VN:1.0	SO:coordinate
@SQ	SN:gi|9626243|ref|NC_001416.1|	LN:48502
@PG	ID:bowtie2	PN:bowtie2	CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"	VN:2.3.4.3
@RG	ID:tmp.sorted.bam1
@RG	ID:tmp.sorted.bam2
[sunchengquan 17:04:27 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools split tmp.merge.h.bam
[sunchengquan 17:04:43 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ ll * |grep '.*merge'
-rw-r----- 1 sunchengquan sunchengquan 2.7M 12月 21 16:29 tmp.merge.bam
-rw-r----- 1 sunchengquan sunchengquan 2.5M 12月 21 17:04 tmp.merge.h_0.bam
-rw-r----- 1 sunchengquan sunchengquan 2.5M 12月 21 17:04 tmp.merge.h_1.bam
-rw-r----- 1 sunchengquan sunchengquan 2.7M 12月 21 17:04 tmp.merge.h.bam

fastq

作用:bam文件转换为fastq


[sunchengquan 17:11:19 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools fastq tmp.sorted.bam -1 read1.fq.gz -2 read2.fq.gz 
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 20000 reads
[sunchengquan 17:13:44 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ ll *|grep 'gz$'
-rw-r----- 1 sunchengquan sunchengquan 1.1M 12月 21 17:13 read1.fq.gz
-rw-r----- 1 sunchengquan sunchengquan 1.1M 12月 21 17:13 read2.fq.gz
[sunchengquan 17:14:20 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ zcat read1.fq.gz |wc -l 
40000
[sunchengquan 17:14:58 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ cat read_1.fq |wc -l 
cat: read_1.fq: 没有那个文件或目录
0
[sunchengquan 17:15:19 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ cat reads_1.fq |wc -l 
40000

fasta

bam文件转换为fasta


[sunchengquan 17:15:31 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools fasta tmp.sorted.bam -1 read1.fa.gz -2 read2.fa.gz 
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 20000 reads
[sunchengquan 17:18:56 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ zcat read1.fa.gz |wc -l 
20000
[sunchengquan 17:19:09 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ zcat read1.fa.gz |head
>r523
GACTTCCATTGTTCATTCCACGGACACAAACAGAGAAAGGAAACGACAGAGNCCAAAAAGCCTNGCTTTCAGCACCTGTCGTTTCCTTTCTTTTCAGAGGGTATTTTAAATAAAAACATTAAGTTATGACGAAGAAGAACGGAAACGCCTTAAACCGGAAAATTTTCATAAATAGCGAAAACCCGCGAGGNCGCCGCCCCCGCGCCCGGCGGANTGAGTGCGAAA
>r2543
CAGCTGCTTTTTGTTGACTTCCATTGTTCATTCCACGGACAAAAACAGAGANAGGAAACGACAGAGGCCAAAAAGCCTCGCTTTCAGCACCTGTCGTTTCCTTTCTTTTCAGAGGGTATTTTAAATAAAAACATTAAGTTATGACGAAGAAGAACGGAAACGCCTTAAACCGGAAAAATTTCNTAAATAGCGAAAACCCGCGAGGTCGCCGCCC
>r9555
TCCCTTCTTTTAAGAGGGTATTTTAAATAAAAACATTAAGTTATGACGAAGAAGAACGGAAACGCCTTAAACCGGAAAATTTTCATAAATAGCGAAAACCCGCGAGGTCGCCGCCC
>r3903
NGCGCGGGGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGTAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCG
>r874
CGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTTCGGTTTAAGG

bedcov

输出每个panel的总的碱基数;或者将一个panel中每个碱基的深度加到一起。panel指的是每个bed中的一个窗口,由chrID,start,end定义使用方法

depth

输出每个碱基的深度,即匹配到的次数。可以通过-b添加bed文件,来缩小范围。另外。默认最大深度为8000,如果是高深度测序应该通过-d-m来定义更大的深度。如果需要统计深度为0的位点,应该加上-a选项。参数较多,主要参数如下:

  • -a:输出所有的碱基深度(包括0)

  • -b/-r:控制深度的范围(后面跟染色体)

  • -f:bam文件名字

  • -l:设置read长度阈值

  • -d/-m:最大覆盖深度

  • -q:碱基质量阈值

  • -Q:比对质量阈值

查看100-200bp区域的深度


[sunchengquan 19:14:33 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools index tmp.sorted.bam
[sunchengquan 19:15:19 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools depth -r 'gi|9626243|ref|NC_001416.1|':100-200 tmp.sorted.bam > tmp.sorted.bam.depth
[sunchengquan 19:15:22 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ cat tmp.sorted.bam.depth|head
gi|9626243|ref|NC_001416.1|	100	30
gi|9626243|ref|NC_001416.1|	101	31
gi|9626243|ref|NC_001416.1|	102	31
gi|9626243|ref|NC_001416.1|	103	31
gi|9626243|ref|NC_001416.1|	104	30
gi|9626243|ref|NC_001416.1|	105	30
gi|9626243|ref|NC_001416.1|	106	30
gi|9626243|ref|NC_001416.1|	107	30
gi|9626243|ref|NC_001416.1|	108	30
gi|9626243|ref|NC_001416.1|	109	29

结果是tab分割的三列

第一列:染色体名称

第二列:碱基位置

第三列:测序深度

flagstat

对bam文件进行简单的统计,一共有13项,根据QC结果,每一项都分PASS和FAIL,一般都是PASS,毕竟一般QC都在比对前做过了,低质量的大都过滤了

[sunchengquan 19:15:49 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flagstat tmp.sorted.bam
20000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
18995 + 0 mapped (94.98% : N/A)
20000 + 0 paired in sequencing
10000 + 0 read1
10000 + 0 read2
18332 + 0 properly paired (91.66% : N/A)
18416 + 0 with itself and mate mapped
579 + 0 singletons (2.90% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

总的reads

20000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates

总体reads比对率

18995 + 0 mapped (94.98% : N/A)

PEreads

20000 + 0 paired in sequencing

read1

10000 + 0 read1

read2

10000 + 0 read2

PE reads比对到同一条参考序列,并且两条reads之间的距离符合设置的阈值

18332 + 0 properly paired (91.66% : N/A)

PE中两条都比对到参考序列上的reads数

18416 + 0 with itself and mate mapped

单独一条匹配到参考序列上的reads数,和上一个相加,则是总的匹配上的reads数

579 + 0 singletons (2.90% : N/A)

PE中两条分别比对到两条不同的参考序列的reads数

0 + 0 with mate mapped to a different chr

PE中两条分别比对到两条不同的参考序列的reads数(比对质量>=5)

0 + 0 with mate mapped to a different chr (mapQ>=5)

idxstats

对bam/sam/cram文件的索引进行输出,一共四列:

  • chrID 序列名
  • chrLength 序列长度
  • mapped reads 比对上的reads数
  • unmapped reads 未必对数目

文件的索引文件必须存在于输入文件相同目录下,如果没有需要先建立索引

[sunchengquan 19:49:46 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools idxstats tmp.sorted.bam
gi|9626243|ref|NC_001416.1|	48502	18995	579
*	0	0	426

stats

对bam文件做详细统计,其统计结果可用mics/plot-bamstats作图

  • SN Summary Number
  • FFQ First Fragment Qualities
  • LFQ Last Fragment Quailites
  • GCF GC Content of first fragments
  • GCL GC Content of last fragments
  • GCC ACGT content per cycle
  • IS Insert sizes
  • RL Read lengths
  • ID Indel distribution
  • IC Indels per cycle
  • COV Coverage distribution
  • GCD GC-depth

提取SN

[sunchengquan 20:08:45 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools stats tmp.sorted.bam|grep '^SN' |cut -f 2-
raw total sequences:	20000
filtered sequences:	0
sequences:	20000
is sorted:	1
1st fragments:	10000
last fragments:	10000
reads mapped:	18995
reads mapped and paired:	18416	# paired-end technology bit set + both mates mapped
reads unmapped:	1005
reads properly paired:	18332	# proper-pair bit set
reads paired:	20000	# paired-end technology bit set
reads duplicated:	0	# PCR or optical duplicate bit set
reads MQ0:	11	# mapped and MQ=0
reads QC failed:	0
non-primary alignments:	0
total length:	2178385	# ignores clipping
bases mapped:	2079519	# ignores clipping
bases mapped (cigar):	2079519	# more accurate
bases trimmed:	0
bases duplicated:	0
mismatches:	51081	# from NM fields
error rate:	2.456385e-02	# mismatches / bases mapped (cigar)
average length:	108
maximum length:	366
average quality:	16.3
insert size average:	251.6
insert size standard deviation:	44.2
inward oriented pairs:	8485
outward oriented pairs:	0
pairs with other orientation:	40
pairs on different chromosomes:	0

提取IS

[sunchengquan 21:25:35 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools stats tmp.sorted.bam|grep '^IS' |cut -f 2-|tail
357	7	7	0	0
358	5	5	0	0
359	6	6	0	0
360	4	4	0	0
361	7	7	0	0
362	5	5	0	0
363	1	1	0	0
364	0	0	0	0
365	1	1	0	0
366	4	4	0	0

flags

将字符型和数字型flag相互转换

[sunchengquan 21:29:13 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flags 83
0x53	83	PAIRED,PROPER_PAIR,REVERSE,READ1
[sunchengquan 21:29:20 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flags read2
0x80	128	READ2
[sunchengquan 21:30:01 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flags read1
0x40	64	READ1
[sunchengquan 21:30:24 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flags read1,read2
0xc0	192	READ1,READ2

tview

直观显示reads比对到基因组的情况,和基因组浏览器类似

主要参数:

  • -d:输出类型

  • -p:直接定位给定位置

  • -s:reads显示

当给出参考基因组的时候,会在第一排给出参考基因组的序列,否则第一排全用N表示。

首先利用sort进行排序后,在利用index建立索引后,用下面命令:

[sunchengquan 21:35:17 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools tview -p 'gi|9626243|ref|NC_001416.1|':300 tmp.sorted.bam

示意图:
xxxx

利用颜色来标注比对质量、碱基质量、核苷酸等。

>=30 的碱基质量或比对质量使用白色表示

20-39的用黄色表示

10-19的用绿色表示

0-9的用蓝色表示

使用字母r切换显示read name等。

具体的操作说明可以?键查看。

view

sam和bam文件之间相互转换,针对bam文件进行相关操作。bam文件是sam文件的二进制格式,占据内存较小且运算速度快

重要参数释义:

  • -b:输出bam格式,用于后续分析

  • -C:输出CRAM文件

  • -1:快速压缩(需要-b)

  • -u:输出未压缩的bam文件,节约时间,占据较多磁盘空间(需要-b)

  • -h:默认输出sam文件不带表头,该参数设定后输出带表头信息

  • -H:仅仅输出表头信息

  • -c:仅打印匹配数

  • -o:输出文件(stdout标准输出)

  • -U:输出没有经过过滤选择的reads

  • -t:制表分隔符文件(需要提供额外的参考数据,比如参考基因组的索引)

  • -L:仅包括和bed文件有重叠的reads

  • -r:仅输出在STR读段组中的reads

  • -R:仅输出特定reads

  • -q:定位的质量大于INT[默认0]

  • -l:仅输出在STR 库中的reads

  • -F:过滤标识(flag)(去除匹配标识对应的reads,保留剩余的)[默认0]

  • -f:匹配标识(flag)(保留匹配的标识对应的reads)[默认0]

  • -T:使用fasta格式的参考序列

bam文件转换为sam文件

[sunchengquan 21:41:12 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -h tmp.sorted.bam >tmp.sam

sam文件转换为bam文件

[sunchengquan 21:41:12 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -bS tmp.sam > tmp.sorted.bam

比对到反向互补链的reads

$ samtools view -f 16 tmp.sorted.bam|head -1
r523	89	gi|9626243|ref|NC_001416.1|	1	0	3M26I196M	=	1	0	TTTCGCACTCANTCCGCCGGGCGCGGGGGCGGCGNCCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCNAGGCTTTTTGGNCTCTGTCGTTTCCTTTCTCTGTTTGTGTCCGTGGAATGAACAATGGAAGTC	"1&;;,*'<-1G$427,3>D:+F+';13<:06=F2$:71D5G><@E@;0>");,G0=@7)!:H9&D.C?=1@00H=81E3C,8@<=B?;9;'.+2A!H<#!073&7=8&9G!CHDCG*>%'1C!;9!.+68,5?9304=12>4:18:C&%G7+H6-8CD0$:2"./(#D5E14?5=230>"?4?>G&@>'60&(?;44:3G5/B::4'9FC);(52!@<<6F5=5	AS:i:-97	XN:i:0	XM:i:7	XO:i:1	XG:i:26	NM:i:33	MD:Z:0G0G0G5A126G11C24T26	YT:Z:UP

比对到正向链的reads

[sunchengquan 21:55:03 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -F 16 tmp.sorted.bam|head -1
r523	165	gi|9626243|ref|NC_001416.1|	1	0	*	=	1	0	TTTCGCACTCAATCCGCCGGGCGCGGGGGCGGCGACCTCGCGGGTTTTCGCTATTTAT	(?,8F5CH3+1H<H&.7;F+H?@>%$./?%>F.+/9(H3?4%.=<6A>&"5/.FG?-%	YT:Z:UP

统计共有多少条reads(pair-end-reads这里算一条)参与了比对参考基因组

[sunchengquan 21:56:24 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -c tmp.sorted.bam
20000
[sunchengquan 21:58:02 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ echo `samtools view -c tmp.sorted.bam`/2 |bc
10000

筛选出比对失败的reads,看序列特征

[sunchengquan 21:59:33 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -cf 4 tmp.sorted.bam
1005

[sunchengquan 22:09:01 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -f 4 tmp.bam|cut -f10 |head -3
GCGTNGTACNNGNNGATGNTTACGACTAANACTTNNNNNTNCNGNNT
ANTGGNTNAGCGGNCGNAGNGNTCNGNAGCNANNAANCTGNTATGNTNNNTNNATCNNNCNNNNGTGNNAANNNCCAGNNNNGGCNNNNCATACANNNTNNNCNTNNNTACANNATANNCGNNATTNNACNGTNNCGNGNNCCGNGTNNNNN
NNTNNNNNAGNTNAANANANGTNTTGNGAANCTNNANACNNCTTTTNNNNTATGNTNNTNAGCCTAG

筛选出比对质量值大于30的情况

[sunchengquan 22:16:19 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -q 30 tmp.bam |awk '{print $1,$5}'|head -3
r1 42
r1 42
r2 42

筛选出比对成功,但是并不是完全匹配的序列

[sunchengquan 22:17:46 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -F 4 tmp.bam |awk '{print $6}'|grep '[IDNSHPX]'|head -4
72M2D119M
126M3D61M
70M1D71M
53M5D134M

猜你喜欢

转载自blog.csdn.net/sunchengquan/article/details/85176940