生信知识与软件小结:BWA、wtdbg

三代组装软件:canu、smartdenovo、wtdbg
三代矫正软件:Racon、Nanopolish
Falcon: 一个实验性的二倍体组装工具,测试multi Gb genomes。
Canu :Celera Assembler的一个分支,专门用于高噪音单分子测序。
blast :的全称是 basic local alignment search tool 是一种极其常见的序列比对工具。其中包含几个模块(可以这么认为),blastn blastp blastx,tblastx 等等。

blastp 用于蛋白序列之间的比对。BLAST是序列搜索算法,同时也是工具的名字。

知乎:请问现在三代测序的reads用什么比对?

https://www.zhihu.com/question/28688884/answer/126305959

de novo assembly

De Bruijn 图是目前二代测序序列最常用的拼接算法,该算法将已经非常短的reads再分割成更多个kmer短序列(k 小于reads 序列的长度),相邻的kmers序列通过(k-1)个碱基连接到一起(即每次只移动一个位置),进而降低算法计算重叠区域的复杂度,降低内存消耗。
常用的短序列拼接软件有:SPAdes、Velvet、SOAPdenovo、Abyss、MasuRCA

1Mb=1000kb=1000000bpkb 是一千个bp 即是一千个碱基 ( kilobases)

三代测序主要分为两大派别:pacbio的smart sequencing 以及 nanopore公司的nanopore sequencing

fuzzy-Bruijn graph (FBG)

de Bruijn graph(DBG)

PE reads 就是 paired-end reads。
reads(读长)是高通量测序中一个反应获得的测序序列。
在测序过程中,一条DNA分子的两端都可以测序。先测其中的一端,获得一个reads,然后再转到另一端测序,获得另外一个reads。得到的这两个reads就是PE reads。
PE reads 的获得有助于后期序列组装。

在这里插入图片描述

高通量测序技术

高通量测序技术(High-throughput sequencing)又称“下一代”测序技术(“Next-generation” sequencing technology),以能一次并行对几十万到几百万条DNA分子进行序列测定和一般读长较短等为标志。高通量测序技术是对传统测序一次革命性的改变,一次对几十万到几百万条DNA分子进行序列测定,因此在有些文献中称其为下一代测序技术(next generation sequencing)足见其划时代的改变,同时高通量测序使得对一个物种的转录组和基因组进行细致全貌的分析成为可能,所以又被称为深度测序(deep sequencing)。

转录组

转录组(transcriptome)广义上指某一生理条件下,细胞内所有转录产物的集合,包括信使RNA、核糖体RNA、转运RNA及非编码RNA;狭义上指所有mRNA的集合。蛋白质是行使细胞功能的主要承担者,蛋白质组是细胞功能和状态的最直接描述,转录组成为研究基因表达的主要手段,转录组是连接基因组遗传信息与生物功能的蛋白质组的必然纽带,转录水平的调控是目前研究最多的,也是生物体最重要的调控方式。

De Novo 测序

De Novo 测序也叫从头测序,不需要任何基因序列信息即可对某个物种进行测序。用生物信息学的分析方法对序列进行拼接、组装,从而获得该物种的基因组序列图谱。目前广泛应用于从头解析未知物种的基因组序列、基因组成、进化特点等。基因组从头测序也叫de novo测序,是指对基因组序列未知或没有近源物种基因组信息的某个物种,对其不同长度基因组DNA片段及其文库进行序列测定,然后用生物信息学方法进行拼接、组装和注释,从而获得该物种完整的基因组序列图谱。

全基因组重测序

全基因组重测序是对已知基因组序列的物种进行不同个体的基因组测序,并在此基础上对个体或群体进行差异性分析。SBC将不同梯度插入片段(Insert-Size)的测序文库结合短序列(Short-Reads)、双末端(Paired-End)进行测序,帮助客户在全基因组水平上扫描并检测与重要性状相关的基因序列差异和结构变异,实现遗传进化分析及重要性状候选基因预测。

基因组重测序(GenomeRe-sequencing)

对基因组序列已知的个体进行基因组测序,并在个体或群体水平上进行差异性分析的方法。

测序深度:每个片段平均测多少次

知乎:用于较大基因组三代测序拼接的软件或者策略有哪些?

还使用了WTDBG(ruanjue/wtdbg),这个软件还没有发表,但是也能组装超过1G的基因组。最开始感觉可能不太靠谱,但是用了之后才发现WTDBG的结果很好。这个软件不像其他的三代软件一样先自纠错,也不是用的overlap layout算法,而是将三代reads打断成kmer,然后组装,组装后用minimap来进行polishing。结果的基因组大小和质量都比较合适,但是因为会产生大量kmer,所以WTDBG运行时要求的内存很大,但是也跟参数相关(有个参数是选择使用组装的kmer数量,比如使用全部的kmer,使用1/2的kmer,1/4的kmer等)。

1、为什么要从头测序组装基因组?

基因组是不同表型的遗传基础;获得参考基因组是深入研究一个生物体全基因组的第一步也是必须的一步;从头测序组装能够对新的测序物种构建参考基因组;

2、为什么要研究全基因组?

确定基因组中缺失了什么;确定难以生化研究的基因和pathways;研究感兴趣的pathway通路中的每一个基因;研究基因组的非编码区域(introns内含子、promoters启动子、telomeres端粒等)的调控机理和结构特征;基因组提供了一个可以进行各种统计的大型数据库(provide large databases that are amenable to statistical methods);识别不同的可能有细微表型的序列;研究物种和基因组的进化过程。

reads

reads是测序所得,即一个读长,能够合并的reads为overlap的reads,单独的reads不能成为contig,在生物信息分析中运用聚类,没有聚类的reads一般不做后续分析了。

reads不是基因基因组中的组成,实际是一小段短的测序片段,是高通量测序仪产生的测序数据,对整个基因组进行测序,就会产生成百上千万的reads,然后将这些reads拼接起来就能获得基因组的全序列了~发展到现在,高通量测序技术已经可以应用到转录组的研究,不同片段reads量(就是这些小的测序片段)不同可以代表不同的表达水平。

就是一段测序的读数,和测斜仪的读长相关。
如:将基因组打断成200 bp的片段,reads为10 bp,读出的序列为ATCTGTCCTA,那么该序列就是read。很多read就是reads咯

reads是指测序出来的一条条序列,contig reads是生信分析后拼接得到的序列。测序出来的序列是一小段一小段的,生信分析拼接是序列筛选优化加拼接的过程,把一小段变成比较长的大段序列。

高通量测序平台产生的序列就称为reads。

kmer

https://www.jianshu.com/p/435baf1707e9

CCS(Circular Consensus Sequence): 环装一致性序列,是一个Polymerase read 上的多条subreads 序列,相互校正得到的一条反映真实文库的序列

kb=千碱基 kilobase

nt=核苷酸 nucleotide

bp=碱基对 base pair

外显子测序(whole exon sequencing)

mRNA测序(RNA-seq)

small RNA测序

Small RNA(micro RNAs、siRNAs和 pi RNAs)

miRNA(microRNA)测序

Chip-seq

染色质免疫共沉淀技术(ChromatinImmunoprecipitation,ChIP)也称结合位点分析法

CHIRP-Seq

CHIRP-Seq(Chromatin Isolation by RNA Purification)是一种检测与RNA绑定的DNA和蛋白的高通量测序方法

metagenomic(宏基因组)

基因拼接算法

拼接的两种算法OLC和DBG

测序拼接的主要过程就是把reads分组为重叠群(contigs),把重叠群分组为支架( scaffolds)。重叠群以reads 进行多重排列,并且形成共同序列,而支架( 即超级重叠群或巨型重叠群)规定了重叠群的顺序和方向以及重叠群之间缺口的大小。

基因组组装算法

Read

高通量测序平台产生的序列标签就称为reads

Contig

拼接软件基于reads之间的overlap区,拼接获得的序列称为Contig(重叠群)

Contig N50

Reads拼接后会获得一些不同长度的Contigs。将所有的Contig长度相加,能获得一个Contig总长度。然后将所有的Contigs按照从长到短进行排序,如获得Contig 1,Contig 2,Contig 3…………Contig 25。将Contig按照这个顺序依次相加,当相加的长度达到Contig总长度的一半时,最后一个加上的Contig长度即为Contig N50。举例:Contig 1+Contig 2+ Contig 3+Contig 4=Contig总长度*1/2时,Contig 4的长度即为Contig N50。Contig N50可以作为基因组拼接的结果好坏的一个判断标准

Scaffold

基因组de novo测序,通过reads拼接获得Contigs后,往往还需要构建454 Paired-end库或lllumina Mate-pair库,以获得一定大小片段(如3Kb、6Kb、10Kb、20Kb)两端的序列。基于这些序列,可以确定一些Contig之间的顺序关系,这些先后顺序已知的Contigs组成Scaffold

Scaffold N50

Scaffold N50与Contig N50的定义类似。Contigs拼接组装获得一些不同长度的Scaffolds。将所有的Scaffold长度相加,能获得一个Scaffold总长度。然后将所有的Scaffolds按照从长到短进行排序,如获得Scaffold 1,Scaffold 2,Scaffold 3…………Scaffold 25。将Scaffold按照这个顺序依次相加,当相加的长度达到Scaffold总长度的一半时,最后一个加上的Scaffold长度即为Scaffold N50。举例:Scaffold 1+Scaffold 2+ Scaffold 3 +Scaffold 4 +Scaffold 5=Scaffold总长度*1/2时,Scaffold 5的长度即为Scaffold N50。Scaffold N50可以作为基因组拼接的结果好坏的一个判断标准

//reads、contigs、scaffold、unigene、singleton关系

高通量测序时,在芯片上的每个反应,会读出一条序列,是比较短的,叫read,它们是原始数据;
有很多reads通过片段重叠,能够组装成一个更大的片段,称为contig;
多个contigs通过片段重叠,组成一个更长的scaffold;
一个contig被组成出来之后,鉴定发现它是编码蛋白质的基因,就叫singleton;
多个contigs组装成scaffold之后,鉴定发现它编码蛋白质的基因,叫unigene.
  
  对一条染色体进行测序,将测序得到的reads进行拼接,能够完全拼接起来,中间没有gap的序列称为contig。 如果中间有gap,但是gap的 长度我们知道,这样的序列就叫做scaffold。
将测序得到的所有contig和scaffold从大到小进行排列,当其长度达到染色体长度的一半时,这一条contig和scaffold的长度就叫做Contig N50和Scaffold N50。这两个数值主要用来评估序列组装的质量的,值越大,组装效果越好,测序效率也就越好了。

read是读长,就是一个反应能测出的碱基数
congtig是将很多read根据序列拼接在一起拼出的片段,理论上同一个染色体上的read结合起来拼出一个contig,但实际上做不到。contig中所以的碱基序列是已知的
在高通量测序前,会构建一定长度的文库(譬如8k的文库),测序时是将文库中的每一个DNA从两头测序,获得两端序列信息,此外还会知道你从每一条DNA上所测两条序列是在同一个片段上的。根据这个信息,可以将contig连接起来,这就是scaffold。在scaffold中会有部分序列信息未知

测序深度和覆盖度

测序深度是指测序得到的总碱基数与待测基因组大小的比值。假设一个基因大小为2M,测序深度为10X,那么获得的总数据量为20M。覆盖度是指测序获得的序列占整个基因组的比例。由于基因组中的高GC、重复序列等复杂结构的存在,测序最终拼接组装获得的序列往往无法覆盖所有的区域,这部分没有获得的区域就称为Gap。例如一个细菌基因组测序,覆盖度是98%,那么还有2%的序列区域是没有通过测序获得的

转录本重构

用测序的数据组装成转录本。有两种组装方式:1,de-novo构建; 2,有参考基因组重构

其他依赖软件

一、BWA

由Li Heng大神所开发,运用最为广泛的比对软件。最新的比对算法为mem(maximally exact matches)。aln处理小于100bp的reads,mem处理大于70bp的reads

意义

高通量测序技术的诞生可以说是基因组学研究领域一个具有里程碑意义的事件。该技术使得核酸测序的单碱基成本与第一代测序技术相比急剧下降, 以人类基因组测序为例, 上世纪末进行的人类基因组 计划花费 30 亿美元解码了人类生命密码, 而第二代测序使得人类基因组测序已进入万(美)元基因组时代。如此低廉的单碱基测序成本使得我们可以实施更多物种的基因组计划从而解密更多生物物种的基因组遗传密码。同时在已完成基因组序列测定的物种中, 对该物种的其他品种进行大规模地全基因组重测序也成为了可能。

  • 新一代基因测序技术(NGS)主要包括三种具体技术即(Next Generation Sequencing),全基因组重测序(whole-genome sequencing,WGS)、全外显子组测序(whole-exome sequencing,WES)和目标区域测序(targeted region sequencing,TRS),它们同属于新一代基因测序的范畴。
    • 全基因组重测序(Whole Genome Sequencing,WGS)
      最新的基因测序技术可以达到同时检测单基因病和染色体非整倍性的诊断目的,其准确率已经超过了99%。基因测序一度被看作疾病预防最重要的科技突破,它不仅可以大大降低遗传相关的疾病发生率,减少出生缺陷,还可以实现对疾病预测、预防、预警等; 研究表明,人体内总共约有3万多个基因,除外伤和某些常见的外在因素导致的疾病外,发病的原因大多都与基因相关。基因异常、基因受损等都会引起对应蛋白质或酶的功能变化,从而引起疾病。基因检测就是通过血液以及其他体液细胞中的DNA或RNA进行检测,从而使人们能了解自己的基因信息,预知身体患疾病的风险。
    • 全外显子组测序(Whole Exome Sequencing,WES)
      外显子组是单个个体的基因组DNA上所有蛋白质编码序列的总合。人类外显子组序列约占人类全部基因组序列的1%,但大约包含85%的致病突变。全外显子组测序就是将全外显子区域DNA捕捉并富集后进行高通量测序的基因分析方法。
    • 目标区域测序(Targeted Regions Sequencing,TRS)
      目前常用的是基因芯片技术。其测序原理是基于DNA杂交原理,利用目标基因组区域定制的探针与基因组DNA进行芯片杂交或溶液杂交,将目标基因区域DNA富集,再通过NGS技术进行测序。测序所选定的目标区域可以是连续的DNA序列,也可以是分布在同一个染色体不同区域或不同染色体上的片段。目标区域测序技术,对于以往通过连锁分析将基因突变锁定在染色体某一片段区域内。
      NGS基因测序技术需要7-14天的时间才能出具检查报告,所以选择此项技术的前提是选择冻胚移植 (即将培育出的囊胚进行冷冻处理,待到NGS检查结果出来后再进行囊胚移植) 。新一代测序技术NGS会给患者带来一个更准确的检查结果,试管婴儿成功率亦会得到更大程度地提升。
      大家也可能听说过PGD技术,那PGD技术与PGS技术的区别是什么呢?
      PGD英文全称Preimplantation Genetic Diagnosis,胚胎植入前遗传学基因诊断,是通过特定基因的检查,从而可以确定胚胎是否携带可能导致特定疾病的基因突变。如果基因发生某种异常,就可能导致胚胎罹患特定疾病如地中海贫血症、唐氏综合征、猫叫综合征等。目前常用的PGD技术有SNP和PCR两种操作,最多可以诊断出125种隐性疾病,如果是家族性的糖尿病、高血压等目前全球医学技术也是无法排出的。
      在这里插入图片描述

图片来源:https://www.sohu.com/a/227765109_464200

杂记

分位数是将总体的全部数据按大小顺序排列后,处于各等分位置的变量值。如果将全部数据分成相等的两部分,它就是中位数;如果分成四等分,就是四分位数,四分位数(Quartile)也称四分位点,是指在统计学中把所有数值由小到大排列并分成四等份,处于三个分割点位置的数值。

统计碱基数目、GC含量、read数、最长的read、最短的read及平均read长度


# 用于fasta格式文件的碱基数目和GC含量的统计

grep -v '>' input.fa| perl -ne  '{$count_A=$count_A+($_=~tr/A//);$count_T=$count_T+($_=~tr/T//);$count_G=$count_G+($_=~tr/G//);$count_C=$count_C+($_=~tr/C//);$count_N=$count_N+($_=~tr/N//)};END{print qq{total count is },$count_A+$count_T+$count_G+$count_C+$count_N, qq{\nGC%:},($count_G+$count_C)/($count_A+$count_T+$count_G+$count_C+$cont_N),qq{\n} }'

# 用于fastq格式文件的read数、碱基数、最长的read、最短的read及平均read长度

perl -ne 'BEGIN{$min=1e10;$max=0;}next if ($.%4);chomp;$read_count++;$cur_length=length($_);$total_length+=$cur_length;$min=$min>$cur_length?$cur_length:$min;$max=$max<$cur_length?$cur_length:$max;END{print qq{Totally $read_count reads\nTotally $total_length bases\nMAX length is $max bp\nMIN length is $min bp \nMean length is },$total_length/$read_count,qq{ bp\n}}' input.fq
 
# 用于fasta格式文件的read数、碱基数、最长的read、最短的read及平均read长度

perl -ne 'BEGIN{$min=1e10;$max=0;}next if ($.%2);chomp;$read_count++;$cur_length=length($_);$total_length+=$cur_length;$min=$min>$cur_length?$cur_length:$min;$max=$max<$cur_length?$cur_length:$max;END{print qq{Totally $read_count reads\nTotally $total_length bases\nMAX length is $max bp\nMIN length is $min bp \nMean length is },$total_length/$read_count,qq{ bp\n}}' input.fa

wtdbg2 -h

WTDBG: De novo assembler for long noisy sequences
Author: Jue Ruan <[email protected]>
Version: 2.4 (20190417)
Usage: wtdbg2 [options] -i <reads.fa> -o <prefix> [reads.fa ...]
Options:
 -i <string> Long reads sequences file (REQUIRED; can be multiple), []
 -o <string> Prefix of output files (REQUIRED), []
 -t <int>    Number of threads, 0 for all cores, [4]
 -f          Force to overwrite output files
 -x <string> Presets, comma delimited, []
            preset1/rsII/rs: -p 21 -S 4 -s 0.05 -L 5000
                    preset2: -p 0 -k 15 -AS 2 -s 0.05 -L 5000
                    preset3: -p 19 -AS 2 -s 0.05 -L 5000
                  sequel/sq
               nanopore/ont:
            (genome size < 1G: preset2) -p 0 -k 15 -AS 2 -s 0.05 -L 5000
            (genome size >= 1G: preset3) -p 19 -AS 2 -s 0.05 -L 5000
      preset4/corrected/ccs: -p 21 -k 0 -AS 4 -K 0.05 -s 0.5
 -g <number> Approximate genome size (k/m/g suffix allowed) [0]
 -X <float>  Choose the best <float> depth from input reads(effective with -g) [50.0]
 -L <int>    Choose the longest subread and drop reads shorter than <int> (5000 recommended for PacBio) [0]
             Negative integer indicate tidying read names too, e.g. -5000.
 -k <int>    Kmer fsize, 0 <= k <= 25, [0]
 -p <int>    Kmer psize, 0 <= p <= 25, [21]
             k + p <= 25, seed is <k-mer>+<p-homopolymer-compressed>
 -K <float>  Filter high frequency kmers, maybe repetitive, [1000.05]
             >= 1000 and indexing >= (1 - 0.05) * total_kmers_count
 -S <float>  Subsampling kmers, 1/(<-S>) kmers are indexed, [4.00]
             -S is very useful in saving memeory and speeding up
             please note that subsampling kmers will have less matched length
 -l <float>  Min length of alignment, [2048]
 -m <float>  Min matched length by kmer matching, [200]
 -R          Enable realignment mode
 -A          Keep contained reads during alignment
 -s <float>  Min similarity, calculated by kmer matched length / aligned length, [0.05]
 -e <int>    Min read depth of a valid edge, [3]
 -q          Quiet
 -v          Verbose (can be multiple)
 -V          Print version information and then exit
 --help      Show more options
 ** more options **
 --cpu <int>
   See -t 0, default: all cores
 --input <string> +
   See -i
 --force
   See -f
 --prefix <string>
   See -o
 --preset <string>
   See -x
 --kmer-fsize <int>
   See -k 0
 --kmer-psize <int>
   See -p 21
 --kmer-depth-max <float>
   See -K 1000.05
 -E, --kmer-depth-min <int>
   Min kmer frequency, [2]
 --kmer-subsampling <float>
   See -S 4.0
 --kbm-parts <int>
   Split total reads into multiple parts, index one part by one to save memory, [1]
 --aln-kmer-sampling <int>
   Select no more than n seeds in a query bin, default: 256
 --dp-max-gap <int>
   Max number of bin(256bp) in one gap, [4]
 --dp-max-var <int>
   Max number of bin(256bp) in one deviation, [4]
 --dp-penalty-gap <int>
   Penalty for BIN gap, [-7]
 --dp-penalty-var <int>
   Penalty for BIN deviation, [-21]
 --aln-min-length <int>
   See -l 2048
 --aln-min-match <int>
   See -m 200. Here the num of matches counting basepair of the matched kmer's regions
 --aln-min-similarity <float>
   See -s 0.05
 --aln-max-var <float>
   Max length variation of two aligned fragments, default: 0.25
 --aln-dovetail <int>
   Retain dovetail overlaps only, the max overhang size is <--aln-dovetail>, the value should be times of 256, -1 to disable filtering, default: 256
 --aln-strand <int>
   1: forward, 2: reverse, 3: both. Please don't change the deault vaule 3, unless you exactly know what you are doing
 --aln-maxhit <int>
   Max n hits for each read in build graph, default: 1000
 --aln-bestn <int>
   Use best n hits for each read in build graph, 0: keep all, default: 500
   <prefix>.alignments always store all alignments
 -R, --realign
   Enable re-alignment, see --realn-kmer-psize=15, --realn-kmer-subsampling=1, --realn-min-length=2048, --realn-min-match=200, --realn-min-similarity=0.1, --realn-max-var=0.25
 --realn-kmer-psize <int>
   Set kmer-psize in realignment, (kmer-ksize always eq 0), default:15
 --realn-kmer-subsampling <int>
   Set kmer-subsampling in realignment, default:1
 --realn-min-length <int>
   Set aln-min-length in realignment, default: 2048
 --realn-min-match <int>
   Set aln-min-match in realignment, default: 200
 --realn-min-similarity <float>
   Set aln-min-similarity in realignment, default: 0.1
 --realn-max-var <float>
   Set aln-max-var in realignment, default: 0.25
 -A, --aln-noskip
   Even a read was contained in previous alignment, still align it against other reads
 --corr-mode <float>
   Default: 0.0. If set > 0 and set --g <genome_size>, will turn on correct-align mode.
   Wtdbg will select <genome_size> * <corr-mode> bases from reads of middle length, and align them aginst all reads.
   Then, wtdbg will correct them using POACNS, and query corrected sequences against all reads again
   In correct-align mode, --aln-bestn = unlimited, --no-read-clip, --no-chaining-clip. Will support those features in future
 --corr-min <int>
 --corr-max <int>
   For each read to be corrected, uses at least <corr-min> alignments, and at most <corr-max> alignments
   Default: --corr_min = 5, --corr-max = 10
 --corr-cov <float>
   Default: 0.75. When aligning reads to be corrected, the alignments should cover at least <corr-cov> of read length
 --corr-block-size <int>
   Default: 2048. MUST be times of 256bp. Used in POACNS
 --corr-block-step <int>
   Default: 1536. MUST be times of 256bp. Used in POACNS
 --keep-multiple-alignment-parts
   By default, wtdbg will keep only the best alignment between two reads after chainning. This option will disable it, and keep multiple
 --verbose +
   See -v. -vvvv will display the most detailed information
 --quiet
   See -q
 --limit-input <int>
   Limit the input sequences to at most <int> M bp. Usually for test
 -L <int>, --tidy-reads <int>
   Default: 0. Pick longest subreads if possible. Filter reads less than <--tidy-reads>. Please add --tidy-name or set --tidy-reads to nagetive value
   if want to rename reads. Set to 0 bp to disable tidy. Suggested value is 5000 for pacbio RSII reads
 --tidy-name
   Rename reads into 'S%010d' format. The first read is named as S0000000001
 -g <number>, --genome-size <number>
   Provide genome size, e.g. 100.4m, 2.3g. In this version, it is used with -X/--rdcov-cutoff in selecting reads just after readed all.
 -X <float>, --rdcov-cutoff <float>
   Default: 50.0. Retaining 50.0 folds of genome coverage, combined with -g and --rdcov-filter.
 --rdcov-filter [0|1]
   Default 0. Strategy 0: retaining longest reads. Strategy 1: retaining medain length reads. 
 --err-free-nodes
   Select nodes from error-free-sequences only. E.g. you have contigs assembled from NGS-WGS reads, and long noisy reads.
   You can type '--err-free-seq your_ctg.fa --input your_long_reads.fa --err-free-nodes' to perform assembly somehow act as long-reads scaffolding
 --node-len <int>
   The default value is 1024, which is times of KBM_BIN_SIZE(always equals 256 bp). It specifies the length of intervals (or call nodes after selecting).
   kbm indexs sequences into BINs of 256 bp in size, so that many parameter should be times of 256 bp. There are: --node-len, --node-ovl, --aln-min-length, --aln-dovetail .   Other parameters are counted in BINs, --dp-max-gap, --dp-max-var .
 --node-matched-bins <int>
   Min matched bins in a node, default:1
 --node-ovl <int>
   Default: 256. Max overlap size between two adjacent intervals in any read. It is used in selecting best nodes representing reads in graph
 --node-drop <float>
   Default: 0.25. Will discard an node when has more this ratio intervals are conflicted with previous generated node
 -e <int>, --edge-min=<int>
   Default: 3. The minimal depth of a valid edge is set to 3. In another word, Valid edges must be supported by at least 3 reads
   When the sequence depth is low, have a try with --edge-min 2. Or very high, try --edge-min 4
 --drop-low-cov-edges
   Don't attempt to rescue low coverage edges
 --node-min <int>
   Min depth of an interval to be selected as valid node. Defaultly, this value is automaticly the same with --edge-min.
 --node-max <int>
   Nodes with too high depth will be regarded as repetitive, and be masked. Default: 200, more than 200 reads contain this node
 --ttr-cutoff-depth <int>, 0
 --ttr-cutoff-ratio <float>, 0.5
   Tiny Tandom Repeat. A node located inside ttr will bring noisy in graph, should be masked. The pattern of such nodes is:
   depth >= <--ttr-cutoff-depth>, and none of their edges have depth greater than depth * <--ttr-cutoff-ratio 0.5>
   set --ttr-cutoff-depth 0 to disable ttr masking
 --dump-kbm <string>
   Dump kbm index into file for loaded by `kbm` or `wtdbg`
 --dump-seqs <string>
   Dump kbm index (only sequences, no k-mer index) into file for loaded by `kbm` or `wtdbg`
   Please note: normally load it with --load-kbm, not with --load-seqs
 --load-kbm <string>
   Instead of reading sequences and building kbm index, which is time-consumed, loading kbm-index from already dumped file.
   Please note that, once kbm-index is mmaped by kbm -R <kbm-index> start, will just get the shared memory in minute time.
   See `kbm` -R <your_seqs.kbmidx> [start | stop]
 --load-seqs <string>
   Similar with --load-kbm, but only use the sequences in kbmidx, and rebuild index in process's RAM.
 --load-alignments <string> +
   `wtdbg` output reads' alignments into <--prefix>.alignments, program can load them to fastly build assembly graph. Or you can offer
   other source of alignments to `wtdbg`. When --load-alignment, will only reading long sequences but skip building kbm index
   You can type --load-alignments <file> more than once to load alignments from many files
 --load-clips <string>
   Combined with --load-nodes. Load reads clips. You can find it in `wtdbg`'s <--prefix>.clps
 --load-nodes <sting>
   Load dumped nodes from previous execution for fast construct the assembly graph, should be combined with --load-clips. You can find it in `wtdbg`'s <--prefix>.1.nodes
 --bubble-step <int>
   Max step to search a bubble, meaning the max step from the starting node to the ending node. Default: 40
 --tip-step <int>
   Max step to search a tip, 10
 --ctg-min-length <int>
   Min length of contigs to be output, 5000
 --ctg-min-nodes <int>
   Min num of nodes in a contig to be ouput, 3
 --minimal-output
   Will generate as less output files (<--prefix>.*) as it can
 --bin-complexity-cutoff <int>
   Used in filtering BINs. If a BIN has less indexed valid kmers than <--bin-complexity-cutoff 2>, masks it.
 --no-local-graph-analysis
   Before building edges, for each node, local-graph-analysis reads all related reads and according nodes, and builds a local graph to judge whether to mask it
   The analysis aims to find repetitive nodes
 --no-read-length-sort
   Defaultly, `wtdbg` sorts input sequences by length DSC. The order of reads affects the generating of nodes in selecting important intervals
 --keep-isolated-nodes
   In graph clean, `wtdbg` normally masks isolated (orphaned) nodes
 --no-read-clip
   Defaultly, `wtdbg` clips a input sequence by analyzing its overlaps to remove high error endings, rolling-circle repeats (see PacBio CCS), and chimera.
   When building edges, clipped region won't contribute. However, `wtdbg` will use them in the final linking of unitigs
 --no-chainning-clip
   Defaultly, performs alignments chainning in read clipping
   ** If '--aln-bestn 0 --no-read-clip', alignments will be parsed directly, and less RAM spent on recording alignments
发布了27 篇原创文章 · 获赞 15 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/weixin_42306122/article/details/102150039