对于基因的定义总体可以划分为两类
GAD: Gene associated with Mendelian disorder; GADs include genes that meet criteria for definitive, strong, or moderate evidence for association with disease as described by ClinGen
GUS: Gene of uncertain significance; GUSs include gene that meet the ClinGen categories of limited or dispute evidence
Clinical Genome Resource (ClinGen,www.clinicalgenome.org),大概有600多个基因(https://search.clinicalgenome.org/kb/gene-validity)[8] 该数据库对每个基因进行了分类,针对不同的疾病。分类属于GAD是必须要包含在内的。
1:此外主要考虑的因素是你检测的对象是SNV(必须)、indels(必须的)、CNAs、SV,另外你的panel必须包含基因的热点区域(例如:PIK3CA的exon9 and 20以及BRAF的exon 15,exons 18 to 21 of EGFR, or exons 12 and 14 of JAK2)另外你也可以决定cover几个重点基因的整个编码区和非编码区(KRAS、NRAS、TP53)。
2:如果要设定copy数目的检测几个常见的例如TP53、PTEN、CDKN2A以及RB1的losses以及ERBB2(HER2)、MET、RICTOR、MDM2的gain在临床上都是很有意义的
3:SV的检测主要中主要体现的是基因融合,例如RET/PTC、TMPRSS2/ERG、EML4/ALK,无论是DNA还是RNA(ctDNA)断点都发生在内含子区域,建议在设计的时候至少向外延伸20bp
4:在探针富集层数上内含子和外显子可以区别对待
5:梯度测试:不同DNA输入量的梯度测试,一篇文章中分别给出了75bp、100bp、150bp、200bp四个不同梯度总共4X7个样本,这个需要在测试完成后需要提出最低起始量和NGS的建议起始量,一般较高的起始量会得到较低的Duplication,因此做完了梯度测试应该有类似以下的三个图:
6:可重复性
一般是过CAP要自己测序,对于同样的样本可以选择重复测序3次也就是3个RUN,样本频率的范围选择是0-0.7,如下是总共考察了17个样本,每个样本重复用3个独立的实验,总共是17X3X3=153个实验
实验完后应得到如下图的结果:
7:检测下限(Lower Limit of Detection)
将12个样本为肿瘤纯度在80%-100%的样本进行稀释,按照100%、50%、20%,也是重复三次,得到如下结果
8:数据追溯
FASTQ、BAM、VCF
9:样本接收类型(可参考专家共识)
10:target区域描述,可参考FoundationOne的描述以表格的形式呈现(表2和表3)
11:样本测序质控至少要包含以下内容但不限于以下内容 ::
Metrics and QC parameters may include, but are not limited to:
1) NGS library fragment size distribution(插入片段大小,建议选择中值代表)
2) NGS instrument cluster densities(个人感觉是原始数据Q30>80%)
3) NGS instrument sequence output, base quality and error rates.
12:分析数据过程中参考数据来源(以下仅供参考)
dbSNP基于hg19 ::
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz
gnomAD ::
http://hgdownload.cse.ucsc.edu/gbdb/hg19/gnomAD/vcf/
clinvar ::
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/
ExAC ::
http://hgdownload.cse.ucsc.edu/gbdb/hg19/ExAC/
hg19_fasta ::
ftp://[email protected]/bundle/hg19/ucsc.hg19.fasta.gz
13:一些样本数据合格的标准 ::
These may include, but are not limited to:
1) base and mapping quality scores
2) percentage of reads mapping to the target
3) duplicate read rate
4) read coverage of recognized target(s) regions
5) target regions with inadequate sequence due to mapping qualities and/or coverage below thresholds
6) numbers and types of variants from reference
7) transition to transversion ratio in exome and genomes
Metrics and QC parameters may include, but are not limited to ::
Total reads generated for each sample compared to a reference average that is determined during assay validation
Percent of reads aligned to target compared to a reference average that is determined during assay validation
Percent of unique reads aligned to target (for targeted-capture assays) compared to a reference average that is determined during assay validation Average coverage of targeted bases compared to a reference average that is determined during assay validation
Percent of bases covered at specific read depths (e.g. 30X, 100X, 2000X)
for germline or somatic variant tests compared to a reference average that is determined during assay validation
Determination of test reproducibility (e.g. identification of the same variants in a specimen) over time (e.g. monthly, quarterly) by re-testing a subset of specimens For somatic cancer assays, monitoring of limit of detection controls for determination of assay sensitivity over time
下面是metrix的一个例子[10]
… image:: metrix.png
:scale: 40 %
:align: center
14:关于变异位点的解释可以参考文献[6]
15:目前的生信流程针对Indels的分析其长度一般为<=21bp,根据文献[7]
16:在没有真实数据的时候,你可以用BAmsurgeon https://github.com/adamewing/bamsurgeon/ 进行数据模拟变异,来首先检测你的数据分析流程
参考文献
1:CAP Accreditation Program-Molecular Pathology Checklist.pdf
2:Denovo request for evaluation of automatic class III designation for the MSK-IMPACT