关于mosdepth的介绍:https://github.com/torfinnnome/mosdepth
1.首先是在服务器上安装mosdepth。
(1)下载安装。
wget ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.41.tar.gz
tar zxvf pcre-8.41.tar.gz
cd pcre-8.41/
./configure
make
(2)使用conda进行安装。
conda install -y mosdepth
(3)使用虚拟环境安装软件。
#创建虚拟环境。
conda create -y --name mosdepth
#进入虚拟环境
conda activate mosdepth
#软件安装
conda install -y mosdepth
#使用mosdepth
~/.conda/envs/mosdepth/bin/mosdepth -h
#退出虚拟环境
conda deactivate
2.mosdepth的使用。
(1)准备好bed文件和bam文件。
(2)通过运行mosdepth软件来计算在1x、2x、3x、4x、5x、10x、15x下bam文件在全基因的覆盖率
nohup ~/.conda/envs/mosdepth/bin/mosdepth -n -t 3 --by filter_gap.bed -T 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,20,25,30 prefix *.bam >log 2>&1 & ## *.bam为bam文件的绝对路径
(3)需要查看的文件是prefix.thresholds.bed.gz
(4)通过一个简单的python脚本去得到覆盖率。
#!/usr/bin/python
# -*- coding: utf-8 -*-
import pandas as pd
from pandas import DataFrame
import os
import os.path
import gzip
import sys
def read_gz_file(path):
if os.path.exists(path):
with gzip.open(path, 'r') as pf:
for line in pf:
yield line
else:
print('the path [{}] is not exist!'.format(path))
strZipFile = sys.argv[1]
strDstFile = sys.argv[2]
gzipfile = gzip.GzipFile(strZipFile, "r")
outFile = open(strDstFile, 'wb+')
outFile.write(gzipfile.read())
outFile.close()
data = pd.read_csv(sys.argv[2], sep="\t")
#print(data)
sum_end_start=sum(data['end']-data['start'])
#print(sum_end_start)
sum_1X=sum(data['1X'])
sum_2X=sum(data['2X'])
sum_3X=sum(data['3X'])
sum_4X=sum(data['4X'])
sum_5X=sum(data['5X'])
sum_6X=sum(data['6X'])
sum_7X=sum(data['7X'])
sum_8X=sum(data['8X'])
sum_9X=sum(data['9X'])
sum_10X=sum(data['10X'])
sum_11X=sum(data['11X'])
sum_12X=sum(data['12X'])
sum_13X=sum(data['13X'])
sum_14X=sum(data['14X'])
sum_15X=sum(data['15X'])
sum_20X=sum(data['20X'])
sum_25X=sum(data['25X'])
sum_30X=sum(data['30X'])
p1=sum_1X/sum_end_start
p2=sum_2X/sum_end_start
p3=sum_3X/sum_end_start
p4=sum_4X/sum_end_start
p5=sum_5X/sum_end_start
p6=sum_6X/sum_end_start
p7=sum_7X/sum_end_start
p8=sum_8X/sum_end_start
p9=sum_9X/sum_end_start
p10=sum_10X/sum_end_start
p11=sum_11X/sum_end_start
p12=sum_12X/sum_end_start
p13=sum_13X/sum_end_start
p14=sum_14X/sum_end_start
p15=sum_15X/sum_end_start
p20=sum_20X/sum_end_start
p25=sum_25X/sum_end_start
p30=sum_30X/sum_end_start
d = {'1x' : [p1], '2x' : [p2], '3x' : [p3], '4x' : [p4], '5x' : [p5], '6x' : [p6], '7x' : [p7], '8x' : [p8], '9x' : [p9], '10x' : [p10], '11x' : [p11], '12x' : [p12], '13x' : [p13], '14x' : [p14], '15x' : [p15], '20x' : [p20], '25x' : [p25], '30x' : [p30]}
df = pd.DataFrame(d, index=['Frequency'])
df.to_csv(sys.argv[3], sep='\t')
运行脚本是:
python3 frequency.py prefix.thresholds.bed.gz prefix.txt frequency.txt
3.最后得到覆盖率文件是:frequency.txt