主题思想：

正交基函数，　sin,cos 是通过网络训练得到的参数。
使用一维卷积核直接对于原始音频，进行卷积生成语谱图；
使用一维卷积核生成语谱图特征，

不同于以往的方式，正是因为这些正交基函数是通过卷积核构成的，
由于这些卷积核的参数可训练的，　这表明这些正交基是通过训练得来的，　理论上是更容易适配好当前的任务，　因为人为定义好的统一的正交基函数，并不能自适应在当前的任务上，　每个任务肯定自身对应的最好的正交基函数，通过训练得来，应该是恰当的；

但是，目前笔者实现下来，　这种方式占用显存特别高。
基本上需要 24G 显存以上，　多卡并行，比较好实验；

1. 使用神经网络生成语谱图的方式

现有的工作如下:

1.1 nnAudio

nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks
https://github.com/KinWaiCheuk/nnAudio；

1.2 PANN

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition:

开源实现：
https://github.com/qiuqiangkong/audioset_tagging_cnn;

此外, 　使用torch 完成 librosa 函数中的功能，同样是基于神经网络；
公布使用torch 中一维卷积核的方式生成语谱图的仓库：
https://github.com/qiuqiangkong/torchlibrosa;

1.3 特征抽取的方式

提供时域，频域的特征；

https://github.com/libAudioFlux/audioFlux

1.4 网络模型的部署

网络模型架构的部署部分，　可以参考weNet 项目；

https://github.com/wenet-e2e/wenet；

2. torch 实现的部分函数

下面实现的函数，在上面的开源仓库中也实现了，　建议可以多阅读源码:

2.1 torch　实现power to db

#note: 使用torch 实现 librosa 中的power_to_db 函数:
# 将功率谱，转换为对数谱；
def power_to_db_torch(S, ref=1.0, amin=1e-10, top_db=80.0):


    #note 使用断言的方式，对输入检查；
    if amin <= 0:
        raise ValueError(" amin  must be strictly  positive")

    S = torch.tensor(S)
    amin = torch.tensor([amin])
    ref = torch.abs(torch.tensor([ref]))

    log_spec = 10.0 * torch.log10(torch.max(S, amin))
    log_spec -= 10.0  * torch.log10(torch.max(amin, ref))

    if top_db is not None:
        if top_db < 0:
            raise  ValueError("top_db  must be  non-negative")

        max_val = torch.max(log_spec)
        log_spec = torch.maximum(log_spec, max_val - top_db)

    return  log_spec

2.2 torch　实现 cv2.resize()

# 使用torch,　对单通道的图片进行缩放，
import  torch.nn.functional as F
def  resize_torch_single_channel(img, resz, method="bilinear"):
    # 函数的输入，需要使用断言，检查维度是否匹配
    assert  len(img.shape) == 2,  "Input image should have 2 dimension: (height, width)"

    #  检查张量是否是张量形式
    if not  isinstance(img, torch.Tensor):
        img = torch.tensor(img).float()

    # 增加batch, channel 维度
    img = img.unsqueeze(0).unsqueeze(0)

    height, width = img.shape[2], img.shape[3]
    new_height, new_width = int(height * resz), int(width * resz)

    if method == " bilinear":
        mode = 'bilinear'
    else:
        raise  ValueError("Unsupported  interpolation  method")

    # 使用torch 自带的线性插值函数，　完成尺寸的缩放
    resized_img = F.interpolate(img, size=(new_height, new_width),
                                mode=mode, align_corners=False)

    # remove the  batch and  channel  dim
    resized_img = resized_img.squeeze(0).squeeze(0)

    return  resized_img


import torch
import torch.nn.functional as F


def resize_torch(img, resz, method='bilinear'):
    assert len(img.shape) == 3, "Input image should have 3 dimensions: (height, width, channels)"

    # Convert the input image to a PyTorch tensor if it's not already one
    if not isinstance(img, torch.Tensor):
        img = torch.tensor(img).float()

    # Convert the image from HWC to CHW format
    img = img.permute(2, 0, 1).unsqueeze(0)  # Add an extra dimension for the batch

    height, width = img.shape[2], img.shape[3]
    new_height, new_width = int(height * resz), int(width * resz)

    if method == 'bilinear':
        mode = 'bilinear'
    else:
        raise ValueError("Unsupported interpolation method")

    # Resize the image using torch.nn.functional.interpolate
    resized_img = F.interpolate(img, size=(new_height, new_width), mode=mode, align_corners=False)

    # Convert the image back to HWC format and remove the batch dimension
    resized_img = resized_img.squeeze(0).permute(1, 2, 0)

    return resized_img

使用神经网络中的卷积核生成语谱图

1. 使用神经网络生成语谱图的方式

1.1 nnAudio

1.2 PANN

1.3 特征抽取的方式

1.4 网络模型的部署

2. torch 实现的部分函数

2.1 torch　实现power to db

2.2 torch　实现 cv2.resize()

猜你喜欢

使用神经网络中的卷积核生成语谱图

1. 使用神经网络生成语谱图的方式

1.1 nnAudio

1.2 PANN

1.3 特征抽取的方式

1.4 网络模型的部署

2. torch 实现的部分函数

2.1 torch 实现power to db

2.2 torch 实现 cv2.resize()

猜你喜欢

2.1 torch　实现power to db

2.2 torch　实现 cv2.resize()