first commit
BIN
Lab/Lab1/requirement/实验1 PRAAT.ppt
Normal file
BIN
Lab/Lab1/source/fan.png
Normal file
|
After Width: | Height: | Size: 359 KiB |
BIN
Lab/Lab1/source/jiao.png
Normal file
|
After Width: | Height: | Size: 342 KiB |
BIN
Lab/Lab1/source/jing.png
Normal file
|
After Width: | Height: | Size: 394 KiB |
BIN
Lab/Lab1/source/ke.png
Normal file
|
After Width: | Height: | Size: 411 KiB |
BIN
Lab/Lab1/source/max pitch.png
Normal file
|
After Width: | Height: | Size: 8.9 KiB |
BIN
Lab/Lab1/source/mean pitch.png
Normal file
|
After Width: | Height: | Size: 8.8 KiB |
BIN
Lab/Lab1/source/min pitch.png
Normal file
|
After Width: | Height: | Size: 9.4 KiB |
BIN
Lab/Lab1/source/p1.png
Normal file
|
After Width: | Height: | Size: 8.1 MiB |
BIN
Lab/Lab1/source/p2.png
Normal file
|
After Width: | Height: | Size: 322 KiB |
BIN
Lab/Lab1/source/pitch.png
Normal file
|
After Width: | Height: | Size: 152 KiB |
BIN
Lab/Lab1/source/voice.wav
Normal file
BIN
Lab/Lab1/source/wo.png
Normal file
|
After Width: | Height: | Size: 421 KiB |
145
Lab/Lab1/source/柯劲帆_21281280_实验1.md
Normal file
@@ -0,0 +1,145 @@
|
||||
<h1><center>北京交通大学实验报告</center></h1>
|
||||
|
||||
<div style="text-align: center;">
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">课程名称</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">计算机语音技术</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">实验题目</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">语音工具使用</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">学号</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">21281280</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">姓名</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">柯劲帆</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">班级</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">物联网2101班</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">指导老师</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">朱维彬</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">报告日期</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">2023年10月22日</span></div>
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## 目录
|
||||
|
||||
[TOC]
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
# 1. 语图1
|
||||
|
||||
如下图所示。上方是波形图,中间是窄带语图,下方是基频变化曲线。
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
# 2. 语图2
|
||||
|
||||
如下图所示。上方是波形图,中间是宽带语图和基频变化曲线,下方是标注结果。
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
# 3. 标注说明
|
||||
|
||||
## 3.1. “wo3”标注说明
|
||||
|
||||

|
||||
|
||||
`w`不构成一个单独的声母,而是与`o`结合为一个韵母。`wo3`发音过程中能量一直集中在低频成分。
|
||||
|
||||
## 3.2. “jiao4”标注说明
|
||||
|
||||

|
||||
|
||||
`j`是塞擦音。`jiao4`发音出现了3个阶段:
|
||||
|
||||
1. `j`的闭塞阶段,没有高频成分,能量在低频成分;
|
||||
2. `j`的擦音阶段,频率突变,高频成分增强,基频变化相对剧烈(以至于算法已经无法分析出基频);
|
||||
3. `iao3`的发音阶段,能量集中在低频成分,基频变化相对平稳。
|
||||
|
||||
## 3.3. “ke1”标注说明
|
||||
|
||||

|
||||
|
||||
`k`是送气塞音。`ke1`发音也经过3个阶段:
|
||||
|
||||
1. `k`的塞闭阶段,能量集中在低频区,没有高频成分,波形图几乎为一条直线;
|
||||
2. `k`的爆发阶段,高频能量突增,能量剧烈上升,基频变化相对剧烈;
|
||||
3. 送气阶段,也是`e1`的发音阶段,频谱突变,出现低频成分,之后基频逐渐减弱衰落。
|
||||
|
||||
## 3.4. “jing4”标注说明
|
||||
|
||||

|
||||
|
||||
又出现了塞擦音`j`。`jing4`发音也是3个阶段:
|
||||
|
||||
1. `j`的闭塞阶段,但是由于“ke1”和"jing4"两个字连读,这个阶段被跳过了;
|
||||
2. `j`的擦音阶段,能量集中在高频成分,基频变化相对剧烈;
|
||||
3. `ing4`的发音阶段,频谱出现低频成分,基频逐渐减弱。
|
||||
|
||||
## 3.5. “fan1”标注说明
|
||||
|
||||

|
||||
|
||||
`f`是个清擦音。`fan1`发音主要有两个阶段:
|
||||
|
||||
1. `f`的清擦音阶段,频谱主要集中在高频成分,基频变化剧烈;
|
||||
2. `an1`的发音阶段,频谱体现为进入较平稳的低频区,基频平稳。
|
||||
|
||||
|
||||
|
||||
# 4. 基频分析
|
||||
|
||||

|
||||
|
||||
该图上的数字表示基频在该点的置信度。将散点连起来既是基频曲线。没有选中的点是基频计算算法计算出置信度较小的基频点,可以人工挑选以修改基频曲线。
|
||||
|
||||
通过Praat自动计算的基频曲线,基频分析如下:
|
||||
|
||||
1. `wo3`的基频总体下降,表现第3声的音调总体降低的趋势。
|
||||
2. 在`wo3`和`jiao4`之间出现了高频噪声。
|
||||
3. `jiao4`也是基频总体下降,表现第4声的音调总体降低的趋势。
|
||||
4. `ke1`基频首先由高至低,这是因为塞音`k`存在一个爆发阶段,产生大量的高频成分;然后基频平稳,因为第1声的发音音调是平稳的。
|
||||
5. `jing4`与`jiao4`相似,也是基频总体下降,表现第4声的音调总体降低的趋势。
|
||||
6. `fan1`与`ke1`的基频相似,都是由于声母存在擦音阶段或爆破阶段导致一开始基频较高;然后发音声调为第1声导致后来基频趋于平稳。
|
||||
|
||||
总体来说,基频在100hz到200Hz之间。估计最高基频为210Hz,最低基频在100Hz,平均为170Hz。
|
||||
|
||||
使用Praat导出基频的最高、最低、平均值,如下:
|
||||
|
||||
最高基频:
|
||||
|
||||
<img src="max pitch.png" alt="max pitch" style="zoom:50%;" />
|
||||
|
||||
最低基频:
|
||||
|
||||
<img src="min pitch.png" alt="min pitch" style="zoom:50%;" />
|
||||
|
||||
平均基频:
|
||||
|
||||
<img src="mean pitch.png" alt="mean pitch" style="zoom:50%;" />
|
||||
|
||||
除了最高基频Praat预测有误差之外,Praat预测的最低基频和平均基频都与我的估计差别不大。
|
||||
BIN
Lab/Lab1/source/柯劲帆_21281280_实验1.pdf
Normal file
BIN
Lab/Lab1/柯劲帆_21281280_实验1.pdf
Normal file
BIN
Lab/Lab2/code/tang1.wav
Normal file
623
Lab/Lab2/code/test.ipynb
Normal file
179
Lab/Lab2/code/test.py
Normal file
@@ -0,0 +1,179 @@
|
||||
from typing import Optional
|
||||
import scipy.io.wavfile as wav
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
import ipdb
|
||||
|
||||
|
||||
def hamming(frame_length: int) -> np.ndarray:
|
||||
# frame_length - 窗长
|
||||
|
||||
n = np.arange(frame_length)
|
||||
h = 0.54 - 0.4 * np.cos(2 * np.pi * n / (frame_length - 1))
|
||||
return h
|
||||
|
||||
|
||||
def delta_sgn(x: np.ndarray) -> np.ndarray:
|
||||
# x - 语音信号
|
||||
|
||||
sound = x
|
||||
threshold = np.max(np.abs(sound)) / 20
|
||||
negative_sound = sound + threshold
|
||||
negative_sound -= np.abs(negative_sound)
|
||||
positive_sound = sound - threshold
|
||||
positive_sound += np.abs(positive_sound)
|
||||
sound = negative_sound + positive_sound
|
||||
return np.sign(sound)
|
||||
|
||||
|
||||
def ampf(
|
||||
x: np.ndarray, FrameLen: Optional[int] = 128, inc: Optional[int] = 90
|
||||
) -> np.ndarray:
|
||||
# x - 语音时域信号
|
||||
# FrameLen - 每一帧的长度
|
||||
# inc - 步长
|
||||
|
||||
frames = []
|
||||
for i in range(0, len(x) - FrameLen, inc):
|
||||
frame = x[i : i + FrameLen]
|
||||
frames.append(frame)
|
||||
frames = np.array(frames)
|
||||
|
||||
h = hamming(frame_length=FrameLen) # 海明窗
|
||||
amp = np.dot(frames**2, h.T**2).T / FrameLen
|
||||
|
||||
return amp
|
||||
|
||||
|
||||
def zcrf(
|
||||
x: np.ndarray, FrameLen: Optional[int] = 128, inc: Optional[int] = 90
|
||||
) -> np.ndarray:
|
||||
# x - 语音时域信号
|
||||
# FrameLen - 每一帧的长度
|
||||
# inc - 步长
|
||||
|
||||
sound = x
|
||||
sgn_sound = np.sign(sound)
|
||||
|
||||
dif_sound = np.abs(sgn_sound[1:] - sgn_sound[:-1])
|
||||
h = np.ones((FrameLen,)) / (2 * FrameLen)
|
||||
|
||||
frames = []
|
||||
for i in range(0, len(dif_sound) - FrameLen, inc):
|
||||
frame = dif_sound[i : i + FrameLen]
|
||||
frames.append(frame)
|
||||
|
||||
frames = np.array(frames)
|
||||
zcr = np.dot(frames, h.T).T
|
||||
return zcr
|
||||
|
||||
|
||||
def zcrf_delta(
|
||||
x: np.ndarray, FrameLen: Optional[int] = 128, inc: Optional[int] = 90
|
||||
) -> np.ndarray:
|
||||
# x - 语音时域信号
|
||||
# FrameLen - 每一帧的长度
|
||||
# inc - 步长
|
||||
|
||||
sound = x
|
||||
sgn_sound = delta_sgn(sound)
|
||||
|
||||
dif_sound = np.abs(sgn_sound[1:] - sgn_sound[:-1])
|
||||
h = np.ones((FrameLen,)) / (2 * FrameLen)
|
||||
|
||||
frames = []
|
||||
for i in range(0, len(dif_sound) - FrameLen, inc):
|
||||
frame = dif_sound[i : i + FrameLen]
|
||||
frames.append(frame)
|
||||
|
||||
frames = np.array(frames)
|
||||
zcr = np.dot(frames, h.T).T
|
||||
return zcr
|
||||
|
||||
|
||||
def analyze_sound(
|
||||
filename: str, FrameLen: Optional[int] = 128, inc: Optional[int] = 90
|
||||
) -> None:
|
||||
sr, sound_array = wav.read(filename)
|
||||
sound_array = sound_array.T[0, :] if sound_array.ndim != 1 else sound_array
|
||||
sound_array = sound_array / np.max(np.abs(sound_array)) # 归一化
|
||||
|
||||
amp = ampf(sound_array, FrameLen, inc)
|
||||
zcr = zcrf_delta(sound_array, FrameLen, inc)
|
||||
|
||||
rescale_rate = len(sound_array) / amp.shape[0]
|
||||
frameTime = np.arange(len(amp)) * rescale_rate
|
||||
|
||||
# 边界检测
|
||||
x1 = []
|
||||
x2 = []
|
||||
x3 = []
|
||||
amp2 = np.min(amp) + (np.max(amp) - np.min(amp)) / 20
|
||||
zcr2 = np.min(zcr) + (np.max(zcr) - np.min(zcr)) / 18
|
||||
|
||||
threshold_len = 6
|
||||
state = 1
|
||||
for i in range(threshold_len, len(amp) - threshold_len):
|
||||
if state == 1:
|
||||
if np.all(zcr[i : i + threshold_len] > zcr2):
|
||||
x1.append(i * rescale_rate)
|
||||
state = 2
|
||||
elif state == 2:
|
||||
if np.all(amp[i : i + threshold_len] > amp2):
|
||||
x3.append(i * rescale_rate)
|
||||
state = 3
|
||||
if (
|
||||
state != 1
|
||||
and np.all(amp[i : i + threshold_len] < amp2)
|
||||
and np.all(zcr[i : i + threshold_len] < zcr2)
|
||||
):
|
||||
x2.append(i * rescale_rate)
|
||||
state = 1
|
||||
|
||||
# 绘制语音波形、短时能量、短时过零率
|
||||
plt.figure(figsize=(12, 8))
|
||||
# 语音波形
|
||||
plt.subplot(3, 1, 1)
|
||||
plt.plot(sound_array)
|
||||
plt.title("Waveform")
|
||||
for boundary in x1:
|
||||
plt.axvline(x=boundary, color="r", linestyle="--", linewidth=0.5)
|
||||
for boundary in x2:
|
||||
plt.axvline(x=boundary, color="b", linestyle="--", linewidth=0.5)
|
||||
for boundary in x3:
|
||||
plt.axvline(x=boundary, color="g", linestyle="--", linewidth=0.5)
|
||||
|
||||
# 短时能量
|
||||
plt.subplot(3, 1, 2)
|
||||
plt.plot(frameTime, amp, label="Energy")
|
||||
plt.axhline(y=amp2, color="r", linestyle="--", label="Energy Threshold")
|
||||
plt.legend()
|
||||
plt.title("Short-time Energy")
|
||||
for boundary in x1:
|
||||
plt.axvline(x=boundary, color="r", linestyle="--", linewidth=0.5)
|
||||
for boundary in x2:
|
||||
plt.axvline(x=boundary, color="b", linestyle="--", linewidth=0.5)
|
||||
for boundary in x3:
|
||||
plt.axvline(x=boundary, color="g", linestyle="--", linewidth=0.5)
|
||||
|
||||
# 短时过零率
|
||||
plt.subplot(3, 1, 3)
|
||||
plt.plot(frameTime, zcr, label="Zero Crossing Rate")
|
||||
plt.axhline(y=zcr2, color="r", linestyle="--", label="ZCR Threshold")
|
||||
plt.legend()
|
||||
plt.title("Short-time Zero Crossing Rate")
|
||||
|
||||
# 显示语音端点和清/浊音边界
|
||||
for boundary in x1:
|
||||
plt.axvline(x=boundary, color="r", linestyle="--", linewidth=0.5)
|
||||
for boundary in x2:
|
||||
plt.axvline(x=boundary, color="b", linestyle="--", linewidth=0.5)
|
||||
for boundary in x3:
|
||||
plt.axvline(x=boundary, color="g", linestyle="--", linewidth=0.5)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
analyze_sound("tang1.wav", FrameLen=128, inc=90)
|
||||
BIN
Lab/Lab2/code/voice.wav
Normal file
BIN
Lab/Lab2/requirement/voicebox.zip
Normal file
BIN
Lab/Lab2/requirement/实验2 音段边界检测器.pptx
Normal file
BIN
Lab/Lab2/source/compare.png
Normal file
|
After Width: | Height: | Size: 40 KiB |
BIN
Lab/Lab2/source/picture-amp.png
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
Lab/Lab2/source/picture-tang1.png
Normal file
|
After Width: | Height: | Size: 110 KiB |
BIN
Lab/Lab2/source/picture-zcrf.png
Normal file
|
After Width: | Height: | Size: 26 KiB |
BIN
Lab/Lab2/source/self_record.png
Normal file
|
After Width: | Height: | Size: 149 KiB |
BIN
Lab/Lab2/source/self_record_optimed.png
Normal file
|
After Width: | Height: | Size: 109 KiB |
BIN
Lab/Lab2/source/tang1_optimed.png
Normal file
|
After Width: | Height: | Size: 117 KiB |
360
Lab/Lab2/source/柯劲帆_21281280_实验2.md
Normal file
@@ -0,0 +1,360 @@
|
||||
<h1><center>实验报告</center></h1>
|
||||
|
||||
<div style="text-align: center;">
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">课程名称</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">计算机语音技术</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">实验题目</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">短时分析应用</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">学号</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">21281280</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">姓名</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">柯劲帆</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">班级</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">物联网2101班</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">指导老师</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">朱维彬</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">报告日期</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">2023年10月29日</span></div>
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
# 目录
|
||||
|
||||
[TOC]
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
# 1. 短时能量和短时过零率函数
|
||||
|
||||
**添加短时时域参数函数:**
|
||||
|
||||
- **短时能量**
|
||||
- **短时过零率**
|
||||
|
||||
## 1.1. 计算短时能量
|
||||
|
||||
短时能量指在语音信号的不同时间段内,信号的能量或振幅的平均值。
|
||||
|
||||
短时能量的计算公式如下:
|
||||
$$
|
||||
E_{n}=\sum_{m=-\infty}^{\infty}[x\left(m\right) h\left(n-m\right)]^{2}=\sum_{m=n-N+1}^{n}[x\left(m\right) h\left(n-m\right)]^{2}
|
||||
$$
|
||||
其中$h\left(n\right)$为窗函数,这里选择为海明窗:
|
||||
$$
|
||||
h\left(n\right)=\left\{\begin{array}{ll}
|
||||
0.54 - 0.4\cos\left[2\pi n / \left(N - 1\right)\right], & 0 \leq n \leq N-1 \\
|
||||
0, & \text { others }
|
||||
\end{array}\right. \\
|
||||
$$
|
||||
因此使用Python定义计算海明窗的函数如下。(numpy库也有内置的海明窗函数,这里手动实现,和numpy的接口一致)
|
||||
|
||||
```python
|
||||
def hamming(frame_length:int) -> np.ndarray:
|
||||
# frame_length - 窗长
|
||||
|
||||
n = np.arange(frame_length)
|
||||
h = 0.54 - 0.4 * np.cos(2 * np.pi * n / (frame_length - 1))
|
||||
|
||||
return h
|
||||
```
|
||||
|
||||
**计算短时能量的算法:将每一帧的语音信号提取出来,乘上窗长并平方,然后求和取平均,即可得出该帧的短时能量。将窗口移动步长个单位,重复前面的流程,直至分析完整段语音。**
|
||||
|
||||
使用Python实现如下。
|
||||
|
||||
```python
|
||||
def ampf(x: np.ndarray, FrameLen: Optional[int] = 128, inc: Optional[int] = 90) -> np.ndarray:
|
||||
# x - 语音时域信号
|
||||
# FrameLen - 每一帧的长度
|
||||
# inc - 步长
|
||||
|
||||
frames = []
|
||||
for i in range(0, len(x) - FrameLen, inc):
|
||||
frame = x[i : i + FrameLen]
|
||||
frames.append(frame)
|
||||
frames = np.array(frames)
|
||||
|
||||
h = hamming(frame_length=FrameLen)[::-1] / FrameLen
|
||||
amp = np.dot(frames ** 2, h.T ** 2).T
|
||||
|
||||
return amp
|
||||
```
|
||||
|
||||
画出`tang1`的短时能量曲线如下:
|
||||

|
||||
|
||||
短时能量体现了该帧的振幅,可以表征韵母的发声和结束。
|
||||
|
||||
## 1.2. 计算短时过零率
|
||||
|
||||
短时过零率指在语音信号的短时段内,信号穿过水平线(即振幅为0)的次数。定义如下:
|
||||
|
||||
窗函数:
|
||||
$$
|
||||
w\left(n\right)=\left\{\begin{array}{ll}
|
||||
\frac{1}{2 N}, & 0 \leq n \leq N-1 \\
|
||||
0, & \text { others }
|
||||
\end{array}\right. \\
|
||||
$$
|
||||
短时过零率:
|
||||
$$
|
||||
Z_{n}=\sum_{m=-\infty}^{\infty}\left|\operatorname{sgn}\left[x\left(m\right)\right]-\operatorname{sgn}\left[x\left(m-1\right)\right]\right| w\left(n-m\right)
|
||||
$$
|
||||
其中$\operatorname{sgn}$是符号函数:
|
||||
$$
|
||||
\operatorname{sgn}\left(x\left(n\right)\right)=\left\{\begin{array}{ll}
|
||||
1, & x\left(n\right) \geq 0 \\
|
||||
-1, & x\left(n\right)<0
|
||||
\end{array}\right.
|
||||
$$
|
||||
**计算短时过零率的算法:先从语音信号中计算出过零序列(经过$\operatorname{sgn}$转化后,后一信号减前一信号)。然后将每一帧的语音信号对应的过零序列提取出来,求和并除以帧长,即为该帧的过零率。将窗口移动步长个单位,重复前面的流程,直至分析完整段语音。**
|
||||
|
||||
使用Python实现如下:
|
||||
|
||||
```python
|
||||
def zcrf(x: np.ndarray, FrameLen: Optional[int] = 128, inc: Optional[int] = 90) -> np.ndarray:
|
||||
# x - 语音时域信号
|
||||
# FrameLen - 每一帧的长度
|
||||
# inc - 步长
|
||||
|
||||
sound = x
|
||||
sgn_sound = np.sign(sound)
|
||||
dif_sound = np.abs(sgn_sound[1:] - sgn_sound[:-1])
|
||||
frames = []
|
||||
for i in range(0, len(dif_sound) - FrameLen, inc):
|
||||
frame = dif_sound[i : i + FrameLen]
|
||||
frames.append(frame)
|
||||
frames = np.array(frames)
|
||||
|
||||
h = np.ones((FrameLen,)) / (2 * FrameLen)
|
||||
zcr = np.dot(frames, h.T).T
|
||||
|
||||
return zcr
|
||||
```
|
||||
|
||||
画出`tang1`的短时过零率曲线如下:
|
||||

|
||||
|
||||
短时过零率体现了该帧的高频声音,可以表征声母的发声。
|
||||
|
||||
|
||||
|
||||
# 2. 边界检测
|
||||
|
||||
**添加边界检测器,基于短时能量、短时过零率,实现边界检测功能,包括**
|
||||
|
||||
- **语音端点检测——起始边界x1、终止边界x2**
|
||||
|
||||
- **清/浊边界检测x3**
|
||||
|
||||
我将每个发音分为3个阶段:
|
||||
|
||||
1. 未发声阶段:此时短时能量和短时过零率都很低
|
||||
2. 声母阶段:此时声母的塞音、擦音和塞擦音等会产生大量的高频声波,过零率较大;但是此时韵母还没发出,短时能量较低。这一阶段的开始为`x1`。
|
||||
3. 韵母阶段:此时韵母发出,频率趋于平稳和下降,因此此时过零率下降,但短时能量激增,并逐渐减少,直至发声完毕,回到1阶段。这一阶段的开始为`x3`,结束为`x2`。
|
||||
|
||||
**一开始将阶段初始化为`1`未发声阶段;接着当过零率高于阈值时,进入`2`声母阶段,添加`x1`;接着当短时能量高于阈值时,进入`3`声母阶段,添加`x3`;在进入`2`或`3`阶段后,当短时能量和短时过零率同时低于阈值时,重置为`1`未发声阶段,添加`x2`。**
|
||||
|
||||
**另外还设置了一个阈值宽度,当语音信号在大于阈值宽度的信号段满足条件才算通过。**
|
||||
|
||||
使用Python实现如下:
|
||||
|
||||
```python
|
||||
sr, sound_array = wav.read(filename)
|
||||
sound_array = sound_array.T[0, :] if sound_array.ndim != 1 else sound_array # 双通道改单通道
|
||||
sound_array = sound_array / np.max(np.abs(sound_array)) # 归一化
|
||||
|
||||
amp = ampf(sound_array, FrameLen, inc)
|
||||
zcr = zcrf_delta(sound_array, FrameLen, inc)
|
||||
|
||||
rescale_rate = len(sound_array) / amp.shape[0]
|
||||
frameTime = np.arange(len(amp)) * rescale_rate
|
||||
# 将曲线图拉伸至和语音信号图一样长,方便分析
|
||||
|
||||
x1 = []
|
||||
x2 = []
|
||||
x3 = []
|
||||
amp2 = np.min(amp) + (np.max(amp) - np.min(amp)) * 0.05
|
||||
zcr2 = np.min(zcr) + (np.max(zcr) - np.min(zcr)) * 0.04
|
||||
|
||||
threshold_len = 6
|
||||
state = 1
|
||||
for i in range(threshold_len, len(amp) - threshold_len):
|
||||
if state == 1:
|
||||
if np.all(zcr[i : i + threshold_len] > zcr2):
|
||||
x1.append(i * rescale_rate)
|
||||
state = 2
|
||||
elif state == 2:
|
||||
if np.all(amp[i : i + threshold_len] > amp2):
|
||||
x3.append(i * rescale_rate)
|
||||
state = 3
|
||||
if state != 1 and np.all(amp[i : i + threshold_len] < amp2) and np.all(zcr[i : i + threshold_len] < zcr2):
|
||||
x2.append(i * rescale_rate)
|
||||
state = 1
|
||||
```
|
||||
|
||||
阈值参数的选取在下一节中分析。
|
||||
|
||||
|
||||
|
||||
# 3. 绘制图像与分析
|
||||
|
||||
**绘制语音边界检测图,包括**
|
||||
|
||||
- **语音波形、短时能量、短时过零率**
|
||||
- **自动检测结果:音段起始/终止边界、清音/浊音边界**
|
||||
|
||||
使用Python实现如下:
|
||||
|
||||
```python
|
||||
# 绘制语音波形、短时能量、短时过零率
|
||||
plt.figure(figsize=(12, 8))
|
||||
|
||||
# 语音波形
|
||||
plt.subplot(3, 1, 1)
|
||||
plt.plot(sound_array)
|
||||
plt.title("Waveform")
|
||||
for boundary in x1:
|
||||
plt.axvline(x=boundary, color="r", linestyle="--", linewidth=0.8)
|
||||
for boundary in x2:
|
||||
plt.axvline(x=boundary, color="b", linestyle="--", linewidth=0.8)
|
||||
for boundary in x3:
|
||||
plt.axvline(x=boundary, color="g", linestyle="--", linewidth=0.8)
|
||||
|
||||
# 短时能量
|
||||
plt.subplot(3, 1, 2)
|
||||
plt.plot(frameTime, amp, label="Energy")
|
||||
plt.axhline(y=amp2, color="r", linestyle="--", label="Energy Threshold", linewidth=0.8)
|
||||
plt.legend()
|
||||
plt.title("Short-time Energy")
|
||||
for boundary in x1:
|
||||
plt.axvline(x=boundary, color="r", linestyle="--", linewidth=0.8)
|
||||
for boundary in x2:
|
||||
plt.axvline(x=boundary, color="b", linestyle="--", linewidth=0.8)
|
||||
for boundary in x3:
|
||||
plt.axvline(x=boundary, color="g", linestyle="--", linewidth=0.8)
|
||||
|
||||
# 短时过零率
|
||||
plt.subplot(3, 1, 3)
|
||||
plt.plot(frameTime, zcr, label="Zero Crossing Rate")
|
||||
plt.axhline(y=zcr2, color="r", linestyle="--", label="ZCR Threshold", linewidth=0.8)
|
||||
plt.legend()
|
||||
plt.title("Short-time Zero Crossing Rate")
|
||||
for boundary in x1:
|
||||
plt.axvline(x=boundary, color="r", linestyle="--", linewidth=0.8)
|
||||
for boundary in x2:
|
||||
plt.axvline(x=boundary, color="b", linestyle="--", linewidth=0.8)
|
||||
for boundary in x3:
|
||||
plt.axvline(x=boundary, color="g", linestyle="--", linewidth=0.8)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
将`x1`语音开始边界标记为红色,`x2`语音结束边界标记为蓝色,将`x3`声韵母边界标记为绿色。
|
||||
|
||||
画出`tang1`的语音边界检测图如下:
|
||||
|
||||

|
||||
|
||||
一共有3个参数:阈值宽度、短时能量阈值、短时过零率阈值
|
||||
|
||||
观察短时能量曲线和短时过零率曲线可见,声母开始时,短时能量曲线有一个小峰值,而短时过零率曲线出现大峰值,因此短时能量阈值必须高于该小峰值,才不会将声母开始判定为韵母开始。
|
||||
|
||||
韵母开始时,短时能量曲线出现大峰值,因此短时能量阈值应在大峰值和小峰值之间,且尽可能偏小,才能准确预测声韵母边界,经过多次实验,将短时能量阈值定为最大值的$5\%$;而短时过零率曲线回落,并在低值维持一段时间。因此短时过零率阈值要小于这个低值,经过多次实验,将短时过零率阈值定为最大值的$4\%$。
|
||||
|
||||
经过多次实验,将阈值宽度定为$6$帧。
|
||||
|
||||
从检测结果来看,上述参数的选择能够较为准确地区分三个边界。但是声韵母边界有少许滞后于真实边界。
|
||||
|
||||
|
||||
|
||||
# 4. 自录制语音检测、分析与算法优化
|
||||
|
||||
自录制一段语音:“计算机语音技术”,检测与绘图如下:
|
||||

|
||||
|
||||
很显然,噪声较大,严重干扰了分析。
|
||||
|
||||
分析可知,噪声主要影响的是短时过零率。因此,我对短时过零率算法进行了优化,采用了噪声背景下的修正$\operatorname{sgn}$函数:
|
||||
$$
|
||||
\operatorname{sgn}\left(x\left(n\right)\right)=\left\{\begin{array}{ll}
|
||||
1, & x\left(n\right) \geq \Delta \\
|
||||
-1, & x\left(n\right)< -\Delta\\
|
||||
0, & \text{others}
|
||||
\end{array}\right.
|
||||
$$
|
||||
在具体的实现中,我使用矩阵运算语音信号,逐采样点判断$x\left(n\right)$和$\pm \Delta$的大小既不经济,也不优雅。因此,我首先将$x\left(n\right)$进行了变换,即将修正$\operatorname{sgn}$函数改写为:
|
||||
$$
|
||||
\operatorname{sgn}\left(x\left(n\right)\right)=\left\{\begin{array}{ll}
|
||||
1, & x\left(n\right) \geq 0 \wedge x\left(n\right) - \Delta \geq 0\\
|
||||
-1, & x\left(n\right) < 0 \wedge x\left(n\right) + \Delta< 0\\
|
||||
0, & \text{others}
|
||||
\end{array}\right.
|
||||
$$
|
||||
相当于正负值信号都向横坐标轴缩减了$\Delta$,再进行普通的$\operatorname{sgn}$操作。
|
||||
|
||||
所以,首先将语音信号减去阈值$\Delta$,去掉负值信号,得到正值信号;将语音信号加上阈值$\Delta$,去掉正值信号,得到负值信号。再将两者相加合并,得到处理后的语音信号。最后,进行普通的普通的$\operatorname{sgn}$函数操作。
|
||||
|
||||
Python实现如下:
|
||||
|
||||
```python
|
||||
def delta_sgn(x: np.ndarray) -> np.ndarray:
|
||||
# x - 语音信号
|
||||
|
||||
sound = x
|
||||
threshold = np.max(np.abs(sound)) * 0.05
|
||||
negative_sound = sound + threshold
|
||||
negative_sound -= np.abs(negative_sound)
|
||||
positive_sound = sound - threshold
|
||||
positive_sound += np.abs(positive_sound)
|
||||
sound = (negative_sound + positive_sound) / 2
|
||||
sound = np.sign(sound)
|
||||
|
||||
return sound
|
||||
```
|
||||
|
||||
画出向横坐标轴缩减了$\Delta$的语音信号(下)与原语音信号(上)的对比图:
|
||||
|
||||

|
||||
|
||||
很明显,噪声几乎被消除了。
|
||||
|
||||
接下来使用上面定义的`delta_sign()`函数,重复之前的计算进行分析和画图:
|
||||
|
||||

|
||||
|
||||
可以看到,算法能够在噪声下辨认出`ji4`、`suan4`、`ji1`、`ji4`和`shu4`这4个发音的发声起止边界和声韵母边界,但是`yu3`和`yin1`两个发音没有声母,在仅用短时能量和短时过零率两个指标的条件下无法正常检测出边界。另外由于算法抑制了一部分噪声,韵母发音的最后一小部分被消除了,因此检测到的发音结束边界较正确的结束边界有所提前。
|
||||
|
||||
将改进后的算法应用到`tang1`的音频中,检测结果如下:
|
||||

|
||||
|
||||
发现其声母`t`阻塞阶段的高频声音被抑制了,但由于音量较大,没有被作为噪声消除,依然能被正常识别。但是发音末尾的少部分被当作噪声消除了,导致发音结束边界较正确的结束边界有所提前。检测结果总体上正确。
|
||||
BIN
Lab/Lab2/柯劲帆_21281280_实验2.pdf
Normal file
BIN
Lab/Lab3/code/tang1.wav
Normal file
137
Lab/Lab3/code/test.ipynb
Normal file
55
Lab/Lab3/code/test.py
Normal file
@@ -0,0 +1,55 @@
|
||||
import scipy.io.wavfile as wav
|
||||
from scipy import signal
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
from mpl_toolkits.mplot3d import Axes3D
|
||||
import ipdb;
|
||||
|
||||
# 读取音频文件
|
||||
filename = "./tang1.wav"
|
||||
sample_rate, sound_array = wav.read(filename)
|
||||
sound_array = sound_array.T[0, :] if sound_array.ndim != 1 else sound_array
|
||||
sound_array = sound_array / np.max(np.abs(sound_array)) # 归一化
|
||||
|
||||
frame_length = int(sample_rate * 0.01)
|
||||
num_frames = len(sound_array) // frame_length
|
||||
autocorrelation = np.zeros((num_frames, frame_length))
|
||||
autocorrelation_of_candidates = np.zeros((num_frames, frame_length))
|
||||
min_peak_threshold = min(sample_rate // 400, frame_length)
|
||||
max_peak_threshold = min(sample_rate // 80, frame_length)
|
||||
for n in range(num_frames):
|
||||
frame = sound_array[n * frame_length: (n + 1) * frame_length]
|
||||
autocorrelation[n, :] = signal.correlate(frame, frame, mode='full')[frame_length - 1:]
|
||||
# 基频阈值为80-400Hz,则基音周期(即延迟)t最小为sample_rate/400,最大为sample_rate/80
|
||||
|
||||
# 本应该使用峰值的延迟作为基音周期的候选值,但是发现峰值(局部最大值)并不好判断,同时一帧内的点数不多,因此将阈值内的所有点都作为候选点
|
||||
# 那么将不在阈值内的自相关系数置为一个非常小的数,从而不让算法选择不在阈值内的基音周期
|
||||
autocorrelation_of_candidates[n, :] = np.pad(
|
||||
autocorrelation[n, min_peak_threshold : max_peak_threshold],
|
||||
(min_peak_threshold, max(frame_length - max_peak_threshold, 0)),
|
||||
mode='constant',
|
||||
constant_values=-30.0,
|
||||
)
|
||||
|
||||
dist = -autocorrelation
|
||||
cost = np.zeros((num_frames, frame_length))
|
||||
path = np.zeros((num_frames, frame_length))
|
||||
|
||||
for n in range(num_frames - 1):
|
||||
for j in range(min_peak_threshold, max_peak_threshold):
|
||||
# f0 = sample_rate / candidate
|
||||
cost[n + 1, j] = dist[n + 1, j] + np.min(
|
||||
cost[n, :] + np.abs(sample_rate / np.arange(frame_length) - sample_rate / j)
|
||||
)
|
||||
path[n + 1, j] = np.argmin(
|
||||
cost[n, :] + np.abs(sample_rate / np.arange(frame_length) - sample_rate / j)
|
||||
)
|
||||
|
||||
l_hat = np.zeros(num_frames, dtype=np.int32)
|
||||
l_hat[num_frames - 1] = np.argmin(cost[num_frames - 1, :])
|
||||
|
||||
for n in range(num_frames - 2, -1, -1):
|
||||
l_hat[n] = path[n + 1, l_hat[n + 1]]
|
||||
|
||||
f0 = sample_rate / l_hat
|
||||
|
||||
BIN
Lab/Lab3/code/voice.wav
Normal file
88
Lab/Lab3/requirements/pitch.m
Normal file
@@ -0,0 +1,88 @@
|
||||
clc;clear all;close all;
|
||||
warning('off')
|
||||
%[x,fs,nbits]=wavread('tang1.wav'); %<EFBFBD><EFBFBD><EFBFBD><EFBFBD>¼<EFBFBD>õ<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ź<EFBFBD><EFBFBD>ļ<EFBFBD><EFBFBD><EFBFBD>
|
||||
global fs;
|
||||
[x,fs] = audioread('tang1.wav'); %<EFBFBD><EFBFBD><EFBFBD><EFBFBD>¼<EFBFBD>õ<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ź<EFBFBD><EFBFBD>ļ<EFBFBD><EFBFBD><EFBFBD>
|
||||
%info = audioinfo('tang1.wav');
|
||||
%nbits = info.BitsPerSample;
|
||||
x = x / max(abs(x)); %<EFBFBD><EFBFBD><EFBFBD>ȹ<EFBFBD>һ<EFBFBD><EFBFBD><EFBFBD><EFBFBD>[-1,1]
|
||||
|
||||
%<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>
|
||||
kl = round(1 / 500 * fs); %500Hz
|
||||
kr = round(1 / 80 * fs); %80Hz
|
||||
N = 3 * kr; %֡<EFBFBD><EFBFBD>
|
||||
inc = round(fs / 100); %֡<EFBFBD>Ʋ<EFBFBD><EFBFBD><EFBFBD>10ms
|
||||
|
||||
%<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ͼ
|
||||
subplot(3, 1, 1);
|
||||
plot(x);
|
||||
axis([1 length(x) -1 1]) %<EFBFBD><EFBFBD><EFBFBD><EFBFBD>x<EFBFBD><EFBFBD><EFBFBD><EFBFBD>y<EFBFBD><EFBFBD><EFBFBD>ķ<EFBFBD>Χ<EFBFBD><EFBFBD>
|
||||
xlabel('֡<EFBFBD><EFBFBD>');
|
||||
ylabel('Speech');
|
||||
legend('FrameLen = 552');
|
||||
|
||||
%<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>غ<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ɻ<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>ں<EFBFBD>ѡ<EFBFBD>㼯
|
||||
subplot(3, 1, 2);
|
||||
A = enframe(x, N, inc);
|
||||
R = zeros(size(A));
|
||||
l = zeros(1, size(R, 1));
|
||||
f = zeros(1, size(R, 1));
|
||||
Can = zeros(1, 10);
|
||||
acCan = zeros(1, 10);
|
||||
CostF = zeros(1, 10);
|
||||
it = 0;
|
||||
for n = 1:size(R, 1)
|
||||
R(n, :) = autocorr(A(n, :), N - 1);
|
||||
Can_ = Can;
|
||||
CostF_ = CostF;
|
||||
[acCan,Can] = findpeaks(R(n, kl:kr), 'MinPeakHeight', R(n, 1) * 0.25, 'MinPeakProminence', 0.9);
|
||||
Can = Can + kl - 1;
|
||||
sz = size(Can, 2);
|
||||
if sz ~= 0
|
||||
it = it + 1;
|
||||
if it == 1
|
||||
CostF = dist(acCan);
|
||||
else
|
||||
CostF = zeros(1, sz);
|
||||
Path = zeros(1, sz);
|
||||
CostT = diff(Can, Can_);
|
||||
for j = 1:sz
|
||||
[CostF(j), Path(j)] = min(CostF_ + CostT(j, :));
|
||||
CostF = CostF + dist(acCan);
|
||||
end
|
||||
end
|
||||
[~, l(n)] = min(CostF);
|
||||
ff = f0(Can);
|
||||
plot(n, ff, '.');
|
||||
hold on;
|
||||
f(n) = ff(l(n));
|
||||
else
|
||||
it = 0;
|
||||
end
|
||||
end
|
||||
|
||||
subplot(3, 1, 3);
|
||||
f = medfilt1(f, 5);
|
||||
stem(f, 'MarkerSize',3);
|
||||
xlabel('֡<EFBFBD><EFBFBD>(n)');
|
||||
ylabel('Ƶ<EFBFBD><EFBFBD>(Hz)');
|
||||
|
||||
function f = f0(Can)
|
||||
global fs;
|
||||
f = fs ./ Can;
|
||||
end
|
||||
|
||||
function dis = dist(ac)
|
||||
dis = -log(ac);
|
||||
end
|
||||
|
||||
function df = diff(Can1, Can2)
|
||||
n = size(Can1, 2);
|
||||
m = size(Can2, 2);
|
||||
df = zeros(n, m);
|
||||
for i = 1:n
|
||||
for j = 1:m
|
||||
df(i, j) = abs(f0(Can1(i)) - f0(Can2(j)));
|
||||
end
|
||||
end
|
||||
end
|
||||
BIN
Lab/Lab3/requirements/pitch实验报告-参考.pdf
Normal file
BIN
Lab/Lab3/requirements/tang1.wav
Normal file
BIN
Lab/Lab3/source/p1.png
Normal file
|
After Width: | Height: | Size: 70 KiB |
BIN
Lab/Lab3/source/p2.png
Normal file
|
After Width: | Height: | Size: 69 KiB |
BIN
Lab/Lab3/source/p3.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
Lab/Lab3/source/p4.png
Normal file
|
After Width: | Height: | Size: 80 KiB |
BIN
Lab/Lab3/source/p5.png
Normal file
|
After Width: | Height: | Size: 95 KiB |
BIN
Lab/Lab3/source/p6.png
Normal file
|
After Width: | Height: | Size: 81 KiB |
BIN
Lab/Lab3/source/p7.png
Normal file
|
After Width: | Height: | Size: 141 KiB |
352
Lab/Lab3/source/柯劲帆_21281280_实验3.md
Normal file
@@ -0,0 +1,352 @@
|
||||
<h1><center>实验报告</center></h1>
|
||||
|
||||
<div style="text-align: center;">
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">课程名称</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">计算机语音技术</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">实验题目</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">实验3</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">学号</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">21281280</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">姓名</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">柯劲帆</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">班级</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">物联网2101班</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">指导老师</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">朱维彬</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">报告日期</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">2023年10月13日</span></div>
|
||||
</div>
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
[TOC]
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
# 1. 自相关系数计算和基音周期候选点集
|
||||
|
||||
- **基于自相关系数AC的局部极大值构成基音周期候选点集,通过预设基频范围进行初步筛选**
|
||||
|
||||
## 1.1. 计算自相关系数
|
||||
|
||||
自相关算法能够体现语音信号的时域重复性。
|
||||
|
||||
基于某帧语音信号的自相关函数的在某一时延$t$(单位为采样点)的大小,可以推断该帧语音信号的基音周期是$t$的可能性:$t$点自相关系数越大,基音周期是$t$的概率越大;反之亦然。
|
||||
|
||||
因此,可以通过计算语音信号中各帧的自相关函数来推断各帧的基音周期。
|
||||
|
||||
自相关函数的计算方法:
|
||||
$$
|
||||
\hat{R}_n\left(k\right) = \sum_{m=0}^{N-1}x\left(n+m\right)x\left(n+m+k\right)
|
||||
$$
|
||||
其中$\hat{R}_n\left(k\right)$是第$k$帧的自相关函数,其中$0\le k < N$,$N$为语音的分帧数。
|
||||
|
||||
代码实现如下:
|
||||
|
||||
```python
|
||||
# 读取音频文件
|
||||
filename = "./tang1.wav"
|
||||
sample_rate, sound_array = scipy.io.wavfile.read(filename)
|
||||
sound_array = sound_array.T[0, :] if sound_array.ndim != 1 else sound_array
|
||||
sound_array = sound_array / np.max(np.abs(sound_array)) # 归一化
|
||||
|
||||
frame_length = int(sample_rate * 0.01)
|
||||
num_frames = len(sound_array) // frame_length
|
||||
autocorrelation = np.zeros((num_frames, frame_length)) # 每帧的自相关函数
|
||||
autocorrelation_of_candidates = np.zeros((num_frames, frame_length))
|
||||
|
||||
for n in range(num_frames):
|
||||
frame = sound_array[n * frame_length: (n + 1) * frame_length] # 提取单帧
|
||||
autocorrelation[n, :] = scipy.signal.correlate(frame, frame, mode='full')[frame_length - 1:]
|
||||
```
|
||||
|
||||
代码的逻辑为:
|
||||
|
||||
1. 读取音频文件并进行预处理;
|
||||
2. 设置帧长和分帧数,以及初始化自相关函数的记录变量;
|
||||
3. 逐帧进行自相关计算。
|
||||
|
||||
## 1.2. 选取基音周期候选点集
|
||||
|
||||
得到自相关函数`autocorrelation`后,下一步就是取各帧的局部极大值构成基音周期候选点集。
|
||||
|
||||
但是在实际实验中,局部极大值的提取较为困难:要么就是自相关函数波动严重,局部极大值非常多;要么就是自相关函数单调变化或平滑变化,局部极大值几乎不存在。由于后续的DP算法需要每一帧都有基音周期候选点集,所以局部极大值数量少会严重影响算法的稳定性。
|
||||
|
||||
因此,考虑到一帧的长度不大,我**将符合人声基频范围的所有基音周期都视为候选点**。那么基音周期候选点的范围限制为:
|
||||
$$
|
||||
\frac{1}{min\times \frac{1}{sample \space rate}} \le 80\rm{Hz}\Longrightarrow min \ge \frac{sample \space rate}{80\rm{Hz}}
|
||||
$$
|
||||
同理,
|
||||
$$
|
||||
\frac{1}{max\times \frac{1}{sample \space rate}} \ge 400\rm{Hz}\Longrightarrow max \le \frac{sample \space rate}{400\rm{Hz}}
|
||||
$$
|
||||
综上,基音周期候选点的范围为
|
||||
$$
|
||||
\left [\frac{sample \space rate}{400\rm{Hz}}, \frac{sample \space rate}{80\rm{Hz}}\right ]
|
||||
$$
|
||||
为了让DP算法只选择在阈值内的基音周期(即不选择阈值外的基音周期),所以将阈值外的自相关系数置为一个较小的数,从而让DP算法选择阈值外基音周期的代价变高。
|
||||
|
||||
代码实现如下:
|
||||
|
||||
```python
|
||||
min_peak_threshold = min(sample_rate // 400, frame_length)
|
||||
max_peak_threshold = min(sample_rate // 80, frame_length)
|
||||
|
||||
autocorrelation_of_candidates = autocorrelation
|
||||
autocorrelation_of_candidates[:, 0:min_peak_threshold] = np.min(autocorrelation) - 5.
|
||||
autocorrelation_of_candidates[:, max_peak_threshold:] = np.min(autocorrelation) - 5.
|
||||
```
|
||||
|
||||
最终,`tang1.wav`的自相关函数计算结果如下所示,
|
||||
|
||||
<img src="p1.png" alt="p1" style="zoom:67%;" />
|
||||
|
||||
$x$轴为帧,$y$轴为候选点,$z$轴为自相关函数。不在人声基频范围内的点的自相关系数被置为一个很低的数值(蓝色区域所示)。
|
||||
|
||||
|
||||
|
||||
# 2. 动态规划DP算法
|
||||
|
||||
- **采用动态规划DP算法,通过代价函数`CostFunction() `涉及目标、转移代价,利用帧同步搜索代价最小路径,检测出基频**
|
||||
|
||||
DP算法的步骤如下:
|
||||
|
||||
1. **初始化第$1$帧的代价函数**
|
||||
$$
|
||||
CostF\left(1, i\right) = Dist\left(ac\left({Can}_{1}\left(i\right)\right)\right) \left ( i=1, 2, \dots, L \right )
|
||||
$$
|
||||
其中:
|
||||
|
||||
- $CostF\left(n, i\right)$表示第$n$帧的第$i$个基音周期候选点的累计代价;
|
||||
- $Dist\left(\right)$为距离测度;
|
||||
- $ac\left(\right)$表示自相关函数序列;
|
||||
- ${Can}_{n}\left(i\right)$表示第$n$帧的第$i$个基音周期候选点;
|
||||
- $L$表示候选点数。
|
||||
|
||||
我将$Dist\left(\right)$的计算设置为
|
||||
$$
|
||||
Dist\left(ac\left({Can}_{1}\left(i\right)\right)\right) = - \operatorname{normalize} \left(ac\left({Can}_{1}\left(i\right)\right)\right) \times 2 \times10^6
|
||||
$$
|
||||
其中$\operatorname{normalize} \left(\right)$表示归一化,这是为了控制不同的音频计算得到的不同自相关系数在固定的范围。
|
||||
|
||||
代码实现如下:
|
||||
|
||||
```python
|
||||
costF = np.zeros((num_frames, frame_length))
|
||||
path = np.zeros((num_frames, frame_length))
|
||||
dist = -(autocorrelation_of_candidates - np.min(autocorrelation_of_candidates)) / (np.max(autocorrelation_of_candidates) - np.min(autocorrelation_of_candidates)) * 2e3
|
||||
|
||||
costF[0, :] = dist[0, :]
|
||||
```
|
||||
|
||||
代码声明了代价记录变量和转移路径记录变量,然后初始化第$1$帧的代价函数。
|
||||
|
||||
2. **计算每帧的代价函数**
|
||||
$$
|
||||
\begin{array}{c}
|
||||
CostF\left(n+1, j\right) = \underset{i}{\operatorname{min}}\left(CostF\left(n, i\right)+CostT\left(n, i, j\right)\right)+CostG\left(n+1, j\right)\\
|
||||
Path = \underset{i}{\operatorname{argmin}}\left(CostF\left(n, i\right)+CostT\left(n, i, j\right)\right)\left ( i,j=1, 2, \dots, L \right )
|
||||
\end{array}
|
||||
$$
|
||||
其中:
|
||||
|
||||
- $CostT\left(n, i, j\right)$表示转移代价:第$n$帧中第$i$个候选点的基频与第$n+1$帧中第$j$个候选点的基频之差,并进行归一化;
|
||||
- $CostG\left(n+1, j\right)$为目标代价:等于$Dist\left(ac\left({Can}_{n+1}\left(j\right)\right)\right)$的值。
|
||||
|
||||
这一步是要计算得到的结果满足两个评价指标的权衡:
|
||||
|
||||
- 自相关系数尽可能大,即基音周期正确概率尽可能大;
|
||||
- 基频的变化尽可能小,即基音周期的变化尽可能小。
|
||||
|
||||
这两个指标都进行了归一化操作,即都在$\left[0, 1\right]$之间,只需要调整它们的权重即可人工控制权衡的决策。这里,我将自相关系数相关的权重设置为$2 \times10^6$,基音周期变化的权重设置为$1$。
|
||||
|
||||
代码实现如下:
|
||||
|
||||
```python
|
||||
for n in range(num_frames - 1):
|
||||
for j in range(min_peak_threshold, max_peak_threshold):
|
||||
costT = np.abs(sample_rate / np.arange(frame_length)[1:] - sample_rate / j)
|
||||
costT = (costT - np.min(costT)) / (np.max(costT) - np.min(costT))
|
||||
costG = dist[n + 1, j]
|
||||
costF[n + 1, j] = costG + np.min(costF[n, 1:] + costT)
|
||||
path[n + 1, j] = np.argmin(costF[n, 1:] + costT) + 1
|
||||
```
|
||||
|
||||
代码循环计算了每帧的代价函数。
|
||||
|
||||
3. **确定最优路径**
|
||||
$$
|
||||
\hat{l}_N=\underset{j}{\operatorname{argmin}}\left(CostF\left(N, j\right)\right)
|
||||
$$
|
||||
即找到累计代价最小的结束帧作为预测结果的结束帧。
|
||||
|
||||
代码实现如下:
|
||||
|
||||
```python
|
||||
l_hat = np.zeros(num_frames, dtype=np.int32)
|
||||
l_hat[num_frames - 1] = np.argmin(costF[num_frames - 1, 1:]) + 1
|
||||
```
|
||||
|
||||
4. **路径回溯**
|
||||
$$
|
||||
\begin{array}{c}
|
||||
\hat{l}_n=Path\left(n+1, \hat{l}_{n+1}\right)\\
|
||||
F0\left [ n \right ] = F0\left ( {Can}_{n}\left(\hat{l}_n\right) \right )
|
||||
\end{array}
|
||||
$$
|
||||
即从后一帧往前推出使得代价最小的前一帧(代价最小的转移路径已经在`Path`变量中记录),然后计算出该帧的基频。
|
||||
|
||||
由候选点计算基频的公式为:
|
||||
$$
|
||||
F0\left ({Can}_{n}\left(i\right) \right ) = \frac{sample \space rate}{{Can}_{n}\left(i\right)}
|
||||
$$
|
||||
|
||||
代码实现如下:
|
||||
|
||||
```python
|
||||
for n in range(num_frames - 2, -1, -1):
|
||||
l_hat[n] = path[n + 1, l_hat[n + 1]]
|
||||
|
||||
f0 = sample_rate / l_hat
|
||||
```
|
||||
|
||||
运行以上代码,并画出语音信号及基频变化如下:
|
||||
|
||||

|
||||
|
||||
使用Praat进行基频检测,预测基频如下:
|
||||
|
||||

|
||||
|
||||
预测结果相近,表明算法基本正确。
|
||||
|
||||
|
||||
|
||||
# 3. 绘制图像与分析
|
||||
|
||||
- **自行录制一段连续语音发音,对检测结果进行分析,针对观察到的问题优化算法**
|
||||
|
||||
使用实验1中录制的“wo3 jiao4 ke1 jing4 fan1”作为测试音频,进行基频分析。
|
||||
|
||||
自相关函数如下:
|
||||
|
||||
<img src="p4.png" alt="p4" style="zoom:67%;" />
|
||||
|
||||
发现自相关函数比单音节的`tang1`的更加复杂,但是可以明显发现五个发音之间由四个空隙隔开——在这四个空隙中,自相关函数值相对较低,印证了发音间隙基频变化大的事实。
|
||||
|
||||
使用过DP算法计算后,画出基频曲线:
|
||||
|
||||

|
||||
|
||||
虽然可以大致判断基频变化情况,但是不明显。分析发现,基频变化太大了,因此需要调整转移代价的权重,增加基频转移对DP算法决策的惩罚。
|
||||
|
||||
因此,将
|
||||
|
||||
```python
|
||||
dist = -autocorrelation_of_candidates * 2e3
|
||||
```
|
||||
|
||||
修改为
|
||||
|
||||
```python
|
||||
dist = -autocorrelation_of_candidates * 1e1
|
||||
```
|
||||
|
||||
其余不变。再次运行并绘图:
|
||||
|
||||

|
||||
|
||||
此时,基频更加集中,变化趋势较为明显。
|
||||
|
||||
与Praat的检测结果进行比对:
|
||||
|
||||

|
||||
|
||||
有较高的相似性,说明算法有效。
|
||||
|
||||
对于算法的优化调整,我只想到调整目标代价和转移代价两者之间的平衡权重进行优化。这个优化方法只能靠手动且经验性较强,效果不佳。
|
||||
|
||||
|
||||
|
||||
# 4. 附录
|
||||
|
||||
完整算法及绘图代码:
|
||||
|
||||
```python
|
||||
import scipy
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# 读取音频文件
|
||||
filename = "./tang1.wav"
|
||||
sample_rate, sound_array = scipy.io.wavfile.read(filename)
|
||||
sound_array = sound_array.T[0, :] if sound_array.ndim != 1 else sound_array
|
||||
sound_array = sound_array / np.max(np.abs(sound_array)) # 归一化
|
||||
|
||||
frame_length = int(sample_rate * 0.01)
|
||||
num_frames = len(sound_array) // frame_length
|
||||
autocorrelation = np.zeros((num_frames, frame_length))
|
||||
autocorrelation_of_candidates = np.zeros((num_frames, frame_length))
|
||||
|
||||
# 基频阈值为80-400Hz,则基音周期(即延迟)t最小为sample_rate/400,最大为sample_rate/80
|
||||
min_peak_threshold = min(sample_rate // 400, frame_length)
|
||||
max_peak_threshold = min(sample_rate // 80, frame_length)
|
||||
for n in range(num_frames):
|
||||
frame = sound_array[n * frame_length: (n + 1) * frame_length]
|
||||
autocorrelation[n, :] = scipy.signal.correlate(frame, frame, mode='full')[frame_length - 1:]
|
||||
# 本应该使用峰值的延迟作为基音周期的候选值,但是发现峰值(局部极大值)并不好判断,同时一帧内的点数不多,因此将阈值内的所有点都作为候选点
|
||||
# 那么将不在阈值内的自相关系数置为一个非常小的数,从而不让算法选择不在阈值内的基音周期
|
||||
autocorrelation_of_candidates = autocorrelation
|
||||
autocorrelation_of_candidates[:, 0:min_peak_threshold] = np.min(autocorrelation) - 10.
|
||||
autocorrelation_of_candidates[:, max_peak_threshold:] = np.min(autocorrelation) - 10.
|
||||
|
||||
x, y = np.meshgrid(np.arange(frame_length), np.arange(num_frames))
|
||||
x, y, z = x.flatten(), y.flatten(), autocorrelation_of_candidates.flatten()
|
||||
fig = plt.figure()
|
||||
ac_3d = fig.add_subplot(111, projection='3d')
|
||||
sc = ac_3d.scatter(x, y, z, c=z, cmap='plasma')
|
||||
ac_3d.set_xlabel('Candidates')
|
||||
ac_3d.set_ylabel('Frame')
|
||||
ac_3d.set_zlabel('AC Value')
|
||||
plt.colorbar(sc)
|
||||
plt.show()
|
||||
|
||||
|
||||
costF = np.zeros((num_frames, frame_length))
|
||||
path = np.zeros((num_frames, frame_length))
|
||||
dist = -(autocorrelation_of_candidates - np.min(autocorrelation_of_candidates)) / (np.max(autocorrelation_of_candidates) - np.min(autocorrelation_of_candidates)) * 1e1
|
||||
|
||||
costF[0, :] = dist[0, :]
|
||||
|
||||
for n in range(num_frames - 1):
|
||||
for j in range(min_peak_threshold, max_peak_threshold):
|
||||
# f0 = sample_rate / candidate
|
||||
costT = np.abs(sample_rate / np.arange(frame_length)[1:] - sample_rate / j)
|
||||
costT = (costT - np.min(costT)) / (np.max(costT) - np.min(costT)) # 归一化
|
||||
costG = dist[n + 1, j]
|
||||
costF[n + 1, j] = costG + np.min(costF[n, 1:] + costT)
|
||||
path[n + 1, j] = np.argmin(costF[n, 1:] + costT) + 1
|
||||
|
||||
l_hat = np.zeros(num_frames, dtype=np.int32)
|
||||
l_hat[num_frames - 1] = np.argmin(costF[num_frames - 1, 1:]) + 1
|
||||
|
||||
for n in range(num_frames - 2, -1, -1):
|
||||
l_hat[n] = path[n + 1, l_hat[n + 1]]
|
||||
|
||||
f0 = sample_rate / l_hat
|
||||
|
||||
plt.figure(figsize=(15, 6))
|
||||
plt.subplot(2, 1, 1)
|
||||
plt.plot(sound_array)
|
||||
plt.ylabel("Signal")
|
||||
plt.subplot(2, 1, 2)
|
||||
plt.ylim(0, 500)
|
||||
x = frame_length * np.arange(num_frames)
|
||||
y = f0
|
||||
X_Smooth = np.linspace(x.min(), x.max(), 300)
|
||||
Y_Smooth = scipy.interpolate.make_interp_spline(x, y)(X_Smooth)
|
||||
plt.plot(X_Smooth, Y_Smooth, color="orange")
|
||||
plt.plot(x, y, "o")
|
||||
plt.ylabel("Pitch (Hz)")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
BIN
Lab/Lab3/柯劲帆_21281280_实验3.pdf
Normal file
BIN
Lab/文献阅读报告/materials/Scale-dot-attn.png
Normal file
|
After Width: | Height: | Size: 25 KiB |
BIN
Lab/文献阅读报告/materials/The_transformer_encoder_decoder_stack.png
Normal file
|
After Width: | Height: | Size: 125 KiB |
BIN
Lab/文献阅读报告/materials/model-arch.png
Normal file
|
After Width: | Height: | Size: 75 KiB |
BIN
Lab/文献阅读报告/materials/multi-head.png
Normal file
|
After Width: | Height: | Size: 61 KiB |
BIN
Lab/文献阅读报告/materials/result.png
Normal file
|
After Width: | Height: | Size: 71 KiB |
BIN
Lab/文献阅读报告/materials/result2.png
Normal file
|
After Width: | Height: | Size: 95 KiB |
BIN
Lab/文献阅读报告/materials/self-attention-matrix-calculation.png
Normal file
|
After Width: | Height: | Size: 28 KiB |
BIN
Lab/文献阅读报告/materials/transformer_resideual_layer_norm_2.png
Normal file
|
After Width: | Height: | Size: 78 KiB |
BIN
Lab/文献阅读报告/materials/transformer_resideual_layer_norm_3.png
Normal file
|
After Width: | Height: | Size: 184 KiB |
254
Lab/文献阅读报告/materials/柯劲帆_21281280_文献阅读报告.md
Normal file
@@ -0,0 +1,254 @@
|
||||
<h1><center>课程作业</center></h1>
|
||||
|
||||
<div style="text-align: center;">
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">课程名称</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">计算机语音技术</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">作业名称</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">文献阅读报告</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">学号</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">21281280</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">姓名</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">柯劲帆</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">班级</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">物联网2101班</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">指导老师</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">朱维彬</span></div>
|
||||
<div><span style="display: inline-block; width: 65px; text-align: center;">修改日期</span><span style="display: inline-block; width: 25px;">:</span><span style="display: inline-block; width: 210px; font-weight: bold; text-align: left;">2023年12月24日</span></div>
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
> 文献名称:**Attention Is All You Need**
|
||||
>
|
||||
> Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
|
||||
|
||||
|
||||
|
||||
[TOC]
|
||||
|
||||
# 1. 文献解读
|
||||
|
||||
## 1.1. 引言
|
||||
|
||||
引言首先在第一段叙述了当时(2017年)的深度学习主流架构:RNN架构,以LSTM和GRU为代表的RNN在多项序列任务中取得了SOTA的结果,主要的研究趋势是不断提升递归语言模型和“encoder-decoder”架构的能力上限。
|
||||
|
||||
接着第二段叙述了RNN的不足,主要是其必须使用串行计算,必须依照序列的顺序性,并行计算困难。
|
||||
|
||||
而注意力机制的应用可以无视序列的先后顺序,捕捉序列间的关系。
|
||||
|
||||
因此文章提出一种新架构Transformer,完全使用注意力机制来捕捉输入输出序列之间的依赖关系,规避RNN的使用。实验证明,这种新架构在GPU上训练大幅加快,且达到了新的SOTA水平。
|
||||
|
||||
总结:
|
||||
|
||||
- 论文主题:提出一种新架构
|
||||
- 解决问题:提出能够并行计算的架构替代RNN
|
||||
- 技术思路:仅使用注意力机制处理输入输出序列
|
||||
|
||||
|
||||
|
||||
## 1.2. 算法
|
||||
|
||||
### 1.2.1 整体架构
|
||||
|
||||
<img src="model-arch.png" alt="model-arch" style="zoom: 50%;" />
|
||||
|
||||
<p align="center">图1 子编码器架构</p>
|
||||
|
||||
上图是Transformer架构的模型网络结构。
|
||||
|
||||
整个Transformer模型由$N=6$个这样的Encoder-Decoder对组成,如下图所示:
|
||||
|
||||
<img src="The_transformer_encoder_decoder_stack.png" alt="The_transformer_encoder_decoder_stack" style="zoom: 33%;" />
|
||||
|
||||
<p align="center">图2 编码-解码器架构</p>
|
||||
|
||||
每个子编码器的架构相同,为图1所示,但是各个子编码器的模型权重不同;同理,每个子解码器的架构相同,权重不同。
|
||||
|
||||
因此,模型的数据处理过程如下:
|
||||
|
||||
1. 对输入数据进行文本嵌入;
|
||||
2. 对输入数据进行位置编码;
|
||||
3. 将输入数据依次输入$N=6$层子编码器;
|
||||
4. 将最后一层子编码器的输出分别传入每层子解码器;
|
||||
5. 对目标数据进行文本嵌入;
|
||||
6. 对目标数据进行位置编码;
|
||||
7. 将目标数据依次输入$N=6$层子解码器;
|
||||
8. 使用全连接层和Softmax层将解码器的输出转换为预测值并输出。
|
||||
|
||||
### 1.2.2. 缩放点乘注意力
|
||||
|
||||
这一部分包含于多头注意力模块之中。
|
||||
|
||||
<img src="Scale-dot-attn.png" alt="Scale-dot-attn" style="zoom: 33%;" />
|
||||
|
||||
<p align="center">图3 缩放点乘注意力计算描述</p>
|
||||
|
||||
计算公式如下:
|
||||
$$
|
||||
\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V
|
||||
$$
|
||||
其中,输入的$Q$为查询(Query)向量,$K$为关键字(Key)向量,$V$为值(Value)向量,由输入token序列$X$计算得到:
|
||||
|
||||
<img src="self-attention-matrix-calculation.png" alt="self-attention-matrix-calculation" style="zoom: 33%;" />
|
||||
|
||||
<p align="center">图4 自注意力矩阵计算描述</p>
|
||||
|
||||
其中,$W^Q$、$W^K$和$W^V$都是待学习的权重。
|
||||
|
||||
接着,$QK^T$表示将查询向量$Q$与关键字向量$K$做内积,意在计算两者的相关性。查询与对应token的关键字的相关性越高,得到的结果矩阵中的对应值越高,受到注意力就会越高。将结果矩阵除以$\sqrt{d_{k}}$(避免较高值造成Softmax层的梯度消失),送进Softmax层进行概率计算。
|
||||
|
||||
得到概率矩阵后,将概率矩阵与值向量$V$相乘,作为对应token的值的权重。
|
||||
|
||||
最后,将加权的值向量输出。
|
||||
|
||||
注意力机制算法的核心思想是,对于每个token的键值对,查询与一个token的键越相似,就对该token的值越关注。
|
||||
|
||||
### 1.2.3. 多头注意力
|
||||
|
||||
这部分对应于图1中的`Multi-Head Attention`。
|
||||
|
||||
<img src="multi-head.png" alt="multi-head" style="zoom: 33%;" />
|
||||
|
||||
<p align="center">图5 多头注意力计算描述</p>
|
||||
|
||||
计算公式如下:
|
||||
$$
|
||||
\begin{align}
|
||||
\begin{aligned}
|
||||
\operatorname{MultiHead}(Q, K, V) & = \operatorname{Concat}\left(\operatorname{head}_{1}, \ldots, \operatorname{head}_{\mathrm{h}}\right) W^{O} \\
|
||||
\text { where head } & = \operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right)
|
||||
\end{aligned}
|
||||
\end{align}
|
||||
$$
|
||||
多头注意力实际上是多个自注意力的叠加,类似卷积神经网络中多个通道特征,这样可以有效学习到多个不同的特征。
|
||||
|
||||
将查询向量$Q$、关键字向量$K$和值向量$V$复制$h=8$份($h$即为“多头”的头数),分别放入多个自注意力计算模块中计算特征。最后将得到的特征堆叠,送入全连接层融合。
|
||||
|
||||
### 1.2.4. 应用注意力机制
|
||||
|
||||
<img src="transformer_resideual_layer_norm_2.png" alt="transformer_resideual_layer_norm_2" style="zoom: 33%;" />
|
||||
|
||||
<p align="center">图6 子编码器计算流程</p>
|
||||
|
||||
在编码器中,每一个子编码器都由以下计算流程依次组成:
|
||||
|
||||
1. 输入token序列$X$进行多头注意力计算;
|
||||
2. 将$h$个自注意力模块计算结果使用全连接层整合;
|
||||
3. 与原输入序列$X$相加,构成残差网络,并进行层归一化;
|
||||
4. 将结果送入前馈网络;
|
||||
5. 进行残差计算,随后归一化,最后输出。
|
||||
|
||||
<img src="transformer_resideual_layer_norm_3.png" alt="transformer_resideual_layer_norm_3" style="zoom: 33%;" />
|
||||
|
||||
<p align="center">图7 编码解码器计算流程</p>
|
||||
|
||||
在解码器中,每一个子解码器都由以下计算流程依次组成:
|
||||
|
||||
1. 对目标token序列进行掩码处理;
|
||||
2. 送入多头注意力模块计算;
|
||||
3. 与输入的目标序列求和计算残差并归一化;
|
||||
4. 将编码器输出结果作为查询向量$Q$,从目标序列计算出关键字向量$K$和值向量$V$,输入自注意力模块计算;
|
||||
5. 与输入序列求和计算残差并归一化;
|
||||
6. 将结果送入前馈网络;
|
||||
7. 与输入序列求和计算残差并归一化,最后输出。
|
||||
|
||||
### 1.2.5. 前馈网络
|
||||
|
||||
计算公式如下:
|
||||
$$
|
||||
FFN\left(x\right) = max\left(0, xW_1+b_1\right)W_2 + b_2
|
||||
$$
|
||||
两层全连接层之间使用RELU作为激活层。
|
||||
|
||||
### 1.2.6. 掩码多头注意力
|
||||
|
||||
由于Transformer架构支持并行计算,当预测${Output}_t$需要在时,不能向模型提供${Output}_{t+1}$及之后的子序列,因此要对目标序列输入进行掩码。
|
||||
|
||||
比如一组数据为
|
||||
|
||||
```txt
|
||||
{
|
||||
Input: "i love computer speech technology",
|
||||
Target: "我爱计算机语音技术"
|
||||
}
|
||||
```
|
||||
|
||||
希望预测Target[4]“机”时,只能向模型提供Target[0:4]“我爱计算”,而不能输入后面的序列,因此需要将其遮掩。
|
||||
|
||||
遮掩方法是让注意力公式的Softmax的输入为$-\infty$,那么得到的后面的token的注意力权重就几乎为0,达到遮住后面的token的效果。
|
||||
|
||||
即希望预测Target[4]“机”时,Softmax的输入相当于["我", "爱", "计", "算", "$-\infty$", "$-\infty$", "$-\infty$", "$-\infty$", "$-\infty$"],其中文字代表其对应的计算得到的注意力权重。
|
||||
|
||||
### 1.2.7. 位置编码
|
||||
|
||||
计算公式为:
|
||||
$$
|
||||
\begin{aligned}
|
||||
P E_{(\text {pos }, 2 i)} & =\sin \left(\operatorname{pos} / 10000^{2 i / d_{\text {model }}}\right) \\
|
||||
P E_{(\text {pos }, 2 i+1)} & =\cos \left(\operatorname{pos} / 10000^{2 i / d_{\text {model }}}\right)
|
||||
\end{aligned}
|
||||
$$
|
||||
其中$pos$为token在序列中的下标,$2i$或$2i+1$为词向量的维度序号。即词向量维度为偶数时使用正弦函数,为奇数时使用余弦函数。
|
||||
|
||||
该函数满足以下性质:
|
||||
|
||||
- 对于一个词嵌入向量的不同元素,编码各不相同;
|
||||
|
||||
- 对于向量的同一个维度处,不同$pos$的编码不同。且$pos$间满足相对关系:
|
||||
$$
|
||||
\left\{\begin{array}{l}
|
||||
P E(\text { pos }+k, 2 i)=P E(\text { pos }, 2 i) \times P E(k, 2 i+1)+P E(\text { pos, } 2 i+1) \times P E(k, 2 i) \\
|
||||
P E(\text { pos }+k, 2 i+1)=P E(\text { pos }, 2 i+1) \times P E(k, 2 i+1)-P E(\text { pos }, 2 i) \times P E(k, 2 i)
|
||||
\end{array}\right.
|
||||
$$
|
||||
从实际意义上看,即例如Target[4]“机”的位置编码可以被Target[1]“爱”和Target[3]“算”的位置编码线性表示。
|
||||
|
||||
|
||||
|
||||
## 1.3. 结论
|
||||
|
||||
### 1.3.1. 实验结果
|
||||
|
||||
研究测试了“英语-德语”和“英语-法语”两项翻译任务。使用论文的默认模型配置,在8张P100上只需12小时就能把模型训练完。
|
||||
|
||||
研究使用了Adam优化器,并对学习率调度有一定的优化。模型有两种正则化方式:
|
||||
|
||||
1. 每个子层后面有Dropout,丢弃概率0.1;
|
||||
2. 标签平滑。
|
||||
|
||||
<img src="result.png" alt="result" style="zoom: 25%;" />
|
||||
|
||||
<p align="center">图8 翻译任务实验结果</p>
|
||||
|
||||
实验表明,Transformer在翻译任务上胜过了所有其他模型,且训练时间大幅缩短。
|
||||
|
||||
论文同样展示了不同配置下Transformer的消融实验结果。
|
||||
|
||||
<img src="result2.png" alt="result2" style="zoom: 25%;" />
|
||||
|
||||
<p align="center">图9 消融实验结果</p>
|
||||
|
||||
实验A表明,计算量不变的前提下,需要谨慎地调节$h$和$d_k$、$d_v$的比例,太大太小都不好。这些实验也说明,多头注意力比单头是要好的。实验B表明,$d_k$增加可以提升模型性能。作者认为,这说明计算key、value相关性是比较困难的,如果用更精巧的计算方式来代替点乘,可能可以提升性能。实验C, D表明,大模型是更优的,且dropout是必要的。如正文所写,实验E探究了可学习的位置编码。可学习的位置编码的效果和三角函数几乎一致。
|
||||
|
||||
### 1.3.2. 研究结论
|
||||
|
||||
Transformer这一仅由注意力机制构成的模型。Transformer的效果非常出色,不仅训练速度快了,还在两项翻译任务上胜过其他模型。
|
||||
|
||||
Transformer在未来还可能被应用到图像、音频或视频等的处理任务中。
|
||||
|
||||
|
||||
|
||||
## 1.4. 个人见解
|
||||
|
||||
在接触过的研究中,我使用过很多基于Transformer架构的模型。
|
||||
|
||||
在图片Caption任务中,我将Transformer与CNN和RNN结合的编码-解码器进行效果比较,后者的效果非常差,而Transformer可以达到较好的效果。
|
||||
|
||||
另外,基于Transformer架构的各类大模型可以达到很好的效果,其中我体验过商用的ChatGPT,也复现过开源的[Llama2](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)、[LLaVA](https://arxiv.org/abs/2304.08485)等VQA任务模型,使用过[RoBERTa](https://arxiv.org/abs/1907.11692)等BERT-base模型完成token分类任务,使用[ViT](https://arxiv.org/abs/2010.11929)为[CLIP](https://proceedings.mlr.press/v139/radford21a)等图像分类任务模型进行图像特征提取等。
|
||||
|
||||
由于学识不足,个人无法独立给出对Transformer架构本身的评价。因此,本人查阅了近年来对Transformer进行改进或替代的研究,搜集到如下几点问题或改进方法:
|
||||
|
||||
- Transformer模型中自注意力机制的计算量会随着上下文长度的增加呈平方级增长,计算效率非常低。最近一项研究[Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752)针对长文本有更好的效果;
|
||||
- [SIMPLIFYING TRANSFORMER BLOCKS](https://arxiv.org/abs/2311.01906)发现可以移除一些Transformer模块的部分,比如残差连接、归一化层和值参数以及MLP序列化子块(有利于并行布局),以简化类似 GPT 的解码器架构以及编码器式BERT模型。
|
||||
|
||||
|
||||
|
||||
# 2. 参考和引用资料
|
||||
|
||||
- https://zhuanlan.zhihu.com/p/569527564
|
||||
- https://jalammar.github.io/illustrated-transformer/
|
||||
|
||||