谷歌云的Speech-to-Text音转文加说话人识别接口使用示例

前提

确保拥有google账号并已配置好结算账号
准备好一个音频文件

一、配置ADC环境

首先进入谷歌云控制台，点击下图按钮激活Cloud Shell
初始化
1. Pick configuration to use:选1
2. Select an account:选择你自己的邮箱编号，我这里选1
3. Pick cloud project to use: 选择你要应用到的项目id，我这里选2
创建本地身份验证凭据
```
gcloud auth application-default login
```
1. Do you want to continue (Y/n)? 输入y
2. 此时会给你一个链接，点击链接跳转到确认页面，选择账号后并继续后，会跳到验证码页面，复制验证码并填入Cloud Shell中

拿到上面打印出的证书的路径并打印内容

# 每一次的随机路径不一样
cat /tmp/tmp.A2jv5crnDz/application_default_credentials.json

本地环境新建文件~/.google-cloud-credentials.json，粘贴/tmp/tmp.A2jv5crnDz/application_default_credentials.json的内容

二、添加应用

stt应用

前往上述链接添加应用，我这里已经添加好了

三、Demo

检测录音中不同的说话人

添加依赖

uv add google-cloud-speech

示例代码（main.py）

from google.cloud import speech_v1p1beta1 as speech
from google.auth import default
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/<username>/.google-cloud-credentials.json"
os.environ["GOOGLE_CLOUD_PROJECT"] = '<你的项目id-2>'
credentials, _ = default()

client = speech.SpeechClient()

speech_file = some-audio.wav"

with open(speech_file, "rb") as audio_file:
    content = audio_file.read()

audio = speech.RecognitionAudio(content=content)

diarization_config = speech.SpeakerDiarizationConfig(
    enable_speaker_diarization=True,
    min_speaker_count=1,
    max_speaker_count=10,
)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="zh-CN",
    diarization_config=diarization_config,
)

print("Waiting for operation to complete...")
response = client.recognize(config=config, audio=audio)

# The transcript within each result is separate and sequential per result.
# However, the words list within an alternative includes all the words
# from all the results thus far. Thus, to get all the words with speaker
# tags, you only have to take the words list from the last result:
result = response.results[-1]

words_info = result.alternatives[0].words

# Printing out the output:
for word_info in words_info:
    print(f"word: '{word_info.word}', speaker_tag: {word_info.speaker_tag}")

print(result)

运行

uv run main.py

结果

Waiting for operation to complete...
word: '但', speaker_tag: 1
word: '是', speaker_tag: 1
word: '在', speaker_tag: 1
word: '一', speaker_tag: 1
word: '米', speaker_tag: 1
word: '左', speaker_tag: 1
word: '右', speaker_tag: 1
word: '的', speaker_tag: 1
word: '距', speaker_tag: 1
word: '离', speaker_tag: 1
...

这里有个问题，如果转录的文件是中文，那么所识别的说话人始终是1个，英文文件不会存在这个问题