语义内核的实时 AI 集成

项目
03/08/2025

添加了语义内核的第一个实时 API 集成，它目前仅在 Python 中可用，并被视为实验性。这是因为基础服务仍在开发中，并且会受到更改，我们可能需要在语义内核中对 API 进行重大更改，因为我们从客户那里了解如何使用它，以及添加此类模型和 API 的其他提供程序。

实时客户端抽象

为了支持来自不同供应商的不同实时 API，使用不同的协议，新客户端抽象已添加到内核中。此客户端用于连接到实时服务并发送和接收消息。客户端负责处理与服务的连接、发送消息和接收消息。客户端还负责处理连接或消息发送/接收过程中发生的任何错误。考虑到这些模型的工作方式，相较于普通的聊天完成，它们更像是代理。因此，它们接收指令，而不是系统消息，保留内部状态，并可以被调用来为我们执行任务。

实时 API

任何实时客户端都实现以下方法：

方法	描述
`create_session`	创建新会话
`update_session`	更新现有会话
`delete_session`	删除现有会话
`receive`	这是一种异步生成器方法，用于监听来自服务的消息，并在消息到达时逐一生成它们。
`send`	向服务发送消息

Python 实现

语义内核的 python 版本目前支持以下实时客户端：

客户	协议	方式	已启用函数调用	描述
OpenAI	Websocket	文本 & 音频	是的	OpenAI 实时 API 是基于 Websocket 的 API，可用于实时发送和接收消息，此连接器使用 OpenAI Python 包连接和接收和发送消息。
OpenAI	WebRTC	文本 & 音频	是的	OpenAI 实时 API 是基于 WebRTC 的 API，允许实时发送和接收消息，它需要在会话创建时使用与 WebRTC 兼容的音频轨道。
Azure	Websocket协议	文本 & 音频	是的	Azure 实时 API 是基于 Websocket 的 API，可用于实时发送和接收消息，这使用与 OpenAI websocket 连接器相同的包。

入门

要开始使用实时 API，您需要安装包含 realtime 附加功能的 semantic-kernel 包。

pip install semantic-kernel[realtime]

根据音频的处理方式，可能需要其他包才能与扬声器和麦克风（如 pyaudio 或 sounddevice）进行交互。

Websocket 客户端

然后，可以创建一个内核并将实时客户端添加到其中，这说明如何使用 AzureRealtimeWebsocket 连接执行此作，无需进行任何进一步更改即可将 AzureRealtimeWebsocket 替换为 OpenAIRealtimeWebsocket。

from semantic_kernel.connectors.ai.open_ai import (
    AzureRealtimeWebsocket,
    AzureRealtimeExecutionSettings,
    ListenEvents,
)
from semantic_kernel.contents import RealtimeAudioEvent, RealtimeTextEvent

# this will use environment variables to get the api key, endpoint, api version and deployment name.
realtime_client = AzureRealtimeWebsocket()
settings = AzureRealtimeExecutionSettings(voice='alloy')
async with realtime_client(settings=settings, create_response=True):
    async for event in realtime_client.receive():
        match event:
            # receiving a piece of audio (and send it to a undefined audio player)
            case RealtimeAudioEvent():
                await audio_player.add_audio(event.audio)
            # receiving a piece of audio transcript
            case RealtimeTextEvent():
                # Semantic Kernel parses the transcript to a TextContent object captured in a RealtimeTextEvent
                print(event.text.text, end="")
            case _:
                # OpenAI Specific events
                if event.service_type == ListenEvents.SESSION_UPDATED:
                    print("Session updated")
                if event.service_type == ListenEvents.RESPONSE_CREATED:
                    print("\nMosscap (transcript): ", end="")

需要注意两个重要事项，第一个是 realtime_client 是异步上下文管理器，这意味着你可以在异步函数中使用它，并使用 async with 来创建会话。 receive 方法是一个异步生成器，这意味着你可以在 for 循环中使用它来在消息到达时接收它们。

WebRTC 客户端

WebRTC 连接的设置比较复杂，因此在创建客户端时需要一个额外的参数。此参数 audio_track 必须是实现 aiortc 包的 MediaStreamTrack 协议的对象，下面链接的示例也演示了这一点。

若要创建使用 WebRTC 的客户端，请执行以下作：

from semantic_kernel.connectors.ai.open_ai import (
    ListenEvents,
    OpenAIRealtimeExecutionSettings,
    OpenAIRealtimeWebRTC,
)
from aiortc.mediastreams import MediaStreamTrack

class AudioRecorderWebRTC(MediaStreamTrack):
    # implement the MediaStreamTrack methods.

realtime_client = OpenAIRealtimeWebRTC(audio_track=AudioRecorderWebRTC())
# Create the settings for the session
settings = OpenAIRealtimeExecutionSettings(
    instructions="""
You are a chat bot. Your name is Mosscap and
you have one goal: figure out what people need.
Your full name, should you need to know it, is
Splendid Speckled Mosscap. You communicate
effectively, but you tend to answer with long
flowery prose.
""",
    voice="shimmer",
)
audio_player = AudioPlayer
async with realtime_client(settings=settings, create_response=True):
    async for event in realtime_client.receive():
        match event.event_type:
            # receiving a piece of audio (and send it to a undefined audio player)
            case "audio":
                await audio_player.add_audio(event.audio)
            case "text":
                # the model returns both audio and transcript of the audio, which we will print
                print(event.text.text, end="")
            case "service":
                # OpenAI Specific events
                if event.service_type == ListenEvents.SESSION_UPDATED:
                    print("Session updated")
                if event.service_type == ListenEvents.RESPONSE_CREATED:
                    print("\nMosscap (transcript): ", end="")

这两个示例都以 RealtimeAudioEvent 的形式接收音频，然后将其传递给未指定的audio_player对象。

音频输出回调

在此旁边，我们有一个名为audio_output_callback的参数，用于receive方法和类的创建。在任何进一步处理音频之前，首先会调用此回调函数，并获取音频数据的 numpy 数组，而不是解析成 AudioContent 并作为 RealtimeAudioEvent 返回供你处理，这与此前描述的情况不同。这显示了为音频输出提供更流畅的输出，因为传入的音频数据与提供给播放器之间的开销较小。

此示例演示如何定义和使用 audio_output_callback：

from semantic_kernel.connectors.ai.open_ai import (
    ListenEvents,
    OpenAIRealtimeExecutionSettings,
    OpenAIRealtimeWebRTC,
)
from aiortc.mediastreams import MediaStreamTrack

class AudioRecorderWebRTC(MediaStreamTrack):
    # implement the MediaStreamTrack methods.

class AudioPlayer:
    async def play_audio(self, content: np.ndarray):
        # implement the audio player

realtime_client = OpenAIRealtimeWebRTC(audio_track=AudioRecorderWebRTC())
# Create the settings for the session
settings = OpenAIRealtimeExecutionSettings(
    instructions="""
You are a chat bot. Your name is Mosscap and
you have one goal: figure out what people need.
Your full name, should you need to know it, is
Splendid Speckled Mosscap. You communicate
effectively, but you tend to answer with long
flowery prose.
""",
    voice="shimmer",
)
audio_player = AudioPlayer
async with realtime_client(settings=settings, create_response=True):
    async for event in realtime_client.receive(audio_output_callback=audio_player.play_audio):
        match event.event_type:
            # no need to handle case: "audio"
            case "text":
                # the model returns both audio and transcript of the audio, which we will print
                print(event.text.text, end="")
            case "service":
                # OpenAI Specific events
                if event.service_type == ListenEvents.SESSION_UPDATED:
                    print("Session updated")
                if event.service_type == ListenEvents.RESPONSE_CREATED:
                    print("\nMosscap (transcript): ", end="")

样品

存储库有四个示例，它们涵盖了使用 websocket 和 WebRTC 的基础知识，以及更复杂的设置，包括函数调用。最后，有一个更复杂的演示，它使用 Azure 通信服务来调用语义内核增强的实时 API。

通过

实时客户端抽象

实时 API

Python 实现

入门

Websocket 客户端

WebRTC 客户端

音频输出回调

样品

其他资源

通过

实时多模式 API

实时客户端抽象

实时 API

Python 实现

入门

Websocket 客户端

WebRTC 客户端

音频输出回调

样品

其他资源