引导 LLM 的响应以创建结构化输出-程序员充电站

原文：towardsdatascience.com/guiding-an-llms-response-to-create-structured-output-5dde0d3e426b

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/2ff653d8945615f8a4e4352606965b30.png

图片由Ricardo Gomez Angel在Unsplash提供

本文将教你如何使用 Python 中的验证库来结构化 LLM 响应，如 GPT-4 或 Llama 3。

这是一个非常相关的话题，鉴于提取结构化信息（例如，JSON 格式）对于数据挖掘任务来说是基本要求，在这些任务中，从非结构化格式（如自由文本）中提取精确信息。

此外，即使在最商业化的系统中，如 GPT，结构化响应格式也不可靠，这是由于 LLM 在生成输出标记时的随机性质。

我们将使用几个库，例如 Pydantic 和 Instructor 进行验证和模式建模，以及 OpenAI 和 ollama 进行 LLM 部分。提出的内容将适用于闭源模型，如 OpenAI 的 GPT 或 Anthropic，以及开源模型如 Llama 3。

通过阅读这篇文章，你将学习：

什么是数据模型以及如何定义一个数据模型
如何确保你的LLM 通过验证规则尊重输出格式
如何使用Instructor 和 Pydantic 库

享受阅读！

我们为什么需要结构化输出？

即使 GPT-4 等 LLM 不遵循特定模式来结构化它们的响应，也能提供巨大的价值。然而，对于程序员和数据工作者来说，如果用户愿意，尊重可能的响应模式是很重要的。

从 GPT-3.5 的一个特定版本开始，OpenAI 在其completionsAPI 中添加了response_format参数——这允许用户定义不同的键，例如json_object，以引导模型向更适合输入提示的响应发展。

这里有一个例子：

fromopenaiimportOpenAI client=OpenAI()response=client.chat.completions.create(model="gpt-3.5-turbo-0125",response_format={"type":"json_object"},messages=[{"role":"system","content":"You are a helpful assistant designed to output JSON."},{"role":"user","content":"Who won the world series in 2020?"}])print(response.choices[0].message.content)>>>"content":"{"winner": "Los Angeles Dodgers"}"

然而，这种逻辑并不总是有效。事实上，OpenAI 在其文档中建议在提示中精确地写出“JSON”这个词，以引导 GPT 生成它。这是一个如此重要的提示，以至于当我们使用response_format={ "type": "json_object" }时，我们被迫在提示中某处写上它。

为什么 LLM 产生一致的 JSON 输出这么困难？

这是因为 LLM 实际上是机器，它们在给定输入提示的情况下更有可能返回紧随其后的下一个标记。实际上，除非在训练阶段明确引导模型去观察和理解这些格式，否则在“自然”中很难遇到这种模式。

新的 LLM 的 JSON 模式并不保证输出与特定模式匹配，只保证它是有效的，并且没有错误地解析。

因此，能够验证这些输出中的内容，并在它们与我们的数据模型不一致时引发异常和错误，仍然非常重要。

用例

我们将看到从简单问题到 LLM（如 GPT-4 或 Llama3）的信息提取示例，正如之前提到的。

我们可以问任何问题，但我们将询问模型关于足球世界杯历史上获胜者的相关问题。

尤其是我们想要提取

最终日期
本届赛事的主办国
获胜团队
最佳射手

我们将不会担心验证数据的准确性，而只是将 LLM 的文本响应适配到我们现在将看到的方案。

在本文中，我们将探讨这个例子，也许还会探索其他一些例子。

必需的依赖项

现在让我们看看运行此教程所需的依赖项。

显然，假设我们已经有了一个活跃的开发环境，我们将安装 Pydantic、Instructor、OpenAI 客户端和 ollama。

Pydantic：由于使用方便、效率高和与数据科学的关联性，它是社区中最著名的用于数据模型定义和验证的库
Instructor：实际上是一个针对 LLM 工作的 Pydantic 包装器，是允许你创建验证逻辑的库
OpenAI：著名的用于查询 GPT 和其他 OpenAI 模型的客户端
ollama：非常方便的界面，用于打开源代码 LLMs，如 llama3。在我们的开发环境中，我们发出开始命令

在我们的开发环境中，我们发出开始命令

pip install pydantic instructor openai ollama

由于我们还想测试开源模型，下一步是全局安装 ollama。你可以通过阅读这篇专门的文章来了解如何安装和使用 ollama

如何在 ollama 和 Python 中本地使用 LLM

现在，我们可以专注于开发。

数据模型定义

数据模型是一种逻辑模式，用于结构化数据。它们在许多场景中使用，从定义数据库中的表到验证输入数据。

在下面的帖子中，我已经介绍了一些使用 Pydantic 进行数据建模的例子，涉及数据科学和机器学习 👇

使用 Pydantic 改进你的数据模型

让我们从创建 Pydantic 数据模型开始：

frompydanticimportBaseModel,FieldfromtypingimportListimportdatetimeclassSoccerData(BaseModel):date_of_final:datetime.date=Field(...,description="Date of the final event")hosting_country:str=Field(...,description="The nation hosting the tournament")winner:str=Field(...,description="The soccer team that won the final cup")top_scorers:list=Field(...,description="A list of the top 3 scorers of the tournament")classSoccerDataset(BaseModel):reports:List[SoccerData]=[]

在这个脚本中，我们正在从 Pydantic 导入BaseModel和Field类，并使用它们来创建数据模型。实际上，我们正在构建最终结果必须具有的结构。

Pydantic 要求我们声明进入模型的数据类型。我们有datetime.date，例如，它强制date字段为日期而不是字符串。同时，top_scorers字段必须是列表，否则 Pydantic 将返回验证错误。

最后，我们创建一个数据模型，它收集了多个SoccerData模型的实例。这被称为SoccerDataset，并将由 Instructor 用于验证多个报告的存在，而不仅仅是单个报告。

创建系统提示

非常简单，我们将用英语写下模型必须执行的操作，通过提供示例来强调意图和结果的结构。

system_prompt="""You are an expert sports journalist. You'll be asked to create a small report on who won the soccer world cups in specific years. You'll report the date of the tournament's final, the top 3 scorers of the entire tournament, the winning team, and the nation hosting the tournament. Return a JSON object with the following fields: date_of_final, hosting_country, winner, top_scorers. If multiple years are inputted, separate the reports with a comma. Here's an example [ { "date_of_final": "1966", "hosting_country": "England", "winner": "England", "top_scorers": ["Player A", "Player B", "Player C"] }, { "date_of_final": ... "hosting_country": ... "winner": ... "top_scorers": ... }, ] Here's the years you'll need to report on: """

这个提示将用作系统提示，并简单地允许我们通过逗号分隔传递感兴趣的年份。

创建 Instructor 代码

在这里，我们将借助 Instructor 创建 JSON 验证和结构化的主要逻辑。它使用一个与 OpenAI 提供的通过 API 调用 GPT 的接口相似的接口。

首先，我们将使用一个名为query_gpt的函数来使用 OpenAI，该函数允许我们参数化我们的提示：

fromopenaiimportOpenAIimportinstructordefquery_gpt(prompt:str)->list:client=instructor.from_openai(OpenAI(api_key="..."))resp=client.chat.completions.create(model="gpt-3.5-turbo",response_model=SoccerDataset,messages=[{"role":"system","content":system_prompt},{"role":"user","content":prompt},],)returnresp.model_dump_json(indent=4)

让我们记住将我们的 OpenAI API 密钥传递给新创建的客户端。我们将使用 GPT-3.5-Turbo，将SoccerDataset作为response_model传递。当然，如果我们想使用当时最强大的模型，我们可以使用"gpt-4o"。

我们不使用SoccerData，而是使用SoccerDataset。
如果我们使用前者，LLM 只会返回单个结果。

让我们把所有东西放在一起，并启动软件，将用户提示中的年份“2010, 2014 and 2018”作为输入，这是我们想要生成结构化报告的内容。

fromopenaiimportOpenAIimportinstructorfromtypingimportListfrompydanticimportBaseModel,FieldimportdatetimeclassSoccerData(BaseModel):date_of_final:datetime.date=Field(...,description="Date of the final event")hosting_country:str=Field(...,description="The nation hosting the tournament")winner:str=Field(...,description="The soccer team that won the final cup")top_scorers:list=Field(...,description="A list of the top 3 scorers of the tournament")classSoccerDataset(BaseModel):reports:List[SoccerData]=[]system_prompt="""You are an expert sports journalist. You'll be asked to create a small report on who won the soccer world cups in specific years. You'll report the date of the tournament's final, the top 3 scorers of the entire tournament, the winning team, and the nation hosting the tournament. Return a JSON object with the following fields: date_of_final, hosting_country, winner, top_scorers. If the query is invalid, return an empty report. If multiple years are inputted, separate the reports with a comma. Here's an example [ { "date_of_final": "1966", "hosting_country": "England", "winner": "England", "top_scorers": ["Player A", "Player B", "Player C"] }, { "date_of_final": ... "hosting_country": ... "winner": ... "top_scorers": ... }, ] Here's the years you'll need to report on: """defquery_gpt(prompt:str)->list:client=instructor.from_openai(OpenAI())resp=client.chat.completions.create(model="gpt-3.5-turbo",response_model=SoccerDataset,messages=[{"role":"system","content":system_prompt},{"role":"user","content":prompt},],)returnresp.model_dump_json(indent=4)if__name__=="__main__":resp=query_llm("2010, 2014, 2018")print(resp)

这是结果：

{"reports":[{"date_of_final":"2010-07-11","hosting_country":"South Africa","winner":"Spain","top_scorers":["Thomas Müller","David Villa","Wesley Sneijder"]},{"date_of_final":"2014-07-13","hosting_country":"Brazil","winner":"Germany","top_scorers":["James Rodríguez","Thomas Müller","Neymar"]},{"date_of_final":"2018-07-15","hosting_country":"Russia","winner":"France","top_scorers":["Harry Kane","Antoine Griezmann","Romelu Lukaku"]}]}

太棒了。GPT-3.5-Turbo 完美地遵循了我们的提示，Instructor 验证了字段，创建了一个与数据模型一致的结构。事实上，输出不是一个字符串，而是一个 Python 字典的列表，这是 LLM 如 GPT 通常会返回的。

现在，让我们尝试插入一个没有意义的输入。

if__name__=="__main__":print(query_gpt("hi, how are you?"))>>>{"reports":[]}

LLM 正确地返回了一个空报告，因为我们就是这样通过系统提示要求它处理无效查询的。

使用 Instructor 使用开源模板

我们已经看到了如何在 Instructor 中使用 GPT 来获得结构化的 JSON 输出。现在让我们看看如何使用 ollama 来使用开源模板如 llama3。

记住，您需要通过 ollama 下载 llama3 才能使用它。
使用ollama pull llama3命令下载它！

让我们创建一个名为query_llama的新函数。

defquery_llama(prompt:str)->list:client=instructor.from_openai(OpenAI(base_url="http://localhost:11434/v1",api_key="ollama",# valore richiesto, ma non influente),mode=instructor.Mode.JSON,)resp=client.chat.completions.create(model="llama3",messages=[{"role":"system","content":system_prompt},{"role":"user","content":prompt}],response_model=SoccerDataset,)returnresp.model_dump_json(indent=4)

与 GPT 代码有一些不同。让我们看看它们。

ollama 是通过与 GPT 相同的接口调用的，但需要更改基础 URL 指针（base_url）和 API 密钥，这些是必需的，但不是正确操作所必需的（不要问我为什么）
您需要通过模式参数解释 JSON 模式。让我们运行新函数

让我们运行这个函数

if__name__=="__main__":print(query_llama("2010, 2014, 2018"))

下面是结果：

{"reports":[{"date_of_final":"2010-07-11","hosting_country":"South Africa","winner":"Spain","top_scorers":["Thomas Müller","Wolfram Toloi","Landon Donovan"]},{"date_of_final":"2014-07-13","hosting_country":"Brazil","winner":"Germany","top_scorers":["James Rodríguez","Miroslav Klose","Thomas Müller"]},{"date_of_final":"2018-07-15","hosting_country":"Russia","winner":"France","top_scorers":["Harry Kane","Kylian Mbappé","Antoine Griezmann"]}]}