news 2026/4/18 10:52:51

Elasticsearch:如何使用 LLM 在摄入数据时提取需要的信息

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
Elasticsearch:如何使用 LLM 在摄入数据时提取需要的信息

在很多的应用场景中,我们可以使用 LLM 来帮助我们提取需要的结构化数据。这些结构化的数据可以是分类,也可以是获取同义词等等。在我之前的文章 “如何自动化同义词并使用我们的 Synonyms API 进行上传” 里,我们展示了如何使用 LLM 来生成同义词,并上传到 Elasticsearch 中。在今天的例子里,我们把 LLM 提取数据的流程放到我们的 ingest pipeline 里。这样在摄入的同时,会自动提前所需要的信息!

创建 LLM Chat completion 端点

我们可以参考之前的文章 “Elasticsearch:使用推理端点及语义搜索演示”。我们可以创建一个如下的 chat completion 端点:

PUT _inference/completion/azure_openai_completion { "service": "azureopenai", "service_settings": { "api_key": "${AZURE_API_KEY}", "resource_name": "${AZURE_RESOURCE_NAME}", "deployment_id": "${AZURE_DEPLOYMENT_ID}", "api_version": "${AZURE_API_VERSION}" } }

创建一个 ingest pipeline

我们可以使用如下的一个方法来测试 pipeline:

在上面,我们定义了一个 EXTRACTION_PROMPT 变量:

Extract audio product information from this description. Return raw JSON only. Do NOT use markdown, backticks, or code blocks. Fields: category (string, one of: Headphones/Earbuds/Speakers/Microphones/Accessories), features (array of strings from: wireless/noise_cancellation/long_battery/waterproof/voice_assistant/fast_charging/portable/surround_sound), use_case (string, one of: Travel/Office/Home/Fitness/Gaming/Studio). Description:

如果你还不了解如何定义这个变量,请参考我之前的文章 “Kibana:如何设置变量并应用它们”。

POST _ingest/pipeline/_simulate { "description": "Use LLM to interpret messages to come out categories", "pipeline": { "processors": [ { "script": { "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description", "params": { "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}" } } }, { "inference": { "model_id": "azure_openai_completion", "input_output": { "input_field": "prompt", "output_field": "ai_response" } } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "remove": { "field": [ "prompt", "ai_response" ] } } ] }, "docs": [ { "_source": { "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 } } ] }

提示:你可以使用任何一个你喜欢的大模型来创建上面的端点。

上面命令运行的结果就是:

{ "docs": [ { "doc": { "_index": "_index", "_version": "-3", "_id": "_id", "_source": { "use_case": "Travel", "features": [ "wireless", "noise_cancellation", "long_battery" ], "price": 299.99, "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "model_id": "azure_openai_completion", "category": "Headphones" }, "_ingest": { "timestamp": "2026-01-22T13:56:11.926494Z" } } } ] }

上面的测试非常成功。我们可以进一步创建 pipeline:

PUT _ingest/pipeline/product-enrichment-pipeline { "processors": [ { "script": { "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description", "params": { "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}" } } }, { "inference": { "model_id": "azure_openai_completion", "input_output": { "input_field": "prompt", "output_field": "ai_response" } } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "remove": { "field": [ "prompt", "ai_response" ] } } ] }

创建索引并写入数据

我们接下来创建一个叫做 products 的索引:

PUT products { "settings": { "default_pipeline": "product-enrichment-pipeline" } }

如上所示,我们把 default_pipeline,也即默认的 pipeline 设置为 product-enrichment-pipeline。这样我们像正常地写入数据的时候,这个 pipeline 也会被自动调用:

POST _bulk { "index": { "_index": "products", "_id": "1" } } { "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 } { "index": { "_index": "products", "_id": "2" } } { "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "price": 149.99 } { "index": { "_index": "products", "_id": "3" } } { "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "price": 199.99 }

注意:依赖于大模型的速度,上面的调用可能需要一点时间来完成!

如上所示,我们写入数据。我们使用如下的命令来查看我们的数据:

GET products/_search?filter_path=**.hits
{ "hits": { "hits": [ { "_index": "products", "_id": "1", "_score": 1, "_source": { "use_case": "Travel", "features": [ "wireless", "noise_cancellation", "long_battery" ], "price": 299.99, "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "model_id": "azure_openai_completion", "category": "Headphones" } }, { "_index": "products", "_id": "2", "_score": 1, "_source": { "use_case": "Travel", "features": [ "waterproof", "surround_sound" ], "price": 149.99, "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "model_id": "azure_openai_completion", "category": "Speakers" } }, { "_index": "products", "_id": "3", "_score": 1, "_source": { "use_case": "Studio", "features": [ "noise_cancellation", "voice_assistant" ], "price": 199.99, "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "model_id": "azure_openai_completion", "category": "Microphones" } } ] } }

有了如上所示的结构化数据,我们就可以针对我们的数据进行搜索或统计了。

祝大家学习愉快!

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/8 13:48:27

达梦python客户端安装

1,需求: 通过python脚本连接达梦数据,实现SQL查询数据自动化导出excel 2,python安装dmPython模块 https://eco.dameng.com/document/dm/zh-cn/pm/dmpython-installation.html 2.1 在有达梦服务的主机安装dmPython # 1, 配置环…

作者头像 李华
网站建设 2026/4/18 1:35:50

Steam创意工坊下载工具:突破限制的模组获取方案

Steam创意工坊下载工具:突破限制的模组获取方案 【免费下载链接】WorkshopDL WorkshopDL - The Best Steam Workshop Downloader 项目地址: https://gitcode.com/gh_mirrors/wo/WorkshopDL 你是否曾遇到这样的困扰:想为《泰拉瑞亚》添加精美材质包…

作者头像 李华
网站建设 2026/4/18 5:41:34

UnrealPakViewer:虚幻引擎Pak文件架构解析与效能优化工具

UnrealPakViewer:虚幻引擎Pak文件架构解析与效能优化工具 【免费下载链接】UnrealPakViewer 查看 UE4 Pak 文件的图形化工具,支持 UE4 pak/ucas 文件 项目地址: https://gitcode.com/gh_mirrors/un/UnrealPakViewer UnrealPakViewer是一款针对虚幻…

作者头像 李华
网站建设 2026/3/21 2:37:50

BetterNCM插件管理工具2024全新攻略:从入门到精通的完整路径

BetterNCM插件管理工具2024全新攻略:从入门到精通的完整路径 【免费下载链接】BetterNCM-Installer 一键安装 Better 系软件 项目地址: https://gitcode.com/gh_mirrors/be/BetterNCM-Installer 网易云音乐插件生态正在改变音乐爱好者的使用体验,…

作者头像 李华
网站建设 2026/4/18 8:21:25

FSDP推理为何需要unshard?Live Avatar显存需求深度解析

FSDP推理为何需要unshard?Live Avatar显存需求深度解析 1. Live Avatar:开源数字人模型的硬核现实 Live Avatar是由阿里联合高校开源的端到端数字人生成模型,它能将一张静态人像、一段语音和一段文本提示,实时合成高质量、高保真…

作者头像 李华
网站建设 2026/3/23 20:11:07

Qwen3-4B-Instruct推理延迟优化:KV Cache配置实战调优

Qwen3-4B-Instruct推理延迟优化:KV Cache配置实战调优 1. 为什么延迟优化对Qwen3-4B-Instruct如此关键 你刚部署好Qwen3-4B-Instruct-2507,输入一句“请用Python写一个快速排序函数”,结果等了1.8秒才看到第一行输出——这在本地小模型里算…

作者头像 李华