Naive RAG

本指南将指导您如何在基于国产算力卡 曦云 C500与 燧原 S60 ，通过 vLLM 提供推理服务，结合 LlamaIndex 框架与 Milvus Lite 本地向量数据库，实现了一套完全私有化的 Naive RAG 系统。

一、框架概览

数据层：
- 内置示例数据集：平台在内置路径中预置了 “算力市场文档” 等专业数据集。用户无需繁琐的数据准备，即可通过调用内置库实现 “一键上手”，快速验证 RAG 流程的闭环性。
- 私有数据接入：系统具备高度的灵活性，支持用户通过数据通道自行上传 PDF、Markdown 等私有文档。通过简单的路径配置，即可实现从公共知识到行业私有知识的无缝切换。
推理层：
- 核心引擎：采用 vLLM 作为推理后端，利用其 PagedAttention 技术提升并发处理能力。
- 启动策略：执行显存割让策略（如 40% 显存用于 LLM），为后续的向量化模型预留必要的计算空间。
数据存储与检索层：
- 本地向量库：集成 Milvus Lite。不同于集群版，Lite 版作为嵌入式库运行，仅通过 Python 加载即可操作本地 .db 文件，无需部署额外的数据库服务或维护复杂的容器集群，即可在单机实现百万级检索。
- 嵌入模型（Embedding）：调用内置库加载 Qwen3-Embedding 系列模型，将 PDF/文本等非结构化数据转化为高维语义向量。
逻辑编排层：
- 工作流管理：基于 LlamaIndex 框架，定义文档读取（SimpleDirectoryReader）、切分（TextSplitter）、索引构建与查询路由。
- Naive RAG 范式：遵循标准的索引 — 检索 — 生成直线型逻辑，确保系统结构简洁且易于维护。

二、沐曦 (MetaX) 部署指南

本章节适用于 曦云 C500 等沐曦系列算力卡。

1. 硬件与基础环境

算力型号：曦云 C500 (64GB)
算力主机：
- jiajia-mxc：vLLM / vllm：0.11.0 / Python 3.10 / maca 3.3.0.11
- suanfeng-mxc：vLLM / 0.10.2 / Python 3.10 / maca 3.2.1.3

2. 基础步骤

进入算力容器，启动实例后，点击 JupyterLab 进入工作台。

3. 实现步骤

3.1 下载 LlamaIndex 与 Milvus Lite 框架

创建终端窗口(Terminal)

输入代码：

pip install --target /data/llama_libs --no-deps -i https://mirrors.aliyun.com/pypi/simple/ -U \
"pymilvus==2.6.6" milvus-lite orjson minio pathspec python-dateutil pytz six \
llama-index-core llama-index-readers-file llama-index-llms-openai llama-index-llms-openai-like \
llama-index-embeddings-huggingface llama-index-vector-stores-milvus llama-index-postprocessor-sbert-rerank  \
llama-index-instrumentation llama-index-workflows llama-index-utils-workflow  \
llama-index-retrievers-bm25 rank-bm25 bm25s  PyStemmer  \
sentence-transformers pypdf docx2txt nest-asyncio ujson grpcio google-api-core protobuf banks griffe sqlalchemy dataclasses-json marshmallow typing-inspect fsspec filetype deprecated wrapt dirtyjson tenacity jinja2 pyyaml \
pandas numpy nltk tiktoken requests charset-normalizer urllib3 certifi idna sniffio anyio h11 httpcore httpx mypy_extensions typing_extensions scikit-learn scipy joblib threadpoolctl tqdm pyarrow \
ragas langchain-core langchain-openai langsmith requests_toolbelt "numpy<2.0" uuid_utils tenacity regex appdirs instructor docstring_parser langchain_community llama-index-llms-huggingface jsonpatch
pip install griffe -t /data/llama_libs
pip install tinytag -t /data/llama_libs
pip install accelerate

完成下载后，新建一个新的终端:

3.2 启动 vLLM 推理

在新的终端内输入代码：

python -m vllm.entrypoints.openai.api_server \
    --model /mnt/moark-models/Qwen3-8B \
    --gpu-memory-utilization 0.4 \
    --port 8000

当终端提示INFO： Application startup compete，则完成vLLM启动步骤。

3.3 创建并运行 Python 脚本

点击 Python File：

输入代码：

import sys
import os
import asyncio
import nest_asyncio
import torch

PRIVATE_LIB = "/data/llama_libs"
if PRIVATE_LIB not in sys.path:
    sys.path.insert(0, PRIVATE_LIB)

nest_asyncio.apply()

try:
    loop = asyncio.get_event_loop()
except RuntimeError:
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai_like import OpenAILike
from llama_index.vector_stores.milvus import MilvusVectorStore

DATA_DIR = "/mnt/moark-models/Naive_RAG" 
EMBED_PATH = "/mnt/moark-models/Qwen3-Embedding-8B"

async def run_universal_rag():
    print(" 环境初始化成功，正在配置模型...")
    
    Settings.embed_model = HuggingFaceEmbedding(
    model_name=EMBED_PATH, 
    device="cuda",
    model_kwargs={"torch_dtype": torch.float16} 
    )
    Settings.llm = OpenAILike(
    model="/mnt/moark-models/Qwen3-8B",              
    api_base="http://localhost:8000/v1", 
    api_key="fake",                  
    is_chat_model=True,              
    timeout=60.0 
    )

    if not os.path.exists(DATA_DIR):
        os.makedirs(DATA_DIR)
        print(f"提示：请确保在 {DATA_DIR} 中放入了文档")

    print(f"正在读取目录: {DATA_DIR} ...")
    reader = SimpleDirectoryReader(input_dir=DATA_DIR, recursive=True)
    documents = reader.load_data()
    print(f"成功加载文档数量: {len(documents)}")

    print("正在连接 Milvus 向量数据库...")
    vector_store = MilvusVectorStore(
        uri="./universal_rag.db", 
        dim=4096, 
        overwrite=True,
        enable_sparse=False 
    )
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    print("正在构建向量索引，这会调用 GPU 加速...")
    index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, show_progress=True)

    query_engine = index.as_query_engine()
    
    print("\n" + "="*50)
    print("RAG 系统初始化完成！现在可以开始对话了。")
    print("输入 'exit'、'quit' 或 '退出' 即可结束对话。")
    print("="*50)

    while True:
        user_query = input("\n用户的输入 >> ").strip()

        if user_query.lower() in ['exit', 'quit', '退出']:
            print("已退出对话。")
            break

        if not user_query:
            continue
        
        try:
            response = query_engine.query(user_query)
            print(f"\nAI 回答: \n{response}")
            print("-" * 30)
            
        except Exception as e:
            print(f"\n 提示：未能获取到有效回答。")
            print(f"您可以尝试：1. 换一种提问方式  2. 检查 PDF 文档内容是否覆盖该问题")
            print(f" (系统反馈: {str(e)[:50]}...) ")

if __name__ == "__main__":
    try:
        loop.run_until_complete(run_universal_rag())
    except Exception as e:
        import traceback
        print(f" 运行过程中出错: {e}")
        traceback.print_exc()

按Ctrl + S保存文件，并完成文件命名test。新建一个终端，输入python test.py，即可进入 RAG 系统。

三、燧原 (Enflame) 部署指南

本章节适用于 燧原 S60 等燧原系列算力卡。

1. 硬件与基础环境

算力型号：燧原 S60 (48GB)
算力主机：enflame-node：Ubuntu / 22.04 / Python 3.13 / ef 1.5.0.604

2. 基础步骤

进入算力容器，启动实例后，点击 JupyterLab 进入工作台。

3. 实现步骤

3.1 下载 LlamaIndex 与 Milvus Lite 框架

创建终端窗口(Terminal)

输入代码：

pip install --target /data/llama_libs --no-deps -i https://mirrors.aliyun.com/pypi/simple/ \
llama-index-core==0.10.68 llama-index-embeddings-huggingface==0.2.0 llama-index-llms-openai==0.1.27 llama-index-llms-openai-like==0.1.3 llama-index-vector-stores-milvus==0.1.10 llama-index-readers-file==0.1.30 \
milvus-lite==2.4.9 pymilvus==2.4.9 sqlalchemy pypdf docx2txt beautifulsoup4 ujson msgpack dirtyjson PyYAML nltk tqdm \
nest-asyncio fsspec typing-inspect marshmallow dataclasses-json tenacity mypy-extensions \
requests urllib3 charset-normalizer idna certifi sniffio anyio h11 httpcore httpx 

完成下载后，新建一个新的终端:

3.2 启动 vLLM 推理

在新的终端内输入代码：

vllm serve /mnt/moark-models/Qwen3-0.6B  --gpu-memory-utilization 0.4 --port 8000

当终端提示INFO： Application startup compete，则完成vLLM启动步骤。

3.3 创建并运行 Python 脚本

点击 Python File：

输入代码：

import sys
import torch
import asyncio
import nest_asyncio

PRIVATE_LIB = "/data/llama_libs"
if PRIVATE_LIB not in sys.path:
    sys.path.insert(0, PRIVATE_LIB)

nest_asyncio.apply()

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai_like import OpenAILike
from llama_index.vector_stores.milvus import MilvusVectorStore

EMBED_PATH = "/mnt/moark-models/Qwen3-Embedding-8B"
DATA_DIR = "/mnt/moark-models/Naive_RAG" 

def run_gcu_rag():

    print(f" 正在搬运 8B 模型至 GCU 显存: {EMBED_PATH}")
    
    Settings.embed_model = HuggingFaceEmbedding(
        model_name=EMBED_PATH,
        device="gcu", 
        model_kwargs={"torch_dtype": torch.float16} 
    )
    print("Embedding 模型加载成功")

    
    Settings.llm = OpenAILike(
    model="/mnt/moark-models/Qwen3-0.6B",              
    api_base="http://localhost:8000/v1", 
    api_key="fake",                  
    is_chat_model=True,              
    timeout=60.0 
    )
    print("已连接本地 vLLM 服务")

    print(f"读取目录: {DATA_DIR}")
    reader = SimpleDirectoryReader(input_dir=DATA_DIR)
    documents = reader.load_data()

    print("启动 Milvus Lite...")
    vector_store = MilvusVectorStore(uri="/data/final_test.db", dim=4096, overwrite=True)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    print("正在进行向量编码 (此时 GPU 应进入高负载状态)...")
    index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, show_progress=True)

    query_engine = index.as_query_engine()

    print("\n" + "="*50)
    print("RAG 系统初始化完成！现在可以开始对话了。")
    print("输入 'exit'、'quit' 或 '退出' 即可结束对话。")
    print("="*50)

    while True:
        user_query = input("\n用户的输入 >> ").strip()

        if user_query.lower() in ['exit', 'quit', '退出']:
            print("已退出对话。")
            break

        if not user_query:
            continue
        
        try:
            response = query_engine.query(user_query)

            print(f"\nAI 回答: \n{response}")
            print("-" * 30)
            
        except Exception as e:
            
            print(f"\n 提示：未能获取到有效回答。")
            print(f"您可以尝试：1. 换一种提问方式  2. 检查 PDF 文档内容是否覆盖该问题")

            print(f" (系统反馈: {str(e)[:50]}...) ")

if __name__ == "__main__":
    run_gcu_rag()

按Ctrl + S保存文件，并完成文件命名test。新建一个终端，输入python test.py，即可进入 RAG 系统。

一、框架概览​

二、沐曦 (MetaX) 部署指南​

1. 硬件与基础环境​

2. 基础步骤​

3. 实现步骤​

3.1 下载 LlamaIndex 与 Milvus Lite 框架​

3.2 启动 vLLM 推理​

3.3 创建并运行 Python 脚本​

三、燧原 (Enflame) 部署指南​

1. 硬件与基础环境​

2. 基础步骤​

3. 实现步骤​

3.1 下载 LlamaIndex 与 Milvus Lite 框架​

3.2 启动 vLLM 推理​

3.3 创建并运行 Python 脚本​

一、框架概览

二、沐曦 (MetaX) 部署指南

1. 硬件与基础环境

2. 基础步骤

3. 实现步骤

3.1 下载 LlamaIndex 与 Milvus Lite 框架

3.2 启动 vLLM 推理

3.3 创建并运行 Python 脚本

三、燧原 (Enflame) 部署指南

1. 硬件与基础环境

2. 基础步骤

3. 实现步骤

3.1 下载 LlamaIndex 与 Milvus Lite 框架

3.2 启动 vLLM 推理

3.3 创建并运行 Python 脚本