目录
- 代码
- 代码解释
- 1. 导入和初始化
- 2. Label 类定义
- 3. RerankedResults 类
- 4. 重排序函数
- 示例
- 类似例子
- 例子中的jinjia模板语法
- 变量
- 2. 控制结构
- 条件语句
- 循环语句
代码
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator, ValidationInfo# Initialize the OpenAI client with Instructor
client = instructor.from_openai(OpenAI(api_key = "your api key",base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"))class Label(BaseModel):chunk_id: str = Field(description="The unique identifier of the text chunk")chain_of_thought: str = Field(description="The reasoning process used to evaluate the relevance")relevancy: int = Field(description="Relevancy score from 0 to 10, where 10 is most relevant",ge=0,le=10,)@field_validator("chunk_id")@classmethoddef validate_chunk_id(cls, v: str, info: ValidationInfo) -> str:context = info.contextchunks = context.get("chunks", [])if v not in [chunk["id"] for chunk in chunks]:raise ValueError(f"Chunk with id {v} not found, must be one of {[chunk['id'] for chunk in chunks]}")return vclass RerankedResults(BaseModel):labels: list[Label] = Field(description="List of labeled and ranked chunks")@field_validator("labels")@classmethoddef model_validate(cls, v: list[Label]) -> list[Label]:return sorted(v, key=lambda x: x.relevancy, reverse=True)def rerank_results(query: str, chunks: list[dict]) -> RerankedResults:return client.chat.completions.create(model="qwen-turbo",response_model=RerankedResults,messages=[{"role": "system","content": """You are an expert search result ranker. Your task is to evaluate the relevance of each text chunk to the given query and assign a relevancy score.For each chunk:1. Analyze its content in relation to the query.2. Provide a chain of thought explaining your reasoning.3. Assign a relevancy score from 0 to 10, where 10 is most relevant.Be objective and consistent in your evaluations.""",},{"role": "user","content": """<query>{{ query }}</query><chunks_to_rank>{% for chunk in chunks %}<chunk chunk_id="{{ chunk.id }}">{{ chunk.text }}</chunk>{% endfor %}</chunks_to_rank>Please provide a RerankedResults object with a Label for each chunk.""",},],context={"query": query, "chunks": chunks},)
代码解释
1. 导入和初始化
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator, ValidationInfoclient = instructor.from_openai(OpenAI(...))
- 使用 instructor 增强 OpenAI 功能
- 使用 Pydantic 进行数据验证和序列化
2. Label 类定义
class Label(BaseModel):chunk_id: str = Field(...)chain_of_thought: str = Field(...)relevancy: int = Field(..., ge=0, le=10)
定义了文本块的标签模型:
chunk_id
: 文本块的唯一标识符chain_of_thought
: 相关性评估的推理过程relevancy
: 0-10的相关性得分
包含了一个验证器:
@field_validator("chunk_id")
def validate_chunk_id(cls, v: str, info: ValidationInfo) -> str:
确保 chunk_id 存在于输入的文本块列表中
3. RerankedResults 类
class RerankedResults(BaseModel):labels: list[Label]
- 存储所有标签的容器类
- 包含一个验证器,按相关性得分降序排序结果
4. 重排序函数
def rerank_results(query: str, chunks: list[dict]) -> RerankedResults:
核心功能:
- 接收查询和文本块列表
- 使用 AI 模型评估相关性
- 返回排序后的结果
系统提示设置:
- 定义 AI 为专家排序系统
- 提供评估标准和打分规则
用户提示模板:
- 使用 Jinja2 模板语法
- 动态插入查询和文本块
- 格式化为结构化的 XML 格式
这个系统的主要用途:
- 智能文本相关性排序
- 提供透明的推理过程
- 确保结果的一致性和可验证性
示例
def main():# Sample query and chunksquery = "What are the health benefits of regular exercise?"chunks = [{"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890","text": "Regular exercise can improve cardiovascular health and reduce the risk of heart disease.",},{"id": "b2c3d4e5-f6g7-8901-bcde-fg2345678901","text": "The price of gym memberships varies widely depending on location and facilities.",},{"id": "c3d4e5f6-g7h8-9012-cdef-gh3456789012","text": "Exercise has been shown to boost mood and reduce symptoms of depression and anxiety.",},{"id": "d4e5f6g7-h8i9-0123-defg-hi4567890123","text": "Proper nutrition is essential for maintaining a healthy lifestyle.",},{"id": "e5f6g7h8-i9j0-1234-efgh-ij5678901234","text": "Strength training can increase muscle mass and improve bone density, especially important as we age.",},]# Rerank the resultsresults = rerank_results(query, chunks)# Print the reranked resultsprint("Reranked results:")for label in results.labels:print(f"Chunk {label.chunk_id} (Relevancy: {label.relevancy}):")print(f"Text: {next(chunk['text'] for chunk in chunks if chunk['id'] == label.chunk_id)}")print(f"Reasoning: {label.chain_of_thought}")print()main()
Reranked results:
Chunk a1b2c3d4-e5f6-7890-abcd-ef1234567890 (Relevancy: 10):
Text: Regular exercise can improve cardiovascular health and reduce the risk of heart disease.
Reasoning: This chunk directly discusses the health benefits of exercise, specifically improving cardiovascular health and reducing heart disease risk.Chunk c3d4e5f6-g7h8-9012-cdef-gh3456789012 (Relevancy: 8):
Text: Exercise has been shown to boost mood and reduce symptoms of depression and anxiety.
Reasoning: This chunk talks about how exercise can boost mood and reduce symptoms of depression and anxiety, which are health benefits.Chunk e5f6g7h8-i9j0-1234-efgh-ij5678901234 (Relevancy: 7):
Text: Strength training can increase muscle mass and improve bone density, especially important as we age.
Reasoning: Strength training's effects on muscle mass and bone density are health benefits associated with exercise.Chunk d4e5f6g7-h8i9-0123-defg-hi4567890123 (Relevancy: 2):
Text: Proper nutrition is essential for maintaining a healthy lifestyle.
Reasoning: While nutrition is important, this chunk does not discuss the health benefits of exercise itself.Chunk b2c3d4e5-f6g7-8901-bcde-fg2345678901 (Relevancy: 0):
Text: The price of gym memberships varies widely depending on location and facilities.
Reasoning: This chunk is about gym membership prices, which is unrelated to the health benefits of exercise.
类似例子
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator, ValidationInfo# 初始化 OpenAI 客户端
client = instructor.from_openai(OpenAI(api_key = "your api key",base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"))class ReviewLabel(BaseModel):review_id: str = Field(description="评论的唯一标识符")chain_of_thought: str = Field(description="评估相关性的推理过程")relevancy: int = Field(description="相关性得分,0-10分,10分最相关",ge=0,le=10,)@field_validator("review_id")@classmethoddef validate_review_id(cls, v: str, info: ValidationInfo) -> str:context = info.contextreviews = context.get("reviews", [])if v not in [review["id"] for review in reviews]:raise ValueError(f"找不到ID为 {v} 的评论,必须是以下ID之一: {[review['id'] for review in reviews]}")return vclass RankedReviews(BaseModel):labels: list[ReviewLabel] = Field(description="已标记和排序的评论列表")@field_validator("labels")@classmethoddef model_validate(cls, v: list[ReviewLabel]) -> list[ReviewLabel]:return sorted(v, key=lambda x: x.relevancy, reverse=True)def rank_reviews(movie_title: str, reviews: list[dict]) -> RankedReviews:return client.chat.completions.create(model="qwen-turbo",response_model=RankedReviews,messages=[{"role": "system","content": """你是一个专业的电影评论分析专家。你的任务是评估每条评论与给定电影的相关性,并给出相关性得分。对每条评论:1. 分析评论内容与电影的相关程度2. 提供推理过程说明你的评分理由3. 给出0-10的相关性得分,10分表示最相关请保持客观和一致性。""",},{"role": "user","content": """<movie>{{ movie_title }}</movie><reviews_to_rank>{% for review in reviews %}<review review_id="{{ review.id }}">{{ review.text }}</review>{% endfor %}</reviews_to_rank>请提供一个包含每条评论标签的RankedReviews对象。""",},],context={"movie_title": movie_title, "reviews": reviews},)def main():# 示例数据movie_title = "泰坦尼克号"reviews = [{"id": "rev001","text": "这部电影完美展现了泰坦尼克号的悲剧,演员表演令人动容。",},{"id": "rev002","text": "最近电影票价格上涨了不少,看电影越来越贵了。",},{"id": "rev003","text": "Jack和Rose的爱情故事让人难忘,经典场景依然令人感动。",},{"id": "rev004","text": "这家电影院的爆米花很好吃,推荐尝试。",},{"id": "rev005","text": "电影的特效和场景还原都很精良,展现了那个年代的奢华。",},]# 对评论进行排序results = rank_reviews(movie_title, reviews)# 打印排序结果print("评论排序结果:")for label in results.labels:print(f"评论 {label.review_id} (相关性得分: {label.relevancy}):")print(f"内容: {next(review['text'] for review in reviews if review['id'] == label.review_id)}")print(f"推理过程: {label.chain_of_thought}")print()main()
评论排序结果:
评论 rev001 (相关性得分: 10):
内容: 这部电影完美展现了泰坦尼克号的悲剧,演员表演令人动容。
推理过程: 评论直接提到电影《泰坦尼克号》,并赞扬其悲剧展现和演员表演,明显与电影高度相关。评论 rev003 (相关性得分: 9):
内容: Jack和Rose的爱情故事让人难忘,经典场景依然令人感动。
推理过程: 评论聚焦于电影中的爱情故事和经典场景,与《泰坦尼克号》的主题紧密相关。评论 rev005 (相关性得分: 8):
内容: 电影的特效和场景还原都很精良,展现了那个年代的奢华。
推理过程: 评论称赞电影的特效和场景还原,这与《泰坦尼克号》的内容直接相关。评论 rev002 (相关性得分: 2):
内容: 最近电影票价格上涨了不少,看电影越来越贵了。
推理过程: 评论讨论的是电影票价上涨的问题,与具体电影《泰坦尼克号》无关,因此相关性较低。评论 rev004 (相关性得分: 1):
内容: 这家电影院的爆米花很好吃,推荐尝试。
推理过程: 评论谈论的是电影院的爆米花,与电影本身无直接关系,因此相关性很低。
例子中的jinjia模板语法
例子中用到Jinja 模板语法的核心概念:
变量
{{ 变量名 }}
用于在模板中插入变量值,例如:
"你好,{{ username }}" # 如果 username = "小明",输出: "你好,小明"
2. 控制结构
条件语句
{% if 条件 %}内容1
{% else %}内容2
{% endif %}
循环语句
{% for item in items %}{{ item }}
{% endfor %}
Jinja 模板的主要优势:
- 代码复用
- 逻辑与展示分离
- 动态内容生成
- 安全性(自动转义)
- 灵活的扩展性
这些特性使得 Jinja2 成为 Python 生态系统中最流行的模板引擎之一。
例子1:
from instructor.templating import handle_templating
from instructor.mode import Mode
# 输入参数示例
kwargs = {"messages": [{"role": "system","content": "你是一个专业的{{ domain }}助手"},{"role": "user","content": "请分析关于{{ topic }}的问题"}]
}mode = Mode.TOOLS # 使用 OpenAI 格式context = {"domain": "医疗","topic": "心脏病预防"
}# 调用函数
result = handle_templating(kwargs, mode, context)# 输出结果
print(result)
{'messages': [{'role': 'system', 'content': '你是一个专业的医疗助手'}, {'role': 'user', 'content': '请分析关于心脏病预防的问题'}]}
例子2:
query = "What are the health benefits of regular exercise?"
chunks = [{"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890","text": "Regular exercise can improve cardiovascular health and reduce the risk of heart disease.",},{"id": "b2c3d4e5-f6g7-8901-bcde-fg2345678901","text": "The price of gym memberships varies widely depending on location and facilities.",},{"id": "c3d4e5f6-g7h8-9012-cdef-gh3456789012","text": "Exercise has been shown to boost mood and reduce symptoms of depression and anxiety.",},{"id": "d4e5f6g7-h8i9-0123-defg-hi4567890123","text": "Proper nutrition is essential for maintaining a healthy lifestyle.",},{"id": "e5f6g7h8-i9j0-1234-efgh-ij5678901234","text": "Strength training can increase muscle mass and improve bone density, especially important as we age.",},
]kwargs = {"messages": [{"role": "system","content": """You are an expert search result ranker. Your task is to evaluate the relevance of each text chunk to the given query and assign a relevancy score.For each chunk:1. Analyze its content in relation to the query.2. Provide a chain of thought explaining your reasoning.3. Assign a relevancy score from 0 to 10, where 10 is most relevant.Be objective and consistent in your evaluations.""",},{"role": "user","content": """<query>{{ query }}</query><chunks_to_rank>{% for chunk in chunks %}<chunk chunk_id="{{ chunk.id }}">{{ chunk.text }}</chunk>{% endfor %}</chunks_to_rank>Please provide a RerankedResults object with a Label for each chunk.""",},]
}context={"query": query, "chunks": chunks}mode = Mode.TOOLS # 使用 OpenAI 格式# 调用函数
handle_templating(kwargs, mode, context)
{'messages': [{'role': 'system','content': '\nYou are an expert search result ranker. Your task is to evaluate the relevance of each text chunk to the given query and assign a relevancy score.\n\nFor each chunk:\n1. Analyze its content in relation to the query.\n2. Provide a chain of thought explaining your reasoning.\n3. Assign a relevancy score from 0 to 10, where 10 is most relevant.\n\nBe objective and consistent in your evaluations.\n'},{'role': 'user','content': '\n<query>What are the health benefits of regular exercise?</query>\n\n<chunks_to_rank>\n\n<chunk chunk_id="a1b2c3d4-e5f6-7890-abcd-ef1234567890">\n Regular exercise can improve cardiovascular health and reduce the risk of heart disease.\n</chunk>\n\n<chunk chunk_id="b2c3d4e5-f6g7-8901-bcde-fg2345678901">\n The price of gym memberships varies widely depending on location and facilities.\n</chunk>\n\n<chunk chunk_id="c3d4e5f6-g7h8-9012-cdef-gh3456789012">\n Exercise has been shown to boost mood and reduce symptoms of depression and anxiety.\n</chunk>\n\n<chunk chunk_id="d4e5f6g7-h8i9-0123-defg-hi4567890123">\n Proper nutrition is essential for maintaining a healthy lifestyle.\n</chunk>\n\n<chunk chunk_id="e5f6g7h8-i9j0-1234-efgh-ij5678901234">\n Strength training can increase muscle mass and improve bone density, especially important as we age.\n</chunk>\n\n</chunks_to_rank>\n\nPlease provide a RerankedResults object with a Label for each chunk.\n'}]}
参考链接:https://github.com/instructor-ai/instructor/tree/main