👁️‍🗨️【AI赋能】让机器"看见"网页内容：为公开网页图片视频自动添加智能描述，实现非结构化数据AI可读化 🤖

经验分享

Shawn Luo

2025-12-30 18:46·浏览量：324

Shawn Luo

发布于 2025-12-30 18:17更新于 2025-12-30 18:46324浏览

作者：Shawn Luo

🌐 背景与价值

在信息爆炸的今天，网页内容中充斥着大量图片和视频等非结构化数据。这些内容对人类用户直观易懂，但对AI系统却如同"黑盒"🔒——无法直接理解和检索。当企业需要构建知识库或进行内容分析时，这些视觉元素往往成为信息处理的盲区。

本文将分享一个创新解决方案：结合RPA技术与多模态视觉语言模型，自动为网页中的图片和视频生成精准、结构化的文字描述，使非结构化数据变得AI可读，大幅提升数据处理与知识管理效率。✨

⚙️ 功能概述

本方案通过以下流程实现网页内容"可视化"：

🔍 解析HTML源码，识别所有图片和视频元素
🧠 调用多模态AI模型分析媒体内容，生成结构化描述
➕ 将描述自动插入HTML对应位置，形成增强版内容
💰 支持精确的成本核算，便于企业级应用

处理后的HTML不仅保留了原始视觉体验，还为每个媒体元素添加了机器可读的语义描述，打通了非结构化数据与AI系统的连接桥梁。🌉

💻 技术实现

🔄 核心处理流程

def main(
    api_key="",
    model_name="qwen3-vl-flash",
    video_fps=0.25,
    html_content="",
    input_price_tier1=0.00015,
    input_price_tier2=0.0003,
    input_price_tier3=0.0006,
    input_tier2_threshold=32,
    input_tier3_threshold=128,
    output_price=0.003
):
    # 验证参数、设置API密钥、处理HTML内容
    # ...

程序采用模块化设计，主要包含以下核心功能模块：

媒体元素识别模块：使用正则表达式精准匹配HTML中的<img>和<video>标签，提取有效URL
多模态AI调用模块：根据媒体类型构建差异化提示词，调用视觉语言模型生成描述
智能插入模块：将生成的描述以美观、语义化的方式插入HTML对应位置
成本计算模块：精确核算API调用成本，支持阶梯计价模式

🎨 AI描述生成策略

针对不同类型媒体，我们设计了差异化的提示策略：

🖼️ 图片描述：聚焦关键对象识别、场景类型、可见文字提取、关键细节记录，要求描述精简（100字内）
🎥 视频描述：强调内容序列、对象互动变化、时间线关键点、场景转换，允许更详细描述（800字内）

所有描述均以客观、结构化的知识形式呈现，避免主观推测，确保AI系统能够准确理解和检索。

def call_vl_model(api_key, model_name, media_url, media_type, video_fps=2.0):
    """
    调用视觉语言模型API获取媒体描述
    """
    # 根据媒体类型构建差异化提示词
    if media_type == "image":
        prompt = ("请对这张图片进行客观、精确的知识提取，生成适合构建知识库的精简的结构化描述...")
    else:  # video
        prompt = ("请对这段视频进行系统性知识提取，生成适合构建知识库的精简的结构化描述...")
    
    # 调用多模态模型API
    # ...

📥 前置场景：获取待处理内容

本方案可无缝集成到多种RPA工作流中：

🌐 网页爬取后处理：在RPA爬取网页内容后，将源码传入本函数，自动增强媒体元素
🎯 指定元素处理：通过影刀RPA的元素定位功能，获取特定区域的HTML源码进行精细化处理
📚 批量文档处理：对企业内部的HTML格式文档进行批量增强，构建可检索的知识库

# 示例：从网页获取HTML内容
web_page_html = xbot.web.get_page_source()
processed_html, cost = main(api_key="您的API_KEY", html_content=web_page_html)

📤 后置场景：增强内容的应用

处理后的增强HTML可应用于多种场景：

🧠 知识库构建：将增强后的网页内容转换为Markdown格式，保留AI生成的描述，构建全面、可搜索的企业知识库
♿ 无障碍访问：为视障用户提供更丰富的屏幕阅读体验，通过AI描述弥补视觉信息缺失
🔍 内容审核：自动化审核系统可以"理解"图片和视频内容，提高违规内容识别准确率
📈 SEO优化：为搜索引擎提供更丰富的页面语义信息，提升内容在搜索结果中的可见度
📊 数据挖掘：在舆情分析、市场研究等场景中，全面捕捉网页中的视觉信息，避免数据盲区

# 示例：将处理后的HTML保存为增强版网页
xbot.file.write_text("enhanced_page.html", processed_html)

# 进一步处理：转换为Markdown知识库
markdown_content = convert_html_to_markdown(processed_html)
xbot.file.write_text("knowledge_base.md", markdown_content)

⚙️ 环境配置

📦 必需依赖包

import xbot
from xbot import print, sleep
import re
import json
import dashscope
from dashscope import MultiModalConversation
from urllib.parse import urlparse

🔧 需要在影刀安装的外部依赖包

dashscope 

urllib3

⚙️ 配置说明

🔑 API密钥：需配置有效的多模态模型API密钥
🤖 模型名称：支持多种视觉语言模型，如qwen3-vl-flash
⏱️ 视频处理参数：video_fps参数控制视频抽帧频率，平衡精度与成本
💵 价格参数：根据API定价策略配置阶梯价格，精确核算成本

💰 成本优化策略

考虑到API调用成本，本方案实现了一系列优化：

♻️ 避免重复处理：自动检测已添加描述的元素，跳过重复处理
🎯 分级处理策略：对关键图片使用高精度模型，对装饰性图片使用轻量模型
🧮 精确成本核算：实时统计并返回处理成本，便于企业预算管理
⏱️ 请求频率控制：内置请求间隔，避免触发API限流

# 阶梯价格计算
def calculate_token_cost(input_tokens, output_tokens, 
                        input_price_tier1, input_price_tier2, input_price_tier3,
                        input_tier2_threshold, input_tier3_threshold,
                        output_price):
    # 计算阶梯定价成本
    # ...

🌈 总结与展望

本方案通过将RPA技术与多模态AI模型结合，成功解决了非结构化数据AI可读化的难题。在实际应用中，它不仅能显著提升知识库构建效率，还为内容审核、无障碍访问、SEO优化等场景提供了强大支持。🚀

未来，我们将进一步优化以下方向：

📁 增加对本地媒体文件的支持
📊 实现描述质量的自动评估与优化
🌍 支持多语言描述生成
🧩 集成到影刀RPA的标准组件库中，降低使用门槛

在AI与自动化深度融合的今天，让机器真正"理解"视觉内容已不再是遥不可及的梦想。通过这样的技术创新，我们正在构建一个更加智能、无障碍的数字世界。✨

注：本文分享的技术方案依赖于视觉语言模型能力，实际使用效果与模型性能密切相关。代码实现已考虑企业级应用的稳定性与成本控制需求，可根据具体场景调整参数配置。

源码：

import xbot
from xbot import print, sleep
import re
import json
import dashscope
from dashscope import MultiModalConversation
from urllib.parse import urlparse

# 设置默认超时时间（秒）
API_TIMEOUT = 300

def calculate_token_cost(input_tokens, output_tokens, 
                        input_price_tier1, input_price_tier2, input_price_tier3,
                        input_tier2_threshold, input_tier3_threshold,
                        output_price):
    """
    根据阶梯价格计算token使用成本
    """
    # 计算输入token成本
    input_cost = 0.0
    
    # 第一阶梯
    if input_tier2_threshold > 0:
        tier1_tokens = min(input_tokens, input_tier2_threshold)
        input_cost += tier1_tokens * input_price_tier1 / 1000.0
        input_tokens -= tier1_tokens
    
    # 第二阶梯
    if input_tier3_threshold > 0 and input_tokens > 0:
        tier2_tokens = min(input_tokens, input_tier3_threshold - input_tier2_threshold)
        input_cost += tier2_tokens * input_price_tier2 / 1000.0
        input_tokens -= tier2_tokens
    
    # 第三阶梯
    if input_tokens > 0:
        input_cost += input_tokens * input_price_tier3 / 1000.0
    
    # 计算输出token成本
    output_cost = output_tokens * output_price / 1000.0
    
    return input_cost + output_cost

def is_valid_url(url):
    """验证是否为有效的HTTP/HTTPS URL"""
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc]) and result.scheme in ['http', 'https']
    except Exception as e:
        print(f"URL验证失败: {url}, 错误: {str(e)}")
        return False

def call_vl_model(api_key, model_name, media_url, media_type, video_fps=2.0):
    """
    调用通义千问VL模型API获取媒体描述
    返回: (description, usage) 或 (None, None)
    """
    # 验证视频抽帧参数
    if media_type == "video":
        if not isinstance(video_fps, (int, float)) or video_fps < 0.1 or video_fps > 10:
            print(f"警告: 无效的视频抽帧频率 {video_fps}，使用默认值 2.0")
            video_fps = 2.0
    
    # 构建提示词
    if media_type == "image":
        prompt = ("请对这张图片进行客观、精确的知识提取，生成适合构建知识库的精简的结构化描述。要求："
                  "1. 识别并描述所有关键对象、人物及其相互关系；"
                  "2. 准确记录场景/环境类型；"
                  "3. 提取任何可见文字内容；"
                  "4. 描述关键细节：颜色、形状、布局、显著特征；"
                  "5. 保持客观准确，不推测、不添加图片中不存在的信息；"
                  "6. 以知识事实的形式呈现，便于AI系统理解和检索。"
                  "7. 不要包含图片外的其他多余信息和内容，总字数控制在100以内。"
                  "用中文输出，确保描述全面且具有知识价值。")
    else:  # video
        prompt = ("请对这段视频进行系统性知识提取，生成适合构建知识库的精简的结构化描述。要求："
                  "1. 概述核心内容和主要活动序列；"
                  "2. 识别关键对象、人物及其在整个视频中的互动变化；"
                  "3. 记录时间线上的重要变化和关键帧内容；"
                  "4. 提取任何可见或可听的文字、语音内容；"
                  "5. 描述场景转换、环境特征和显著视觉元素；"
                  "6. 保持客观准确，只描述视频中明确呈现的信息；"
                  "7. 以知识事实的形式组织，便于AI理解和知识检索。"
                  "8. 不要包含视频外的其他多余信息和内容，总字数控制在800以内。"
                  "用中文输出，确保描述具有知识结构化和检索价值。")
    
    try:
        # 构建多模态请求内容
        content = []
        if media_type == "image":
            content.append({"image": media_url})
        elif media_type == "video":
            content.append({"video": media_url, "fps": video_fps})
        
        content.append({"text": prompt})
        
        messages = [{
            "role": "user",
            "content": content
        }]
        
        print(f"调用模型: {model_name}")
        print(f"媒体类型: {media_type}, URL: {media_url}")
        if media_type == "video":
            print(f"视频抽帧频率: {video_fps} (每 {1/video_fps:.1f} 秒一帧)")
        
        # 调用DashScope API
        response = MultiModalConversation.call(
            model=model_name,
            messages=messages,
            api_key=api_key,
            timeout=API_TIMEOUT
        )
        
        # 处理响应
        if response.status_code == 200:
            description = response.output.choices[0].message.content[0]["text"].strip()
            usage = response.usage  # 获取token使用量
            
            print(f"AI生成的{media_type}描述: {description}")  
            return description, usage
        else:
            error_msg = f"API请求失败，状态码: {response.status_code}"
            if hasattr(response, 'message'):
                error_msg += f", 详情: {response.message}"
            elif hasattr(response, 'body') and 'message' in response.body:
                error_msg += f", 详情: {response.body['message']}"
            print(error_msg)
            return None, None
            
    except Exception as e:
        print(f"API调用异常: {str(e)}")
        import traceback
        traceback.print_exc()
        return None, None

def insert_description(html_content, end_pos, description, media_type):
    """
    在指定位置后插入描述文本
    """
    # 创建描述HTML（使用更安全的样式）
    description_html = (
        f'<div style="font-size: 0.85em; color: #666; margin: 8px 0 8px 12px; '
        f'padding-left: 10px; border-left: 2px solid #e0e0e0;">'
        f'<strong>{media_type}描述:</strong> {description}'
        f'</div>'
    )
    
    return html_content[:end_pos] + description_html + html_content[end_pos:]

def process_html_content(api_key, model_name, video_fps, html_content,
                        input_price_tier1, input_price_tier2, input_price_tier3,
                        input_tier2_threshold, input_tier3_threshold,
                        output_price):
    """
    处理HTML内容，为图片和视频添加AI生成的描述
    返回: (processed_html, total_cost)
    """
    # 收集所有媒体元素
    media_items = []
    
    # 匹配图片 (支持单双引号)
    img_pattern = r'<img[^>]+src\s*=\s*["\']([^"\']+)["\'][^>]*>'
    for match in re.finditer(img_pattern, html_content, re.IGNORECASE):
        img_url = match.group(1)
        if is_valid_url(img_url):
            media_items.append({
                'type': 'image',
                'url': img_url,
                'end_pos': match.end()
            })
        else:
            print(f"跳过无效图片URL: {img_url}")
    
    # 匹配视频 (支持video标签和source标签)
    video_patterns = [
        r'<video[^>]+src\s*=\s*["\']([^"\']+)["\'][^>]*>',
        r'<source[^>]+src\s*=\s*["\']([^"\']+)["\'][^>]*>'
    ]
    
    for pattern in video_patterns:
        for match in re.finditer(pattern, html_content, re.IGNORECASE):
            video_url = match.group(1)
            if is_valid_url(video_url):
                media_items.append({
                    'type': 'video',
                    'url': video_url,
                    'end_pos': match.end()
                })
            else:
                print(f"跳过无效视频URL: {video_url}")
    
    if not media_items:
        print("未找到有效的媒体元素")
        return html_content, 0.0
    
    print(f"发现 {len(media_items)} 个媒体元素需要处理")
    
    # 按位置从后往前排序（避免插入后位置偏移）
    media_items.sort(key=lambda x: x['end_pos'], reverse=True)
    
    processed_html = html_content
    processed_count = 0
    skipped_count = 0
    total_cost = 0.0  # 初始化总费用
    
    for item in media_items:
        # 检查是否已添加过描述（避免重复处理）
        check_start = max(0, item['end_pos'] - 150)
        check_end = min(len(processed_html), item['end_pos'] + 150)
        if '<div style="font-size: 0.85em' in processed_html[check_start:check_end]:
            print(f"跳过已处理的{item['type']}: {item['url']}")
            skipped_count += 1
            continue
            
        print(f"[{processed_count+1}/{len(media_items)}] 处理{item['type']}: {item['url']}")
        
        # 获取媒体描述和token使用量
        description, usage = call_vl_model(
            api_key=api_key,
            model_name=model_name,
            media_url=item['url'],
            media_type=item['type'],
            video_fps=video_fps if item['type'] == 'video' else None
        )
        
        if description and usage:
            # 计算本次调用成本
            cost = calculate_token_cost(
                input_tokens=usage.input_tokens,
                output_tokens=usage.output_tokens,
                input_price_tier1=input_price_tier1,
                input_price_tier2=input_price_tier2,
                input_price_tier3=input_price_tier3,
                input_tier2_threshold=input_tier2_threshold,
                input_tier3_threshold=input_tier3_threshold,
                output_price=output_price
            )
            total_cost += cost
            print(f"本次调用成本: ¥{cost:.6f} (输入: {usage.input_tokens} tokens, 输出: {usage.output_tokens} tokens)")
            
            # 在标签后插入描述
            processed_html = insert_description(
                processed_html, 
                item['end_pos'], 
                description, 
                item['type']
            )
            processed_count += 1
        else:
            print(f"跳过{item['type']} {item['url']}，原因: 未获取到有效描述或token使用量")
        
        # 防止API请求过于频繁
        sleep_time = 1
        print(f"等待 {sleep_time:.1f} 秒后处理下一个媒体...")
        sleep(sleep_time)
    
    print(f"处理完成! 成功处理: {processed_count}, 已跳过: {skipped_count}, 失败: {len(media_items)-processed_count-skipped_count}")
    print(f"总费用: ¥{total_cost:.6f}")
    return processed_html, total_cost

def main(
    api_key="填入你的api key",
    model_name="qwen3-vl-flash",
    video_fps=0.25,
    html_content="",
    input_price_tier1=0.00015,
    input_price_tier2=0.0003,
    input_price_tier3=0.0006,
    input_tier2_threshold=32,
    input_tier3_threshold=128,
    output_price=0.003
):
    
    # 验证参数
    if not api_key:
        error_msg = "错误: 未提供有效的API Key，无法调用AI模型"
        print(error_msg)
        return {"processed_html": html_content, "error": error_msg}
    
    if not html_content:
        error_msg = "警告: 未提供HTML内容"
        print(error_msg)
        return {"processed_html": "", "error": error_msg}
    
    # 验证视频抽帧参数
    try:
        video_fps = float(video_fps)
        if video_fps < 0.1 or video_fps > 10:
            print(f"警告: 无效的视频抽帧频率 {video_fps}，使用默认值 2.0")
            video_fps = 2.0
    except (TypeError, ValueError):
        print(f"警告: 无效的视频抽帧频率格式，使用默认值 2.0")
        video_fps = 2.0
    
    print(f"开始处理HTML内容，使用模型: {model_name}")
    print(f"视频抽帧频率: {video_fps} (每 {1/video_fps:.1f} 秒一帧)")
    print(f"价格配置: 输入(¥{input_price_tier1}/¥{input_price_tier2}/¥{input_price_tier3} 每千token), "
          f"输出(¥{output_price}/千token)")
    
    # 设置DashScope API密钥
    dashscope.api_key = api_key
    
    # 处理HTML内容
    try:
        result_html, total_cost = process_html_content(
            api_key=api_key,
            model_name=model_name,
            video_fps=video_fps,
            html_content=html_content,
            input_price_tier1=input_price_tier1,
            input_price_tier2=input_price_tier2,
            input_price_tier3=input_price_tier3,
            input_tier2_threshold=input_tier2_threshold,
            input_tier3_threshold=input_tier3_threshold,
            output_price=output_price
        )
        
        # 保留6位小数，实际显示时可四舍五入
        return {
            "processed_html": result_html,
            "total_cost": round(total_cost, 6)
        }
    except Exception as e:
        error_msg = f"处理过程中发生错误: {str(e)}"
        print(error_msg)
        import traceback
        traceback.print_exc()
        return {"processed_html": html_content, "error": error_msg}

作者：Shawn Luo
更新日期：2025年12月30日
版本：v1.2（支持成本计算与视频分析）

💬 交流讨论：在评论区分享你的使用场景或遇到的问题，一起探讨AI+RPA的更多可能！