OmniParser 解析屏幕的 AI 工具

OmniParser 是微软推出的纯视觉界面解析工具，能将 UI 截图转化为结构化元素，提升 GPT-4V 的动作生成准确性。支持本地追踪日志和 OmniTool 在 Windows 11 上的自动化操作，且可结合多种视觉模型，实现高效界面控制。项目以 Python 开发，开源于 GitHub，采用 CC-BY-4.0 许可证，已收获 22.1k 星，社区活跃维护。

源码：https://github.com/microsoft/OmniParser

安装

首先克隆 repo，然后安装环境：

cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

确保已将 V2 权重下载到 weights 文件夹中（确保标题权重文件夹名为 icon_caption_florence）。如果没有，请使用以下命令下载：

# download the model checkpoints to local directory OmniParser/weights/
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

例子：

我们在 demo.ipynb 中整理了几个简单的例子。

Gradio 演示

要运行 gradio demo，只需运行：

python gradio_demo.py

模型权重许可证

对于 huggingface 模型中心上的模型检查点，请注意，icon_detect 模型采用 AGPL 许可证，因为它是从原始 yolo 模型继承而来的许可证。icon_caption_blip2 和 icon_caption_florence 采用 MIT 许可证。请参阅每个模型文件夹中的 LICENSE 文件：https://huggingface.co/microsoft/OmniParser。

📚 引用

我们的技术报告可以在这里找到。如果您觉得我们的工作有用，请考虑引用我们的工作：

@misc{lu2024omniparserpurevisionbased,
      title={OmniParser for Pure Vision Based GUI Agent}, 
      author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
      year={2024},
      eprint={2408.00203},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.00203}, 
}

Libre Depot（自由仓库）原创文章、发布者：Libre Depot = 转载请注明出处：https://www.libredepot.top/zh/5588.html