Nexa
Nexa
Model Hub
Gallery
navigation
Octopus v3link icon

Compact (Sub-Billion) Multimodal Action Model for On-Device AI Agents

Deploy this model for your business

TL;DR

  • Octopus v3 is less than half the size of Octopus v2 2B. It runs efficiently on edge devices, including those as small as a Raspberry Pi.
  • With the functional token, Octopus v3 achieves function calling accuracy on par with GPT-4V and GPT-4.
  • Octopus v3 supports both text and image inputs.
  • Octopus v3 understands both English and Chinese.

Introduction

Octopus v3 represents a significant advancement in on-device multimodal AI and AI agent applications. While its predecessor, Octopus v2, focused on text-based interactions and outperformed GPT-4 in function-calling speed and accuracy, v3 expands capabilities to include visual inputs and multilingual support.

This blog explores Octopus v3's technical aspects and assesses its performance in real-world scenarios, comparing this compact, on-device model to larger, cloud-based alternatives like GPT-4V.

visualInfor1

Training Methods

We developed Octopus v3 with these key techniques to add multimodal capabilities while keeping the model compact:

  • We use CLIP (Contrastive Language-Image Pre-training) for image encoding. CLIP's ability to align visual and textual information makes it ideal for processing visual data alongside text in the multimodal model.
  • We retained the functional token approach from Octopus v2. It allows the model to represent specific functions as tokens, enabling it to understand and execute a wide range of tasks efficiently and responsively.
  • Our process begins with separate training of the causal language model and image encoder. We then merge these components and perform alignment training to synchronize image and text processing. After integrating functional token learning from Octopus v2, we also use reinforcement learning with another large language model as the reward model. It helps to refine the model's ability to process multimodal input (vision + text) into actual actions.

Model Evaluation

To assess Octopus v3's performance in multimodality and function-calling, we compared it with a combination of GPT-4V and GPT-4 across 10 common use cases in smartphone operating systems:

  • Given an image with the text "THE ERA OF ON-DEVICE AI AGENT IS COMING!" and instructions to email about AI progress, Octopus v3 composed a concise email capturing the main idea in both English and Chinese and called for a send email action.
  • When shown an image of the Golden Gate Bridge, Octopus v3 accurately described the scene for a text message in both English and Chinese and called for a send text action.
  • Presented with an image of the Great Wall of China and asked about its history, Octopus v3 generated accurate search queries in both English and Chinese and called for a search action.
  • When shown an image of a compact dishwasher, Octopus v3 accurately described the product's color, size, category, and application, and called for a purchase action.
  • Given an image of plastic water bottles, Octopus v3 correctly identified the items and broke down the material for recycling.
  • When shown an image of a computer mouse, Octopus v3 provided a detailed description of the item in both English and Chinese.
  • Presented with an image of a modern living room, Octopus v3 offered style suggestions that aligned with the room's existing aesthetics.
  • When shown an image of a pineapple and asked to order two, Octopus v3 correctly identified the item and quantity for purchase in both English and Chinese and called for an Instacart purchase action
  • Given an image of Mexican cuisine and instructions to deliver to a specific location, Octopus v3 accurately described the dish and included the delivery address in both English and Chinese and called for a DoorDash ordering action.
  • Presented with an image of a long-haired dog and asked for care instructions, Octopus v3 correctly identified the animal and the type of care needed (grooming) in both English and Chinese.

In all these cases, Octopus v3, with less than 1B parameters, demonstrated function-calling accuracy comparable to the cloud-based, large-scale GPT-4V and GPT-4 combination. It proficiently handled both English and Chinese inputs, effectively processing visual and textual information to generate accurate responses and actions.

The versatility across complex, multimodal tasks in various domains underscores Octopus v3's potential as a powerful AI agent for everyday smartphone use cases. Importantly, Octopus v3 runs entirely on-device, requiring no internet connection for its operations.

What's Next