A universal framework for agentic multimodal data tailoring

DataClaw: Agentic Multimodal Data Tailoring

Actively refining and structuring data to align with diverse user and downstream intents.

DataClaw elevates data processing to a learnable, high-order capability. Given a user intent or downstream objective, a 9B tailoring agent filters redundant signal from long videos, GUI traces, embodied trajectories, and editing sequences, then reorganizes the residual into dense, verifiable, application-specific supervision — trained with SFT + rule-driven GRPO, deployed as a single Omni model or a panel of domain Experts.

Explore cases

v1 Method paper drafted · code, dataset, and DataClaw-val release with v2.

Method

Overview.

DataClaw extracts bottom-up factual anchors from raw multimodal data and combines them with domain-specific intents for top-down semantic synthesis by a strong VLM, producing structured training data across five domains. The corpus then trains DataClaw under two paradigms — an Omni model and per-domain Experts — with SFT followed by rule-driven GRPO.

DataClaw overview: data construction via bottom-up factual anchor extraction and top-down semantic synthesis, followed by training under Omni and Expert paradigms, inference, and downstream utilization.

Qualitative cases

Five representative domains: daily life, education, embodied, GUI agents, and AIGC.

Experiments

DataClaw-val and Targeted Refinement: structured-output quality, then downstream training utility.

We evaluate two complementary aspects of agentic data tailoring. First, DataClaw-val: a benchmark dedicated to data refinement, scoring outputs by JSON validity, schema-field correctness, textual semantic alignment, and visual consistency. Second, Targeted Refinement: post-training as the ultimate validation touchstone — downstream SFT on DataClaw-tailored data is compared against frontier-VLM annotations under volume-aligned budgets.

Setting A

Refinement Quality — DataClaw-val

200 diversity-sampled examples spanning five domains plus a fuzzy-intent stress subset. Outputs are scored by JSON validity, then by Field, Semantic, and Sequence metrics over the target schema.

  • 5 domains + fuzzy intent
  • schema-aware metrics
  • Field / Semantic / Sequence
  • vs. proprietary VLMs
Setting B

Targeted Refinement — downstream SFT

Identical raw streams are tailored by Qwen3.5-9B, Gemini-3.1-Pro, and DataClaw. Equal-size SFT subsets are sampled from each pool, used to fine-tune a base model, and evaluated on the task's official test split.

  • volume-aligned
  • data-quality isolation
  • 3 downstream tasks
  • end-to-end success

Headline numbers

DataClaw-E matches frontier VLMs on schema, leads on end-to-end downstream metrics.

Scores below are taken from the v1 paper. DataClaw-E is the routed expert configuration; DataClaw-O is the unified omni model. Best per row in bold.

Refinement

DataClaw-val (Field / Sem / Seq)

DataClaw-E: 97.53 / 74.94 / 48.86 · Gemini-3.1-Pro: 98.12 / 73.85 / 58.50 · GPT-4o: 97.27 / 75.15 / 49.43.

GUI navigation

AgentNet (SSR / TSR)

Base Qwen3.5-4B. SFT on Gemini data: 39.5 / 14.2. SFT on DataClaw: 38.2 / 15.6 — higher end-to-end task success.

Video generation

Ego4D action gen (FVD ↓ / Contact mAP ↑)

Base Wan2.2-I2V-5B. SFT on Gemini: 295.4 / 48.5. SFT on DataClaw: 288.6 / 51.2.

Spatio-temporal VQA

ReMoT (Partial / Overall)

Base Qwen3.5-4B. SFT on Gemini: 53.4 / 31.5. SFT on DataClaw: 52.1 / 33.2.

Main results

DataClaw-val — structured-output quality across five domains and a fuzzy-intent subset.

Field measures schema completeness, Semantic measures content correctness, Sequence measures ordering and structural consistency. Best per column in bold. Numbers from the v1 paper.

Model Metric GUI Embodied AIGC Daily Education Fuzzy Overall
Claude-Sonnet-4-6 Field 57.50100.00100.00100.00100.0076.3588.98
Semantic50.0584.8374.2654.4654.4665.7263.96
Sequence54.0650.1133.3841.0741.0736.4842.70
GPT-4o (1120-global) Field 100.00100.00100.00100.00100.0083.6197.27
Semantic84.8187.5569.3854.5880.2174.3975.15
Sequence80.6946.3346.9529.3350.7142.5749.43
Gemini-3.1-Pro-Preview Field 100.00100.00100.00100.00100.0088.7498.12
Semantic90.0189.1775.2654.5154.5179.6373.85
Sequence99.6767.9733.1451.4851.4847.2558.50
MiniMax-M2.7 Field 92.50100.00100.0097.5095.0073.2993.05
Semantic78.9379.2869.0744.3843.5261.8562.84
Sequence77.8651.846.8332.8917.5432.1636.52
Qwen3.6-plus Field 70.00100.0095.0095.0082.5077.5886.68
Semantic62.6487.4067.8251.2062.1464.3765.93
Sequence66.9660.3339.7142.4430.8535.9246.04
Qwen3.5-9B Field 94.87100.0095.0087.5090.0070.4689.64
Semantic72.7277.4865.2745.6643.4158.2460.46
Sequence72.7159.353.2927.7024.7529.6336.24
DataClaw-O Ours Field 100.00100.0085.0092.5070.0078.4287.65
Semantic80.0163.3755.7062.6145.7167.3562.46
Sequence85.7067.0123.9035.0517.4139.8444.82
DataClaw-E Ours Field 100.00100.00100.00100.00100.0085.1797.53
Semantic88.9082.9375.3649.7276.4376.2874.94
Sequence93.6271.6015.2642.5919.7550.3148.86

Downstream application

Targeted Refinement — same raw streams, same SFT budget, only the annotator changes.

Three downstream tasks fine-tuned on identical sample counts drawn from each annotator's pool. DataClaw matches Gemini overall and leads on the end-to-end task-success metrics.

Data source for SFT GUI NavigationBase: Qwen3.5-4B Action Video GenerationBase: Wan2.2-I2V-5B Spatio-temporal VQABase: Qwen3.5-4B
SSR ↑TSR ↑ FVD ↓Consis. ↑Contact mAP ↑ Partial ↑Overall ↑
Zero-shot base model12.41.2385.268.418.528.39.8
Processed by base model16.83.5362.169.124.233.514.2
Processed by Gemini-3.1-Pro39.514.2295.476.248.553.431.5
Processed by DataClaw Ours38.215.6288.675.851.252.133.2

Release roadmap

v1 reports the numbers; v2 ships code, weights, and DataClaw-val.

The v1 method paper covers the full pipeline, ablations, scaling curves, and t-SNE diversity analysis. Code, model weights, dataset, and the DataClaw-val benchmark are forthcoming with v2.

  1. Setting A DataClaw-val refinement quality

    Score the agent's structured outputs against schema-aware Field, Semantic, and Sequence metrics across five domains and a fuzzy-intent subset.

    • DataClaw-val (5 domains) v1 paper
    • DataClaw-Intent (fuzzy) v1 paper
    • Reward / routing ablations v1 paper
    • Public benchmark release Forthcoming · v2
  2. Setting B Targeted Refinement — downstream SFT

    Hold raw streams fixed, vary only the annotator. Evaluate on three downstream tasks under volume-aligned SFT.

    • AgentNet — GUI navigation v1 paper
    • Ego4D — action video gen v1 paper
    • ReMoT — spatio-temporal VQA v1 paper
    • Reproduction recipes Forthcoming · v2