A universal framework for agentic multimodal data tailoring

DataClaw: Agentic Multimodal Data Tailoring

Actively refining and structuring data to align with diverse user and downstream intents.

DataClaw elevates data processing to a learnable, high-order capability. Given a user intent or downstream objective, a 9B tailoring agent filters redundant signal from long videos, GUI traces, embodied trajectories, and editing sequences, then reorganizes the residual into dense, verifiable, application-specific supervision — trained with SFT + rule-driven GRPO, deployed as a single Omni model or a panel of domain Experts.

Explore cases

GitHub

arXiv

Hugging Face

v1 Method paper drafted · code, dataset, and DataClaw-val release with v2.

Method

Overview.

DataClaw extracts bottom-up factual anchors from raw multimodal data and combines them with domain-specific intents for top-down semantic synthesis by a strong VLM, producing structured training data across five domains. The corpus then trains DataClaw under two paradigms — an Omni model and per-domain Experts — with SFT followed by rule-driven GRPO.

Qualitative cases

Five representative domains: daily life, education, embodied, GUI agents, and AIGC.

01 Daily Life

Cooking Tutorial from Raw Video

Turn a cooking-and-cleanup video into interleaved instructional data that feels like a real tutorial page.

You

A source asset is available for this case.

This is a cooking video. Turn it into interleaved tutorial data with step-by-step images and text.

DataClaw

THOUGHT

DataClaw

TASK

This is my current situation: I am about to cook meat, and I am setting a timer to keep track of the cooking time. I want to know what I should do next.

SOLUTION

Experiments

DataClaw-val and Targeted Refinement: structured-output quality, then downstream training utility.

We evaluate two complementary aspects of agentic data tailoring. First, DataClaw-val: a benchmark dedicated to data refinement, scoring outputs by JSON validity, schema-field correctness, textual semantic alignment, and visual consistency. Second, Targeted Refinement: post-training as the ultimate validation touchstone — downstream SFT on DataClaw-tailored data is compared against frontier-VLM annotations under volume-aligned budgets.

Setting A

Refinement Quality — DataClaw-val

200 diversity-sampled examples spanning five domains plus a fuzzy-intent stress subset. Outputs are scored by JSON validity, then by Field, Semantic, and Sequence metrics over the target schema.

5 domains + fuzzy intent
schema-aware metrics
Field / Semantic / Sequence
vs. proprietary VLMs

Setting B

Targeted Refinement — downstream SFT

Identical raw streams are tailored by Qwen3.5-9B, Gemini-3.1-Pro, and DataClaw. Equal-size SFT subsets are sampled from each pool, used to fine-tune a base model, and evaluated on the task's official test split.

volume-aligned
data-quality isolation
3 downstream tasks
end-to-end success

Headline numbers

DataClaw-E matches frontier VLMs on schema, leads on end-to-end downstream metrics.

Scores below are taken from the v1 paper. DataClaw-E is the routed expert configuration; DataClaw-O is the unified omni model. Best per row in bold.

Refinement

DataClaw-val (Field / Sem / Seq)

DataClaw-E: 97.53 / 74.94 / 48.86 · Gemini-3.1-Pro: 98.12 / 73.85 / 58.50 · GPT-4o: 97.27 / 75.15 / 49.43.

GUI navigation

AgentNet (SSR / TSR)

Base Qwen3.5-4B. SFT on Gemini data: 39.5 / 14.2. SFT on DataClaw: 38.2 / 15.6 — higher end-to-end task success.

Video generation

Ego4D action gen (FVD ↓ / Contact mAP ↑)

Base Wan2.2-I2V-5B. SFT on Gemini: 295.4 / 48.5. SFT on DataClaw: 288.6 / 51.2.

Spatio-temporal VQA

ReMoT (Partial / Overall)

Base Qwen3.5-4B. SFT on Gemini: 53.4 / 31.5. SFT on DataClaw: 52.1 / 33.2.

Main results

DataClaw-val — structured-output quality across five domains and a fuzzy-intent subset.

Field measures schema completeness, Semantic measures content correctness, Sequence measures ordering and structural consistency. Best per column in bold. Numbers from the v1 paper.

Model	Metric	GUI	Embodied	AIGC	Daily	Education	Fuzzy	Overall
Claude-Sonnet-4-6	Field	57.50	100.00	100.00	100.00	100.00	76.35	88.98
	Semantic	50.05	84.83	74.26	54.46	54.46	65.72	63.96
	Sequence	54.06	50.11	33.38	41.07	41.07	36.48	42.70
GPT-4o (1120-global)	Field	100.00	100.00	100.00	100.00	100.00	83.61	97.27
	Semantic	84.81	87.55	69.38	54.58	80.21	74.39	75.15
	Sequence	80.69	46.33	46.95	29.33	50.71	42.57	49.43
Gemini-3.1-Pro-Preview	Field	100.00	100.00	100.00	100.00	100.00	88.74	98.12
	Semantic	90.01	89.17	75.26	54.51	54.51	79.63	73.85
	Sequence	99.67	67.97	33.14	51.48	51.48	47.25	58.50
MiniMax-M2.7	Field	92.50	100.00	100.00	97.50	95.00	73.29	93.05
	Semantic	78.93	79.28	69.07	44.38	43.52	61.85	62.84
	Sequence	77.86	51.84	6.83	32.89	17.54	32.16	36.52
Qwen3.6-plus	Field	70.00	100.00	95.00	95.00	82.50	77.58	86.68
	Semantic	62.64	87.40	67.82	51.20	62.14	64.37	65.93
	Sequence	66.96	60.33	39.71	42.44	30.85	35.92	46.04
Qwen3.5-9B	Field	94.87	100.00	95.00	87.50	90.00	70.46	89.64
	Semantic	72.72	77.48	65.27	45.66	43.41	58.24	60.46
	Sequence	72.71	59.35	3.29	27.70	24.75	29.63	36.24
DataClaw-O Ours	Field	100.00	100.00	85.00	92.50	70.00	78.42	87.65
	Semantic	80.01	63.37	55.70	62.61	45.71	67.35	62.46
	Sequence	85.70	67.01	23.90	35.05	17.41	39.84	44.82
DataClaw-E Ours	Field	100.00	100.00	100.00	100.00	100.00	85.17	97.53
	Semantic	88.90	82.93	75.36	49.72	76.43	76.28	74.94
	Sequence	93.62	71.60	15.26	42.59	19.75	50.31	48.86

Downstream application

Targeted Refinement — same raw streams, same SFT budget, only the annotator changes.

Three downstream tasks fine-tuned on identical sample counts drawn from each annotator's pool. DataClaw matches Gemini overall and leads on the end-to-end task-success metrics.

Data source for SFT	GUI NavigationBase: Qwen3.5-4B		Action Video GenerationBase: Wan2.2-I2V-5B			Spatio-temporal VQABase: Qwen3.5-4B
	SSR ↑	TSR ↑	FVD ↓	Consis. ↑	Contact mAP ↑	Partial ↑	Overall ↑
Zero-shot base model	12.4	1.2	385.2	68.4	18.5	28.3	9.8
Processed by base model	16.8	3.5	362.1	69.1	24.2	33.5	14.2
Processed by Gemini-3.1-Pro	39.5	14.2	295.4	76.2	48.5	53.4	31.5
Processed by DataClaw Ours	38.2	15.6	288.6	75.8	51.2	52.1	33.2

Release roadmap

v1 reports the numbers; v2 ships code, weights, and DataClaw-val.

The v1 method paper covers the full pipeline, ablations, scaling curves, and t-SNE diversity analysis. Code, model weights, dataset, and the DataClaw-val benchmark are forthcoming with v2.

Setting A DataClaw-val refinement quality

Score the agent's structured outputs against schema-aware Field, Semantic, and Sequence metrics across five domains and a fuzzy-intent subset.
- DataClaw-val (5 domains) v1 paper
- DataClaw-Intent (fuzzy) v1 paper
- Reward / routing ablations v1 paper
- Public benchmark release Forthcoming · v2
Setting B Targeted Refinement — downstream SFT

Hold raw streams fixed, vary only the annotator. Evaluate on three downstream tasks under volume-aligned SFT.
- AgentNet — GUI navigation v1 paper
- Ego4D — action video gen v1 paper
- ReMoT — spatio-temporal VQA v1 paper
- Reproduction recipes Forthcoming · v2