User-Intent-Driven Data Construction

DataClaw: Make ANY Data You Want

Turn raw inputs into user-intent-aligned, trainable multimodal data.

DataClaw is built around a simple premise: users do not want a fixed dataset recipe. They want the exact data form that matches their downstream goal. From videos, images, GUI traces, robot trajectories, and more, DataClaw infers intent and composes the right trainable output.

Explore Cases See Method

Multi-Domain tutorials, education, GUI, robotics, generation

Intent-Driven construct data based on what the user actually wants

Trainable outputs are structured for downstream training rather than demo-only generation

Raw Inputs

Video, image, GUI trace, robot demo, drawing sequence

Ordinary materials enter the system without being pre-shaped into a single dataset template.

User Intent

What kind of data does the user actually want to build from this material?

Constructed Outputs

Interleaved tutorials, temporal QA, long-horizon GUI tasks, world-model samples, editing data.

Intent First

Case Scalable

Method

Raw video and user intent are merged once, then dispatched by task type.

DataClaw first resolves user intent into typed instructions, then combines that instruction with the raw video context in one shared route before dispatching parallel synthesis models.

Initial Inputs

Final outputs

Raw Video

Full source video with actions, frames, and temporal evidence.

User Intent

Natural-language request that may still be broad or ambiguous.

User Intent Agent

Infer candidate task types and rewrite each candidate into a clear, synthesis-ready instruction.

task typing instruction rewriting ambiguity preserved

Task Type A

Tutorial

Interleaved tutorial-style supervision.

Task Type B

Temporal

Ordering, localization, and state-change reasoning.

...

More Types

Additional task classes can be inferred from the same intent.

Tutorial Agent

Use the task-specific synthesis agent selected for tutorial supervision.

Temporal Agent

Use the task-specific synthesis agent selected for temporal reasoning data.

More Agents

Attach more specialized agents for other task types when needed.

Tutorial Data

Task description + frame-text labels.

Temporal Data

Task description + timestamped image-text labels.

More Task Data

Additional task description + multimodal labels for more task classes.

Case Gallery

One system, many kinds of data users may ask for.

01 Everyday Tutorials

Cooking Tutorial from Raw Video

Turn a cooking-and-cleanup video into interleaved instructional data that feels like a real tutorial page.

You

A source asset is available for this case.

This is a cooking video. Turn it into interleaved tutorial data with step-by-step images and text.

DataClaw

THOUGHT

DataClaw

TASK

This is my current situation: I am about to cook meat, and I am setting a timer to keep track of the cooking time. I want to know what I should do next.

SOLUTION

Evaluation

We test both benchmark reconstruction and benchmark augmentation.

The goal is not just to generate plausible outputs. The goal is to reconstruct benchmark-style supervision from raw sources, and then show that newly constructed data can improve benchmark performance after training.

Setting A

Benchmark Reconstruction

Give the system raw materials, a user intent prompt, and target schema constraints. Then compare the constructed output against the official benchmark ground truth.

raw source
user intent
schema-aligned output
GT comparison

Setting B

Benchmark Augmentation

Give the system in-distribution but unlabeled raw materials, construct new data in the benchmark schema, train with the augmented set, and evaluate on the official test split.

unlabeled raw source
new benchmark-style data
train with augmentation
test on official split

Benchmarks

Source-linked benchmarks make reconstruction measurable.

We focus on benchmarks where the connection from benchmark sample back to raw source material is explicit enough to support reconstruction, alignment, or augmentation experiments.

Video QA

LongVideoBench

Raw videos, subtitles, and annotations make it suitable for long-context reconstruction.

Temporal

TempCompass

Temporal ordering and concurrency questions are well matched to structured reconstruction.

Video QA

Video-MME

Video source linkage is clear enough to test benchmark-style QA recovery from raw videos.

Motion

MotionBench

Its mapping and clip provenance design makes it a strong reconstruction-friendly reference.

Comparison Table

Results can be organized by benchmark and evaluation setting.

Numbers can be filled in later.

Method	Family	Setting A: Benchmark Reconstruction				Setting B: Benchmark Augmentation
Method	Family	LongVideoBench	TempCompass	Video-MME	MotionBench	LongVideoBench	TempCompass	Video-MME	MotionBench
DataClaw	Intent-driven data construction	—	—	—	—	—	—	—	—
Qwen3.5	Direct multimodal generation	—	—	—	—	—	—	—	—
Gemini	Direct multimodal generation	—	—	—	—	—	—	—	—
GPT-4o	Direct multimodal generation	—	—	—	—	—	—	—	—
InternVL	General VLM baseline	—	—	—	—	—	—	—	—
Rule Pipeline	Hand-crafted data synthesis	—	—	—	—	—	—	—	—