User-Intent-Driven Data Construction

DataClaw: Make ANY Data You Want

Turn raw inputs into user-intent-aligned, trainable multimodal data.

DataClaw is built around a simple premise: users do not want a fixed dataset recipe. They want the exact data form that matches their downstream goal. From videos, images, GUI traces, robot trajectories, and more, DataClaw infers intent and composes the right trainable output.

Multi-Domain tutorials, education, GUI, robotics, generation
Intent-Driven construct data based on what the user actually wants
Trainable outputs are structured for downstream training rather than demo-only generation

Raw Inputs

Video, image, GUI trace, robot demo, drawing sequence

Ordinary materials enter the system without being pre-shaped into a single dataset template.

User Intent

What kind of data does the user actually want to build from this material?

Constructed Outputs

Interleaved tutorials, temporal QA, long-horizon GUI tasks, world-model samples, editing data.

Intent First
Case Scalable

Method

Raw video and user intent are merged once, then dispatched by task type.

DataClaw first resolves user intent into typed instructions, then combines that instruction with the raw video context in one shared route before dispatching parallel synthesis models.

Initial Inputs
Final outputs

Raw Video

Full source video with actions, frames, and temporal evidence.

User Intent

Natural-language request that may still be broad or ambiguous.

User Intent Agent

Infer candidate task types and rewrite each candidate into a clear, synthesis-ready instruction.

task typing instruction rewriting ambiguity preserved
Task Type A

Tutorial

Interleaved tutorial-style supervision.

Task Type B

Temporal

Ordering, localization, and state-change reasoning.

...

More Types

Additional task classes can be inferred from the same intent.

Tutorial Agent

Use the task-specific synthesis agent selected for tutorial supervision.

Temporal Agent

Use the task-specific synthesis agent selected for temporal reasoning data.

More Agents

Attach more specialized agents for other task types when needed.

Tutorial Data

Task description + frame-text labels.

Temporal Data

Task description + timestamped image-text labels.

More Task Data

Additional task description + multimodal labels for more task classes.

Case Gallery

One system, many kinds of data users may ask for.

Evaluation

We test both benchmark reconstruction and benchmark augmentation.

The goal is not just to generate plausible outputs. The goal is to reconstruct benchmark-style supervision from raw sources, and then show that newly constructed data can improve benchmark performance after training.

Setting A

Benchmark Reconstruction

Give the system raw materials, a user intent prompt, and target schema constraints. Then compare the constructed output against the official benchmark ground truth.

  • raw source
  • user intent
  • schema-aligned output
  • GT comparison
Setting B

Benchmark Augmentation

Give the system in-distribution but unlabeled raw materials, construct new data in the benchmark schema, train with the augmented set, and evaluate on the official test split.

  • unlabeled raw source
  • new benchmark-style data
  • train with augmentation
  • test on official split

Benchmarks

Source-linked benchmarks make reconstruction measurable.

We focus on benchmarks where the connection from benchmark sample back to raw source material is explicit enough to support reconstruction, alignment, or augmentation experiments.

Video QA

LongVideoBench

Raw videos, subtitles, and annotations make it suitable for long-context reconstruction.

Temporal

TempCompass

Temporal ordering and concurrency questions are well matched to structured reconstruction.

Video QA

Video-MME

Video source linkage is clear enough to test benchmark-style QA recovery from raw videos.

Motion

MotionBench

Its mapping and clip provenance design makes it a strong reconstruction-friendly reference.

Comparison Table

Results can be organized by benchmark and evaluation setting.

Numbers can be filled in later.
Method Family Setting A: Benchmark Reconstruction Setting B: Benchmark Augmentation
LongVideoBench TempCompass Video-MME MotionBench LongVideoBench TempCompass Video-MME MotionBench
DataClaw Intent-driven data construction
Qwen3.5 Direct multimodal generation
Gemini Direct multimodal generation
GPT-4o Direct multimodal generation
InternVL General VLM baseline
Rule Pipeline Hand-crafted data synthesis