User-Intent-Driven Data Construction
DataClaw: Make ANY Data You Want
Turn raw inputs into user-intent-aligned, trainable multimodal data.
DataClaw is built around a simple premise: users do not want a fixed dataset recipe. They want the exact data form that matches their downstream goal. From videos, images, GUI traces, robot trajectories, and more, DataClaw infers intent and composes the right trainable output.
Raw Inputs
Video, image, GUI trace, robot demo, drawing sequence
Ordinary materials enter the system without being pre-shaped into a single dataset template.
User Intent
What kind of data does the user actually want to build from this material?
Constructed Outputs
Interleaved tutorials, temporal QA, long-horizon GUI tasks, world-model samples, editing data.
Method
Raw video and user intent are merged once, then dispatched by task type.
DataClaw first resolves user intent into typed instructions, then combines that instruction with the raw video context in one shared route before dispatching parallel synthesis models.
Raw Video
Full source video with actions, frames, and temporal evidence.
User Intent
Natural-language request that may still be broad or ambiguous.
User Intent Agent
Infer candidate task types and rewrite each candidate into a clear, synthesis-ready instruction.
Tutorial
Interleaved tutorial-style supervision.
Temporal
Ordering, localization, and state-change reasoning.
More Types
Additional task classes can be inferred from the same intent.
Tutorial Agent
Use the task-specific synthesis agent selected for tutorial supervision.
Temporal Agent
Use the task-specific synthesis agent selected for temporal reasoning data.
More Agents
Attach more specialized agents for other task types when needed.
Tutorial Data
Task description + frame-text labels.
Temporal Data
Task description + timestamped image-text labels.
More Task Data
Additional task description + multimodal labels for more task classes.
Case Gallery
One system, many kinds of data users may ask for.
Evaluation
We test both benchmark reconstruction and benchmark augmentation.
The goal is not just to generate plausible outputs. The goal is to reconstruct benchmark-style supervision from raw sources, and then show that newly constructed data can improve benchmark performance after training.
Benchmark Reconstruction
Give the system raw materials, a user intent prompt, and target schema constraints. Then compare the constructed output against the official benchmark ground truth.
- raw source
- user intent
- schema-aligned output
- GT comparison
Benchmark Augmentation
Give the system in-distribution but unlabeled raw materials, construct new data in the benchmark schema, train with the augmented set, and evaluate on the official test split.
- unlabeled raw source
- new benchmark-style data
- train with augmentation
- test on official split
Benchmarks
Source-linked benchmarks make reconstruction measurable.
We focus on benchmarks where the connection from benchmark sample back to raw source material is explicit enough to support reconstruction, alignment, or augmentation experiments.
LongVideoBench
Raw videos, subtitles, and annotations make it suitable for long-context reconstruction.
TempCompass
Temporal ordering and concurrency questions are well matched to structured reconstruction.
Video-MME
Video source linkage is clear enough to test benchmark-style QA recovery from raw videos.
MotionBench
Its mapping and clip provenance design makes it a strong reconstruction-friendly reference.
Comparison Table
Results can be organized by benchmark and evaluation setting.
| Method | Family | Setting A: Benchmark Reconstruction | Setting B: Benchmark Augmentation | ||||||
|---|---|---|---|---|---|---|---|---|---|
| LongVideoBench | TempCompass | Video-MME | MotionBench | LongVideoBench | TempCompass | Video-MME | MotionBench | ||
| DataClaw | Intent-driven data construction | — | — | — | — | — | — | — | — |
| Qwen3.5 | Direct multimodal generation | — | — | — | — | — | — | — | — |
| Gemini | Direct multimodal generation | — | — | — | — | — | — | — | — |
| GPT-4o | Direct multimodal generation | — | — | — | — | — | — | — | — |
| InternVL | General VLM baseline | — | — | — | — | — | — | — | — |
| Rule Pipeline | Hand-crafted data synthesis | — | — | — | — | — | — | — | — |