Data-Centered Pipeline Efficiency and Theoretical Optimization of Model Internal Mechanisms
Advancement of Autonomous Agents and Multimodal Technologies for Specialized Domains

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

https://arxiv.org/abs/2512.16676

DataFlow proposes a unified, scalable data preparation framework for LLMs, improving on unstructured script-based approaches. It supports modularized data transformation through PyTorch-style APIs and 200+ reusable operators, and introduces DataFlow-Agent that automatically converts natural language specifications into executable pipelines. Verified across text, mathematics, and code domains, DataFlow proves superior performance compared to synthetic data or human-built datasets in text-SQL conversion and code benchmarks, laying the foundation for reliable data-centric AI development.