Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

1HKU, 2SCUT 3Kuaishou Technology, Kling team

Abstract

Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose Alchemist, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.



Overall pipeline of Alchemist. In the initial data rating stage (a), the rater predicts a classification score for each image based on gradient extracted from a T2I proxy model. The rater and the proxy model are jointly optimized through weighted loss and total loss. In the data pruning stage (b), we introduce the Shift-Gsample strategy to efficiently retain informative samples while filtering out redundant data and outliers. The resulting Alchemist-selected dataset enables highly efficient training of downstream text-to-image models.



LAION-30M

Method #Params #Images Training Time (hours) MJHQ-30K GenEval
FID ↓ CLIP-Score ↑
Full 0.3B 30M 65.34 17.48 0.2336 0.2752
Random 0.3B 15M 34.60 19.70 0.2220 0.2632
Aesthetic 0.3B 15M 34.60 17.36 0.2299 0.2604
Clarity 0.3B 15M 34.60 17.85 0.2261 0.2251
Frequency 0.3B 15M 34.60 18.77 0.2276 0.2519
Edge-density 0.3B 15M 34.60 20.13 0.2240 0.2429
Ours-small 0.3B 6M 13.08 18.22 0.2277 0.2367
Ours 0.3B 15M 34.60 16.20 0.2325 0.2645


Image distribution of Alchemist-selected subset



performance changes curve with epochs



FLUX lora finetuning



We develop a meta-gradient-based framework that automatically selects the most informative text–image samples to improve T2I model training quality and efficiency.

This example is generated by Kling AI 2.5 Turbo.


BibTeX

@article{ding2025alchemist,
      title={Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection}, 
      author={Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou, Rui Chen, Xin Tao, Hengshuang Zhao},
      year={2025},
      journal={arXiv preprint 2512.16905},
}