Code Concepts: A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds

https://huggingface.co/blog/nvidia/synthetic-code-concepts(huggingface.co)

A concept-driven workflow for synthetic data generation was designed to improve specific LLM capabilities, addressing the lack of targeted content in large pretraining datasets. This approach was used to create "Code Concepts," a dataset of 15 million synthetic Python programming problems derived from a curated taxonomy of programming knowledge. Including a portion of this dataset in the final pretraining phase of the Nemotron-Nano-v3 model resulted in a six-point performance gain on the HumanEval benchmark. The dataset and its underlying taxonomy are being released to enable the community to apply this method to other domains for more targeted LLM training.

0 points•by chrisf•4 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?