NLPExplorer
Papers
Venues
Authors
Authors Timeline
Field of Study
URLs
ACL N-gram Stats
TweeNLP
API
Team
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
Dan Su
|
Kezhi Kong
|
Ying Lin
|
Joseph Jennings
|
Brandon Norick
|
Markus Kliegl
|
Mostofa Patwary
|
Mohammad Shoeybi
|
Bryan Catanzaro
|
Paper Details:
Month: July
Year: 2025
Location: Vienna, Austria
Venue:
ACL |
Citations
URL
No Citations Yet
https://data.commoncrawl.org/contrib/
https://commoncrawl.org/
https://data.commoncrawl.org/contrib/Nemotron/
https://github.com/NVIDIA/NeMo-Curator
https://huggingface.co/nvidia/nemocurator-
https://pypi.org/project/pycld2/
https://fasttext.cc/docs/en/language-
https://github.com/NVIDIA/NeMo-Curator
https://github.com/google-research/
https://mistral.ai/news/mixtral-8x22b/
https://mistral.ai/news/mistral-nemo
https://github.com/NVIDIA/TensorRT-LLM
https://github.com/NVIDIA/NeMo-Skills
https://github.com/NVIDIA/Megatron-LM
https://github.com/EleutherAI/lm-evaluation-
https://huggingface.co/datasets/
https://huggingface.co/spaces/
https://huggingface.co/datasets/
https://github.com/NVIDIA/Megatron-LM
Field Of Study