Key Highlights
* Red Pajama 2 is open-source language model pre-training dataset containing a massive 30 trillion tokens, it the largest public dataset for language model pre-training.
* The dataset includes over 100 billion text documents from 84 CommonCrawl snapshots, covering English, German, French, Italian, and Spanish.
* Red Pajama 2 provides high-quality