← homedata systems

Pretraining Data

Common Crawl, deduplication, quality filtering, domain mixing ratios, and data scaling laws.