Data Curation at ScalePerplexity filtering, MinHash dedup, domain classifiers, and toxic content removal pipelines.