Common Crawl web data

January 29, 2024

Chatted with an AI-vendor last week and they mentioned this public dataset, hadn't heard of it before but apparently underpins a lot of the public training data that goes into foundation models - https://commoncrawl.org/ + https://commoncrawl.org/get-started