Common Crawl
Making web crawl data accessible and analyzable for everyone.
A non-profit initiative that builds and maintains a free, open repository of web crawl data. This data is accessible to anyone and is a valuable resource for researchers. With over 240 billion pages spanning 15 years, it's a treasure trove of information. It's also a primary training corpus in many LLM's and has been cited in over 8000 research papers.
Category