Don’t these crawlers save some kind of metadata before fully committing it to their databases? It’d surely be able to see that a specific domain served just garbage (and/or that it’s so “basic”), and then blacklist/purge the data? Or are the AO crawlers even dumber than I’d imagine?
I’d be surprised if anything crawled from a site using iocaine actually made it into an LLM training set. GPT 3’s initial set of 45 terabytes was reduced to 570 GB, which it was actually trained on. So yeah, there’s a lot of filtering/processing that takes place between crawl and train. Then again, they seem to have failed entirely to clean the reddit data they fed into Gemini, so /shrug
Don’t these crawlers save some kind of metadata before fully committing it to their databases? It’d surely be able to see that a specific domain served just garbage (and/or that it’s so “basic”), and then blacklist/purge the data? Or are the AO crawlers even dumber than I’d imagine?
I’d be surprised if anything crawled from a site using iocaine actually made it into an LLM training set. GPT 3’s initial set of 45 terabytes was reduced to 570 GB, which it was actually trained on. So yeah, there’s a lot of filtering/processing that takes place between crawl and train. Then again, they seem to have failed entirely to clean the reddit data they fed into Gemini, so /shrug