Open data sets are fantastic, but some problems require very diverse and broad spectrum content that hasn’t been gathered and vetted for various reasons. Sometimes it’s not enough to rely on search engines alone to find data, especially if the content may be more prevalent on the deep/dark websites or if you can’t think of every possible thing to look for. You can write scripts to gather data all day, but it seems like such big jobs need big tools and services for data acquisition. What does your team use?
You’d be surprised how much manual tagging of data goes on in the industry. Hire a bunch of workers getting paid $5/hour in another country, run that 3x to validate the tags, and you’ve got a very reliable, large, and relatively affordable supervised learning training set.
☝️