In 2019, LinkedIn filed suit against HiQ for scraping LinkedIn data to build their sales navigator tool in the Ninth Circuit. This suit ultimately yielded the decision that scraping publicly available information is legal, which withstood several appeals until late 2022, where LinkedIn was able to secure a permanent injunction against HiQ (1) but without meaningful changes to the precedent set by the Ninth Circuit in regards to the legality of web scraping. Now, scraping has been in the crosshairs as a vector for privacy violations. In late 2022 (coinciding with a significant settlement (6) against two companies that were scraping data from behind meta’s login wall on Facebook/IG), Meta was fined 275M by the EU for GDPR violations that stemmed from scraped Facebook data being collated and sold online to identify and target Facebook users with ads/unsolicited sale of goods/etc (7). Interestingly, this fine had to be passed as a GDPR violation because web scraping (data mining) of copyrighted materials is technically legal in the EU if the express purpose of generating information (which would go to training AI models) (5). However, it seems that even this 2019 EU decision still does not completely reverse or nullify the Belgian Database Act of 1998 (10) which effectively protects the effort made to collate and normalize data into a database (arguably A generic descriptor for the foundation of every web page on the internet). In fact, the 2019 decision focuses on copyright law specifically, which still leaves legal recourse for companies that can prove damages from scraping. Now, there’s a class action against Microsoft/GitHub/OpenAI over Copilot (2), several classes being formed against openAI (8), Meta filing suit claiming damages against BrightData for scraping public profile information from IG (9), Getty suing Stable Diffusion for using Getty images to train its generative model (4)… the list goes on, and the commonality is that, in several of these cases, they were not dismissed summarily by the judge, which seems to challenge the precedent set by the Ninth Circuit in LinkedIn vs HiQ. The burning question is - will decisions be made to reverse or meaningfully change EU/US law around web scraping? Can these companies prove damages to the scale where major precedents are set (and reset), and how will this impact the recent glut of newly minted AI companies, let alone companies that are slowly developing a strong reliance on AI models that may have been trained with data that was acquired by scraping? It’s too early to tell for sure, but I suspect there will be some substantive changes in data mining/web scraping legislation on the horizon. References: 1) https://www.fbm.com/publications/what-recent-rulings-in-hiq-v-linkedin-and-other-cases-say-about-the-legality-of-data-scraping/ 2) https://cyberscoop.com/openai-lawsuit-privacy-data-scraping/ 3) https://www.theregister.com/2023/05/12/github_microsoft_openai_copilot/ 4) https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-lawsuit-getty-images-stable-diffusion 5) https://blog.apify.com/is-web-scraping-legal/ 6) https://techcrunch.com/2022/10/03/meta-settles-lawsuit-for-significant-sum-against-businesses-scraping-facebook-and-instagram-data/amp/ 7) https://techcrunch.com/2022/11/28/facebook-gdpr-penalty/amp/ 8) https://www.searchenginejournal.com/chatgpt-creator-faces-multiple-lawsuits-over-copyright-privacy-violations/490686/ 9) https://www.theregister.com/AMP/2023/02/02/meta_web_scraping/ 10) https://www.lexology.com/library/detail.aspx?g=aa5eb784-32b4-4bf3-8406-823f32c6844f
I believe it should remain legal . Because all the companies are anyways gaining profits using the data generated by their users. They are not paying the users for sharing their personal behavioural data. And the value is in the insights that the companies derive by analysing this data and web scraping can't touch that info. So let's hope scraping remains legal.
If it’s in the best interest of these companies to keep that data private, I doubt it will remain legal for long.
Even if it remains legal, I would count on companies increasingly putting content behind a login/pay gate. Especially as AI training continues to ramp up, they'll see access to data as another revenue stream.
its already too late to stop the most important aspect of scraping: the llm the other aspects can be engineered around
With companies limiting or pay walling apis scrapechads will have more work than ever before 😎
The real question is does LinkedIn get to own the data generated by its users? If LinkedIn can use its users data, why not some other companies? We don’t need intermediaries to user data. If you argue that scraping shouldn’t be allowed then the companies need to be held back significantly as well. In that case, let user data be portable and make everyone access it with limited per-use provision. If not, let all the companies access that data. I personally prefer the former.
If companies aren't made to compensate those who's data they've scraped prepare for everything to be pay walled (it's already started happening) bye bye free and open internet
Great post!
Tldr?
Scraping publicly available information is legal but a bunch of criminal and civil lawsuits that challenge that precedent have been moved to trial, which means that web scraping laws have a strong possibility of becoming more restrictive