Tech IndustryJul 9, 2023
NewmBgY42

The future of web scraping

In 2019, LinkedIn filed suit against HiQ for scraping LinkedIn data to build their sales navigator tool in the Ninth Circuit. This suit ultimately yielded the decision that scraping publicly available information is legal, which withstood several appeals until late 2022, where LinkedIn was able to secure a permanent injunction against HiQ (1) but without meaningful changes to the precedent set by the Ninth Circuit in regards to the legality of web scraping. Now, scraping has been in the crosshairs as a vector for privacy violations. In late 2022 (coinciding with a significant settlement (6) against two companies that were scraping data from behind meta’s login wall on Facebook/IG), Meta was fined 275M by the EU for GDPR violations that stemmed from scraped Facebook data being collated and sold online to identify and target Facebook users with ads/unsolicited sale of goods/etc (7). Interestingly, this fine had to be passed as a GDPR violation because web scraping (data mining) of copyrighted materials is technically legal in the EU if the express purpose of generating information (which would go to training AI models) (5). However, it seems that even this 2019 EU decision still does not completely reverse or nullify the Belgian Database Act of 1998 (10) which effectively protects the effort made to collate and normalize data into a database (arguably A generic descriptor for the foundation of every web page on the internet). In fact, the 2019 decision focuses on copyright law specifically, which still leaves legal recourse for companies that can prove damages from scraping. Now, there’s a class action against Microsoft/GitHub/OpenAI over Copilot (2), several classes being formed against openAI (8), Meta filing suit claiming damages against BrightData for scraping public profile information from IG (9), Getty suing Stable Diffusion for using Getty images to train its generative model (4)… the list goes on, and the commonality is that, in several of these cases, they were not dismissed summarily by the judge, which seems to challenge the precedent set by the Ninth Circuit in LinkedIn vs HiQ. The burning question is - will decisions be made to reverse or meaningfully change EU/US law around web scraping? Can these companies prove damages to the scale where major precedents are set (and reset), and how will this impact the recent glut of newly minted AI companies, let alone companies that are slowly developing a strong reliance on AI models that may have been trained with data that was acquired by scraping? It’s too early to tell for sure, but I suspect there will be some substantive changes in data mining/web scraping legislation on the horizon. References: 1) https://www.fbm.com/publications/what-recent-rulings-in-hiq-v-linkedin-and-other-cases-say-about-the-legality-of-data-scraping/ 2) https://cyberscoop.com/openai-lawsuit-privacy-data-scraping/ 3) https://www.theregister.com/2023/05/12/github_microsoft_openai_copilot/ 4) https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-lawsuit-getty-images-stable-diffusion 5) https://blog.apify.com/is-web-scraping-legal/ 6) https://techcrunch.com/2022/10/03/meta-settles-lawsuit-for-significant-sum-against-businesses-scraping-facebook-and-instagram-data/amp/ 7) https://techcrunch.com/2022/11/28/facebook-gdpr-penalty/amp/ 8) https://www.searchenginejournal.com/chatgpt-creator-faces-multiple-lawsuits-over-copyright-privacy-violations/490686/ 9) https://www.theregister.com/AMP/2023/02/02/meta_web_scraping/ 10) https://www.lexology.com/library/detail.aspx?g=aa5eb784-32b4-4bf3-8406-823f32c6844f

What Recent Rulings in ‘hiQ v. LinkedIn’ and Other Cases Say About the Legality of Data Scraping
What Recent Rulings in ‘hiQ v. LinkedIn’ and Other Cases Say About the Legality of Data Scraping
Farella Braun + Martel LLP
OpenAI lawsuit reignites privacy debate over data scraping
OpenAI lawsuit reignites privacy debate over data scraping
CyberScoop
GitHub and OpenAI fail to wriggle out of Copilot lawsuit
GitHub and OpenAI fail to wriggle out of Copilot lawsuit
The Register
Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement
Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement
The Verge
Is web scraping legal? Yes, if you know the rules.
Is web scraping legal? Yes, if you know the rules.
Apify Blog
Meta settles lawsuit for ‘significant’ sum against businesses scraping Facebook and Instagram data
Meta settles lawsuit for ‘significant’ sum against businesses scraping Facebook and Instagram data
TechCrunch
Meta hit with ~$275M GDPR penalty for Facebook data-scraping breach
Meta hit with ~$275M GDPR penalty for Facebook data-scraping breach
TechCrunch
ChatGPT Creator Faces Multiple Lawsuits Over Copyright & Privacy Violations
ChatGPT Creator Faces Multiple Lawsuits Over Copyright & Privacy Violations
Search Engine Journal
Meta, which pays for web scraping, sues to stop web scraping
Meta, which pays for web scraping, sues to stop web scraping
The Register
Recent ECJ decision confirms: be careful with online scrapers and crawlers, they are often illegal
Recent ECJ decision confirms: be careful with online scrapers and crawlers, they are often illegal
Sirius Legal
ADP iGot5s Jul 9, 2023

Tldr?

New
mBgY42 OP Jul 9, 2023

Scraping publicly available information is legal but a bunch of criminal and civil lawsuits that challenge that precedent have been moved to trial, which means that web scraping laws have a strong possibility of becoming more restrictive

Microsoft starl1ght Jul 9, 2023

I believe it should remain legal . Because all the companies are anyways gaining profits using the data generated by their users. They are not paying the users for sharing their personal behavioural data. And the value is in the insights that the companies derive by analysing this data and web scraping can't touch that info. So let's hope scraping remains legal.

JPMorgan Chase vNDX33 Jul 9, 2023

If it’s in the best interest of these companies to keep that data private, I doubt it will remain legal for long.

Confluent NP≠P Jul 9, 2023

Even if it remains legal, I would count on companies increasingly putting content behind a login/pay gate. Especially as AI training continues to ramp up, they'll see access to data as another revenue stream.

True Fit gptuser Jul 9, 2023

its already too late to stop the most important aspect of scraping: the llm the other aspects can be engineered around

Amazon pipulus Jul 9, 2023

With companies limiting or pay walling apis scrapechads will have more work than ever before 😎

Meta (hope) Jul 9, 2023

The real question is does LinkedIn get to own the data generated by its users? If LinkedIn can use its users data, why not some other companies? We don’t need intermediaries to user data. If you argue that scraping shouldn’t be allowed then the companies need to be held back significantly as well. In that case, let user data be portable and make everyone access it with limited per-use provision. If not, let all the companies access that data. I personally prefer the former.

Activision Blizzard RenfieldD Jul 10, 2023

If companies aren't made to compensate those who's data they've scraped prepare for everything to be pay walled (it's already started happening) bye bye free and open internet

Target failing_up Jul 11, 2023

Great post!