Misc.Jul 12, 2019
NewTechLeed

Scraping bots DDoSing website

I'm at a smallish company that is running into issues with someone scraping all of our products on our store. It's at this point a huge part of our bandwidth and we've tried a lot of different tactics to block them but they're always managing to bypass whatever it is. We wouldn't care if they were decent and didn't double or triple our regular traffic. Recaptcha doesn't work since it's easily defeated by Google's own voice to text. We've checked everything their headers send and it's almost impossible to determine through that. We've run into some luck blocking them through ips but some of our SaaS products run on AWS... I wanted to ask as a plea for help to see if there is something we could do to stop them. The CTO isn't keen on adding cloudflare to any part of our stack for security reasons (idk why?). She's also throwing around that we avoiding being down during their latest outage as proof she's right. I've been trying to find something for months and I'm out of ideas. Any cyber security people got any tips?

Add a comment
Apple MyYO05 Jul 12, 2019

If you can’t figure it out, use cloudflare.

New
TechLeed OP Jul 12, 2019

I literally just said that they will not consider it. They saw the outage as a complete win for them and will not be convinced. Trust me I'm frustrated enough having to work on this shit that cloudflare is the cheaper option

Apple MyYO05 Jul 12, 2019

Guess your SOL. CTO will fire you and hire replacement who will convince them to use cloudlfare.

Google gthanks Jul 12, 2019

Perhaps you could just publish a feed with all your data so they wouldn't have to scrape

New
TechLeed OP Jul 12, 2019

I've also proposed this and the product team has already expressed that they don't want to release the information to protect our competitive edge.

Amazon senat Jul 12, 2019

They would probably scrape anyway

F5 Networks <- Jul 12, 2019

Guessing this might be more expensive than Cloudflare, but, it's one of the best WAF in the market.

New
TechLeed OP Jul 12, 2019

Thanks I'll take a look but the company wants to keep the data out of the hands of cloudflare (they don't like the MITM approach to SSL). This looks promising, hoping it's not prohibitive in cost but the bandwidth costs are already ludicrous

Verisk Analytics pinkfloyd🎸 Jul 12, 2019

Check out AWS shield https://aws.amazon.com/answers/networking/aws-ddos-attack-mitigation/ They had a keynote session on this yesterday at the AWS summit in NYC

New
TechLeed OP Jul 12, 2019

We're not currently hosted on AWS but I'll definitely take a look. It's not necessarily a malicious ddos (I don't think it is, we're not down it's just expensive...) So I don't know if shield would be right

Apple Tim cock Jul 12, 2019

This is the main reason people use cloudflare

New
TechLeed OP Jul 12, 2019

Alright if you want to take the challenge of convincing a brick wall that she's wrong when she just can't be wrong in any capacity go ahead. I've tried for literal fucking months. Cloudflare won't be a solution with the current management

Apple Tim cock Jul 12, 2019

Other options are Akamai, F5, etc. Get off blind and do your job lol.

T-Mobile nth Jul 12, 2019

Just add rate limits to your APIs and publish client ids or authentication for your apis if possible. Most firewalls/gateways detect bot activity

New
TechLeed OP Jul 12, 2019

It's not an API it's a storefront. I wish I could rate limit but they just change their ip and scan from elsewhere. It's never the same ip's

T-Mobile nth Jul 12, 2019

Insert hidden gibberish in your DOM which would be visible if they don’t have your styles....dynamic classes too... Their parsing logic would need to be non trivial then .. just to increase the challenge

Walmart.com whoooooooo Jul 12, 2019

There are different kinds of captcha (some require you to identify parts of an image, math problems ). Have none of them worked?

New
TechLeed OP Jul 12, 2019

Recaptcha allows solving with audio for accessibility which we could be held liable if we block the sight impaired... The v3 recaptcha works for forms but not for general traffic

Walmart.com whoooooooo Jul 12, 2019

This is an Interesting problem!! 🤔 Pls do let us know how you solved it eventually. :)

Google emc2too Jul 12, 2019

Did you try traps that only a robot would follow and humans would not? If they get there you block them soon but not make it obvious what the trap was.

New
TechLeed OP Jul 12, 2019

What do you mean, like lead them into believing something is working then cut them off? They don't seem to be executing JS because our API doesn't see these requests it's just the scraped pages

Google gthanks Jul 12, 2019

Just set some hidden field or cookie with js or something and drop requests that don't have it set Actually why is that not happening already since you said you have recaptcha which requires JS? Are you sure you integrated recaptcha correctly?

Google gthanks Jul 12, 2019

Recaptcha gives you score, you could tighten up the threshold at the expense of annoying your real users. Instead of blocking IPs you could block whole ASNs and geos. It's quite a bit harder to find a new data center than a new IP. Unless they are scraping from a botnet of consumer machines. For the latter maybe score IPs against email spam blacklists

New
TechLeed OP Jul 12, 2019

They never get served the captcha since I can't figure out how to identify them. Their IP range is sometimes on AWS, sometimes on gcp, digital ocean, linode, and some random other IPs. How do you figure out their ASN? I really hope it's not some IoT botnet combined with servers.

Google gthanks Jul 12, 2019

Look up based on IP. You can easily download ASN<->IP mapping online, small and relatively static database Few legitimate users surf from cloud IPs, just VPNs and bots, so probably ok to ban them all

Charter Mæstro Jul 12, 2019

Is the the products page gated with a login page?

New
TechLeed OP Jul 12, 2019

It is not, I don't think I could convince product to add one for SEO and business reasons

Charter Mæstro Jul 12, 2019

Sure you could. It will yield MUCH better marketing data, generate leads, all that good stuff. Idk if you have commercial or carrier customer's but imo that would be the way to go. Allow them to opt out of communication post validation and then you're good. At worst you will have another lead and make it more of a pain in the ass for them to mine. Best deterrent for hackers is effort until your company does something to personally piss someone off