(Replying to PARENT post)
(Replying to PARENT post)
Also poisoning only works for a while. As soon as they detect the poisoning, they can easily figure out what tripped the scraping detection, and now you need to poison in an even more subtle way because the scrapers know what to look for.
You can't win this game. Especially the "obvious" type of browser checks, where you can tell the js is checking your browser are so easy to circumvent, because you can tell what they're doing.
Really though, I've reversed a few fingerprinting libs in the wild and also looked at some counter measures that are being sold in the blackhat world. Both sides completely and utterly suck.
The fingerprinting stuff is easy to circument if you're willing to reverse a bit of minified JS, the browser automation 'market leader?' is comically bad and it took me 15 minutes to reliably detect it.
That browser/profile/fingerprint automation thingy in the default option sometimes uses a firefox fingerprint while using chrome. Protip: Chrome and Firefox send HTTP headers in different order. Detected, passive, without JS.
Claiming you're on windows while actually running on a Linux VM, your TCP fingerprint gives it away... Detected, passive, without JS.
Really, both sides, step up your game, this is still boring! Or, just stop fighting scrapers, you can't win.
(Replying to PARENT post)
Having written bespoke scraping systems professionally, I think you overestimate the applicability of this technique.
For one thing, detecting that a scraper is a scraper is the problem, not the prelude to the problem. You might as well block them at that point, if you feel you can reliably detect them. If nothing else it's more resource efficient than sending them fake data.
Second, and more importantly, "poisoning the well" is not going to work against a sophisticated scraper. I used to use massive amounts of crawled web data to accurately forecast earnings announcements months in advance. I've also consulted with various companies building distributed crawling systems or looking for ways to develop integrations without public APIs.
My colleagues would know very quickly if something was wrong with the data because our model would suddenly be extremely out of whack in ways that could be traced back to the website's behavior. We used to specifically look for this sort of thing and basically eyeball the data on a daily basis. We even had tools in place that measured the volume, response time and type of data being received, and would alert us if the ingested data went more than one standard deviation outside of the expectation in any of these metrics.
You might succeed in screwing up whatever the data is being used to inform for a little while, but you will, with near certainty, show your hand by doing this, and scrapers will react in the usual cat and mouse game. Modern scraping systems are extremely sophisticated and success stories are prone to confirmation bias, because you're mostly unaware when it's happening successfully on your website. For a while I was experimenting with borrowing (non-malicious) methods from remote timing attacks to identify when servers were treating automated requests differently from manual requests instead of simply dropping them. The rabbit hole of complexity is sufficiently deep that you could saturate a full research time with work to do in this area.
If you want to productively block scrapers, you should consider using a captcha-based system at the application layer, preferably a captcha that hasn't been broken yet and which can't be outsourced to a mechanical turk-based API. If nothing else, doing that will introduce at least 10 - 20 seconds of latency per request, which might be intolerable for many scrapers even if they're quite sophisticated.
(Replying to PARENT post)
Honest question, would this open the site up to legal liability?
You can never identify a bot 100%, so if you intentionally provided false information to an otherwise legitimate user, and that user is harmed by your false information, isn't that a breach of contract that would make you liable for damages based on the harm your bad data caused?
What if the user claimed that the site was acting maliciously, providing lower prices to cause a market reaction ala what happened to coinmarketcap? How would you prove that you were only targeting bots or provide statistics that your methods were effective instead of arbitrary. That stuff matters when money is lost.
You'd have to have a clause in your terms of service that says "if we think you are a bot then the information we give you will be bad", then users could (and probably should) run far away, since they have no way of knowing whether you think they are a bot or not, and thus can't trust anything on your site?
(Replying to PARENT post)
(Replying to PARENT post)
If you detect wrongly though, you risk alienating your users -- if I saw obviously incorrect prices on an ecommerce site, I wouldn't shop there, assuming that they are incompetent and if I enter any payment information it's going to be whisked away by hackers.
Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers.
Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers
That may be harder than you suspect, given that it's not illegal to violate a website's terms of use:
https://www.eff.org/document/oracle-v-rimini-ninth-circuit-o...
Even if you have an enforceable contract, you can still face an expensive legal battle if they choose to fight (and it's likely not their first fight), so hopefully you have some deep pockets.
(Replying to PARENT post)
(Replying to PARENT post)
I agree with you there.
Also relevant: Courts: Violating a Websiteβs Terms of Service Is Not a Crime (https://news.ycombinator.com/item?id=16119686)
(Replying to PARENT post)
Depending on your industry etc., it may be viable to take the legal route. If you suspect who the scraper is, you can deliberately plant a 'trap street', then look for it at your suspect. If it shows up, let lose the lawyers.
Of course, the very best solution is to not care about being scraped. If your problem is the load that is caused to your site, provide an api instead and make it easily discoverable.