(Replying to PARENT post)
I really don't think scraping should fall onto that list.
There isn't even a consensus in the IT world whether or not scraping should be able to be legally restricted.
(Replying to PARENT post)
Just let the web be the web, and stop trying to control it.
(Replying to PARENT post)
(Replying to PARENT post)
Headless Chrome is awesome and such a step up from previous automation tools.
The Chromeless project provides a nice abstraction and received 8k start in its first two weeks on Github: https://github.com/graphcool/chromeless
(Replying to PARENT post)
I guess I disagree with the premise of this article.
How is web scraping fundamental malicious?
What rights/expectations can you have that a publicly accessible website you create must be used by humans only?
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
And by releasing headless chrome, they killed off some of the competition. (https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuN...)
(Replying to PARENT post)
(Replying to PARENT post)
Regardless as others are saying, using complete Chrome or Firefox with webdriver solves all these, right? Is there a way to detect the webdriver extension? That's the only difference I think from a normal browser.
(Replying to PARENT post)
You can just use document.body.
I also suggest to use a data URL instead. E.g. "data:," is an empty plain text file, which, as you can imagine, won't be interpreted as a valid image.
let image = new Image();
image.onerror = () => {
console.log(image.width); // 0 -> headless
};
document.body.appendChild(image);
image.src = 'data:,';
> In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browserThe zoom doesn't affect this. It's always in CSS "pixels".
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
at what point does it make more sense for companies to just start offering open APIs or data exports? Obviously it would never make sense for a company who's value IS their data, but for retail platforms, auction sites, forum platforms, etc... that have a scraper problem, it seems like just providing their useful data through a more controlled, and optimized, avenue could be worth it.
The answer is probably "never", it's just something that comes to mind sometimes.
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
Cheating an advertiser I'll grant you, but the other two are 100% legitimate.
(Replying to PARENT post)
Since when web scraping considered malicious? Companies like Google are doing billions because they use web scraping.
(Replying to PARENT post)
But someone who really wants to do web scraping or anything similar will use a real browser like Firefox or Chrome run it through xvfb and control it using webdriver and maybe expose it through an API. I find these to be almost undetectable.. The only way you can mitigate this is to do more interesting mitigation techniques. Liie IP detection, Captchas, etc.
edit: when I say real browser, I mean running the full browser process including extensions etc.