๐Ÿ‘คavastel๐Ÿ•‘8y๐Ÿ”ผ375๐Ÿ—จ๏ธ157

(Replying to PARENT post)

Your solutions in detecting Chrome headless is good.

But someone who really wants to do web scraping or anything similar will use a real browser like Firefox or Chrome run it through xvfb and control it using webdriver and maybe expose it through an API. I find these to be almost undetectable.. The only way you can mitigate this is to do more interesting mitigation techniques. Liie IP detection, Captchas, etc.

edit: when I say real browser, I mean running the full browser process including extensions etc.

๐Ÿ‘คwestoque๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

> ... to automate malicious tasks. The most common cases are web scraping...

I really don't think scraping should fall onto that list.

There isn't even a consensus in the IT world whether or not scraping should be able to be legally restricted.

๐Ÿ‘คshakna๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

So again someone wants to punish all the legitimate people using a web site to get some marginal benefit from detecting the remaining <1%. The inevitable false positives don't affect the "malicious" users. Only the legitimate ones. And how much will this bloat the page load by? Adding more code to an already overly large page isn't helping anyone.

Just let the web be the web, and stop trying to control it.

๐Ÿ‘คstevefeinstein๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

This looks like a list of bugs that need fixing; ideally, headless Chrome should be completely indistinguishable from ordinary Chrome, so that it gets an identical view of the web.
๐Ÿ‘คJoshTriplett๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Leaving aside for a moment that many "malicious" use cases are actually fairly common and totally legitimate.

Headless Chrome is awesome and such a step up from previous automation tools.

The Chromeless project provides a nice abstraction and received 8k start in its first two weeks on Github: https://github.com/graphcool/chromeless

๐Ÿ‘คsorenbs๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

> Beyond the two harmless use cases given previously, a headless browser can also be used to automate malicious tasks. The most common cases are web scraping

I guess I disagree with the premise of this article.

How is web scraping fundamental malicious?

What rights/expectations can you have that a publicly accessible website you create must be used by humans only?

๐Ÿ‘คjosteink๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Since when is web scraping a "malicious task"?
๐Ÿ‘คfforflo๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

If someone wants to scrape your site he will do it, just find workarounds against your "protection". It is impossible to tell the difference between a real user and an automated scrape request, you can only make their job a bit harder.
๐Ÿ‘คXCSme๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I wonder how many of these were deliberate, and how many were missed. Google has a vested interest in bot detection.

And by releasing headless chrome, they killed off some of the competition. (https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuN...)

๐Ÿ‘คtyingq๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I don't want to start an argument here, but can someone explain why web scraping is considered malicious?
๐Ÿ‘คPascLeRasc๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

How many of these can be faked with some additional code with Chrome headless?

Regardless as others are saying, using complete Chrome or Firefox with webdriver solves all these, right? Is there a way to detect the webdriver extension? That's the only difference I think from a normal browser.

๐Ÿ‘คskinnymuch๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

> var body = document.getElementsByTagName("body")[0];

You can just use document.body.

I also suggest to use a data URL instead. E.g. "data:," is an empty plain text file, which, as you can imagine, won't be interpreted as a valid image.

  let image = new Image();
  image.onerror = () => {
    console.log(image.width); // 0 -> headless
  };
  document.body.appendChild(image);
  image.src = 'data:,';
> In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browser

The zoom doesn't affect this. It's always in CSS "pixels".

๐Ÿ‘คtomatsu๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Shouldn't the first block of code have "HeadlessChrome" instead of just "Chrome" as the search term?
๐Ÿ‘คnetsharc๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I do hope that these methods get patched, I tend to archive my bookmark collection with chrome headless to prevent loosing content when such a site goes offline. I hate it when a website requires me to play special snowflake to scrape them for this purpose.
๐Ÿ‘คtscs37๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

dumb question from someone who's written a ton of scrapers and scraping based "products" for fun:

at what point does it make more sense for companies to just start offering open APIs or data exports? Obviously it would never make sense for a company who's value IS their data, but for retail platforms, auction sites, forum platforms, etc... that have a scraper problem, it seems like just providing their useful data through a more controlled, and optimized, avenue could be worth it.

The answer is probably "never", it's just something that comes to mind sometimes.

๐Ÿ‘คjdc0589๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

The irony of using JavaScript to detect scraping or bots when the majority of them not used to trick ads don't ever execute any of it because they are a better curl.
๐Ÿ‘คrevelation๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

All of these could quite easily be overcome by compiling your own headless chrome. It wouldn't surprise me if there is a fork to this effect soon.
๐Ÿ‘คaskvictor๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Those who want a more "authentic" experience would do better to use a real normal browser, and control it from outside.
๐Ÿ‘คuserbinator๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I'd be willing to bet that missing image size variance is more of a bug or oversight, and is something that will be fixed.
๐Ÿ‘คDannyDaemonic๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

"Beyond the two harmless use cases given previously, a headless browser can also be used to automate malicious tasks. The most common cases are web scraping, increase advertisement impressions or look for vulnerabilities on a website."

Cheating an advertiser I'll grant you, but the other two are 100% legitimate.

๐Ÿ‘คhossbeast๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

"... a headless browser can also be used to automate malicious tasks. The most common cases are web scraping... "

Since when web scraping considered malicious? Companies like Google are doing billions because they use web scraping.

๐Ÿ‘คassafmo๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

What about mining cryptocurrency on a page load as a solution against scrapers?
๐Ÿ‘คcodedokode๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Isn't it possible to detect a bot by tracking some events like random mouse moving, scrolling, clicking etc.? Why weren't these kinds of detection tried in place of captchas, for example?
๐Ÿ‘คfiatjaf๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Can you guys shut up already?
๐Ÿ‘คmegamindbrian๐Ÿ•‘8y๐Ÿ”ผ0๐Ÿ—จ๏ธ0