Avoiding bot detection: How to scrape the web without getting blocked?

  • > I need to make a general remark to people who are evaluating (and/or) planning to introduce anti-bot software on their websites. Anti-bot software is nonsense. Its snake oil sold to people without technical knowledge for heavy bucks.

    If this guy got to experience how systemically bad the credential stuffing problem is, he'd probably take down the whole repository.

    None of these anti-bot providers give a shit about invading your privacy, tracking your every movements, or whatever other power fantasy that can be imagined. Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

  • I am always amazed when otherwise intelligent people assert without data that the marginal cost of serving web traffic to scrapers/bots is zero. It is kind of like people who say "Why don't they put more fuel in the rocket so it can get all the way into orbit with just one stage?"

    It sounds great but it is a completely ignorant thing to say.

  • What I really enjoy about this thread is all of the completely different perspectives. Lots of people doing anti-abuse research bemoaning that this stuff exists, and lots of people working against what are from their perspective ham-handed anti-abuse tech blocking legitimate useful automation trading tips on how to do it better. I guess the other sides of those we don't see much. People doing actual black-hat work probably don't post about it on public forums, and most of the over-broad anti-abuse is probably a side effect of taking some anti-abuse tech and blindly applying it to the whole site just because that's simpler, often no tech people may be really involved at all.

  • If someone is signalling to you you that they do not want your bot on their site, then maybe respect that? Trying to circumvent it is besides being legally questionable, a serious pain in the ass for the site owner and makes websites more prone to attempt to block bots in general.

    Also, in my experience, most websites that block your bot, block your bot because your bot is too aggressive, or because you are fetching some resource that is expensive that bots in general refuse to lay off. Bots with seconds between the requests rarely get blocked even by CDNs.

  •     You use this software at your own risk. Some of them contain malwares just fyi
    
    LOL why post LINKS to them then? Flat-out irresponsible...

        you build a tool to automate social media accounts to manage ads more efficiently
    
    If by "manage" you mean "commit click fraud"

  • I'm a lead engineer on the search team of a publicly traded company who's bread and butter is this domain. I was curious about this list, it candidly misses the mark- the tech mentioned in this blog is what you might get if you hired a competent consultant to build out a service without having domain knowledge. In my experience, what's being used on the bleeding edge is two steps ahead of this.

  • There’s one technique that can be very useful in some circumstances that isn’t mentioned. Put simply, some sites try to block all bots except for those from the major search engines. They don’t want their content scraped, but they want the traffic that comes from search. In those cases, it’s often possible to scrape the search engines instead using specialized queries designed to get the content you want into the blurb for each search result.

    This kind of indirect scraping can be useful for getting almost all the information you want from sites like LinkedIn that do aggressive scraping detection.

  • It's very easy to install Chrome on a linux box and launch it with a whitelisted extension. You can run Xorg using the dummy driver and get a full Chrome instance (i.e. not headless). You can even enable the DevTools API programmatically. I don't see how this would be detectable, and probably a lot safer than downloading a random browser package from an unknown developer.

  • Google "residential proxies for sale" if you want to see the weird shady grey market for proxies when you need your traffic to come from things like cablemodem operator ASNs' DHCP pools

  • Another great resource is incolumitas.com. A list of detection methods are here: https://bot.incolumitas.com/

    I run a no-code web scraper (https://simplescraper.io) and we test against these.

    Having scraped million of webpages, I find dynamic CSS selectors a bigger time sink than most anti-scraping tech encountered so far (if your goal is to extract structured data).

  • 2 of my social media accounts have fallen victim to bot detection, despite not using scripts. There are other websites for which I have used scripts, and sometimes ran into CAPTCHA restrictions, but was able to adjust the rate to stay within limits.

    CouchSurfing blocked me after I manually searched for the number of active hosts in each country (191 searches), and posted the results on Facebook. Basically I questioned their claim that they have 15 million users - although that may be their total number of registered accounts, the real number of users is about 350k. They didn't like that I said that (on Facebook) so they banned my CouchSurfing account. They refused to give a reason, but it was a month after gathering the data, so I know that it was retaliation for publication.

    LinkedIn blocked me 10 days ago, and I'm still trying to appeal to get my account back.

    A colleague was leaving, and his manager asked me to ask people around the company to sign his leaving card. Rather than go to 197 people directly, I intentionally wanted to target those who could also help with the software language translation project (my actual work). So I read the list of names, cut it down to 70 "international" people, and started searching for their names on Google. Then I clicked on the first result, usually LinkedIn or Facebook.

    The data was useful, and I was able to find willing volunteers for Malay, Russian, and Brazilian Portuguese!

    After finding the languages from 55 colleagues over 2 hours, LinkedIn asked for an identity verification: upload a photo of my passport. No problem, I uploaded it. I also sent them a full explanation of what I was doing, why, how it was useful, and a proof of my Google search history.

    But rather than reactivate my account, LinkedIn have permanently banned me, and will not explain why.

    "We appreciate the time and effort behind your response to us. However, LinkedIn has reviewed your request to appeal the restriction placed on your account and will be maintaining our original decision. This means that access to the account will remain restricted.

    We are not at liberty to share any details around investigations, or interpret the terms of service for you."

    So when the CAPTCHA says "Are you a robot?", I'm really not sure. Like Pinocchio, "I'm a real boy!"

  • I knew there was a reason why I used client certificates and alternate ports.

    Why is it so difficult to just respect robots.txt? Maybe there's an idea for a browser plugin that determines if you can easily scrape the data or not. If not, then the website is blocked and then traffic will drop. I know this is a naive idea...

  • Never underestimate the scraping technique of last resort: paying people on Mechanical Turk or equivalent to browse to the site and get the data you want

  • Are there any court cases that provide precedence regarding the legality of web scraping?

    I'm currently looking for ways to get real estate listings in a particular area and apparently the only real solution is the scrape the few big online listing sites.

  • Half of the short-links to cutt.ly aren't working. Why use short links in markdown ?

  • It always amazes me how people believe they have a right to retrive data from a website. The HTTP protocol calls it a request for a reason: you are asking for data. The server is allowed to say no, for any reason it likes, even a reason you don't agree with.

    This whole field of scraping and anti-bot technology is an arms race: one side gets better at something, the other side gets better at countering it. An arms race benefits no one but the arms dealers.

    If we translate this behavior into the real world, it ends up looking like https://xkcd.com/1499

  • For the row "Long-lived sessions after sign-in" the author mentions that this solution is for social media automation i.e. you build a tool to automate social media accounts to manage ads more efficiently.

    I am curious by what the author means by automating social media accounts to manage ads more efficiently

  • Trying to stop credential stuffing by blocking bots will not work, and can often severely impact people depending on assistive technologies.

    I think a better solution is to implement 2FA/MFA (even bad 2FA/MFA like SMS or email will block the mass attacks, for people worried about targeted attacks let them use a token or software token app) or SSO (e.g. sign in with Google/Microsoft/Facebook/Linkedin/Twitter who can generally do a better job securing accounts than some random website). SSO is also a lot less hassle in the long term that 2FA/MFA for most users (major note: public use computers, but that's a tough problem to solve security wise, no matter what).

    Better account security is, well, better, regardless of the bot/credential stuffing/etc problem.

  • A lot of web scraping is annoying often because there’s *an explicit API built for the scrapers needs*. Instead of looking for an API, many think to first use web scraping. This in turn puts load and complexity on the user facing web app that must now tell scraper from real users.

  • Here's a good resource about web scrapping: https://bot.incolumitas.com/#:~:text=more%20sources%2Finform...

  • I am running a no-code web automation and data extraction tool called https://automatio.co. And from my experience most of the time when using quality residential proxies you will be fine. But that comes at cost since they are way expensive then data center proxies.

    But for some websites, even residential ips doesn't let you pass.

    I noticed there is like a premium reCaptcha service, which just work differently then standard one and not let you pass. It's mostly shown with a Cloud flare anti bot page.

  • By the way - is it possible to stop Google bot from scrapping without maintaining a list of IP addresses? Google doesn't publish these and it's not good to run reverse DNS as it slows down legitimate clients. I know you can put a meta tag, but bot still has to make a request to read it. I would like to completely cut off Google from scrapping.

  • Datadome, PerimeterX, anyone tried ine if them?

  • I've had a lot of success just with Selenium and this custom version of Chromedriver: https://github.com/ultrafunkamsterdam/undetected-chromedrive...

  • In a previous venture my team successfully circumvented bot detection for a price comparison project simply by using apify.com. Wasn't that expensive, either. We were drilling sites with 500k+ hits per day for months.

  • Some how related: https://news.ycombinator.com/item?id=29062027

  • A couple of things for unblockable scraping

    1. plenty of VPS with many IP addresses (this is easier with IPv6 subnet)

    2. HTTP header rearranging

    3. Fuzzing user-agent

    4. Pseudo-PKBOE algorithm

    5. office hours, break-time, lunch-time activity emulation

    6. ????

    7. profit

    I am looking at you, SSH port bashers.

  • * Scrape open proxy websites for open proxies, then use those proxies, cycle which proxies you use frequently.

    * Change your user-agent to a real user-agent, cycle it frequently.

    * Done.

  • You could ask first. The site's robots.txt file might have some information.

    Put your email address in your User-Agent string so they can get in touch if needed.

  • The proxy service recommendations are pretty expensive. Does anyone have alternatives they suggest to keep costs down?

  • Not to forget the most important rule, don't be an asshole to the site hosting the content.

  • Pretty useful crash course on what is out there in the web scraping universe

  • What if we solved it by replacing passwords with client HSMs?

  • plivo.com is good at anti-bot, i tried many method and some residential proxys . there still blocked me out .

  • Will I scrape faster with RTX 3080 Ti?

  • undefined