Puppeteer advanced scrapping

In this article we will talk about ways to bypass anti-bot firewalls using techniques such as stealth plugins, ip-rotation and residential proxy provisionning, we will not be talking about the scrapping in itself but more about the architecture that will allow use to stay under the radar, if you want to see how to extract data from a website using puppeteer go read This Article

BE ADVISED !! I do not encourage using those techniques to scrape illegally from webistes, this article only exists for the sole purpose of educational content, be mindful of how you use this knowledge, you are responsible for your own actions.

Resources

https://bot.sannysoft.com/ is a website that will tell us wether or not our browser is flagged as a bot, if you go to visit it right now you will pass the test, being a real user and all… , but puppeteer will not because we didn’t tell it to hide its origin.

Basic Stealth plugin

So lets get down to the basics, if you do not know how to start a basic puppeteer project you can go to the following article that explain in a beginner friendly way how to start and scrape basic websites using puppeteer.

Here is an example of a basic app.js :

import puppeteer from "puppeteer";
import { setTimeout } from "node:timers/promises"; // waitForTimeOut is deprecated so we use native node

const browser = await puppeteer.launch({ headless: true });

const page = await browser.newPage();
await page.goto("https://bot.sannysoft.com/");

await setTimeout(3000);
await page.screenshot({ path: "screenshot.jpeg", fullPage: true });

await browser.close();
concurrency: 9;

This code does the simplest thing ever, it goes to https://bot.sannysoft.com/ and takes a screnshot where we will see that we, in fact do not pass the tests.

sannysoft1

To make it seem like we are a legit user we will be installing some ready to use plugins

npm install puppeteer-extra puppeteer-extra-plugin-stealth

And declare in our code that we’ll be using them rather than basic puppeteer

import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";
import { setTimeout } from "node:timers/promises";

puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({ headless: true });

const page = await browser.newPage();
await page.goto("https://bot.sannysoft.com/");
await setTimeout(3000);
await page.screenshot({ path: "screenshot.jpeg", fullPage: false });

await browser.close();

Sure enough this basic test will pass, its the first step of a long road to being able to pass through security like DataDome’s, that notoriously provide anti-bot protection for a lot of websites.

sannysoft2

Ip Rotation

Now one of the goals of scraping any amount of data, is to not get you ip flagged, for this purpose, services such as residential proxy servers can be rented and for cheap as well, we’ll go on here with a free tier version of Webshare.io that gives basically everyone the same 10 proxies so yeah you will be raising alarms if trying to use it for production, but in this example it’ll do.

Working behind a proxy does not mean anonimity

for this we will create a proxy-helper.js that will handle connectivity to our Proxy provider, this provider will handle on their end the proxy rotation / bans / refresh, making it more easy for us, on a bigger scale we might want to take a look at our own proxy pool manager based on a large list of proxies.

// helpers/proxy-manager.js
import axios from "axios";

export function getProxy() {
  const username = process.env.PROXY_USERNAME;
  const password = process.env.PROXY_PASSWORD;
  const host = process.env.PROXY_DOMAIN_NAME;
  const port = process.env.PROXY_PORT;

  return {
    host,
    port,
    username,
    password,
  };
}

/**
 * Logs the current public IP used through the proxy.
 * This helps confirm when rotation happens.
 */
export async function logCurrentIp(proxy) {
  try {
    const response = await axios.get("http://api.ipify.org?format=json", {
      proxy: {
        host: proxy.host,
        port: proxy.port,
        auth: {
          username: proxy.username,
          password: proxy.password,
        },
      },
      timeout: 5000,
    });

    console.log(`🌍 Current Proxy IP: ${response.data.ip}`);
  } catch (err) {
    console.warn("Failed to get current IP:", err.message);
  }
}

Machine → VPN → Proxy → Target Website (HTTPS)

// core/browser.js
import puppeteer from "puppeteer-extra";
import StealthPlugin from "puppeteer-extra-plugin-stealth";
import { getProxy, logCurrentIp } from "../helpers/proxy-manager.js";

puppeteer.use(StealthPlugin());

export default async function launchBrowser() {
  const proxy = getProxy();
  console.log(`Launching browser with proxy: ${proxy.host}:${proxy.port}`);

  const browser = await puppeteer.launch({
    headless: true,
    args: [
      "--no-sandbox",
      "--disable-setuid-sandbox",
      `--proxy-server=http=${proxy.host}:${proxy.port}`,
    ],
  });

  // Create a page just to authenticate, then close it
  const page = await browser.newPage();
  await page.authenticate({
    username: proxy.username,
    password: proxy.password,
  });
  await logCurrentIp(proxy);
  await page.close();

  return browser;
}

Captcha Solvers

Okay, now we stumble upon a real problem, Captchas, some websites integrate captchas when suspecting you might be a bot, the real solution is to first never trigger this suspicion by having a great list of residential proxy for instance, but whatever happens happens.

You have a grand total of three options when coming face to face with a captcha :

Give up (sometimes the data is not worth the hassle)
Use a Captcha solver provider that employs real people to solve them for you (highly effective)
Use AI to try and solve them (better than nothing)

In this article i will not demonstrate how to create a captcha solver that would be the point of THIS ARTICLE neither will i show you how to integrate an API captcha solver because first and foremost the README.md of the https://www.npmjs.com/package/@2captcha/captcha-solver is enough to guide you, but also because depending on your country, trying to get around captchas might be a criminal offense.

BONUS : Queuing tabs

What better than scrapping using a headless browser ?

Using multiple tabs in said browser, right now if you just followed along all your scrapper will know how to do is launch a single chrome tab, but lets say you want for each product of a list, the details of said product, it would be a hassle to only use one tab for all thoses concurrent tasks..

But be carfeul scraping too much pages at once can easily crash your browser or get you rate-limited. The trick? Use a task queue to limit how many tabs open at the same time.

Here’s a quick setup to do just that using p-queue

Then in your scraper:

import PQueue from "p-queue";
import puppeteer from "puppeteer-extra";

const queue = new PQueue({ concurrency: 3 }); // 3 tabs max
const browser = await puppeteer.launch({ headless: true });
const urls = [
  "https://example.com/1",
  "https://example.com/2",
  "https://example.com/3",
];

for (const url of urls) {
  queue.add(async () => {
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: "domcontentloaded" });
    console.log("✅ Scraped:", url);
    await page.close();
  });
}

await queue.onIdle();
await browser.close();

Each task opens a tab, does its scraping, and closes it.
PQueue ensures only 3 run in parallel — smooth, stable, and less detectable.
AND you can scale safely by adjusting the concurrency value depending on your machine and target website.

Resources#

Basic Stealth plugin#

Ip Rotation#

Captcha Solvers#

BONUS : Queuing tabs#

Resources

Basic Stealth plugin

Ip Rotation

Captcha Solvers

BONUS : Queuing tabs