Web Automation and Data Collection with Playwright (Node.js Version)

Playwright는 웹 페이지 테스트 및 자동화를 위한 라이브러리로, Chromium, Firefox, WebKit과 같은 브라우저를 지원합니다. Microsoft에서 개발되었으며 효율적이고 안정적이며 빠르기 때문에 교차 브라우저 웹 자동화 작업을 수행할 수 있습니다.

Collecting Amazon Product Information with Playwright

Playwright를 사용하여 Amazon(www.amazon.com)을 방문하여 제품 정보 및 리뷰를 크롤링하는 것과 같은 사용자 동작을 시뮬레이션할 수 있습니다. CSS 선택자 또는 XPath를 사용하면 웹 페이지 요소를 정확하게 찾아 텍스트나 속성을 추출할 수 있습니다.

Example: Crawling the Amazon Best Sellers List

Playwright를 사용하여 Amazon의 국제 베스트셀러 목록을 수집합니다. 단계는 다음과 같습니다.

대상 페이지 방문, 예: https://www.amazon.com/b/?ie=UTF8&node=16857165011&ref_=sv_b_3
모든 도서 요소 선택(클래스 이름 a-section 및 a-spacing-base 사용)
도서 요소를 반복하고 제목, 가격, 평점, 리뷰 수와 같은 정보 추출

Deploying a Playwright Example on Leapcell

Playwright Deployment Example on Leapcell

이 가이드는 Leapcell에 Playwright 테스트를 배포하는 간소화된 방법을 제공합니다. 단계별 튜토리얼은 위의 링크를 참조하십시오.

Node.js Implementation Code

다음은 Node.js 및 Playwright를 사용한 데이터 수집 구현입니다.

const { chromium } = require('playwright');

(async () => {
    // Launch the browser
    const browser = await chromium.launch({ headless: true });
    const context = await browser.newContext();
    const page = await context.newPage();
    
    // Visit the Amazon search page
    await page.goto('https://www.amazon.com/');
    
    // Search for the keyword "laptop"
    await page.fill('#twotabsearchtextbox', 'laptop');
    await page.click('#nav-search-submit-button');
    
    // Wait for the page to finish loading
    await page.waitForLoadState('networkidle');
    
    // Get the list of product links
    const links = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('.s-result-item h2 a'))
            .map(a => a.href);
    });
    
    // Collect product details data
    const results = [];
    for (const link of links) {
        const productPage = await context.newPage();
        await productPage.goto(link, { waitUntil: 'networkidle' });
        
        const title = await productPage.textContent('#productTitle');
        const rating = await productPage.textContent('#averageCustomerReviews .a-icon-alt').catch(() => 'N/A');
        const reviewCount = await productPage.textContent('#acrCustomerReviewText').catch(() => 'N/A');
        
        results.push({ title: title.trim(), rating, reviewCount });
        
        await productPage.close();
    }
    
    // Output the collected data
    console.log(results);
    
    // Close the browser
    await browser.close();
})();

Code Analysis

Initializing Playwright: Use chromium.launch({ headless: true }) to launch the browser.
Navigating to the Amazon Search Page: Use page.goto() to visit the website, fill in the search box, and submit the search.
Extracting Product Links: Use document.querySelectorAll() to get the URLs of all products.
Collecting Product Details:
- Open each product's page.
- Get the product title (#productTitle).
- Get the rating (#averageCustomerReviews .a-icon-alt).
- Get the number of reviews (#acrCustomerReviewText).
Outputting Data and Closing the Browser

Code Optimization

Error Handling: Some products may not have ratings or review counts. Use .catch(() => 'N/A') to prevent the code from crashing.
Automation Efficiency: Use await context.newPage() to reuse the context and improve page loading speed.
Avoiding Being Blocked:
- You can use proxy access (such as Playwright's proxy option).
- You can adjust the userAgent to make it more like a real user.

Using Playwright and Node.js, we can efficiently automate Amazon web page data collection, which is suitable for scenarios such as e - commerce data analysis and competitor research.

Leapcell: The Next - Gen Serverless Platform for Web Hosting, Async Tasks, and Redis

Finally, I would like to recommend the best platform for deploying Playwright: Leapcell

1. Multi - Language Support

Develop with JavaScript, Python, Go, or Rust.

2. Deploy unlimited projects for free

pay only for usage — no requests, no charges.

3. Unbeatable Cost Efficiency

Pay - as - you - go with no idle charges.
Example: $25 supports 6.94M requests at a 60ms average response time.

4. Streamlined Developer Experience

Intuitive UI for effortless setup.
Fully automated CI/CD pipelines and GitOps integration.
Real - time metrics and logging for actionable insights.

5. Effortless Scalability and High Performance

Auto - scaling to handle high concurrency with ease.
Zero operational overhead — just focus on building.

Explore more in the documentation!

Leapcell Twitter: https://x.com/LeapcellHQ

Playwright Amazon Scraper: Products & Reviews (Javascript)