Next-Gen App & Browser Testing Cloud
Trusted by 2 Mn+ QAs & Devs to accelerate their release cycles

In this blog on using Playwright for web scraping, you will learn how to set up Playwright with Python and use it to scrape data from web pages.

Jaydeep Karale
December 22, 2025
In today’s data-driven world, the ability to access and analyze large amounts of data can give researchers, businesses & organizations a competitive edge. One of the most important & free sources of this data is the Internet, which can be accessed and mined through web scraping.
Web scraping, also known as web data extraction or web harvesting, involves using code to make HTTP requests to a website’s server, download the content of a webpage, and parse that content to extract the desired data from websites and store it in a structured format for further analysis.
When it comes to data extraction & processing, Python has become the de-facto language in today’s world. In this Playwright Python tutorial on using Playwright for web scraping, we will combine Playwright, one of the newest entrants into the world of web testing & browser automation with Python to learn techniques for Playwright Python scraping.
The reasons for choosing Playwright over some popular alternatives are its developer-friendly APIs, automatic waiting feature, which avoids timeouts in case of slow-loading websites, superb documentation with examples covering various use cases, and a very active community. If you’re looking to improve your playwright interview skills, check out our curated list of questions and solutions at Playwright interview questions
So, let’s get started.
Playwright for Web Scraping is a modern approach to automate data extraction from dynamic websites using Python. It enables faster, reliable, and cross-browser scraping with minimal setup.
Why Does Playwright Matter for Web Scraping?
Traditional scraping tools often fail when dealing with JavaScript-heavy or dynamic pages. Playwright ensures consistency, efficiency, and adaptability across web environments.
What Are the Core Pillars of Playwright Web Scraping?
These pillars define the foundation of efficient and reliable data extraction using Playwright and Python.
Now, let’s explore some of the most popular web scraping use cases.
Whilst web scraping is a powerful tool, there are a few ethical & potentially legal considerations to consider when doing web scraping.
By following these guidelines and using web scraping responsibly, you can ensure that your web scraping projects are legal and ethical and that you are not causing harm to the websites you are scraping.
Having seen so many use cases, it’s evident that the market for web scraping is huge. And as the market grows for anything, so do the available tools. In this Playwright for web scraping tutorial, we will explore in-depth web scraping with Playwright in Python and how it can extract data from the web.
CEO, Vercel
Discovered @TestMu AI yesterday. Best browser testing tool I've found for my use case. Great pricing model for the limited testing I do 👏
Deliver immersive digital experiences with Next-Generation Mobile Apps and Cross Browser Testing Cloud
Playwright is the latest entrant into the array of frameworks (e.g., Selenium, Cypress, etc.) available for web automation testing. It enables fast and reliable end-to-end testing for modern web apps.
At the time of writing this Playwright for web scraping tutorial, the latest stable version of Playwright is 1.28.0, and Playwright is now consistently hitting the >20K download per day mark, as seen from PyPi Stats.

Below are the download trends of Playwright in comparison to a popular alternative, Selenium, taken from Pip Trends.

A key consideration to make when using any language, tool or framework is the ease of its use. Playwright is a perfect choice for web scraping because of its rich & easy-to-use APIs, which allow simpler-than-ever access to elements on websites built using modern web frameworks. You can learn more about it through this blog on testing modern web applications with Playwright.
Some unique and key features of Playwright are
Now that we know what Playwright is, let’s go ahead and explore how we can leverage its Python API for web scraping, starting firstly with installation & setup.
Enhance your testing strategy with our detailed guide on Playwright Headless Testing. Explore further insights into Playwright’s capabilities in this guide.
As mentioned above, it’s possible to use Playwright for web scraping with different languages such as JavaScript, TypeScript, Java, .Net, and Python. So, it is necessary to understand why Python.
I have been programming for ten years using languages such as C++, Java, JavaScript & Python, but in my experience, Python is the most developer-friendly and productivity-oriented language.
It abstracts unnecessary complications of certain other programming languages and lets developers focus on writing quality code and shorten the delivery time whilst also enjoying writing the code.
A quick case for why I love Python automation testing & why we choose Playwright for web scraping, specifically using its Python API.
With an understanding of why we chose to work with Playwright for web scraping in Python, let’s now look at Playwright’s Locators. Playwright supports a variety of locator strategies, including CSS Selectors, XPath expressions, and text content matching.
Locators are a critical component of Playwright, making web browser-related tasks possible, easy, reliable, and fun.
Note: Run Automated Playwright Tests Online. Try TestMu AI Now!
Locators are the centerpiece of the Playwright’s ability to automate actions & locate elements on the browser.
Simply put, locators are a way of identifying a specific element on a webpage so that we can interact with it in our scripts.


In the below example snippet, the button element will be located twice: once for the hover() action and once for the click() action to ensure we always have the latest data. This feature is available out of the box without needing additional code.

Note: In Playwright, waits are unnecessary because it automatically waits for elements to be available before interacting with them. This means you do not have to manually add delays or sleep in your test code to wait for elements to load.
Additionally, Playwright includes many built-in retry mechanisms that make it more resilient to flaky tests. For example, if an element is not found on the page, Playwright will automatically retry for a certain amount before giving up and throwing an error. This can help to reduce the need for explicit waits in your tests.
You can learn more about it through this blog on types of waits.
The below table summarizes some built-in locators available as part of the Playwright
| Locator Name | Use Case |
|---|---|
| page.get_by_role() | locate by explicit and implicit accessibility attributes |
| page.get_by_text() | locate by text content |
| page.get_by_label() | locate a form control by associated label’s text |
| page.get_by_placeholder() | locate an input by placeholder |
| page.get_by_alt_text() | locate an element, usually image, by its text alternative |
| page.get_by_title() | locate an element by its title attribute |
| page.get_by_test_id() | to locate an element based on its data-testid attribute (other attributes can be configured) |
When programming in Python it’s a de-facto approach to have a separate virtual environment for each project. This helps us manage dependencies better without disturbing our base Python installation.


After completing the installation of the Playwright Python package, we need to download & install browser binaries for Playwright to work with.
By default, Playwright will download binaries for Chromium, WebKit & Firefox from Microsoft CDN, but this behavior is configurable. The available configurations are as below.
PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 python -m playwright install
PLAYWRIGHT_DOWNLOAD_HOST=
HTTPS_PROXY=
To keep things simple we install all browsers by using the command playwright install. This step can be skipped entirely if you run your code on cloud Playwright Grid, but we will look at both scenarios, i.e., using Playwright for web scraping locally and on a provided by TestMu AI.
If you have Visual Studio Code(VS Code) already installed the easiest way to start it up is to navigate to the newly created project folder and type code.

This is how the VS Code will look when it opens. We now need to select the interpreter, i.e., the Python installation, from our newly created virtual environment.

Press Ctrl + Shift + P to open up the command palette in VS Code and type Python: Select Interpreter and click it.

VS Code will automatically detect the virtual environment we created earlier and recommend it at the top, select it. If VS Code does not auto-detect, click on the ‘Enter interpreter path..’ and navigate to playwrightplayground > Scripts > Python and select it.

You can verify the virtual environment by pressing the ctrl + (tilt), which will open up the VS Code terminal & activate the virtual environment. An active environment is indicated by its name before the path within round brackets.

With the setup and installation out of the way, we are now ready to use Playwright for web scraping with Python.
For demonstration, I will be scraping information from two different websites. I will be using the XPath locator in the second scenario. This will help you in choosing the best-suited locator for automating the tests.
So, let’s get started…
In the first demo, we are going to scrape a demo E-Commerce website provided by TestMu AI for the following data:

Here is the Playwright for web scraping scenario that will be executed on Chrome on Windows 10 using Playwright Version 1.28.0. We will run the Playwright test on cloud testing platforms like TestMu AI.
By utilizing TestMu AI, you can significantly reduce the time required to run your Playwright Python tests by leveraging an online browser farm that includes more than 50 different browser versions, including Chrome, Chromium, Microsoft Edge, Mozilla Firefox, and Webkit.
You can also subscribe to the TestMu AI YouTube Channel and stay updated with the latest tutorial around Playwright browser testing, Cypress E2E testing, Mobile App Testing, and more.
However, the core logic remains unchanged even if the web scraping has to be done using a local machine/grid.
Test Scenario:
Version Check:
When writing this blog on using Playwright for web scraping, the version of Playwright is 1.28.0, and the version of Python is 3.9.12. The code is fully tested and working on these versions.
Implementation:
You can clone the repo by clicking on the button below.

import json
import logging
import os
import subprocess
import sys
import time
import urllib
from logging import getLogger
from dotenv import load_dotenv
from playwright.sync_api import sync_playwright
# setup basic logging for our project which will display the time, log level & log message
logger = getLogger("webscapper.py")
logging.basicConfig(
stream=sys.stdout, # uncomment this line to redirect output to console
format="%(message)s",
level=logging.DEBUG,
)
# LambdaTest username & access key are stored in an env file & we fetch it from there using python dotenv module
load_dotenv("sample.env")
capabilities = {
"browserName": "Chrome", # Browsers allowed: `Chrome`, `MicrosoftEdge`, `pw-chromium`, `pw-firefox` and `pw-webkit`
"browserVersion": "latest",
"LT:Options": {
"platform": "Windows 10",
"build": "E Commerce Scrape Build",
"name": "Scrape Lambda Software Product",
"user": os.getenv("LT_USERNAME"),
"accessKey": os.getenv("LT_ACCESS_KEY"),
"network": False,
"video": True,
"console": True,
"tunnel": False, # Add tunnel configuration if testing locally hosted webpage
"tunnelName": "", # Optional
"geoLocation": "", # country code can be fetched from https://www.lambdatest.com/capabilities-generator/
},
}
def main():
with sync_playwright() as playwright:
playwright_version = (
str(subprocess.getoutput("playwright --version")).strip().split(" ")[1]
)
capabilities["LT:Options"]["playwrightClientVersion"] = playwright_version
lt_cdp_url = (
"wss://cdp.lambdatest.com/playwright?capabilities="
+ urllib.parse.quote(json.dumps(capabilities))
)
logger.info(f"Initiating connection to cloud playwright grid")
browser = playwright.chromium.connect(lt_cdp_url)
# comment above line & uncomment below line to test on local grid
# browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
try:
# section to navigate to software category
page.goto("https://ecommerce-playground.lambdatest.io/")
page.get_by_role("button", name="Shop by Category").click()
page.get_by_role("link", name="Software").click()
page_to_be_scrapped = page.get_by_role(
"combobox", name="Show:"
).select_option(
"https://ecommerce-playground.lambdatest.io/index.php?route=product/category&path=17&limit=75"
)
page.goto(page_to_be_scrapped[0])
# Since image are lazy-loaded scroll to bottom of page
# the range is dynamically decided based on the number of items i.e. we take the range from limit
# https://ecommerce-playground.lambdatest.io/index.php?route=product/category&path=17&limit=75
for i in range(int(page_to_be_scrapped[0].split("=")[-1])):
page.mouse.wheel(0, 300)
i += 1
time.sleep(0.1)
# Construct locators to identify name, price & image
base_product_row_locator = page.locator("#entry_212408").locator(".row").locator(".product-grid")
product_name = base_product_row_locator.get_by_role("heading")
product_price = base_product_row_locator.locator(".price-new")
product_image = (
base_product_row_locator.locator(".carousel-inner")
.locator(".active")
.get_by_role("img")
)
total_products = base_product_row_locator.count()
for product in range(total_products):
logger.info(
f"
**** PRODUCT {product+1} ****
"
f"Product Name = {product_name.nth(product).all_inner_texts()[0]}
"
f"Price = {product_price.nth(product).all_inner_texts()[0]}
"
f"Image = {product_image.nth(product).get_attribute('src')}
"
)
status = 'status'
remark = 'Scraping Completed'
page.evaluate("_ => {}","lambdatest_action: {"action": "setTestStatus", "arguments": {"status":"" + status + "", "remark": "" + remark + ""}}")
except Exception as ex:
logger.error(str(ex))
if __name__ == "__main__":
main()
Code Walkthrough:
Let’s now do a step-by-step walkthrough to understand the code.
Step 1 – Setting up imports
The most noteworthy imports are
from dotenv import load_dotenv
The reason for using the load_dotenv library is that it reads key-value pairs from a .env file(in our case sample.env) and can set them as environment variables automatically. In our case, we use it to read the access key & username from a sample.env required to access the cloud-based Playwright Grid.
It saves the trouble of setting environment variables manually & hence the same code can seamlessly be tested on different environments without any manual intervention.
from playwright.sync_api import sync_playwright
Playwright provides both sync & async API to interact with web apps, but for this blog on using Playwright for web scraping, we are going to use the sync_api, which is simply a wrapper around the asyncio_api that abstracts away the need to implement async functionality.
For more complicated scenarios where there is a need for fine-grained control when dealing with specific scenarios on websites built using modern web frameworks, we can choose to use the async_api.
For most use cases, the sync_api should suffice, but it’s a bonus that the async_api does exist and can be leveraged when needed.

Step 2 – Setting up logging & reading username & access key
In the next step, we set up logging to see the execution of our code & also print out the product name, price & link to image. Logging is the recommended practice and should be preferred to print() statements almost always. The load_dotenv(“sample.env”) reads the username & access key required to access our cloud-based Playwright grid. The username and access key are available on the TestMu AI Profile Page.


Step 3 – Setting up desired browser capabilities
We set up the browser capabilities required by the cloud-based Playwright automated testing grid in a Python dictionary. Let us understand what each line of this configuration means.

Step 4 – Initializing browser context & connecting to cloud Playwright Grid

Step 5 – Open the website to scrape.
Open the website to scrape using the page context.

Step 6 – Click on ‘Shop by Category’.
‘Shop By Category’ is a link with the ‘button’ role assigned to it. Hence, we use Playwright‘s get_by_role() locator to navigate to it & perform the click() action.


Step 7 – Click on the ‘Software’.
Inspection of the ‘Software’ element shows that it’s a ‘link’ with name in a span. So, we use the built-in locator get_by_role() again and perform click() action.


Step 8 – Adjust the product drop down to get more products.
By default, the ‘Software’ page displays 15 items, we can easily change it to 75, which is nothing but a link to the same page with a different limit. Get that link and call the page.goto() method.


Step 9 – Loading Images.
The images on the website are lazy-loaded, i.e., only loaded as we bring them into focus or simply scroll down to them. To learn more about it, you can go through this blog on a complete guide to lazy load images. We use Playwright’s mouse wheel function to simulate a mouse scroll.
Step 10 – Preparing base locator.
Notice that all the products are contained within two divs with id=entry_212408 and class=row. Each product then has a class=product-grid. We use this knowledge to form our base locator.
The base locator will then be used to find elements, such as name, price & image.


Step 11 – Locating product name, price & image.
With respect to the base locator, the location of other elements becomes easy to capture.


Step 12 – Scraping the data.
Now that we have all the elements located, we simply iterate over the total products & scrape them one by one using the nth() method.
The total number of products is obtained by calling the count() method on base_product_row_locator. For product name & price, we fetch text using the all_inner_texts(), and image URL is retrieved using the get_attribute(‘src’) element handle.

Execution:
Et voilà! Here is the truncated execution snapshot from the VS Code, which shows data of the first 5 products.

Here is a snapshot from the TestMu AI dashboard, which shows all the detailed execution, video capture & logs of the entire process.

Let’s take another example where I will be scraping information from the TestMu AI Selenium Playground. In this demonstration, we scrape the following data from the TestMu AI Selenium Playground:

One important change we are going to make in this demo to show the versatility of Playwright for web scraping using Python is
Here is the scenario for using Playwright for web scraping, which will be executed on Chrome on Windows 10 using Playwright Version 1.28.0.
Test Scenario:
Version Check:
At the time of writing this blog on using Playwright for web scraping, the version of Playwright is 1.28.0, and the version of Python is 3.9.12.
The code is fully tested and working on these versions.
Implementation:
Clone the Playwright Python WebScraping Demo GitHub repository to follow the steps mentioned further in the blog on using Playwright for web scraping.

import json
import logging
import os
import subprocess
import sys
import time
import urllib
from logging import getLogger
from dotenv import load_dotenv
from playwright.sync_api import sync_playwright
logger = getLogger("seleniumplaygroundscrapper.py")
logging.basicConfig(
stream=sys.stdout,
format="%(message)s",
level=logging.DEBUG,
)
# Read LambdaTest username & access key from env file
load_dotenv("sample.env")
capabilities = {
"browserName": "Chrome",
"browserVersion": "latest",
"LT:Options": {
"platform": "Windows 10",
"build": "Selenium Playground Scraping",
"name": "Scrape LambdaTest Selenium Playground",
"user": os.getenv("LT_USERNAME"),
"accessKey": os.getenv("LT_ACCESS_KEY"),
"network": False,
"video": True,
"console": True,
"tunnel": False,
"tunnelName": "",
"geoLocation": "",
},
}
def main():
with sync_playwright() as playwright:
playwright_version = (
str(subprocess.getoutput("playwright --version")).strip().split(" ")[1]
)
capabilities["LT:Options"]["playwrightClientVersion"] = playwright_version
lt_cdp_url = (
"wss://cdp.lambdatest.com/playwright?capabilities="
+ urllib.parse.quote(json.dumps(capabilities))
)
logger.info(f"Initiating connection to cloud playwright grid")
browser = playwright.chromium.connect(lt_cdp_url)
# comment above line & uncomment below line to test on local grid
# browser = playwright.chromium.launch()
page = browser.new_page()
try:
page.goto("https://www.lambdatest.com/selenium-playground/")
# Construct base locator section
base_container_locator = page.locator("//*[@id='__next']/div/section[2]/div/div/div")
for item in range(1, base_container_locator.count()+1):
# Find section, demo name & demo link with respect to base locator & print them
locator_row = base_container_locator.locator(f"//div[{item}]")
for inner_item in range(0, locator_row.count()):
logger.info(f"*-*-"*28)
logger.info(f'Section: {locator_row.nth(inner_item).locator("//h2").all_inner_texts()[0]}
')
for list_item in range(0,locator_row.nth(inner_item).locator("//ul/li").count()):
logger.info(f'Demo Name: {locator_row.nth(inner_item).locator("//ul/li").nth(list_item).all_inner_texts()[0]}')
logger.info(f'Demo Link: {locator_row.nth(inner_item).locator("//ul/li/a").nth(list_item).get_attribute("href")}
')
status = 'status'
remark = 'Scraping Completed'
page.evaluate("_ => {}","lambdatest_action: {"action": "setTestStatus", "arguments": {"status":"" + status + "", "remark": "" + remark + ""}}")
except Exception as ex:
logger.error(str(ex))
if __name__ == "__main__":
main()
Code Walkthrough:
Let’s now do a step-by-step walkthrough to understand the code.
Step 1 – Step 4
Include the imports, logging, cloud Playwright testing grid & browser context setup that remain the same as the previous demonstration; hence, refer to them above.
Step 5 – Open the TestMu AI Selenium Playground Website
Use the page created in the browser context to open the TestMu AI Selenium Playground website, which we will scrape.

Step 6 – Construct the base locator
Inspecting the website in Chrome Developer Tools, we see that there is a master container that holds all the sub-blocks. We will use this to construct our base locator by copying the XPath.


Step 7 – Locate the section, demo name, and demo link
We repeat the inspection using Chrome Developer Tools, and it’s easy to spot that data is presented in 3 divs. Within each div, there are then separate sections for each item.


Based on the inspection, we iterate over the base locator and for each div
The outermost for loop iterates over the outer container, the inner for loop gets the section title & the innermost for loop is for iterating over the list.

Execution:
Hey presto !!! Here is the truncated execution snapshot from the VS Code. We now have a collection of tutorials to visit whenever we wish to.

Here is a snapshot from the TestMu AI dashboard, which shows all the detailed execution along with video capture & logs of the entire process.



The Playwright 101 certification by TestMu AI is designed for developers who want to demonstrate their expertise in using Playwright for end-to-end testing of modern web applications. It is the ideal way to showcase your skills as a Playwright automation tester.
Python Playwright is a new but powerful tool for web scraping that allows developers to easily automate and control the behavior of web browsers. Its wide range of capabilities & ability to support different browsers, operating systems, and languages makes it a compelling choice for any browser related task.
In this blog on using Playwright for web scraping, we learned in detail how to set up Python and use it with Playwright for web scraping using its powerful built-in locators using both XPath & CSS Selectors.
It is well worth considering Python with Playwright for web scraping in your next scraping project, as it is a valuable addition to the web scraping toolkit.
Did you find this page helpful?
More Related Hubs
TestMu AI forEnterprise
Get access to solutions built on Enterprise
grade security, privacy, & compliance