Headless Browsing
we will extend our toolset of python to script the web browser automation.
Headless browsers provide automated control of a web page in an environment similar to popular web browsers.
Puppeter automates a headless chrome browser. It is an important and popular javascript headlesss browsing framework.Formore about it, check up it github link here: github.com/puppeteer/puppeteer. Pyppeter is a port to python.
Headless Browsing Use Cases
The following are the use cases for headless browsing:
- Test automation in modern web applications.
Taking screenshots of web pages.
Running automated tests for javascript libraries. Scraping web sites for data.
Scraping web sites for data.
- Automating interaction of web pages.
Malicious Headless Browsing Use Cases
Headless browsing can be used for malicious purposes as:
- Perform DDoS attacks on web sites.
- Increase advertisement impressions.
- Automate web sites in unintended ways ( credential stuffing).
installing Pyppeteer
We can install Pyppeteer with pip by running the following command:
pip install pyppeteer
It becomes imperative to note that:
Headless browsing can be used to test web applications for performance.
Headless browsing can be used to test web applications for errors.
Headless browsing can be used to take screenshots of webpages.
Writing Scripts to Visit a Web Page.
Coroutines and Asyncio Module
Coroutines are computer program components that generalize subroutines for non-preemptive multitasking, by allowing multiple entry points for suspending and resuming execution at certain locations.
The Asyncio Module provides async/await syntax to build asyncronous coroutines.
Async & Await
The syntax async def introduces either a native coroutine or an asynchronous generator.
The keyword await passes function control back to the event loop. It is imperative to note that the await keyword suspends the execution of the surrounding coroutine. The await def main() returns control until the main() function finishes.
Lets take a look at a good example of headless call in python:
import asyncio
from pyppeteer import launch
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('https://example.com')
await page.screenshot({'path': 'example.png'})
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
From the above python script, we discover that the Async def introduces a native coroutine.Await passes function control back to the event loop. Imperatively, to call a caroutine function, we must await it.
Extracting HTML Elements with Python Script
We can actually develop scripts that utilize headless browsing to extract elements from a web page.
Page Class
This class provides methods to interact with a single tab of chrome. The following functions are used to achieve this:
- Goto function is used to save pages as an image.
- Screenshot function to save page as an image.
- Content function to return contents as a string.
- Metrics function to return a dictionary of metrics in key value pairs.
- Emulate function to set agent and other browser characteristics.
Selecting DOM Elements
To get elements inner Text we use the following page selectors:
- Page.querySelector()
- Page.querySelectorAll()
- Page.xpath()
A good example of how each of them can be used is as follows:
element = await page.querySelector("h1")
Running JavaScript
We need to understand that javaScript strings can be a function or an expression. Pyppeteer tries to automatically detect the string as a function or an expression.The force_expr=True option forces pyppeteer to treat the string as an expression. See an example code below:
dimensions = await page.evaluate('''() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
}
}''')
In summary, we note that Xpath can be used to navigate through elements in a page. Evaluate can execute javascript in the context of the page. The Arrow functions are used to express javascript.