One of the biggest attractions of using Python is the ease of working with data and generating different views, however, they are not always available easily. Often, the data we need is only on a web page, which makes it much more difficult to capture and use it within your program. And that's where web scraping comes in and the tool we're going to use for that, Selenium.
When we do web scraping, we are interpreting the content of a web page in order to extract the raw data we are interested in. This already makes the problem more complicated than if we had the data readily available through an API.
For example, if we were interested in temperature data from several cities, we have available 2 ways to obtain data. We have several APIs made just for this function, such as OpenWeatherMap , which return data in a structured way, in Json or XML, for example. So, with a simple HTTP request, we can get the data of interest.
We could also get this data from a web page, such as Climatempo . This data, however, will be mixed with various HTML tags , which are there for the visual organization of the page and do not influence the data. Using this data in this way is more complicated in addition to leaving us vulnerable to changes in the visual organization of the page. With this, we run the risk of completely breaking old programs that were web scraping the page.
As a rule, whenever possible, it is preferable to opt for APIs or another way of obtaining data in a structured way. However, this is not always possible, which is why web scraping tools have been used so much. An example would be a service company that wants to compile the price of one of its competitors on a daily basis. This is data that is openly available on the competitor's website, but not available in any API.
The company has 2 choices: allocate an employee to log into the competitor's site daily and write down the values manually (time-consuming and error-prone), or create a web scraping script for the competitor's site. Despite being a greater initial effort, it obtains this data daily quickly and without intervention.
Selenium is a webdriver, that is, a tool that allows you to simulate a real user using a browser. Since its inception, it has been used as an automated testing tool. That is, a tool can be activated to run, capable of simulating the entry of a user on the site and verifying its operation correctly. However, today it has become a general website manipulation tool and is also used for web scraping.
For a large number of sites, using HTTP request and web scraping libraries, such as BeautifulSoup , is sufficient. However, the number of sites that use Single Page Applications has increased a lot (more information in this video ). It can be built with javascript frameworks such as Angular, Vue and React and that need a browser to get the result we have on our screen when opening the site.
For example, if we use the Python requests library to get the code from 2 sites, the Python definition on Wikipedia and the React javascript framework site are quite different. The first returns the site already assembled, exactly as we see when opening the site in the browser. In the second case, we have a website that is dynamically generated and that, therefore, does not yet contain the data that we want to obtain in the middle of the HTML. So, this is one of the reasons to use Selenium, as we need to wait for all this processing to be done by the browser, and then we can fetch the data on the page:
Selenium is a programming language independent tool, which even has its own IDE and we can record commands, in addition to several other features. For this article, we are going to use selenium from Python. For that, we need to install the Selenium Python library, using:
1pip install selenium We also need to download the webdriver itself. Go to this page and choose your preferred browser driver. This driver must be placed in the PATH of the computer or in the same folder as your Python code.
1import requests 2 3resposta_python = requests.get("https://pt.wikipedia.org/wiki/Python") 4texto_python = resposta_python.text 5print(texto_python) 6 7resposta_react = requests.get("https://pt-br.reactjs.org") 8texto_react = resposta_react.text 9print(texto_react)
The first step to follow is to open the selenium webdriver on the page we are interested in. In the code below, we first import the selenium library into Python. So we create a browser instance using the Firefox webdriver (this will change depending on your preferred browser). So, we use the get() function to open the Let's Code website, for example:
1from selenium import webdriver 2 3navegador = webdriver.Firefox() 4navegador.get('https://letscode.com.br')
To find a specific element on the page, we need to parse its HTML. The easiest way to do this is using the browser's developer tools. These tools can usually be accessed with the F12 button or by right clicking somewhere on the page and clicking "inspect element". With that, we can see all the structures of the page and also the names, classes and ids of each element.
As an example, let's try to get the names of all coders available on the “Our Team” page, starting from the page we are on. The first step is to navigate to the page we want to get to, we can do this directly with the link:
1navegador.get('https://letscode.com.br/nosso-time')On this page, we have to select all the coders' cards. By using "inspect element", we can see that all cards have the structure below. Some style properties have been hidden to make the code cleaner, but by accessing the site directly, you can see everything.
We can see that all these cards have a common class called “coderCard”, we can use this information to select them all at once. To learn more about the selection strategies available, click on this site . This operation returns us a list, in which we can use a for to iterate over all the cards and select the value of the name of each one of them, as we did below:
1cards = navegador.find_elements_by_class_name('coderCard') 2for card in cards: 3 nameHolder = card.find_element_by_class_name('nameHolder') 4 firstP = nameholder.find_element_by_tag_name('p') 5 tagB = firstP.find_element_by_tag_name('b') 6 nome = tagB.get_attribute('innerHTML') 7 print(nome)
First, we use the 'nameHolder' class, for within each card, select the piece that contains the text. Then the 'p' tag, to select only the name text. It is worth noting that 'nameHolder' has more than one tag of type 'p', but as we use a function of type find_element (singular), we will only have the first of them as an answer. We could usually stop there and use firstP.text to get the name (since it's the only textual element inside the p tag ). However, the text property is opacity sensitive. As on the website the cards are hidden when they are not on the screen, text returns an empty string in most cases. So we need to go deeper to the b tag. For this, we use the get_attribute() function to select the HTML from within this tag, regardless of styling.
Now, we got the data we were interested in. It can be saved to a file, database or anywhere else and let this script run daily to get changes to the page.
Finally, you already know how to do web scraping using Selenium! It's worth mentioning that this is a very simple example and doesn't even come close to using the full functionality of Selenium. After finding an element, we can click on it, fill in forms, drag elements on the screen... In other words, automate any user flow ( Documentation ). Here it only covers the main, identification of an element on the screen. If you're interested in this content, check out our Python course at Let's Code!
Don't miss out on the news! Join the Let's Code community to receive exclusive content every week! Name Email Sign up
Dont miss out on the news!
Join the MAKE NOW academy to receive exclusive content every week!