- Web Scraping Using Python Code
- Web Scraping In Python Using Beautifulsoup System
- Web Scraping In Python Using Beautifulsoup Java
This article is a detailed explanation about the Web Scraping in Python using BeautifulSoup. The prerequisite for this article is Python and Pandas.
Web scraping requires a little knowledge of HTML also, so if you know it already it then it is good, otherwise don’t worry I’ll cover the required topics of HTML.
One of the most challenging tasks in web scraping is being able to login automatically and extract data within your account in that website. In this tutorial, you will learn how you can extract all forms from web pages as well as filling and submitting them using requestshtml and BeautifulSoup libraries.
First, we’ll talk about Web Scraping, then we’ll look into the BeautifulSoup, and in the end, we’ll take an example.
Why do we need Web Scraping?
Suppose someone asks you to get the list of Top 100 Movies and all the details like year, ratings, directors, and actors of the movies then what you’ll do?
First, you’ll search for Top 100 Movies in google, then open the first link (maybe IMDB) and start to copy-pasting the list and the details, this seems a bad idea. What if you have a script or program that takes the URL of the website and extracts all the required information from it.
Similarly, there might be hundreds of websites which have relevant information for you, some of them have static information and some of them have changing information like sports site, news site, etc. In today’s world, digital information is very important and highly valuable.
Implementing steps to Scrape Google Search results using BeautifulSoup. We will be implementing BeautifulSoup to scrape Google Search results here. BeautifulSoup is a Python library that enables us to crawl through the website and scrape the XML and HTML documents, webpages, etc. The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest that data, you’ll need to become skilled at web scraping. The Python libraries requests and Beautiful Soup are powerful tools for the job. Python-3.x web-scraping beautifulsoup. Share improve this question follow edited Oct 7 '19 at 14:41. Bill the Lizard. 352k 167 167 gold badges 530 530 silver badges 820 820 bronze badges. Asked Oct 7 '19 at 14:39. Khalid Gharib Khalid Gharib. Add a comment 1 Answer Active Oldest Votes.
Web Scraping provides a way to automate the information extraction from the given website(s).
What it is?
Web Scraping is nothing but automated data extraction from the website(s) and after extraction, this data is processed and converted to useful information.
Websites are the collection of the Web Pages, where each web page is built using the text-based markup languages like HTML and XHTML.
The web pages contain useful information in the text form, but they are designed for the humans as the end-users, therefore it requires special tools to automate the information extraction.
Web Scraper tool uses the HTML structural elements (div, span, p, a, etc) and the attributes (id, class) of the web page to extract the text information.
Now before moving towards BeautifulSoup, first let’s take a brief look into some HTML basics.
HTML Basics.
In order to extract the information, first we need to get the insight into the structure of the web page, this will tell us which section of HTML holding particular information.
For better understating of the web page’s structure, we need to know some of the basics of HTML.
Elements.
An HTML element contains 3 parts, start tag, some content, and the end tag.
There are several types of elements present in HTML, all of them has a different purpose and usage. Each type of element is uniquely identified by its tag-name. Elements can also be used in a nested fashion.
- h1: This is used for heading, this displays the heading in the biggest size. h2, h3, h4, h5, and h6 are some other heading elements.
- p: Used for the paragraph.
- a: It is used to provide the hyperlinks.
- div: Defines the division or section.
- span: This is used to the grouping of inline-elements.
Attributes.
An HTML element may or may not have the attribute(s). These attributes provide additional information about the element. We only talk about two attributes, Class and Id.
- Class: The HTML class attribute is used to define equal styles for elements with the same class name.
- Id: The id attribute specifies a unique id for an HTML element (the value must be unique within the HTML document).
I have tried to give a brief about some components of HTML, if you are still having doubts or want to explore more then check out w3schools
Ok, so let’s dive into BeautifulSoup, a beautiful tool that makes web scraping super easy.
BeautifulSoup.
BeautifulSoup is a Python package to parse the HTML and XML documents, it provides Pythonic idioms for iterating, searching, and modifying the parse tree.
It can work with different types of parsers like html, xml, html5lib. BeautifulSoup provides API to do a search based on the structural elements and attributes of HTML and XHTML.
Web Scraping Using Python Code
Installation.
Since its a Python Package, so the installation is super easy. Install it using PIP. I am assuming that you are working on Python 3 environment.
Usage.
We’ll talk about only those usage which are most common.
Soup.
Before we parse the HTML document, we have to create the soup of the given document, it is the base of all the entire parsing. Soup can be created by passing the HTML content to the BeautifulSoup constructor.
The HTML content can be passed through multiple ways, like passing the file pointer of HTML document(web page) or passing HTML content as a string.
Passing the file pointer.
Passing HTML content as string.
Prettify.
Once the soup is ready we can display it in a very nice and clean way using Prettify. Prettify maintains all the structural hierarchy of HTML while displaying it.
If you run the above code, you’ll get a nice HTML output, which helps us to get a better understanding of the structure and hierarchy of HTML.
Find.
Now we are ready to parse the document and extract the data from it. find helps us to find the HTML elements (div, span, p, a) based on their tag, id, and attributes.
The find always return the only first search result.
Let’s take an small HTML content as an example.
Now we try to extract different information from the above content using different properties.
Find using tag name.
The above code finds out the first HTML element with h1 tag-name, and text returns the text part of that element. So the output of the code will be Godfather.
BeautifulSoup provides multiple ways to extract the same information. You can extract the above information using the following method also.
Find using id.
It provides the functionality to search for an element by its id. You can extract the movie name from the above example using its id.
Find using attributes.
You can search an element based on its attributes like class. We’ll extract genre from the above example by its class.
Find all.
In the above section, we saw that find returns only the first result, but if we want all the elements with the specific property then we can use find_all.
Let’s take another example with multiple movies.
As you can see that all the movies are in the h1 heading elements, so let’s find all the elements with their h1 tag-name. The following code will return all the 3 results. Psp emulators for mac.
You might have noticed that we didn’t apply text on the results as we did in find, here in the find_all you have to extract text from the individual result.
Apart from returning multiple results, find_all is functionality-wise same as find.
We have learned some of the basic APIs of BeautifulSoup, but it provides many other APIs for more complex scenarios, so if you are willing to explore then check out the documentation.
Done with the theory? ok, let’s take an example and extract the data from a real website.
For Web Scraping, we need to know the structure of the web page we are dealing with, understanding the HTML structure is the most important part of the Web Scraping. So pay extra attention to the next section.
Analysis of the web page.
From the beginning of the article we are talking about the Top 100 Movies, so let’s take the same example and extract the information about movies from IMDb.
Inspect the web page.
First, open the Top 100 Movies and analyze the HTML structure of the web page. To analyze any web page we can use the inspect feature of the browser. Just do a right-click on the web page and click the inspect option from the list.
This gives us the general HTML layout of the web page, but if you want to see the structure of any particular element then take the cursor over that element on the web page and then do the inspect.
If you move the cursor in the HTML section then you’ll see different highlighted blocks on the web page, which shows the mapping between the HTML element and its corresponding block.
In the above image, there is a highlighted block on the web page and above that block, you can see div.lister-item-content which is the HTML element of that block. Here div is the actual element and lister-item-content is the attribute of the element.
The above-highlighted block contains all the information about a particular movie ( name, year, genre,etc.). Similarly, there are 100 blocks in the web page, one block per movie. If you inspect some of them then you’ll find that they have similar HTML elements and attributes.
Finding the relevant HTML elements.
Now let’s explore the above block further, just click on the div.lister-item-content in the HTML section and you’ll see multiple nested elements. These nested elements have our information, so to find out the HTML element for the given information, just move the cursor over that information on the web page and then inspect it.
Finding out the relevant HTML elements is a fairly easy task. so let’s see what you have found. Below is the list of movie info and their corresponding element.
- Movie Name: Element a under the h3.lister-item-header.
- Year: Element span.lister-item-year text-muted unbold under h3.lister-item-header.
- Runtime: Element span.runtime under p.text-muted text-small.
- Genre: Element span.genre under p.text-muted text-small.
- Rating: Element span.ipl-rating-star__rating.
- Director: Element a under second p.text-muted text-small.
- Stars: Element a under second p.text-muted text-small.
Finding the above information about HTML elements is the most important step of web scraping, so before we move further, make sure you understand each and every part of it.
If you have understood the above sections clearly, then implementation is gonna be a piece of cake. So let’s implement The Web Scraping in Pythonusing BeautifulSoup.
Implementation in Python
Note: Not all the websites allow the Web Scraping, so please be cautious before you do it on the given website, it might get your IP blocked to access that website.
As we discussed that we are going to use Web Scraping to extract the information of Top 100 Moveis, so let’s implement it in Python step by step.
Getting the HTML content.
Here we’ll use Python requests module to fetch the HTML content.
First, we need to have the actual URL to fetch the content, that we have in the line no 4.
In the line no. 5 we are sending the GET request to the URL, which returns the actual HTML content, and some other response-related information like the header. We can extract the HTML content from the response using r.content.
Once we have the actual HTML content then we have created the soup by passing the r.content to the BeautifulSoup.
Finding the main blocks.
In the Analysis of the web page section we have talked about the main block that has all the movie-related information, and the entire web page has 100 such blocks, one block per movie.
So the first task is to find out all the main blocks. As we know from the last section that a given main block is a div element with a lister-item-content class attribute, so we can use this information to find out all blocks.
We have used find_all because we want to fetch all the blocks. Now we iterate through all the blocks and find the information from each block individually.
Extracting information from each block.
This section involves the actual information extraction, so pay extra attention and if you don’t understand any part of it then please refer to the Analysis of the web page section.
Let’s understand the each part of above code step by step.
Movie Name: Since the movie name comes under the first ‘a’ element therefore we can extract it using find. Text is used to extract the text part of it (discussed earlier).
Year: It comes under span element with lister-item-year text-muted unbold class. In the next line, we are removing the enclosing brackets.
Runtime: It can be found in span element with runtime class.
Plugins for firefox mac. Genre: We can extract it from the span element with genre class. Once we extracted it, we applied string stripe to remove the unwanted spaces.
Rating: It comes under the span element with ipl-rating-star__rating class.
Directors and Stars: All the above parts were easy, but this part is a little tricky. There are more than one ‘p’ elements with the text-muted text-small class, but the information about Directors and Stars comes under the second result, that is why we have taken the 1st result (count start with 0).
Now we remove all the unwanted ‘n’. As you might have noticed that the list of Director(s) and Star(s) is divided by ‘|’, so we apply the split to divide the text which gives people list. After the split, the first part of people contains the list of Directors, and the second part contains the list of Stars.
Names of all the directors come after the Director or Directors keyword, therefore we applied the re.split (it allows split on multiple delimiters). This split gives the list of two elements as follows,
Output of the above code for the first movie.
As you can see only the first element contains the Director(s) list, that is why we have picked the 1st element only, then we split it on the split(‘,’) to get all the director(s) name(s) as the list. After getting the list we apply the string strip to clean the text.
We applied exactly same logic to extract the list of Stars.
Now we have extracted all the required information that you can save in any format you want.
The next section contains the complete code, and in that code, we save the information into Pandas Dataframe.
Complete Code.
So this is all about the Web Scraping in Python using BeautifulSoup, if you have any issue or doubt then please let me know in the comments.
Thanks for reading !
Web scraping python beautifulsoup tutorial with example
Web scraping python beautifulsoup tutorial with example : The data present are unstructured and web scraping will help to collect data and store it. There are many ways of scraping websites and online services. Use the API of the website. Example, Facebook has the Facebook Graph API and allows retrieval of data posted on Facebook. Then access the HTML of the webpage and extract useful data from it. This technique is called as web scraping or web harvesting or web data extraction.
Steps involved in web scraping python beautifulsoup :-
- Send a request to the URL of a webpage which you want to access.
- Then the server will respond to the request by returning the HTML content of the webpage.
- After accessing data from HTML content we are at the left task of parsing data.
- We need to navigate and search trees that we create a task.
Installing required third party library:-
Easy way to install the library in python to use pip and used to install and manage packages in python.
Pip install requests
Pip install html5lib
Pip install bs4
Then access HTML content from the webpage:-
Import requests
URL=http://www.geeksforgeeks.org/data-structures/
R=requests. get (URL)
Print (r.content)
- First, the step is to import request library and specify URL of webpage which you want to scrape.
- And send an HTTP request to URL and then save response from the server in response object called r.
- Also print r.contemt ton get rawHTML content of webpage.
Parse HTML content:-
Import requests
From bs4 import Beautifulsoup
URL=”http://www.values.com/inspirational-quotes”
R=requests. get (URL)
Soup=Beautifulsoup (r.content,’html5lib’)
Print(soup.prettify ())
The library in beautifulsoup is build on top of the HTML libraries as html.parser.Lxml.and the it will specify parser library as,
Soup=BeautifulSoup (r.content,’html5lib’)
From above example soup=beautifulsoup (r.content,’html5lib’)-will create an object by passing the arguments.
Html5lib:-will specify parser which we use.
r.content:-also called as raw HTML content.
Libraries used for web scraping python beautifulsoup :-
We will use the following libraries:
- Selenium: - It is a web testing library and used to automate browser activities.
- BeautifulSoup: -Beautiful Soup is also called Python package for parsing HTML and XML documents and creates the parse trees which are helpful to extract the data easily.
- Pandas: - the library is used for data manipulation and analysis. And also used to extract the data and store it in the desired format.
Automated web scraping can be used to speed up the data collection process.
You can write your code once and it will get the information you want from many times and many pages.
When you try to get the information and if you want to do manually you have to spend a lot of time clicking, scrolling, and searching.
You need large amounts of data from websites that are regularly updated with new content.
The manual web scraping can take a lot of time and repetition.
There is much information on the Web and new information is added.
Python Beautiful Soup and libraries requests both are powerful tools for the job.
If you like to learn with hands-on example you have a basic understanding of Python and HTML.
Web scraping will extract the data and presents it in a format you can easily make sense of.
It is the process of gathering information from the Internet.
HTML tags:-
<! DOCTYPE html>
<html>
<head>
</head>
<body>
<h1> first scraping</h1>
<p>Hello World</p>
<body>
</html>
1. <! DOCTYPE html>: it starts the document with a type declaration.
2 It is contained between <html> and </html>.
3. The script and Meta declaration of the HTML document is between <head>and </head>.
4. HTML document contains visible part between <body> and </body>tags.
5. The title headings are defined with the <h1> through <h6> tags.
6. All paragraphs are defined with the <p> tag.
And useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.
HTML tags sometimes come with the id or class attributes.
The id attribute specify a unique id for an HTML tag and the value must be unique within the HTML document.
The class attribute is used to define tags with the same class.
We use of these id and classes to help us locate the data we want.
The rules for scraping:-
We have to Terms and Conditions before you scrape it and be careful to read the statements about the legal use of data and should not be used for commercial purposes.
Do not request data from the website with your program as this may break the website. The layout may change from time to time we have to make sure to revisit the site and rewrite your code as needed.
Scraping Flipchart Website:-
Find the URL that you want to scrape
We are going to scrape the Flipchart website to extract the Price, Name, and Rating of Laptops.
The URL for this page is https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2.
Inspecting the Page
The data is usually nested in tag and inspect the page to see which tag the data we want to scrape is nested.
To inspect the page we just right click on the element and click on “Inspect”.
The next step is that you will see a “Browser Inspector Box” open.
Find the data you want to extract
Then extract the Price, Name, and Rating which are nested in the “div”.
Web scraping python beautifulsoup Example:-
Importing libraries as,
From selenium import webdriver
From beautifulsoup import beautifulsoup
Import pandas as pd
For configuration:-
Driver=webdriver.chrome (“/usr/lib/chromium-browser/chromedriver”)
Products= []
Prices= []
Ratings= []
Driver. get(https://www.flipcart.com/laptops/>https://www.flipkart.com/laptops/~buyback-gauranteelaptops-/pr?sid=6bo%Cb5&uniq)
Code is as follows:-
content=driver.page_source
soup=Beautifulsoup (content)
for a in soup.finsAll (‘a’, href=True, attrs= {‘class’:’_31qSD5’}):
name=a. find (‘div’, attrs= {‘class’:’_3wU53n’})
price=a. find (‘div’, attrs= {‘class’:’_1vC4OE_2rQ-NK’})
name=a. find (‘div’, attrs= {‘class’:’hGSR34_2beYZw’})
products. append (name. text)
prices. append(price. text)
ratings. append (ratings. text)
Run the code and extract the data
To run the code, use the below command:
Python web-s.py
Store the data in a required format:-
df=pd.Dataframe ({‘product name’: products,’ Price’: prices, ‘Ratings’: ratings})
df.to_csv (‘products.csv’, index=False, encoding=’utf-8’)
APIs: An Alternative to Web Scraping:-
The Web is grown out of many sources and combines a ton of different technologies, styles, and personalities.
The API (application programming interfaces) allow to accessing data in a predefined manner.
You can avoid parsing HTML and instead access the data directly using format.
HTML is a way to visually present content to users.
The process is more stable than gathering the data through web scraping.
APIs are made to be consumed by programs than by human eyes.
Scraping the Monster Job Site:-
You will build a web scraper that fetches Software Developer job listings from the job aggregator site.
Web scraper will parse the HTML to pick out the pieces of information and filter the content for specific words.
Inspect Your Data Source:-
Click through the site and interact with it just like any normal user would.
In this example you could search for Software Developer jobs in Australia using the site’s native search interface:
Query parameters generally consist of three things:-
- Start: - The query parameters are denoted by a question mark (?).
- Information: - The pieces of information constitute one query parameter that is encoded in key value.
Where related keys and values are joined together by an equals sign.
- Separator: - Every URL can have multiple query parameters which are separated from each other by an ampersand.
Hidden Websites:-
The information is hidden in login and needs to see from the page.
The HTTP request from python script is different than accessing the page from the browser.
Some advanced techniques are also used with a request to access behind the login.
Dynamic Websites:-
They are easy to work with because the server will send you an HTML page which contains all the information as a response.
Then you can parse an HTML response with Beautiful Soup and begin to pick out the relevant data.
Using the dynamic website the server might not send HTML at all and receive JavaScript code as a response.
Parse HTML Code with Beautiful Soup:-
Pip3 install beautifulsoup4
After it import library and create beautiful soup object,
Import requests
From bs4 import Beautifulsoup
URL=’https://www.monster.com/jobs/search/?q=software-developer&where=Austrialia’
Page=requests. get (URL)
Soup=Beautifulsoup (page.content,’html.parser’)
Find the URL you want to scrape:-
To scrape the web for means to find speeches by famous politicians then scrape the text for the speech, and analyze it for how often they approach certain topics, or use certain phrases.
Before you try to start scraping a site we check the rules of the website first.
Rule can be found in the robots.txt file, which can be found by adding a /robots.txt path to the main domain of the site.
Web Scraping In Python Using Beautifulsoup System
Identify the structure of the sites HTML:-
After finding a site to scrap use chrome’s developer tools to inspect the site’s HTML structure.
It is important because more you want to scrape data from certain HTML elements, or elements with specific classes or IDs.
Using the inspect tool you can identify which elements you need to target.
Install Beautiful Soup and Requests:-
There are packages and frameworks, like Scrapy but Beautiful Soup will allow you to parse the HTML.
With Beautiful Soup we need to install a Request library, which will fetch the url content.
The Beautiful Soup documentation has a lot of examples to help get you started as well.
$pip install requests
$pip install beautifulsoup4
Web Scraping In Python Using Beautifulsoup Java
Web Scraping Code:-
Results:-
This finds all of the <p> elements in the HTML.
The text allows selecting only the text from inside all the <p> elements.
It is messy and so filtering of results using the Beautiful Soup text allows us to get a cleaner return.
Other ways are present to search, filter and isolate the results you want from the HTML.
You can also be more specific, finding an element with a specific class as,
This would fine all the <div> elements with the class “cool_paragraph”. Red dead redemption 2 steam.