Web Scraping - BeautifulSoup in Python
Published:
Data collection from public sources is often beneficial to a business or an individual. As such the term “web scraping” isn’t something new. These data are often wrangled within html tags and attributes. Python is often used for data collection from these sources. The intentions of this post is to host example code snippets so people can take ideas from it to build scrapers as per their needs using BeautifulSoup and urllib module in Python. I will be using github’s trending page https://github.com/trending throughout this post for the examples, especially because it best suits for applying various BeautifulSoup methods.
Installation:
pip install BeautifulSoup4
Get html of a page:
>>> import urllib
>>> resp = urllib.request.urlopen("https://github.com/trending")
>>> resp.getcode()
200
>>> resp.read() # the html
Using BeautifulSoup to get title from a page
>>> import urllib
>>> import bs4
>>> github_trending = urllib.request.urlopen("https://github.com/trending")
>>> trending_soup = bs4.BeautifulSoup(github_trending.read(), "lxml")
>>> trending_soup.title
<title>Trending repositories on GitHub today · GitHub</title>
>>> trending_soup.title.string
'Trending repositories on GitHub today · GitHub'
>>>
Find single element by tag name, find multiple elements by tag name
>>> ordered_list = trending_soup.find('ol') #single element
>>>
>>> type(ordered_list)
<class 'bs4.element.Tag'>
>>>
>>> all_li = ordered_list.find_all('li') # multiple elements
>>>
>>> type(all_li)
<class 'bs4.element.ResultSet'>
>>>
>>> trending_repositories = [each_list.find('h3').text for each_list in all_li]
>>> for each_repository in trending_repositories:
... print(each_repository.strip())
...
klauscfhq / taskbook
robinhood / faust
Avik-Jain / 100-Days-Of-ML-Code
jxnblk / mdx-deck
faressoft / terminalizer
trekhleb / javascript-algorithms
apexcharts / apexcharts.js
grain-lang / grain
thedaviddias / Front-End-Performance-Checklist
istio / istio
CyC2018 / Interview-Notebook
fivethirtyeight / russian-troll-tweets
boyerjohn / rapidstring
donnemartin / system-design-primer
awslabs / aws-cdk
QUANTAXIS / QUANTAXIS
crossoverJie / Java-Interview
GoogleChromeLabs / ndb
dylanbeattie / rockstar
vuejs / vue
sbussard / canvas-sketch
Microsoft / vscode
flutter / flutter
tensorflow / tensorflow
Snailclimb / Java-Guide
>>>
Getting Attributes of an element
>>> for each_list in all_li:
... anchor_element = each_list.find('a')
... print("https://github.com" + anchor_element['href'])
...
https://github.com/klauscfhq/taskbook
https://github.com/robinhood/faust
https://github.com/Avik-Jain/100-Days-Of-ML-Code
https://github.com/jxnblk/mdx-deck
https://github.com/faressoft/terminalizer
https://github.com/trekhleb/javascript-algorithms
https://github.com/apexcharts/apexcharts.js
https://github.com/grain-lang/grain
https://github.com/thedaviddias/Front-End-Performance-Checklist
https://github.com/istio/istio
https://github.com/CyC2018/Interview-Notebook
https://github.com/fivethirtyeight/russian-troll-tweets
https://github.com/boyerjohn/rapidstring
https://github.com/donnemartin/system-design-primer
https://github.com/awslabs/aws-cdk
https://github.com/QUANTAXIS/QUANTAXIS
https://github.com/crossoverJie/Java-Interview
https://github.com/GoogleChromeLabs/ndb
https://github.com/dylanbeattie/rockstar
https://github.com/vuejs/vue
https://github.com/sbussard/canvas-sketch
https://github.com/Microsoft/vscode
https://github.com/flutter/flutter
https://github.com/tensorflow/tensorflow
https://github.com/Snailclimb/Java-Guide
>>>
Using class name or other attributes to get element
>>> for each_list in all_li:
... total_stars_today = each_list.find(attrs={'class':'float-sm-right'}).text
... print(total_stars_today.strip())
...
1,063 stars today
846 stars today
596 stars today
484 stars today
459 stars today
429 stars today
443 stars today
366 stars today
330 stars today
282 stars today
182 stars today
190 stars today
200 stars today
190 stars today
166 stars today
164 stars today
144 stars today
158 stars today
157 stars today
144 stars today
144 stars today
142 stars today
132 stars today
101 stars today
108 stars today
>>>
Navigate childrens from an element
>>> for each_children in ordered_list.children:
... print(each_children.find('h3').text.strip())
...
klauscfhq / taskbook
robinhood / faust
Avik-Jain / 100-Days-Of-ML-Code
jxnblk / mdx-deck
faressoft / terminalizer
trekhleb / javascript-algorithms
apexcharts / apexcharts.js
grain-lang / grain
thedaviddias / Front-End-Performance-Checklist
istio / istio
CyC2018 / Interview-Notebook
fivethirtyeight / russian-troll-tweets
boyerjohn / rapidstring
donnemartin / system-design-primer
awslabs / aws-cdk
QUANTAXIS / QUANTAXIS
crossoverJie / Java-Interview
GoogleChromeLabs / ndb
dylanbeattie / rockstar
vuejs / vue
sbussard / canvas-sketch
Microsoft / vscode
flutter / flutter
tensorflow / tensorflow
Snailclimb / Java-Guide
>>>
The .children will only return the immediate childrens of the parent element. If you’d like to get all of the elements under certain element, you should use .descendent
Navigate descendents from an element
>>> for each_children in ordered_list.descendent:
... # perform operations
Navigating previous and next siblings of elements
>>> all_li = ordered_list.find_all('li')
>>> fifth_li = all_li[4]
>>> # each li element is separated by '\n' and hence to navigate to the fourth li, we should navigate previous sibling twice
...
>>>
>>> fourth_li = fifth_li.previous_sibling.previous_sibling
>>> fourth_li.find('h3').text.strip()
'jxnblk / mdx-deck'
>>>
>>> # similarly for navigating to the sixth li from fifth li, we would use next_sibling
...
>>> sixth_li = fifth_li.next_sibling.next_sibling
>>> sixth_li.find('h3').text.strip()
'trekhleb / javascript-algorithms'
>>>
Navigate to parent of an element
>>> all_li = ordered_list.find_all('li')
>>> first_li = all_li[0]
>>> li_parent = first_li.parent
>>> # the li_parent is the ordered list <ol>
...
>>>
Putting it all together(Github Trending Scraper)
>>> import urllib
>>> import bs4
>>>
>>> github_trending = urllib.request.urlopen("https://github.com/trending")
>>> trending_soup = bs4.BeautifulSoup(github_trending.read(), "lxml")
>>> ordered_list = trending_soup.find('ol')
>>> for each_list in ordered_list.find_all('li'):
... repository_name = each_list.find('h3').text.strip()
... repository_url = "https://github.com" + each_list.find('a')['href']
... total_stars_today = each_list.find(attrs={'class':'float-sm-right'}).text
… print(repository_name, repository_url, total_stars_today)
klauscfhq / taskbook https://github.com/klauscfhq/taskbook 1,404 stars today
robinhood / faust https://github.com/robinhood/faust 960 stars today
Avik-Jain / 100-Days-Of-ML-Code https://github.com/Avik-Jain/100-Days-Of-ML-Code 566 stars today
trekhleb / javascript-algorithms https://github.com/trekhleb/javascript-algorithms 431 stars today
jxnblk / mdx-deck https://github.com/jxnblk/mdx-deck 416 stars today
apexcharts / apexcharts.js https://github.com/apexcharts/apexcharts.js 411 stars today
faressoft / terminalizer https://github.com/faressoft/terminalizer 406 stars today
istio / istio https://github.com/istio/istio 309 stars today
thedaviddias / Front-End-Performance-Checklist https://github.com/thedaviddias/Front-End-Performance-Checklist 315 stars today
grain-lang / grain https://github.com/grain-lang/grain 301 stars today
boyerjohn / rapidstring https://github.com/boyerjohn/rapidstring 232 stars today
CyC2018 / Interview-Notebook https://github.com/CyC2018/Interview-Notebook 186 stars today
donnemartin / system-design-primer https://github.com/donnemartin/system-design-primer 189 stars today
awslabs / aws-cdk https://github.com/awslabs/aws-cdk 186 stars today
fivethirtyeight / russian-troll-tweets https://github.com/fivethirtyeight/russian-troll-tweets 159 stars today
GoogleChromeLabs / ndb https://github.com/GoogleChromeLabs/ndb 172 stars today
crossoverJie / Java-Interview https://github.com/crossoverJie/Java-Interview 148 stars today
vuejs / vue https://github.com/vuejs/vue 137 stars today
Microsoft / vscode https://github.com/Microsoft/vscode 137 stars today
flutter / flutter https://github.com/flutter/flutter 132 stars today
QUANTAXIS / QUANTAXIS https://github.com/QUANTAXIS/QUANTAXIS 132 stars today
dylanbeattie / rockstar https://github.com/dylanbeattie/rockstar 130 stars today
tensorflow / tensorflow https://github.com/tensorflow/tensorflow 106 stars today
Snailclimb / Java-Guide https://github.com/Snailclimb/Java-Guide 111 stars today
WeTransfer / WeScan https://github.com/WeTransfer/WeScan 118 stars today
