Code Crunch Corner: Webcrawling

Webcrawling.

Sometimes you just need to grab a bunch of text from a website really quick and fast. This is where its nice to have handy a few web-crawling algorithms for some extreme copy and paste. I once had to quickly gather all the headlines from a website and there was just no time to copy and paste it all. So I imported BeautifulSoup and just skimmed for the headers of each application.

Step 1) Import BeautifulSoup and the request packages.
Step 2) Make a request to your target website and safe the raw html document.
Step 3) Have your BeautifulSoup object skim through whatever it is you need to grab.

Key Code to remember:

from bs4 import BeautifulSoup
import re '''This is the request package for making httprequests'''
html_doc=""" """

def extractHeaders(string):
soup=BeautifulSoup(string, 'html.parser')
for tag in soup.find_all(re.compile("h4")):
print(tag)

def extractHeaders(string, tag):
soup=BeautifulSoup(string, 'html.parser')
for tag in soup.find_all(re.compile(tag)): print(tag)

Key takeaway: Search by the html tag you want targeted, like 'h4' or 'title' or something.

I noticed that in this case however, its important to make your request <i>seem</i> like it is from an actual web browser. The headers parameter of the request can handle this. To be certain, set your headers as something that follows below.

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'}

Then try running your request with that header as your header.

r= requests.get('target_html', headers=headers)

Code Crunch Corner

Thursday, April 4, 2019

Webcrawling

Key Code to remember:

Key takeaway: Search by the html tag you want targeted, like 'h4' or 'title' or something.

No comments:

Post a Comment

About Me