Sometimes you just need to grab a bunch of text from a website really quick and fast. This is where its nice to have handy a few web-crawling algorithms for some extreme copy and paste. I once had to quickly gather all the headlines from a website and there was just no time to copy and paste it all. So I imported BeautifulSoup and just skimmed for the headers of each application.
- Step 1) Import BeautifulSoup and the request packages.
- Step 2) Make a request to your target website and safe the raw html document.
- Step 3) Have your BeautifulSoup object skim through whatever it is you need to grab.
Key Code to remember:
from bs4 import BeautifulSoupimport re '''This is the request package for making httprequests'''
html_doc=""" """
def extractHeaders(string):
soup=BeautifulSoup(string, 'html.parser')
for tag in soup.find_all(re.compile("h4")):
print(tag)
def extractHeaders(string, tag):
soup=BeautifulSoup(string, 'html.parser')
for tag in soup.find_all(re.compile(tag)): print(tag)
Key takeaway: Search by the html tag you want targeted, like 'h4' or 'title' or something.
I noticed that in this case however, its important to make your request <i>seem</i> like it is from an actual web browser. The headers parameter of the request can handle this. To be certain, set your headers as something that follows below.
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'}
Then try running your request with that header as your header.
r= requests.get('target_html', headers=headers)
No comments:
Post a Comment