-->

Thursday, April 4, 2019

Webcrawling

Webcrawling.

 Sometimes you just need to grab a bunch of text from a website really quick and fast. This is where its nice to have handy a few web-crawling algorithms for some extreme copy and paste. I once had to quickly gather all the headlines from a website and there was just no time to copy and paste it all. So I imported BeautifulSoup and just skimmed for the headers of each application.


  • Step 1) Import BeautifulSoup and the request packages. 
  • Step 2) Make a request to your target website and safe the raw html document. 
  • Step 3) Have your BeautifulSoup object skim through whatever it is you need to grab.


 Key Code to remember:  

from bs4 import BeautifulSoup 
import re '''This is the request package for making httprequests''' 
html_doc=""" """ 

def extractHeaders(string): 
    soup=BeautifulSoup(string, 'html.parser') 
        for tag in soup.find_all(re.compile("h4")): 
        print(tag)

def extractHeaders(string, tag): 
    soup=BeautifulSoup(string, 'html.parser')
        for tag in soup.find_all(re.compile(tag)): print(tag)


Key takeaway: Search by the html tag you want targeted, like 'h4' or 'title' or something.


I noticed that in this case however, its important to make your request <i>seem</i> like it is from an actual web browser. The headers parameter of the request can handle this. To be certain, set your headers as something that follows below.

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'}

Then try running your request with that header as your header.

r= requests.get('target_html', headers=headers)

No comments:

Post a Comment