web scraper tutorial

By Alex de Vries January 20, 2023 #beautifulsoup, #python, #requests, #web-scraper

This page describes how to set up a basic web scraper. The core function can be used and adjusted to perform a variety of web scraping abilities. The feature this page will describe is getting text from one or more web pages, removing all HTML and CSS tags to leave the text itself. It will then return only the text.

Github for all the code: LINK

# we only need two libraries, requests and beautifulsoup
import requests
from bs4 import BeautifulSoup as bs

def web_scraper(url, sub_urls = True):
    """Takes as input a url, then it outputs texts found on the website and optionally all the urls sub-urls.
    for instance: https://www.greatwebsite.com/about as input will get text for that url, but also
        - https://www.greatwebsite.com/about/subcategory1
        - https://www.greatwebsite.com/about/subcategory2
        - etc, it has to start with the original input url.
    run the function as follows: web_scraper('https://www.greatwebsite.com/about/').
    It needs quotation marks and start with https://www."""

    page = requests.get(url)
    soup = bs(page.text, "html.parser")
    text_set = []
    url_filtered = []

    if sub_urls:
        # loop over all urls, filter the urls based on the subURL condition to prevent duplicates
        for link in soup.find_all("a"):
            subURL = link.get("href")
            if subURL != None and subURL.startswith(url) and subURL not in url_filtered:
                url_filtered.append(subURL)
                page = requests.get(subURL)
                soup = bs(page.content, "html.parser")
                for data in soup(["style", "script"]):
                    data.decompose()
                text_set.append(" ".join(soup.stripped_strings))
    else:
        for data in soup(["style", "script"]):
            data.decompose()
        text_set.append(" ".join(soup.stripped_strings))

    return " ".join(text_set)

web_scraper('https://en.wikipedia.org/wiki/Python_(programming_language)', False)

'Python (programming language) - Wikipedia Python (programming language) From Wikipedia, the free encyclopedia Jump to navigation Jump to search General-purpose programming language Python Paradigm Multi-paradigm : object-oriented , [1] procedural ( imperative ), functional , structured , reflective Designed\xa0by Guido van Rossum Developer Python Software Foundation First\xa0appeared 20\xa0February 1991 ; 31 years ago ( 1991-02-20 ) [2] Stable release 3.11.1 [3] / 6 December 2022 ; 23 days ago ( 6 December 2022 ) Preview release 3.12.0a3 [3] / 6 December 2022 ; 23 days ago ( 6 December 2022 ) Typing discipline Duck , dynamic , strong typing ; [4] gradual (since 3.5, but ignored in CPython ) [5] OS Windows , macOS , Linux/UNIX , Android [6] [7] and more [8] License Python Software Foundation License Filename extensions .py, .pyi, .pyc, .pyd, .pyw, .pyz (since 3.5), [9] .pyo (prior to 3.5) [10] Website python.org Major implementations CPython , PyPy , Stackless Python , MicroPython , CircuitPython , IronPython , Jython Dialects Cython , RPython'
... much more text as output

Relevant literature

Leave a Reply Cancel reply

PYGRUNN 2025. Presenting about the MCP

After The Hype: How To Make NFTs

The problem of hanging glass panes (and a 3d printed solution)

3d printed modular lamp prototyping