Scraping a website and scheduling emails with Beautifulsoup, Twilio SendGrid and Heroku

0

I recently published a blog post on the Twilio blog about building a microlearning application with Python and Twilio SendGrid.

The application scrapes data from the Python Language Reference and sends you a chapter a day via email.

I noticed that this way wasn’t working well for me because some chapters are really long and so they are difficult to read in a few minutes. I decided to break these chapters into subsections and send one subsection per day.

This version of the application scrapes the website, divides each section into subsections, saves each subsection in a Postgresql database and sends an email a day.

This post will neither cover creating a Github account, creating a Twilio account nor go into details about how to deploy to Heroku, but those were clearly explained in the blog post on Twilio.

This version uses mostly the same files and processes as the earlier version of the app except the changes stated in this blogpost.

Tutorial Requirements

To follow along with this tutorial, you will need to have the following requirements.

Setting up the scraper

To build the application, we’ll need to create 4 files;- `scraper.py`, `mailer.py`, ‘db.py’ and `.env`. Add the following code to your scraper.py.

“`scraper.py
import re
import requests
from bs4 import BeautifulSoup

BASE_URL = ‘ https://docs.python.org/3/reference/ ‘

def main(idx) :
    ” ” “
    Scrape base url and extract links

    :Param idx: index of link to be parsed
    :Return: parsed content
    ” ” “
    content = _scraper(BASE_URL)
    links = content.findAll(href=_has_no_hashtag)

    # We want to get only the 10 chapters of the reference
    start = 17
    end = -14
    links = links[start:end]

    result = _handle_link(links[idx])
    return result

def _has_no_hashtag(href) :
    ” ” “
    Remove sub chapters

    :Param href: tags with the `href` attribute
    :Return: True if `href` exists and does not contain a hashtag
                  else return False
    ” ” “
    return href and not re.compile(“#”).search(href)

def _handle_link(link):
    ” ” “
    Scrape and parse link

    :Param link: link to be parsed
    :Return: tuple of (header, url, split_body)
    ” ” “
    url = f”{BASE_URL}{link.get(‘href’)}”
    content = _scraper(url)

    permalink = content.findChildren(“a”, {“class”: “headerlink”})

    _decompose(permalink)

    header = content.find(‘h1’).text.lstrip(‘0123456789. ‘)
    body = content.find(“div”, {“class”: “section”})

    # Find sub-sections in each page
    sections = body.findChildren(“div”, {“class”: “section”})

    # Split each page into its major sub-headings.
    split_body = [item for item in sections if len(item.findAll(“h2”)) != 0]

    return header, url, split_body

def _scraper(url):
    ” ” “
    Scraper function

    :Param url: url to be scraped
    :Return: BeautifulSoup object
    ” ” “
    response = requests.get(url)
    assert response.status_code == 200, “url could not be reached”

    soup = BeautifulSoup(response.content, “html.parser”)

    return soup

def _decompose(args):
    ” ” “
    Remove tags and their content

    :Param args: list of items to be decomposed
    ” ” “
    for item in args:
        item.decompose()

`”

This file does the following things:

  • Scrapes the Python Language Reference by searching for tags with the “href” attribute that don’t contain the character “#” in them (ignoring internal links that jump to a section of a page).
  • Gets the URLs for the 10 chapters of the reference.
  • Scrapes each chapter, removes unnecessary data and splits the body of each chapter into subsections.
  • Returns a tuple of the header, URL and split body

The `_handle_links` function works differently than it does in the earlier version. The body of the HTML can be gotten from the first div with class name “section” or from the div with class name “document”. I went with the div with the class name “section” because it has less unwanted data embedded in it. The body of the page is further broken up into subsections based on <h2> tags, this way, pages are split into [1.1, 1.2, 1.3…].

Installing and Setting up our Database

You will need to have Postgres installed on your system to move forward with the tutorial. Go to the Postgres website and install the version that works with your system.

After installing Postgres, we’ll need to create our database. In your terminal, run the following commands
“`bash
$ (py-microlearning-app) psql
$ CREATE DATABASE py_microlearning_app;
“`
We’ll need to install psycopg2 to connect with our Postgres database.
“`bash
$ (py-microlearning-app) pip install psycopg2
“`

Add the following code to your db.py file.
“`db.py
import hashlib
import os

import psycopg2

from dotenv import load_dotenv
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT

load_dotenv()

USER= os.getenv(‘DB_USER’)
NAME= os.getenv(‘DB_NAME’)
HOST= os.getenv(‘DB_HOST’)
PASSWORD = os.getenv(‘DB_PASSWORD’)

CREDENTIALS = f”dbname={NAME} user={USER} host={HOST} password={PASSWORD}”

def save_to_db(data):
    “””
    Save items to the database

    :Param data: tuple of (header, url, body)
    “””
    with psycopg2.connect(CREDENTIALS) as con:
     con.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
     cursor = con.cursor()

            header, url, body = data
     for item in body:
             cursor.execute(“INSERT INTO email (header, url, body, hashed_body) \                               VALUES (%s, %s, %s, %s);”, (header, url, str(item),                                                     _hash_text(str(item))))

def query_db(pk):
    “””
    Query database

    :Param pk: Primary key of item to be queried for
    :Return: tuple of (header, url, body)
    “””
    with psycopg2.connect(CREDENTIALS) as con:
        cursor = con.cursor()
        cursor.execute(“SELECT header, url, body FROM email WHERE id = %s;”,                         (pk,))

        return cursor.fetchone()

def _create_table():

    “”” Create table “””
    with psycopg2.connect(CREDENTIALS) as con:
        con.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
        cursor = con.cursor()
        cursor.execute(
            “CREATE TABLE email (id serial PRIMARY KEY, header varchar, \
                url varchar, body text, hashed_body varchar unique);”
        )

def _hash_text(text):
    return hashlib.sha256(text.encode()).hexdigest()

“`
This file sets up the DB connection with Postgres using Python’s psycopg2

  • The `_hash_text` function will be used to create a digest of our HTML data. 
  • The `_create_table` function will be run once and used to create our email table. The table will have five columns; id, header, url, body and hashed_body. The hashed_body column will be used to ensure each HTML body is unique. 
    We can also create the table directly in our terminal using psql.
  • The `save_to_db` function establishes a connection with the database and saves data to the table.
  • The `query_db` function establishes a connection with the database, gets data at a particular primary key and returns the data.

Setting up the Mailing Service

Add the following code to your mailer.py file
“`mailer.py
import datetime
Import logging
import os
Import smtplib
Import ssl

from email.message import EmailMessage
from dotenv import load_dotenv

from dotenv import load_dotenv

from db import query_db
from db import save_to_db

from scraper import main as _scrape_link

load_dotenv()
logger = logging.getLogger(__name__)

SENDER_EMAIL = os.getenv(‘SENDER_EMAIL’)
RECEIVER_EMAIL = os.getenv(‘RECEIVER_EMAIL’)
SENDGRID_API_KEY=os.getenv(‘SENDGRID_API_KEY’)

START_DATE = datetime.date(2020, 5, 4)

def send_email(content):
    “””
    Set up SendGrid and send email to receiver

    :Param content: Tuple of (header, url, body)
    “””
    header, url, body = content

    message_body = f”””\
     Hi there!

     <p>
         Here’s your Python Language Reference chapter for the day! \
             You can also check this out in \
             <a href=”{url}”>The Python Library Reference documentation</a>
     </p>

     {body}
     “””

    message = Mail(
     from_email=SENDER_EMAIL,
     to_emails=RECEIVER_EMAIL,
     subject=header,
     html_content=message_body
    )

    try:
     sendgrid = SendGridAPIClient(SENDGRID_API_KEY’)
     response = sendgrid.send(message)
     assert response.status_code == 202 or response.status_code == 200, \

         “Message was not sent successfully”
except Exception as e:
     raise e

if __name__ == “__main__”:
    today = datetime.datetime.today().date()
    idx = (today – START_DATE).days
    try:
        content = _scrape_link(idx)
    except IndexError:
        logger.warning(‘No link at index: %d’, idx)
    else:
        save_to_db(content)

    # List indices start at 0, but incremental database primary keys start at 1
    to_send = query_db(idx+1)
    if to_send:
        send_email(to_send)
“`
The `“send_email`” function does the following:

  1. Imports our needed packages.
  2. Imports the “main” function from scraper.py 
  3. Loads the environment variable from the env file.
  4. Instantiates SendGrid’s mail helper with the subject, body of the email.
  5. Asserts that the email was sent successfully.

When this file is run, it calls the `“main`” function from scraper.py, saves the scraped data in the database, queries the database for the data for the day (The primary key for the database is incremental and so every day, you get the next data in the database) and then sends an email with the data gotten from the database. Since the language reference has ten chapters, the scraper will be called for ten days, the split body will be looped over and each subsection will be saved as an individual row in the database. After the ten chapters have been scraped and saved in the database, any further call to the scraper will log a warning and not run. The mailer, however, will keep sending emails until we run through all the rows in the database.

Create a `.env` file and add the following to it:

“`.env

SENDER_EMAIL=sender@gmail.com # Input your actual sender’s email address
RECEIVER_EMAIL=receiver@gmail.com # Input your actual receiver email address
SENGRID_API_KEY=”*******” # Input your api key
DB_PASSWORD=’password’  # Input your actual DB password
DB_USER=’user’  # Input your actual DB user
DB_NAME=’db_name’  # Input your actual DB name
DB_HOST=’localhost’ # Input your actual DB host
“`

At this point, you can test your app to see that it works correctly. Run the following code in your terminal
“`bash
$ (py-microlearning-app) python mailer.py
“`

You should see your database being populated with data from the first chapter of the language reference, you should also get an email with the first subsection of that chapter.

Setting up Deployment

To get our app running on Heroku, we need to add two add-ons, one for scheduling our emails and the other for Postgres. This blogpost has a detailed explanation of how to sign up on Heroku, set up Heroku Scheduler and deploy your application. The only thing we’ll add to that is the Heroku Postgres add on.
On your dashboard, click on “Resources” and then in the “Add-ons” search bar, search for “Heroku Postgres” and click on it. After it is added to your application, your DATABASE_URL will automatically be added to your config variables. You can connect to the database using the DATABASE_URL and then create your table manually using the “_create_table” function in your db.py file.

Heroku Postgres connects to the database using the DATABSE_URL which is a combination of the database credentials (postgres://`user`:`password`@`host`:`port`/`name`) so we will need to update our db.py file to use the DATABASE_URL. Your db.py should look like this now.

“`db.py
import hashlib
import os

import psycopg2

from dotenv import load_dotenv
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT

load_dotenv()

DATABASE_URL= os.getenv(‘DATABASE_URL’)

def save_to_db(data):
    “””
    Save items to the database

    :Param data: tuple of (header, url, body)
    “””
    with psycopg2.connect(DATABASE_URL) as con:
     con.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
     cursor = con.cursor()

            header, url, body = data

     for item in body:
             cursor.execute(“INSERT INTO email (header, url, body, hashed_body) \                               VALUES (%s, %s, %s, %s);”, (header, url, str(item),                                                     _hash_text(str(item))))

def query_db(pk):
    “””
    Query database

    :Param pk: Primary key of item to be queried for
    :Return: tuple of (header, url, body)
    “””
    with psycopg2.connect(DATABASE_URL) as con:
        cursor = con.cursor()
        cursor.execute(“SELECT header, url, body FROM email WHERE id = %s;”,                         (pk,))
        return cursor.fetchone()

def _create_table():
    “”” Create table “””
    with psycopg2.connect(DATABASE_URL) as con:
        con.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
        cursor = con.cursor()
        cursor.execute(
            “CREATE TABLE email (id serial PRIMARY KEY, header varchar, \
                url varchar, body text, hashed_body varchar unique);”
        )

def _hash_text(text):
    return hashlib.sha256(text.encode()).hexdigest()

“`
The .env file we created during development was only for local testing. We’ll need to add those environment variables to Heroku for our deployed application to work.
On your Heroku dashboard, click on “Settings” and add your environment variables to  “Config Vars”. Since DATABASE_URL was automatically added by Heroku, the environment variables you need to add are your SENDER_EMAIL, RECEIVER_EMAIL and SENDGRID_API_KEY.

You can now add a job to Heroku Scheduler to send you emails at any time of the day.

That’s it, we’re done! Thanks for reading. You will find this version of the application on GitHub.

Share.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.