Blocking unsafe resources in HTML email using BeautifulSoup

The BeautifulSoup documentation states that it saves programmers hours or days of work. This is an understatement.

30 August 2021 Updated 30 August 2021

https://unsplash.com/@francesphotos

I have created an IMAP E-Mail Reader using IMAPClient and Flask. The IMAP E-Mail Reader decodes the email into valid HTML. Then it needs to display this HTML through the browser. Works fine, so far so good.

In this post I describe how I implemented an option in my IMAP E-Mail Reader to block unsafe resources in the HTML. To do this, I use BeautifulSoup and Python's Regular expression operations.

Why block unsafe resources

External resources in HTML typically are files that are included in a web page. Examples are images, stylesheets, JavaScript libraries. The problem is that they connect you to remote systems. If you are privacy conscious you want to avoid this.

In my Firefox browser I use uBlockOrigin, this is not just an ad blocker. From the website:

'The uBlock Origin is a free and open-source, cross-platform browser extension for content filtering—primarily aimed at neutralizing privacy invasion in an efficient, user-friendly method.'

Most of the HTML emails we receive contain links to external resources, often images. By connecting to such an image you can be tracked. Many email programs try to block these external resources by default and offer an option to allow them. The result can be a strange looking email, privacy has a price.

There can also be HTML emails with intentionally added code, Javascript, to hack your computer. We must remove this code.

Why BeautifulSoup

Here are some ways we can use to filter HTML email:

re: Python's regular expression operations
BeautifulSoup: a library to scrape information from web pages and modify content
Scrapy: a web scraping framework

Python's re is very low level. I use it often but here it does not seems the best choice. Scrapy is a framework and probably overkill. This leaves BeautifulSoup.

Performance is not really that important for my IMAP E-Mail Reader. I do not need to filter thousands of pages. Filtering only needs to be done when showing an email. In a high performance environment we can opt to store the filtered emails.

The BeautifulSoup parser

In the beginning I experienced some problems with the (default) 'html.parser'. It worked but in a few image tags the image url was not replaced. Of course my mistake, TLDR. BeautifulSoup recommends you use either the lxml parser or the html5lib parser.

As I wanted a pure Python solution I opted for the html5parser, which processes HTML the way a web browser does. This is extremely important. Without BeautifulSoup it would take ages to write code that can deal with (intentionally) bad HTML.

The output of BeautifulSoup

BeautifulSoup is a library for pulling data out of HTML. That is nice but in our use case we first remove elements and modify elements and then show the result in a browser.

BeautifulSoup has several output options but it always modifies at least a few things. Sometimes this can be good, like adding missing tags, but in other cases this may not be what we want. A bit like, let's wait and see.

How to block unsafe resources

The most important things we must do:

Completely remove unsafe resources
Replace unsafe resources
Fix bad resources

Completely remove unsafe resources

We must always remove all Javascript from HTML email. We also want to remove other elements like links to stylesheets.

Replace unsafe resources

If we would remove images then email can get messed up. To prevent this we replace images in the email by a local image, a transparent pixel.

Fix bad resources

Some links may not contain the attribute:

target="_blank"

We also want to add the attribute:

rel="noopener noreferrer"

This prevents passing referrer information to the target website.

But wait, there is also CSS

In the HTML email there can be CSS-styles refering to external images, fonts. First I wanted to use the package CSSUtils but this is not very forgiving. For example:

background-image: url ('my_url')

generates an exception because there is a space between 'url' and '('. I also could not find another suitable package so I decided to use Python's regular expression operations.

What I want is to replace CSS-code that contains 'url()' in the value part. In an HTML page we can have:

Inline CSS
CSS elements

To reduce the code, for inline CSS we remove the property completely, and for CSS elements we replace the value part by 'url()'.

The HTMLMailFixer Class

To filter HTML emails I created the HTMLMailFixer class, the code is easy to understand.

# html_mail_fixer.py

from bs4 import BeautifulSoup
import re

class HTMLMailFixer:
    
    def __init__(
        self,
        parser='html5lib',
    ):
        self.__parser = parser
        self.forbidden = [
            'script', 'object', 'iframe',
        ]
        self.__allow_remote_resources = None
        self.__block_img_url = None
        self.__soup = None

    def __remove_forbidden_elems(self):
        for elem in self.__soup():
            if elem.name in self.forbidden:
                # remove
                elem.extract()

    def __fix_a_elems(self):
        for a_elem in self.__soup.find_all('a'):
            # add/replace
            a_elem['target'] = '_blank'
            a_elem['rel'] = 'noopener noreferrer'

    def __remove_link_elems(self):
        for link_elem in self.__soup.find_all('link'):
            link_href = link_elem.get('href')
            if link_href is None:
                continue
            # remove
            link_elem.extract()

    def __fix_img_elems(self):
        for img_elem in self.__soup.find_all('img'):
            img_src = img_elem.get('src')
            if img_src is None:
                continue
            # replace
            img_elem['src'] = self.__block_img_url

    def __fix_style_elems(self):
        """
        objective: remove any property with a value starting with 'url('
        actual: replace value 'url(url)' by 'url()'
        """
        match_url_start_pattern = re.compile(r':\s*url\s*\(', re.I)
        for style_elem in self.__soup.find_all('style'):
            # 'contents': [
            #    '\n.section-block {\n    padding: 1em;\n    background-image: url(https://whatever_image_1);\n}\n#logo {\n    margin-top: 10em;\n    background-image: url(https://whatever_image_1);\n}\n'
            # ]
            new_contents = []
            for content in style_elem.contents:
                chunks = re.split(match_url_start_pattern, content) 
                for i, chunk in enumerate(chunks):
                    if i == 0:
                        continue
                    chunks[i] = re.sub(r'.*?\)', ')', chunk)
                # reconstruct
                new_content = ': url(\'\''.join(chunks)
                new_contents.append(new_content)

            # replace
            style_elem.string.replace_with('\n'.join(new_contents))

    def __fix_inline_style(self):
        """
        search elems with style attribute.
        remove any property name:value having a value starting with 'url('
        """
        match_url_start_pattern = re.compile(r'^url\s*\(', re.I)
        for elem in self.__soup.find_all(attrs={'style': True}):
            style_attr = elem['style']
            new_props = []
            props = style_attr
            if ';' in style_attr:
                props = style_attr.split(';')
            for i, prop in enumerate(props):
                if ':' not in prop:
                    # malformed, skip
                    continue
                name, value = prop.split(':', 1)
                if isinstance(value, list):
                    # malformed, skip
                    continue
                value = value.strip()
                if re.match(match_url_start_pattern, value):
                    # found value starting with 'url(' so skip
                    continue
                new_props.append(name + ': ' + value)

            # replace
            elem['style'] = '; '.join(new_props)

    def fix_all(
        self,
        html=None,
        allow_remote_resources=False,
        block_img_url=None,
    ):
        self.__allow_remote_resources = allow_remote_resources
        self.__block_img_url = block_img_url

        # start soup
        self.__soup = BeautifulSoup(html, self.__parser)

        # remove, modify html elements
        self.__remove_forbidden_elems()
        self.__fix_a_elems()
        if not self.__allow_remote_resources:
            self.__remove_link_elems()
            if self.__block_img_url is not None:
                self.__fix_img_elems()
            self.__fix_inline_style()
            self.__fix_style_elems()
        # output
        output = str(self.__soup)
        return re.sub(r'\n\n+', '\n\n', output)

Usage:

from .htm_mail_fixer import HTMLMailFixer

html = 'YOUR_HTML'
allow_remote_resources = False
block_img_url = 'YOUR_PIXEL_URL'

html_mail_fixer = HTMLMailFixer()

html_fixed = html_mail_fixer.fix_all(
    html=html,
    allow_remote_resources=allow_remote_resources,
    block_img_url=block_img_url,
)

Example

This is a simple example with some ugly HTML and CSS:

html = """<!DOCTYPE html PUBLIC "- / /w3c / /dtd html 4.01 transitional / /en">
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.0/dist/css/bootstrap.min.css" rel="stylesheet"><style>
.section-block{background-image: url ( ' https://whatever_image_1') !important;
font-size: 1.1em}
#logo {
   margin-top: 10em;background-image: url ( https://whatever_image_1 )}
.wino { color: #ff0000 }
</style>
</head><body style="font: inherit; font-size: 100%; margin:0; padding:0; background-image: url( ' https://whatever_image_2' )">

<a href="https://whatever_a_href_1"><img src="https://whatever_img_src_1" width="60"></a>
<p>Amount: &euro; 1,50</p>
<a href=https://whatever_a_href_2 target=_top><img src=https://whatever_img_src_2 width="60"></a>
<script> a = 'b' </script>
"""

After passing it to the HTMLMailFixer the result is:

html_fixed = <!DOCTYPE html PUBLIC "- / /w3c / /dtd html 4.01 transitional / /en">
<html><head><style>
.section-block{background-image: url('') !important;
font-size: 1.1em}
#logo {
   margin-top: 10em;background-image: url('')}
.wino { color: #ff0000 }
</style>
</head><body style="font: inherit;  font-size: 100%;  margin: 0;  padding: 0">

<a href="https://whatever_a_href_1" rel="noopener noreferrer" target="_blank"><img src="YOUR_PIXEL_URL" width="60"/></a>
<p>Amount: € 1,50</p>
<a href="https://whatever_a_href_2" rel="noopener noreferrer" target="_blank"><img src="YOUR_PIXEL_URL" width="60"/></a>

</body></html>

Note that BeautifulSoup made some changes like adding missing tags and converting to UTF-8.

How do we know all replacements were made?

Regarding HTML elements, we don't. If BeautifulSoup fails, we fail. Fortunately, the html5lib parser is parsing like a browser. Regarding the CSS properties, I did a quick and dirty replacement. I must look more into detail how CSS can include external resources.

Summary

It was not difficult to use BeautifulSoup, it is very powerful and saved me a lot of time. It would be nice if there was something similar for parsing and modifying (bad) CCS. Anyway, the end result is HTML emails are filtered and being displayed in my browser with unsafe resources removed and blocked. With a button I can allow external resources.

Blocking external resources should be more flexible. For example, we always want to block Facebook, Google, but allow other resources.

I do not pretend the presented solution here is perfect, it is just a start.

Links / credits

Beautiful Soup Documentation
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Blocking unsafe resources in HTML email using BeautifulSoup

Why block unsafe resources

Why BeautifulSoup

The BeautifulSoup parser

The output of BeautifulSoup

How to block unsafe resources

Completely remove unsafe resources

Replace unsafe resources

Fix bad resources

But wait, there is also CSS

The HTMLMailFixer Class

Example

How do we know all replacements were made?

Summary

Links / credits

Read more

BeautifulSoup

Leave a comment

Comments

Leave a reply

Recent

Most viewed

Tags