Blocking unsafe resources in HTML email using BeautifulSoup
The BeautifulSoup documentation states that it saves programmers hours or days of work. This is an understatement.
I have created an IMAP E-Mail Reader using IMAPClient and Flask. The IMAP E-Mail Reader decodes the email into valid HTML. Then it needs to display this HTML through the browser. Works fine, so far so good.
In this post I describe how I implemented an option in my IMAP E-Mail Reader to block unsafe resources in the HTML. To do this, I use BeautifulSoup and Python's Regular expression operations.
Why block unsafe resources
External resources in HTML typically are files that are included in a web page. Examples are images, stylesheets, JavaScript libraries. The problem is that they connect you to remote systems. If you are privacy conscious you want to avoid this.
In my Firefox browser I use uBlockOrigin, this is not just an ad blocker. From the website:
'The uBlock Origin is a free and open-source, cross-platform browser extension for content filtering—primarily aimed at neutralizing privacy invasion in an efficient, user-friendly method.'
Most of the HTML emails we receive contain links to external resources, often images. By connecting to such an image you can be tracked. Many email programs try to block these external resources by default and offer an option to allow them. The result can be a strange looking email, privacy has a price.
There can also be HTML emails with intentionally added code, Javascript, to hack your computer. We must remove this code.
Why BeautifulSoup
Here are some ways we can use to filter HTML email:
- re: Python's regular expression operations
- BeautifulSoup: a library to scrape information from web pages and modify content
- Scrapy: a web scraping framework
Python's re is very low level. I use it often but here it does not seems the best choice. Scrapy is a framework and probably overkill. This leaves BeautifulSoup.
Performance is not really that important for my IMAP E-Mail Reader. I do not need to filter thousands of pages. Filtering only needs to be done when showing an email. In a high performance environment we can opt to store the filtered emails.
The BeautifulSoup parser
In the beginning I experienced some problems with the (default) 'html.parser'. It worked but in a few image tags the image url was not replaced. Of course my mistake, TLDR. BeautifulSoup recommends you use either the lxml parser or the html5lib parser.
As I wanted a pure Python solution I opted for the html5parser, which processes HTML the way a web browser does. This is extremely important. Without BeautifulSoup it would take ages to write code that can deal with (intentionally) bad HTML.
The output of BeautifulSoup
BeautifulSoup is a library for pulling data out of HTML. That is nice but in our use case we first remove elements and modify elements and then show the result in a browser.
BeautifulSoup has several output options but it always modifies at least a few things. Sometimes this can be good, like adding missing tags, but in other cases this may not be what we want. A bit like, let's wait and see.
How to block unsafe resources
The most important things we must do:
- Completely remove unsafe resources
- Replace unsafe resources
- Fix bad resources
Completely remove unsafe resources
We must always remove all Javascript from HTML email. We also want to remove other elements like links to stylesheets.
Replace unsafe resources
If we would remove images then email can get messed up. To prevent this we replace images in the email by a local image, a transparent pixel.
Fix bad resources
Some links may not contain the attribute:
target="_blank"
We also want to add the attribute:
rel="noopener noreferrer"
This prevents passing referrer information to the target website.
But wait, there is also CSS
In the HTML email there can be CSS-styles refering to external images, fonts. First I wanted to use the package CSSUtils but this is not very forgiving. For example:
background-image: url ('my_url')
generates an exception because there is a space between 'url' and '('. I also could not find another suitable package so I decided to use Python's regular expression operations.
What I want is to replace CSS-code that contains 'url()' in the value part. In an HTML page we can have:
- Inline CSS
- CSS elements
To reduce the code, for inline CSS we remove the property completely, and for CSS elements we replace the value part by 'url()'.
The HTMLMailFixer Class
To filter HTML emails I created the HTMLMailFixer class, the code is easy to understand.
# html_mail_fixer.py
from bs4 import BeautifulSoup
import re
class HTMLMailFixer:
def __init__(
self,
parser='html5lib',
):
self.__parser = parser
self.forbidden = [
'script', 'object', 'iframe',
]
self.__allow_remote_resources = None
self.__block_img_url = None
self.__soup = None
def __remove_forbidden_elems(self):
for elem in self.__soup():
if elem.name in self.forbidden:
# remove
elem.extract()
def __fix_a_elems(self):
for a_elem in self.__soup.find_all('a'):
# add/replace
a_elem['target'] = '_blank'
a_elem['rel'] = 'noopener noreferrer'
def __remove_link_elems(self):
for link_elem in self.__soup.find_all('link'):
link_href = link_elem.get('href')
if link_href is None:
continue
# remove
link_elem.extract()
def __fix_img_elems(self):
for img_elem in self.__soup.find_all('img'):
img_src = img_elem.get('src')
if img_src is None:
continue
# replace
img_elem['src'] = self.__block_img_url
def __fix_style_elems(self):
"""
objective: remove any property with a value starting with 'url('
actual: replace value 'url(url)' by 'url()'
"""
match_url_start_pattern = re.compile(r':\s*url\s*\(', re.I)
for style_elem in self.__soup.find_all('style'):
# 'contents': [
# '\n.section-block {\n padding: 1em;\n background-image: url(https://whatever_image_1);\n}\n#logo {\n margin-top: 10em;\n background-image: url(https://whatever_image_1);\n}\n'
# ]
new_contents = []
for content in style_elem.contents:
chunks = re.split(match_url_start_pattern, content)
for i, chunk in enumerate(chunks):
if i == 0:
continue
chunks[i] = re.sub(r'.*?\)', ')', chunk)
# reconstruct
new_content = ': url(\'\''.join(chunks)
new_contents.append(new_content)
# replace
style_elem.string.replace_with('\n'.join(new_contents))
def __fix_inline_style(self):
"""
search elems with style attribute.
remove any property name:value having a value starting with 'url('
"""
match_url_start_pattern = re.compile(r'^url\s*\(', re.I)
for elem in self.__soup.find_all(attrs={'style': True}):
style_attr = elem['style']
new_props = []
props = style_attr
if ';' in style_attr:
props = style_attr.split(';')
for i, prop in enumerate(props):
if ':' not in prop:
# malformed, skip
continue
name, value = prop.split(':', 1)
if isinstance(value, list):
# malformed, skip
continue
value = value.strip()
if re.match(match_url_start_pattern, value):
# found value starting with 'url(' so skip
continue
new_props.append(name + ': ' + value)
# replace
elem['style'] = '; '.join(new_props)
def fix_all(
self,
html=None,
allow_remote_resources=False,
block_img_url=None,
):
self.__allow_remote_resources = allow_remote_resources
self.__block_img_url = block_img_url
# start soup
self.__soup = BeautifulSoup(html, self.__parser)
# remove, modify html elements
self.__remove_forbidden_elems()
self.__fix_a_elems()
if not self.__allow_remote_resources:
self.__remove_link_elems()
if self.__block_img_url is not None:
self.__fix_img_elems()
self.__fix_inline_style()
self.__fix_style_elems()
# output
output = str(self.__soup)
return re.sub(r'\n\n+', '\n\n', output)
Usage:
from .htm_mail_fixer import HTMLMailFixer
html = 'YOUR_HTML'
allow_remote_resources = False
block_img_url = 'YOUR_PIXEL_URL'
html_mail_fixer = HTMLMailFixer()
html_fixed = html_mail_fixer.fix_all(
html=html,
allow_remote_resources=allow_remote_resources,
block_img_url=block_img_url,
)
Example
This is a simple example with some ugly HTML and CSS:
html = """<!DOCTYPE html PUBLIC "- / /w3c / /dtd html 4.01 transitional / /en">
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.0/dist/css/bootstrap.min.css" rel="stylesheet"><style>
.section-block{background-image: url ( ' https://whatever_image_1') !important;
font-size: 1.1em}
#logo {
margin-top: 10em;background-image: url ( https://whatever_image_1 )}
.wino { color: #ff0000 }
</style>
</head><body style="font: inherit; font-size: 100%; margin:0; padding:0; background-image: url( ' https://whatever_image_2' )">
<a href="https://whatever_a_href_1"><img src="https://whatever_img_src_1" width="60"></a>
<p>Amount: € 1,50</p>
<a href=https://whatever_a_href_2 target=_top><img src=https://whatever_img_src_2 width="60"></a>
<script> a = 'b' </script>
"""
After passing it to the HTMLMailFixer the result is:
html_fixed = <!DOCTYPE html PUBLIC "- / /w3c / /dtd html 4.01 transitional / /en">
<html><head><style>
.section-block{background-image: url('') !important;
font-size: 1.1em}
#logo {
margin-top: 10em;background-image: url('')}
.wino { color: #ff0000 }
</style>
</head><body style="font: inherit; font-size: 100%; margin: 0; padding: 0">
<a href="https://whatever_a_href_1" rel="noopener noreferrer" target="_blank"><img src="YOUR_PIXEL_URL" width="60"/></a>
<p>Amount: € 1,50</p>
<a href="https://whatever_a_href_2" rel="noopener noreferrer" target="_blank"><img src="YOUR_PIXEL_URL" width="60"/></a>
</body></html>
Note that BeautifulSoup made some changes like adding missing tags and converting to UTF-8.
How do we know all replacements were made?
Regarding HTML elements, we don't. If BeautifulSoup fails, we fail. Fortunately, the html5lib parser is parsing like a browser. Regarding the CSS properties, I did a quick and dirty replacement. I must look more into detail how CSS can include external resources.
Summary
It was not difficult to use BeautifulSoup, it is very powerful and saved me a lot of time. It would be nice if there was something similar for parsing and modifying (bad) CCS. Anyway, the end result is HTML emails are filtered and being displayed in my browser with unsafe resources removed and blocked. With a button I can allow external resources.
Blocking external resources should be more flexible. For example, we always want to block Facebook, Google, but allow other resources.
I do not pretend the presented solution here is perfect, it is just a start.
Links / credits
Beautiful Soup Documentation
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
HTML <link> Tag
https://www.w3schools.com/Tags/tag_link.asp
Read more
BeautifulSoup
Recent
- Hiding database UUID primary keys of your web application
- Don't Repeat Yourself (DRY) with Jinja2
- SQLAlchemy, PostgreSQL, maximum number of rows per user
- Show the values in SQLAlchemy dynamic filters
- Secure data transfer with Public Key encryption and pyNaCl
- rqlite: a high-availability and distributed SQLite alternative
Most viewed
- Using Python's pyOpenSSL to verify SSL certificates downloaded from a host
- Using UUIDs instead of Integer Autoincrement Primary Keys with SQLAlchemy and MariaDb
- Connect to a service on a Docker host from a Docker container
- Using PyInstaller and Cython to create a Python executable
- SQLAlchemy: Using Cascade Deletes to delete related objects
- Flask RESTful API request parameter validation with Marshmallow schemas