angle-uparrow-clockwisearrow-counterclockwisearrow-down-uparrow-leftatcalendarcard-listchatcheckenvelopefolderhouseinfo-circlepencilpeoplepersonperson-fillperson-plusphoneplusquestion-circlesearchtagtrashx

Blacklisting IP addresses on your Flask website running on Linux

Sometimes you want to block IP addresses immediately. This post describes a method how you can do this.

16 April 2020 Updated 16 April 2020
In Flask
post main image
https://unsplash.com/@vladbahara

You have a website and it works fine. But you notice that certain visitors are trying to mess with your forms. They come from specific IP addresses. Then there are also bots that are scanning your site. Some are necessary but others should stay away. Don't you hate this? I do. In the past I once wrote a module that returned a not so nice response very slowly, byte-by-byte, slowing down their systems. Or returned a never ending amount of data. But that's another story.

For now I want to focus on another method: blocking these requests. Simply return a HTTP 403 Forbidden. I want to be able to do this on-the-fly from my website admin section. There we specify the IP addresses or range of IP addresses we want to block. There are also other ways to this like using .htaccess files and web server settings. I will mention them at the bottom of this post.

Several reasons to block

I already mentioned that one of the reasons to block access to your site is to block malicious visitors. They want to see how they can break your site, stuff your comment section with advertising or otherwise crazy messages. There are many reasons why this is done by I believe one of them is that they want to force your to grab a third-party anti-spam plugin. These can be very effective as they connect to huge databases with spam information. But if we want to respect the privacy of our visitors we cannot use such a plugin. We must use other ways, and a final resort often is IP address blocking.

It also can be necessary to block certain bots that scan your site. Some bots generate crazy amounts of traffic. I checked all traffic to this site for a certain period and it appeared that only 10%, probably even less, of the requests was from real visitors! Of course not all bots are bad, but some really do not respect the rules. Most bots can be identified by the User Agent string. I found the following two that I really want to block:

  • SemrushBot
  • AhrefsBot

Be careful what to block, many bots are used to get your site in search engine results. SemrushBot is about SEO, I am not using this at the moment. Blocking User Agents is not covered in this post. It will not change that often and you can set blocks in other ways.

Good things of unwanted requests

If you implement proper logging then you can also take advantage of unwanted requests. The list below show some requests that caused a HTTP 404 error for this site:

http://peterspython.com/css/album.css
http://www.peterspython.com/wordpress
http://peterspython.com/blog/wp-includes/wlwmanifest.xml
http://peterspython.com/wordpress/wp-includes/wlwmanifest.xml
http://peterspython.com/website/wp-includes/wlwmanifest.xml
http://peterspython.com/public/ui/v1/js/sea.js
http://www.peterspython.com/public/ui/v1/js/sea.js
http://peterspython.com/vendor/phpunit/phpunit/phpunit.xsd
http://peterspython.com/vendor/phpunit/phpunit/LICENSE
http://www.peterspython.com/apple-touch-icon.png
http://peterspython.com/humans.txt
http://peterspython.com/license.txt

We see that bots are searching for the file wlwmanifest.xml. This appears to be a file associated with 'Windows Live Writer', a blog publishing application developed by Microsoft that was discontinued in 2017 and may be vulnerable. Another attack is looking for PHPUnit, a PHP unit testing framework. This contained a vulnerability that may not have been patched yet. Other attack bots may generate URLs that cause a HTTP 500 error. This may be intended but can also be caused by weaknesses of your site.

The good news is that you can use this information to improve your site. Always make sure to implement proper logging, errors give you very valuable information!

Limit to IPv4 IP addresses only

Blocking visitors by IP address has its limitations. Many people on the internet get their IP address when connecting to a server using DHCP. This is mostly true for mobile phones. So be careful what to block.

Then there is also IPv6 that was designed to overcome the limited availability of IPv4 addresses. Although some reports state that 30% of the internet traffic is on IPv6, the number of servers actually having enabled IPv6 is far less. This fortunately means that there is no reason to migrate your server to IPv6 at this moment. Blocking spam with IPv6 is possible with this method but there is a gotcha.

Admin operations and Blacklisted IP Address Rules

In the admin I want to specify the IP addresses that I want to blacklist. There is a table with blacklisted IP address records. For IP addresses I want to be able to specify IP addresses as follows:

  1. A single IP address, example: 1.2.3.4
  2. An IP network, example: 1.2.3.0/24
  3. An IP address range, example: 1.2.3.6-1.2.4.2

I specify one of these in a single record and I call this a 'Blacklisted IP Address Rule'.

Caching to avoid database access

We certainly do not want to access the database on every request to see if the request is allowed. That would slowdown requests. This is why we use caching. Instead of querying the database we first check the cache to see if the IP address accessed the site before. For every IP address we have a flag called 'allowed'. If True then access is allowed, if False then access is blocked.

If the IP address is in the cache, we are done, continue or block. If the IP address is not in the cache we check if it is in the Blacklisted IP Address Rules. The result is added to the cache, and the next time when a request with this IP address hits our site, the data is in cache and the database does not to be queried.

Adding and removing Blacklisted IP Address Rules

Assume we have hundreds, thousands of items in our cache. Now we want to make changes using the admin either by adding a Blacklisted IP Address Rule or removing a Blacklisted IP Address Rule.

Adding or removing a rule is not trivial because the rule can include IP addresses that are in the cache already. What should be done with our cached values? The most simple way is to flush the cache and let it rebuild again. This will slow down the next requests for a short time. The only other way is to scan the IP addresses in the cache and check if they match the added or removed Blacklisted IP Address Rule. If they match, we remove them from the cache. I have some ideas how to implement this but did not do this yet.

Adding timestamps

For maximum performance the IP addresses information in cache is read-only and does not expire. This means it can grow huge over time if you have many visitors. Because most visitors access your site for just a few minutes we can add a timestamp to the cached IP addresses that is updated on every access. The timestamp makes it easy to remove old entries.

Requests at the same moment

Assume two requests, request A and request B, arrive at the same time, both using the same IP address. If not in cache both will check if their IP address is blocked by searching the Blacklist IP Address Rules table. Then they both update the cached_access item. Request A first creates the allowed item. But then request B creates the allowed item, overwriting the allowed item of request A. The same is true if we want to update the timestamp of the cached item. This may seem bad but is not really that bad. We just must make sure that create operation is atomic.

Using the Linux file system as cache

For the moment I choose to implement the cache with files. The Linux file system is fast enough to handle this for my application. I do not want to add something like Redis, I want to keep dependencies minimal.

If we have an 'allowed' file per IP address, then the file can be small, the contents is 0 (blocked) or 1 (allowed). To prevent a huge number of files in a directory and slow lookup, we create sub directories based on the IP address. We split the IP address by the dot ('.') and use this to create directories. The timestamp of the 'allowed' file automatically changes when the file is read. In Linux we have the following timestamps:

  • mtime (ls -l)
    The last time the file content was modified
  • ctime
    The last time the file status, e.g. permissions, changed
  • atime (ls -lu)
    The last time the file was read

For our purpose we can use atime as a timestamp. We do not have to update the time of the file. There is a problem if you want to be able to show the contents of the allowed files in the admin. This would read the files and change the timestamps. We can overcome this by creating a copy of the 'allowed' file. Reading the copy does not change the atime of the original 'allowed' file.

A warning when using the Linux access time atime

There is much information on the internet about Linux timestamps but only very few mention that this may not work as expected. I invite you to look at the links below about this. For example, you can check if relatime is a mount option with this command:

cat /proc/mounts | grep relatime

The summary is:

  • Updating atime on every read is disabled by default for performance reasons
  • Since kernel 2.6.30, relatime is the default option
  • Since kernel 2.6.30, the last access time of a file is always updated if it is more than 1 day old

This means we still can use atime but must respect a resolution of one day. No problem for me but wait, let's test if this is really working.

ls -l

The result is:

total 8
-rw-r--r-- 1 flaskuser flaskgroup 1 apr 16 15:47 allowed
-rw-r--r-- 1 flaskuser flaskgroup 1 apr 16 15:47 allowed_copy

Next we want to see the atime, or access time, of the allowed file:

stat allowed

The result is:

  File: allowed
  Size: 1         	Blocks: 8          IO Block: 4096   regular file
Device: 806h/2054d	Inode: 38805116    Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1002/flaskuser)   Gid: ( 1002/flaskgroup)
Access: 2020-04-16 15:47:06.024559817 +0200
Modify: 2020-04-16 15:47:06.024559817 +0200
Change: 2020-04-16 15:47:06.024559817 +0200
 Birth: -

Now we change the access time to the previous day:

sudo touch -a -t 202004151530.02 allowed

The result of the stat command shows that the access time is one day earlier:

  File: allowed
  Size: 1         	Blocks: 8          IO Block: 4096   regular file
Device: 806h/2054d	Inode: 38805116    Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1002/flaskuser)   Gid: ( 1002/flaskgroup)
Access: 2020-04-15 15:30:02.000000000 +0200
Modify: 2020-04-16 15:47:06.024559817 +0200
Change: 2020-04-16 15:52:12.472562630 +0200
 Birth: -

Now we generate a request on the website and after the request we run the stat command again:

  File: allowed
  Size: 1         	Blocks: 8          IO Block: 4096   regular file
Device: 806h/2054d	Inode: 38805116    Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1002/flaskuser)   Gid: ( 1002/flaskgroup)
Access: 2020-04-16 15:56:24.200564941 +0200
Modify: 2020-04-16 15:47:06.024559817 +0200
Change: 2020-04-16 15:52:12.472562630 +0200
 Birth: -

The access time was updated to today. Subsequent requests do not update the access time anymore. Working as expected, I learned something today.

Implementation details

I called the class CachedAccess. In Flask's before_request I instantiate it as follows:

    @app.before_request
    def before_request():
        ...
        g.ip_address = get_ip_address()
        ...
        cached_access = CachedAccess()
        if not cached_access.is_allowed():
            # bye bye
            abort(403)

And here are the (important) parts of the class:

class CachedAccess:

    def __init__(self):
        ...


    def log_block(self, reason):
        ...


    def is_allowed_ip_address(self, ip_address_uint):

        # check: single ip addresses
        access_block_ip_address = db_select(
            model_class_list=[AccessBlockIPAddress],
            filter_by_list=[
                (AccessBlockIPAddress, 'is_active', 'eq', True),
                (AccessBlockIPAddress, 'ip_address_type', 'eq', 3),
                (AccessBlockIPAddress, 'ip_address_uint', 'eq', ip_address_uint),
            ],
        ).first()

        if access_block_ip_address is not None:
            # found
            return False

        # check: network and range
        access_block_ip_address = db_select(
            model_class_list=[AccessBlockIPAddress],
            filter_by_list=[
                (AccessBlockIPAddress, 'is_active', 'eq', True),
                (AccessBlockIPAddress, 'ip_address_type', 'in', [1, 2]),
                (AccessBlockIPAddress, 'ip_address_from_uint', 'le', ip_address_uint),
                (AccessBlockIPAddress, 'ip_address_to_uint', 'ge', ip_address_uint),
            ],
        ).first()

        if access_block_ip_address is not None:
            # found
            return False

        return True


    def is_allowed(self):

        # check if valid ip_address
        try:
            ip_address_uint = int( ipaddress.ip_address(g.ip_address) )
        except Exception as e:
            current_app.logger.error(fname + ': not a valid ip address = {}, {}'.format(g.ip_address, str(e)))
            return True

        # create ip_address_file
        app_cached_access_dir = current_app.config['APP_CACHED_ACCESS_DIR']
        ip_address_parts = g.ip_address.split('.')
        ip_address_file = os.path.join(app_cached_access_dir, *ip_address_parts, 'allowed')

        # check if file exists and read its contents
        found = True
        try:
            with open(ip_address_file, 'r') as f:
                allowed = f.read()
        except:
               found = False

        if found:
            # done
            if allowed == '1':
                return True
            self.log_block(1)
            return False

        # check if g.ip_address matches a rule in blacklisted IP addresses table
        allowed = self.is_allowed_ip_address(ip_address_uint)

        # create directories for g.ip_address
        ip_address_dir = os.path.dirname(ip_address_file)
        try:
            pathlib.Path(ip_address_dir).mkdir(parents=True, exist_ok=True)
        except Exception as e:
            current_app.logger.error(fname + ': error creating directories ip_address_dir = {}, {}'.format(ip_address_dir, str(e)))
            return True

        # create allowed temp file
        temp_name = next(tempfile._get_candidate_names())
        ip_address_temp_file = os.path.join(app_cached_access_dir, *ip_address_parts, temp_name)
        try:
            with open(ip_address_temp_file, 'w') as f:
                f.write( '1' if allowed else '0' )
        except Exception as e:
            current_app.logger.error(fname + ': error writing ip_address_temp_file = {}, {}'.format(ip_address_temp_file, str(e)))
            return True

        # atomic move ip_address_temp_file to ip_address_file
        try:
            os.rename(ip_address_temp_file, ip_address_file)
        except Exception as e:
            current_app.logger.error(fname + ': error renaming ip_address_temp_file = {} to  ip_address_temp_file = {}, {}'.format(ip_address_temp_file, ip_address_temp_file, str(e)))
            return True

        if allowed:
            return True

        self.log_block(2)
        return False

This is not really very difficult. I convert the IP address to an Unsigned Int so that we can check if it is in an IP network or IP address range. If an unexpected error occurs I log the error and allow the IP address. This means we do not block unexpected requests.

Development and production

On development you will probably see many requests blocked while testing. The reason is that images, Javascript files, etc. are also served by the Flask development server. You can filter these requests in your code:

    if request_path.startswith( ('/static/') ):
        return

On production I assume you are serving all your static content by the web server, Nginx, Apache, meaning that no time gets wasted. We are only blocking requests to the code, images etc. are not blocked.

Blocking with Nginx

I did not want to control my Nginx web server to keep it simple. But it is not that difficult to tell it to block requests. If you use Nginx, you can add a few lines to block multiple user agents as follows:

    if ($http_user_agent ~* (wget|curl|libwww-perl) ) {
        return 403;
    }

And to block multiple IP addresses you can use:

    location / {
        deny 127.0.0.1; # Individual IP Address
        deny 1.2.3.0/24; # IP network
    }

But this is not what we want. We want dynamic blocking. There are several ways to this but of course you will have to get far more involved in Nginx specifics. There are enough examples on the internet of how to do this.

Summary

I really wanted to implement on-the-fly IP address blacklisting and it appeared not that difficult. I did not implement everything at the moment. This means no smart updates after adding or removing Blacklisted IP Address Rules. Instead, I have a button 'flush cache' that I can hit after I make changes to the Blacklist table. It is like a 'rm -R' in Python.

The Linux access time timestamp delayed me writing this post, I never used the access time but now I know its peculiarities. Linux updates the access time once a day, that is fine with me.

I doubt if you can get better performance but you may want to look at other options like caching the item in memory. You could use TTLCache from Python cachetools.

Links / credits

cachetools
https://pypi.org/project/cachetools/

Dynamic Blacklisting of IP Addresses
https://docs.nginx.com/nginx/admin-guide/security-controls/blacklisting-ip-addresses/

flask-ipban
https://github.com/Martlark/flask-ipban

flask-ipblock
https://github.com/closeio/flask-ipblock

flask-limiter
https://github.com/alisaifee/flask-limiter

how to know if noatime or relatime is default mount option in kernel?
https://superuser.com/questions/318293/how-to-know-if-noatime-or-relatime-is-default-mount-option-in-kernel

Why is cat not changing the access time?
https://superuser.com/questions/464290/why-is-cat-not-changing-the-access-time/464737#464737

Read more

Flask

Leave a comment

Comment anonymously or log in to comment.

Comments

Leave a reply

Reply anonymously or log in to reply.