These are some rough notes on the issues encountered with providing remote access for staff when the UK went into lockdown.
Part 1: The Problem
In March 2020 all staff had instructions to work from home with immediate effect (UK Lockdown 1.0).
As with most other organisations, user network traffic flipped from inside-to-outside to outside-to-inside. More users immediately needed remote access.
Over the following days I checked how our FortiGate firewall was coping with the increase in the number of SSL VPN connections. This had only been used before by about 6 concurrent users.
After a day or so it emerged that it wasn’t coping well at all.
When the number of SSL VPN connections reached 90-100 all traffic in and out of the university stopped. As our monitoring system (PRTG) sent unresponsive alerts for the DNS servers I thought I had found the cause.
Some investigation proved this wasn’t the case. The PRTG sensor was testing the DNS servers by initiating a query. As the DNS server couldn’t reach the internet (due to the firewall issue) PRTG reported the DNS server as unresponsive.
Closer inspection of the firewall dashboard revealed high memory usage. This seemed to correlate with the number of SSL VPN connections. The CLI revealed a separate process for each SSL VPN connection.
I was initially confused as the spec sheet for our firewall model (1500D) stated a maximum of 10000 VPN connections.
Reading again it actually said 10000 IPSec VPN connections *not* 10000 SSL VPN connections.
Digging around in the FortiGate support documents revealed that our level of SSL VPNs is not a recommended usage scenario. The firewall will enter “conserve mode” when it detects high memory usage. This had the effect of halting all traffic forwarding (!).
Technical Tip: SSL VPN in web mode use a lot of CPU and memory resources:
https://kb.fortinet.com/kb/documentLink.do?externalID=FD48014
How conserve mode is triggered:
https://kb.fortinet.com/kb/documentLink.do?externalID=FD33103
As SSL VPN connections use far more memory than IPSec VPN connections this was the cause of the problems.
To try and prevent the traffic from stopping completely, I started to kill idle connections from the CLI to keep the connections below 90 and conserve memory.
I noticed the processes using all the memory were named “guacd”. Something I filed away and became significant later …
Meanwhile, a solution was needed (and fast).
Part 2: The Solution
The search for a solution.
Having determined that our solution for users to work from home wasn’t up to it, I started to look for alternatives.
The first thought was to use RDP. As this is not a good idea across the internet, it was quickly discounted.
Next was utilising the FortiGate IPSec VPN client (FortiClient). Licensing was free if not using central management of the VPN clients. This was fine for devices built, secured and controlled by us i.e. university owned devices. Not advisable for user owned devices. The VPN connection would put the devices on to the corporate network.
This has been rolled out for corporate devices but only solved a small part of the issue.
Microsoft RD Gateway was considered as a an RDP solution. Licensing expensive and not a quick solution to build. Possible in the long term.
While browsing during the evenings looking to see what other sysadmins were using to solve the WFH problem, I came across mention of the guacd process that was causing the issues on the FortiGate firewall.
It transpired that this was part of an open source remote desktop gateway product by the Apache Software Foundation; Apache Guacamole.
Apache Guacamole: https://guacamole.apache.org/
Fortinet had customised it to use as their SSL VPN offering on the FortiGate firewalls. It is a standalone product that can be run on a Linux VM.
I thought it was worth a try to take the load away from the firewall and still provide clientless RDP access for our users.
My thought process was that if the VM was getting busy it would be simple to add more memory or CPU from within vCenter.
Crash courses in Docker, Guacamole, MySQL and NGINX (reverse proxy/SSL) produced a test server.
After some testing it was decided to build a production server and get the users on it. What was there to lose?
So far (Dec 2020), it has performed fine with an average of 100 concurrent daily users while we look for a more long term solution.
TLDR: built Guacamole server, solved immediate problem.
Resources used:
Apache Guacamole Manual – Installing Guacamole with Docker: https://guacamole.apache.org/doc/gug/guacamole-docker.html
Alternative combined guide: https://www.linode.com/docs/applications/remote-desktop/remote-desktop-using-apache-guacamole-on-docker/
Installing Docker: https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-18-04