Wordpress on AWS

As I lamented previously, our paymasters do not let us use our real-world examples of large-scale architecture. So in this article, I will describe architecting this WordPress blog on AWS. These are gorilla products, WordPress hosts 43% of all websites, and Amazon is the largest cloud vendor. This should be interesting, as we all use some COTS products in our systems. Surprisingly it’s been a long journey. I started with fair knowledge of AWS but no knowledge of hosting PHP. I have built mega-systems, so could a blog site be that hard? Well, “yes” is the answer, as the planned couple of days stretched to about 15, and I’m not sure it’s complete yet.

I designed this to the same requirements as I would for medium size commercial deployment, with a focus on the Big Five non-functionals:

Reliable – (typical commercial) – zero planned downtime, 99.99 availability, RPO 15 minutes, RTO 15 minutes.
Performant – (start-up business) – day one, I want to be able to service a few dozen views per second and scale out to a few hundred if successful.
Secure – (moderate) – protected against damage, DOS, and leaking PII.
Maintainable – (small business) – one hour per week admin – patching, backups, monitoring, troubleshooting.
Green – the average WebSite visit generates 0.5g CO₂ per visit. Let’s do better.

I want this design to be not ‘just a blog’ but to be well-architected, including conforming with AWS’s Well-Architected guidelines. I have a follow-up article WordPress on ASWS -review describing how this all went.

The system view

I’ll start with the end design and explain some rabbit holes after.

It’s a bit of AWS alphabet soup, and I’ll explain that as I go along. Firstly some facts that guided this design are:

WordPress (WP) is a framework for building blogs, it’s not a complete product. I need five plugins to get acceptable functionals and five more plugins, eight AWS services, and four OS services for the non-functionals.
WordPress stores data in files (mostly plugin stuff) and a MySql database (the blog entries, comments and such), and they must be in sync.
WordPress is PHP and puts a heavy load on compute, filesystem, DB and IO. Typically rendering one page requires:
- 60 DB and 200 file reads
- 200 assets downloaded to the browser (including four different UI frameworks)
- 0.3 CPU seconds.

How this design works:

I host the service in one Amazon Virtual Private Cloud. This lets my WordPress hosting be isolated from my other usages of AWS.

Route53 is AWS’s high-availability DNS service. Route53 has good tooling to set up DNS, they will work with the Domain Name Registrars on your behalf. I used Route53 to register the shouldjustwork.org domain name and to host the DNS to point to the public IP addresses used by the load balancer.

The incoming traffic is filtered by a Network Access Control List (NACL), which provides some coarse firewall protection (only lets in ICMP, HTTP and HTTPS). I use a Web Application Firewall (WAF) to rate-limit incoming requests for protection against password attacks, non-sophisticated DOS attacks and non-malicious errors. AWS provides free basic DDOS protection. Sadly, NACLs are stateless, so I need to open all the ephemeral TCP ports to incoming traffic to support the site’s outgoing traffic (such as loading plugins or yum loads) to work. In contrast, fine-grain filtering is done using Security Groups (see below).

I use an Application Load Balancer (ALB) to distribute requests across the WP instances (on each EC2). The ALB:

- Terminates the TLS (using a service certificate managed by the AWS Key Management Service ).
- Manages the HTTP connections to the WP instances (I am not running TLS into the WP instances).
- Monitors the health of each WP instance and blocks traffic to faulty instances.
- Redirects HTTP browser requests to use HTTPS (it’s 2023, after all).
- Blocks sensitive paths that WP exposes (such as its CRON end-point).
- Provides a management point to control traffic during upgrades.
- Handles outbound connections for software and plugin downloads.

Each WP instance is hosted in an Availability Zone (AZ), and I deploy in two AZs. Each AZ has identical deployment and configuration, and each is fully functional. “Use Two AZs” is fundamental to AWS’s recommendations for reliability. AWS doesn’t publish SLAs for one AZ or one-of-two AZs, so I trust them on this. It’s not blind trust, AWS is a business with a $20B annual profit, so they are extremely motivated to get this right. I chose to use North American AZs, as that gives good latency to North America (50ms) and Europe (100ms), and OK latency to Australia, where I live (200ms) (Acknowledgement to cloudping.) If this ever gets to be a problem, I’ll deploy AWS CloudFront to give 50ms latency everywhere.

In each AZ, I have two subnets, one is public (has a publicly routable IP address) and runs PHP, and the other is private and runs the DB. Each subnet is a /24 CIDR in the 192.168.0.0 range. For example, the public subnet in AZ2 is 192.168.20.0/24. AWS manages all the IP address allocations (both public and private) within these CIRDs and the AWS public address pool. As an aside, I do use one static IP address to access my staging Ec2 host (to save buying another domain name).

All the WP files (its PHP and configuration and plugins) are stored on an AWS Elastic File System (EFS). This is serverless storage managed by AWS and shared between the two AZs. This is how I keep the two WP instances consistent – if I load a plugin on one, it’s automatically available on the other. EFS also manages backups. Traditionally EFS was targeted at applications needing few but large files. This made it quite impractical to store the 10,000 x 1kB files required by the PHP – after a short period of usage EFS would throttle the throughput. Recently AWS added support for Elastic Throughput, which lets you have high throughput for short periods without throttling or incurring horrendous costs. Therefore EFS should be configured for ‘Elastic’ throughput and use both cachefilesd and opcode caches (see below) to substantially drop the EFS average load.

All the WP operating data is stored in MySql instances managed by AWS (called RDS-MySql). This is an active/standby model with close-to-real-time replication. RDS will auto-failover, and supports manual failover. RDS automatically does backups and applies minor version software patches. Hosting your own database on your EC2 instance will save a little AWS cost, increase your admin costs, and is generally regarded as a cloud anti-pattern. I use AWS Secrets Manager to store and roll the DB passwords (by default WP stores the password in a file).

Each WP deployment runs on a single compute (EC2) instance in the public subnet using a PHP environment hosted in Apache. (More on this later). To support Infrastructure-as -Code, I deploy each EC2 instance from a Launch Template, which specifies the following:

- - - Compute – a t2.small (1 CPU, 2GB RAM, with burst performance).
    - Storage – boot volume with 8GB SSD.
    - OS – Amazon Linux2 (Fedora-based) this is specified in a machine image (AMI).
    - Security groups and roles (these allow the instance to work with AWS’s security services).
    - User data is the script AWS autoruns after installation that mounts the shared EFS file system and installs the OS extensions and PHP Stack.
    - Network – this is not specified in the template but instead provided by the auto-scaler or operator when the instance is created.

Each of the EC2 instances is started by the AWS auto-scaler. I set this so that when CPU usage exceeds 75%, it creates and deploys new instances. Similarly, when the traffic drops, it terminates the instances. In addition, it monitors and replaces dead instances. I set the maximum size to six EC2s (for scale-out) and the minimum size to two (for reliability – one in each AZ).

I use CloudWatch to monitor all the services and present a console to show various performance metrics. In addition, CloudWatch periodically executes synthetic canaries, that send GET requests (via the Internet) to one of my blog pages. I set this to send an alarm using AWS SNS (via email and SMS) if it fails twice in five minutes. CloudWatch also sends alerts for auto-scale actions.

The AWS documentation for each service is excellent, so building this out is pretty straightforward. Except for the EC2 bit, so …

Hosting WordPress on EC2

This was my first deployment of PHP, and it presented a few surprises. The WP architecture enables (and requires) many third-party plugins to be deployed, as WP does not come with many of the functions you need. For example, TLS, monitoring, logging, a reasonable content editor, page styling and such do not come in the base product. But let’s start with caching, which is both essential and absent.

Out of the box, WP took 30 seconds to render a page. PHP code is stateless between invocations, so you need five caches to get that time down to a reasonable level:

Page cache – your home and blog pages need to be 100% cached, else you have maybe 4 seconds TTFB.
Opcode cache – stores a compiled version of the PHP code in memory.
Database and object caches – at least 50 DB calls are required for each call.
EFS cache – EFS is a network file system holding the PHP code, plugin configuration and web assets. (About 120MB and 11k files)

I used Memcached (non-distributed) for the page, DB and object caches and set them up using the W3 Total Cache plugin. The opcode PHP cache naturally uses memory. I used cachefilesd (which also uses memory) for the EFS cache. That’s the reason for the 2GB CPU memory. It took a while to realize that all these caches are necessary, but together the blog pages now have 300ms TTFB (time to first byte) across the Pacific Ocean (I’m in Australia, and my blog is hosted in the USA). Page editing (which does not use the Page cache due to unsolvable cache consistency issues) is still quite slow, with about 1.5s TTFB. That only impacts me, and only a little bit, so I did not bother to try to make it better. When rendering blog pages (the thing I want to work at 100 TPS), the DB and the EFS store have no reads – it’s all in CPU, backed by the page and opcode caches. The stats are (with a Kotlin/JVM implementation of the same data for comparison, both normalized to 70% CPU):

Metric	Served from WordPress caches	Served from JVM caches
CPU	70% on each of the two CPUs	70% on each of the two CPUs
Memory (user space)	1.5GB	0.2GB
DB	negligible (write IOPS=3/s, read IOPS=0.5/s)	zero
EFS load	7% IO bandwidth (this is actually meta-data reads, which are hard to suppress)	zero
Pages per second	100	600

So while the WordPress performance numbers are OK, they are six times worse than a JVM solution. As always, you also need good browser caching so the subsequent calls do not have to download the 200-odd assets required to render a page. Similarly, lazy image loading is valuable for a good UX.

You need a lot of PHP and OS assets to host WordPress (25 yum downloads). You could provide these as part of a custom memory image (AMI) that is installed along with the OS. This would save download times. Instead, I chose to run the downloads from a User data script after the build. This adds about two minutes to the basic build time of around five minutes, but updating library versions is a bit simpler. Either way, a really annoying part of the AWS hosting is that the standard AWS yum repo did not include a full set of PHP 8.2 libraries. This means you need to use the remi repo, and that requires reconfiguring yum to add that repo as the highest priority for all PHP assets. I discovered the hard way that the AWS and remi versions of the 8.2 assets were incompatible.

I want to touch on auto-scaling. I found that one EC2 in each AZ will render about 60 pages per second, and that’s beyond what I think I need. However, auto-scaling has three other plusses, so I would always recommend auto-scale, even if your design calls for only one or two instances:

When you are implementing the hosting, you will build and destroy many EC2 instances (my Launch Template is on version 15, and I tore down over 50 EC2s to get there). With auto-scaling set with a maximum size of one, all you have to do is terminate your current instance, and autoscaling will automatically bring up a new one and add it to your load balancer.
In production, you would set auto-scaling to have a minimum size of two (one in each AZ for reliability), and AWS will then ping and restart any instance that dies.
Remember that in 2023 we don’t patch software. To do a library or OS uplift, you change the values in the Launch Template and then get the auto-scaler to terminate the current instances and deploy the new ones.

I used AWS Session Manager to provide UI access to the EC2 instances. Traditionally, to access your EC2 hosts, you would use an SSH terminal. That meant opening the SSH protocol into the public subnets, having a bastion host access the private, and securing the keys in some manner. That’s a bit of effort, and getting the security right is hard. Instead, Session Manager gives direct access from an operator’s AWS login to a shell session on the EC2 instance. This is authenticated with AWS 2FA and authorized using AWS IAM – it’s simpler to set up and easier to secure. It does not provide file upload, but that’s really a yesterday approach compared to downloading from your GIT or yum repo.

This is a good segue to Security Groups. AWS does not really support inter-subnet firewalling. Instead, a security group is a more powerful concept that applies to traffic into EC2 instances or AWS Services (like EFS). A Security Group is a stateful firewall, that lets you control the network protocol in (or out), based on where the traffic originated (or is destined). The origin could be an IP address or CIDR, but most usefully, it can be another security group. For example, I set up the WP EC2 instances to only accept port 80 traffic from the load balancer security group and the load balancer security group only to accept port 80 from the Internet. This tends to remove some of the need for traditional multi-tier deployments and works well with dynamic IP addresses and auto-scale.

How much does it cost?

This system costs $USD90 per month in AWS fees. Custom WordPress hosting starts at $3 per month and goes to around $100 per month for a business-grade service. The AWS costs drop (to about half) if you pre-commit for a year. The dedicated WP hosting sites also provide some professional WP support, whereas AWS only does that for the hosting component.

So this is very viable for a small-medium business but probably not a great choice for a personal blog (unless you are a software architect),

What is the CO₂ footprint?

According to websitecarbon.com, this site uses 0.1g CO₂ per page, which puts it in the best 10% of sites (mean is 0.5g). Few images, small images and high static content help reduce emissions on the browser and network sides. On the server side, we saw that the resources are six times higher than a JVM implementation, which is not great. I also note that WP really does not run on serverless (such as AWS Lambda). (Update – probably does now – see WordPress on AWS -review) That really impacts both money and CO₂ for a micro-site.

Security

There is far too little security designed into WordPress. Obviously, in the deployment of any software, security is down to the intrinsic security of the software plus the strength of how you choose to deploy it. I’m suggesting that in the case of WordPress, that balance puts the heavy lifting on you. There are some good articles such as WpBeginner, HubSpot, or Kinsta that suggest around 20 changes to base WordPress that you need to make. Then there are over 1000 plugins offering security features. The fact there is so much semi-overlapping advice, and so many band-aid plugins is kind of worrying and indicative that after 19 years, it’s still an issue.

Now it’s clear that there are a dozen or so things that you just must do. However, if you are looking at a formal threat model (eg STRIDE and DREAD), then there seems to be a lot you cannot say as you don’t know what the software does – neither the core product nor the myriad plugins that are essential for it to work. Is a plugin leaking PII under the guise of an innocuous outbound call, is an insecure plugin open to attack, or did the developer put in a support portal? Another observation from the blogosphere is that security controls often break stuff. That aligns with my experience.

So from a DREAD perspective, what are the risks and affected users? As a first cut:

Risk	Affected	Impact	Additional mitigations
Site goes down	You and your customers	Reputation	Alarms and restore from snapshot.
Site vandalized	You and unknown others	Reputation and legal action	Daily scan the site manually?
Your site attacks other sites	You, competitors, customers, etc	Reputation and legal action	Firewall off outbound traffic.
PII stolen	People	Legal and regulator action with damages	?

I cannot see a way to bolt on strong PII protection, given the database holds PII with at best TDE protection. Based on that alone, it seems excessively risky to use WordPress for a site holding a lot of PII, such as a non-trivial e-commerce site.

Does this meet the architectural objectives?

I will summarize the extent to which we have met the six objectives I started out with. I have written another article on this ( WordPress on AWS -review ) but the summary is.

Objective	Strength	Weakness	Rating (1 to 10)
Reliable	Highly immune to hosting failures, probably better than 99.99%.	Vulnerable to upgrade failures, which can be somewhat offset by higher maintenance costs. Vulnerable to WordPress defects, where one instance hogs DB connections, and thus causes an outage (happens about once per month).	7
Performant	Good blog and home page rendering with auto scalability.	Pretty average page editing.	8
Secure	Good platform security.	WordPress’s ecosystem is SOUP, with poor user authentication, hard to analyze and vulnerable PII.	4
Maintainable	No regular tasks, good failure notifications.	The upgrade process is tedious, and logging is basic.	8
Green	0.1g CO₂ per page.	Six times the server carbon footprint of a JVM implementation.	9
AWS Well-Architected	Generally conforms.	Shortfalls in security and reliability pillars.	7

In summary, this deployment is probably the best you get for deploying WordPress as it is intended to run. To improve, you have to start ring-fencing the WordPress upgrade and security functions. That’s going to incur dev-ops costs and limit the system’s function. My experience with COTS is to use it as it is intended, when you start pushing it in ways it does not want to go, then your costs go up, the utility of the product goes down, and the fragility of the system rises. So I would not go there.

I would recommend WP on AWS as a solution for a small-medium business for its content and blog delivery. They would need the skill to spend maybe one day per month on basic system admin. The proposition would be even better if they had other AWS hosting. However, the potential for an outage, the lack of PII protection, and the unknown quality of the plugins would, in my view, make it the wrong choice for a major business or a modest e-commerce website.

Is the site done? I just don’t know. I want to drop password authentication in favour of OAuth to Google or FaceBook. The days of build-it-yourself password systems are over, even for a blog. I’m sure there is a plugin to do this, and that is the joy of WordPress. The blog works, the failure-mode tests pass, and it seems performant. However, so much of this has been blind discovery, aided by dozens of blogs that vary from brilliant to banal. If it broke tomorrow, I would not be surprised. It’s really not a great way to architect software.

Appendix

WordPress extensions

These are my WordPress extensions. They are enough to have a usable blog with adequate maintainability.

Astra	Theme to give overall styling. This is the most common and best-supported theme. The themes from WordPress itself (e.g. Twenty Twenty-Three) are considered undeveloped, and that was my experience.
Advanced Editor Tools	Extensions to the text editor to provide more basic functions.
Astra Pro	Plugin to make Astra work better (paid). Given lots more control without CSS.
Broken Link Checker	Obviously, you need something for this.
Classic Editor	WordPress is moving towards the Guttenberg editor, which seems not ready. This editor works as expected.
Change wp-admin	Change the login URL to confuse password attackers.
Health Check & Troubleshooting	This helps debug conflicts between plugins. It’s essential.
Query Monitor	Makes it possible to debug DB issues.
Site Kit by Google	Shows Google Analytics in the WordPress UI
W3 Total Cache	Manages all the caches necessary to get OK performance.
WP Activity Log	Logs the activity on the site.

After it was meant to be done

This is a log of all the issues that happened after I thought it was done.

Date	What happened	Response
24th Jan 2023	Updated Site Kit version and the Google Analytics dashboard stopped working. Severity – low	Support says this happens sometimes, usually due to plugin interactions. Disabled all other plugins and it started working. Added them back one by one and still worked. A couple of hours later not working again. 1st Feb 2023 Found the problem was an interaction with the Query Monitor plugin. Disabled it and Site Kit worked. Fixed (with work around)
25th Jan 2023	Site health error “Your site could not complete a loopback request”. Severity – low?	Could not find any doc on what the loopback call actually is. Looking at ALB logs it seems to be a call to wp-cron.php, which I had blocked in ALB for security. Temporarily opened wp-cron.php and it’s solved. Reblocked and need a better solution. Open
25th Jan 2023	Uploaded a large file and this caused 90% CPU load, which triggered an ALB health check error, which triggered an autoscale terminate and restart. Severity – low	I need the ALB check to be fast for good UX, and dropping the loaded EC2 from the LB pool was good. I backed off the autoscale health checks, so that terminate/launch is a slower response. Fixed, waiting confirmation.
1st Feb 2023	Scan shows vulnerabilities WP 6.1.1 has blind SSRF vulnerability using pingback. Astra css allows directory scan. Exposes Apache and PHP versions.	Already disabled xmlrpc.php in the ALB, so not vulnerable, though it still shows in a PEN test. Added Options -Indexes to .httpaccess Added ServerTokens Prod to httpd.conf Closed
7 Feb 2023	Synthetic monitoring SMS message showed an outage. Monitoring showed one CPU at 99% load with 50 DB connections, other CPU cannot get connections – hence outage. No excess IO, DB, or EFS load. Severity – medium	No obvious cause. Logs show the Google SiteKit token was refreshed around this time. Manually terminated the overloaded instance, and service was immediately restored. 5 min outage. Autoscaler replaced the instance. Todo – prevent one instance using all the DB connections. Terminate an instance under high CPU load, and make sure the “DB not available page” shows as an error to LB, Autoscaler and synthetics. 13/2/23 Added per-process connection limit in php.ini to 10. This should prevent one aberrant process DOSing the DB. (History shows this did not work). ~~Maybe fixed , waiting confirmation~~ Open
10 Feb 2023	Same as above.	Open
13th Feb 2023	Found that Total Cache added an advertising footer to the HTML that leaked sensitive configuration information.	Found a fix by editing theme’s functions.php file. Sadly this fix looks like it will go when I update the theme. Closed
3rd March 2023	Same as above.	Open

The system view

Hosting WordPress on EC2