Performance and Load Testing A New Service

Last year I worked on a project migrating and replacing a service with a newer one. One phase of this involved performance testing the new service to validate that it:

Had better performance and stability than the old service and;
Could scale to handle increasing traffic in the future

What is performance testing?

Performance testing is "determining how a system performs in terms of responsiveness and stability under a [reasonable] workload". A load test (or stress test) is just a variant of this where the level of load simulated is higher towards the upper bound without breaking the system.

It usually involves using a performance testing library to write tests that execute requests against the application. The load/traffic level can be configured through various parameters.

Performance testing can be used in other ways too. For example, it could be integrated as part of a CI/CD pipeline as an automated sanity check to ensure the system is still stable with any new changes being pushed.

This was a new experience which I was able to learn a lot about (with guidance from my mentors) and so I wanted to document the process I took.

1. Establishing the requirements

We had established why we wanted to do these tests and so we needed to establish how they would work.

The must-have requirements were that it could:

Be used to test against different types of API protocols for wider coverage
Be run in an environment as similar to production as possible - including checking the specs and resources allocated to the application and database are similar

The stretch requirements of the performance tests were:

To be able to spin up its own separate test environment so that it didn't impact existing environments
To make it simple to configure and run e.g. via the GitLab job UI instead of needing to execute it via command line

2. Establish metrics to capture

The primary metrics I wanted to measure aimed at performance were:

Throughput (Requests Per Second or RPS)
Latency (response time in percentiles e.g. P90, P99, P99.9)
Error rate

The secondary metrics I also measured aimed at gauging stability were:

Any incidental event throughput (e.g. Kafka or other message queues) and downstream lag
Application memory usage
Database stability e.g. CPU utilisation, I/O, slow queries

3. Establish variables to test against

There were certain dimensions I tested against to address possible risks. This means I ran each test scenario multiple times with these variables tuned differently whilst keeping all other variables constant.

Number of application pods running. At what point does the database become the bottleneck as the application scales up? Testing against different configurations of this lets us know how the database would perform if the application is ever scaled down or up.
Variables affecting concurrency. If there is any contention (e.g. database locking) how is performance affected by the number of resources (such as accounts) being operated on concurrently?

4. Establish test scenarios

This took a bit of experimentation to understand what the best approach was, but in retrospect here are some different scenarios I would consider getting coverage on in the future:

Testing different APIs with different responsibilities (e.g. are there differences in the performance across the read and write APIs?)
Testing different types of API protocols (e.g. does a JSON-RPC API perform differently compared to a REST or gRPC API?)
Testing different requests which model real requests a client might make with varying complexities such as:
- A request that would result in a simple code flow and;
- A request that would result in a very complex code flow (especially if it has a lot of interactions with the database or other services, locking etc.)
Having test scenarios that deliberately invoke an unhappy path occasionally to see if the application can handle and recover from failure cases efficiently

5. Writing and executing the tests

In my case I used the Python library Locust. When executing the tests the main parameters I had to configure were:

Users: The maximum number of concurrent Locust users making requests. Each user would represent an individual thread executing in the API
Spawn rate: The rate of users spawned per second - this is the 'ramp-up' rate where spawning stops when it reaches the max. number of users specified above
Run time: How long the tests run for

Other performance testing libraries may have similar parameters.

6. Evaluating the results

The questions to answer were:

Do the results answer our original questions or hypotheses?
Do the results help us assess the risks we seeked to address?
Do the results make sense?
- For example, one of my earlier test runs showed unusually high throughput. It was found that the way the database was configured in that particular environment overinflated the application's performance.
- We can also look at the mathematical relationship between throughput and latency to sense-check the results
Seek explanations for any differences in results across test scenarios

7. Presenting the results

I found it was beneficial to summarise the results so that its tailored to be digestible for the given stakeholders rather than presenting the raw datasets. Important information included:

Summarising the key insights or trends clearly - it's important to address the implications across changed variables and different test scenarios
Explaining the differences in results across scenarios
Highlighting any areas of concern