Performance and Load Testing A New Service
What I did and how I would approach it in future
·
-
Last year I worked on a project migrating and replacing a service with a newer one. One phase of this involved performance testing the new service to validate that it:
- Had better performance and stability than the old service and;
- Could scale to handle increasing traffic in the future
What is performance testing?
Performance testing is "determining how a system performs in terms of responsiveness and stability under a [reasonable] workload". A load test (or stress test) is just a variant of this where the level of load simulated is higher towards the upper bound without breaking the system.
It usually involves using a performance testing library to write tests that execute requests against the application. The load/traffic level can be configured through various parameters.
Performance testing can be used in other ways too. For example, it could be integrated as part of a CI/CD pipeline as an automated sanity check to ensure the system is still stable with any new changes being pushed.
This was a new experience which I was able to learn a lot about (with guidance from my mentors) and so I wanted to document the process I took.
1. Establishing the requirements
We had established why we wanted to do these tests and so we needed to establish how they would work.
The must-have requirements were that it could:
- Be used to test against different types of API protocols for wider coverage
- Be run in an environment as similar to production as possible - including checking the specs and resources allocated to the application and database are similar
The stretch requirements of the performance tests were:
- To be able to spin up its own separate test environment so that it didn't impact existing environments
- To make it simple to configure and run e.g. via the GitLab job UI instead of needing to execute it via command line
2. Establish metrics to capture
The primary metrics I wanted to measure aimed at performance were:
- Throughput (Requests Per Second or RPS)
- Latency (response time in percentiles e.g. P90, P99, P99.9)
- Error rate
The secondary metrics I also measured aimed at gauging stability were:
- Any incidental event throughput (e.g. Kafka or other message queues) and downstream lag
- Application memory usage
- Database stability e.g. CPU utilisation, I/O, slow queries
3. Establish variables to test against
There were certain dimensions I tested against to address possible risks. This means I ran each test scenario multiple times with these variables tuned differently whilst keeping all other variables constant.
- Number of application pods running. At what point does the database become the bottleneck as the application scales up? Testing against different configurations of this lets us know how the database would perform if the application is ever scaled down or up.
- Variables affecting concurrency. If there is any contention (e.g. database locking) how is performance affected by the number of resources (such as accounts) being operated on concurrently?
4. Establish test scenarios
This took a bit of experimentation to understand what the best approach was, but in retrospect here are some different scenarios I would consider getting coverage on in the future:
- Testing different APIs with different responsibilities (e.g. are there differences in the performance across the read and write APIs?)
- Testing different types of API protocols (e.g. does a JSON-RPC API perform differently compared to a REST or gRPC API?)
- Testing different requests which model real requests a client might make with varying complexities such as:
- A request that would result in a simple code flow and;
- A request that would result in a very complex code flow (especially if it has a lot of interactions with the database or other services, locking etc.)
- Having test scenarios that deliberately invoke an unhappy path occasionally to see if the application can handle and recover from failure cases efficiently
5. Writing and executing the tests
In my case I used the Python library Locust. When executing the tests the main parameters I had to configure were:
- Users: The maximum number of concurrent Locust users making requests. Each user would represent an individual thread executing in the API
- Spawn rate: The rate of users spawned per second - this is the 'ramp-up' rate where spawning stops when it reaches the max. number of users specified above
- Run time: How long the tests run for
Other performance testing libraries may have similar parameters.
6. Evaluating the results
The questions to answer were:
- Do the results answer our original questions or hypotheses?
- Do the results help us assess the risks we seeked to address?
- Do the results make sense?
- For example, one of my earlier test runs showed unusually high throughput. It was found that the way the database was configured in that particular environment overinflated the application's performance.
- We can also look at the mathematical relationship between throughput and latency to sense-check the results
- Seek explanations for any differences in results across test scenarios
7. Presenting the results
I found it was beneficial to summarise the results so that its tailored to be digestible for the given stakeholders rather than presenting the raw datasets. Important information included:
- Summarising the key insights or trends clearly - it's important to address the implications across changed variables and different test scenarios
- Explaining the differences in results across scenarios
- Highlighting any areas of concern