Bahmni Performance Testing Journey (High Level Summary)
Goals
To publish the baseline reports and advantage of having upgraded to new software components.
To publish the capacity planning reports and to be able to predict per-facility cloud running hardware costs with different hardware contexts.
To publish a Roadmap for the next set of experiments or features that promise to improve the performance (or reduce the per-facility costs) considerably.
To integrate performance test runs with Bahmni deployments and be available to anyone in the community to modify/run/benchmark their own Bahmni deployments.
More details here
Performance Test Plan Strategy
Test Strategy
The strategy mainly focuses on doing realistic stress testing on Bahmni LITE environment by maintaining the following criteria
To have pauses between each user interaction thus maintaining the breathing time for each persona.
To have the overall load shared between each persona based on their breathing time.
To have a ramp-up and ramp-down of users at the beginning and end of each test.
Each test will start with a set of patients for the doctors to start the consultation from the beginning of the test.
To maintain a seamless connection of scenarios between patient registration and consultation.
To have a hard stop time at the end of the test to control the overall test duration.
More details here
Test Scenarios
The performance test suite has the following test scenarios developed
New Patient - Registration - Start OPD Visit
Existing Patient - Patient Search - Start OPD Visit
Upload Patient Document
Doctor Consultation and Observations Flow
More details here
Infra Setup
The Performance Test environment runs on Kubernetes on AWS.
A separate namespace is created with a Bahmni Kubernetes Installation.
The existing RDS is shared with the performance namespace.
For Monitoring, Grafana and JVM Dashboard are added.
More details here
Required Software
Code repository
Archive Report Path - GH Pages
Test Execution Steps
Clone all the repositories.
Use the Wondershaper to set the network speeds only if needed.
Run the test data generator to create and upload new patients.
Copy the registrations.csv file from /output to /src/gatling/resources .
Start the test by providing the simulation type , number of users, and duration of the test.
To run the test against different environments update respective env properties in
src/gatling/scala/configurations/protocols.scala and src/gatling/scala/api/constants.scala
To run the test in cloud use the trigger in GH actions.
Java Profiling
Made use of YourKit profiling tool to profile JVM while running performance executions
Helped in analysing CPU and memory utilisation, troubleshoot code that slows down API responses, locating possible deadlocks and so on.
Setting up YourKit on a remote machine can be found here.
Findings & Remediation
📗 Baseline Test Observations
By default, Openmrs comes with Open JVM memory management which is not optimal for applications with large memory footprints. So we moved to CMS(Concurrent Mark Sweep) which gave us a low GC pause time and higher throughput for minimal patient data. - BAH-2660.
We have configured the min, max heap size and parallel GC threads.
This change has reduced the max time taken by the POST API call to save encounters for 90 users test run from 4149 ms to 1551 ms.
More details about the baseline test reports can be found here
📗 Long Duration Test Observations (24 hour test runs)
Saving the consultation page takes more time due to a groovy parse class function, By disabling the parse class function the response time for a single API call is reduced from 2.5s to 1s - BAH-2870.
The HIP health check module was pinging OpenMRS patients and visit API every 5 seconds causing the environment to go down due to Out-of-Memory Exception constantly whenever the patients count reaches 125k - BAH-2441, BAH-2783. (this was fixed). The fix for this issue has reduced the max time taken by the POST API call to save encounters for 70 users test run from 60s to 4s.
HIP and Crater atom feed were also pinging OpenMRS to query the event feeds causing high GC pauses which in turn spiking CPU utilization - BAH-2801, BAH-2912.
The update of GC strategy from CMS to G1GC has helped to control the CPU spike.
Without the HIP, Crater atomfeed and updated G1GC settings the 99th percentile has reduced to 1.5s.
More details about the JVM configurations, infra setup and long duration test runs can be read here: Bahmni Lite Performance Long Duration Simulation Baselining
Bahmni lite Cost Estimates (Projected)
Based on the long-duration test results and corresponding AWS utilization bills we have come up with a cost calculator. Link: Bahmni LITE Infra-cost estimates (based on Performance testing)
Anyone can create a cloud cost estimation to set up Bahmni LITE by providing the no. of users, users-per-clinic, and operational hours.
The assumption for load pattern is as per our Test suite. If the operations being performed at your facility are different than the scenarios in Test suite, then the results won’t match as-is. Please review the test scenarios to get a better understanding of the performance work done by the team.
Future Recommendations for Performance Testing
Troubleshoot / Improvement stories
Test the environment with multitenancy.
Update the test suite with the latest changes in the application - BAH-2903.
Reduce the impact of HIP and Crater atomfeed on openmrs - BAH-2948.
Optimize the API response time - BAH-2871 , BAH-2890, BAH-2891 , BAH-2892 , BAH-2893.
Optimize the application memory management - BAH-2949 .
Optimize the duplicate SQL queries - BAH-2716.
Backlog stories list.
The Bahmni documentation is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)