Creating robust deployment

Scenarios

Database Ids may overflow over time. It may have developed unnecessary gaps in Ids.
The system leaks memory slowly and has to be restarted
The system becomes slower over time and has to be restarted
The system becomes slower with increase in database size
Individual services runs out of threads
Individual services runs out of database connection allocated to them
The system is not reporting the errors
The system is using more disk space than it is should
On machine restart the services do not come back up
On service restart it doesn't become available for service
Active to Passive data replication is not happening
Automatic switch back from active to passive doesn't work
Automatic switch back from passive to active doesn't work
Database connection becomes stale if not used for some time
Failed events do not resolve on their own (including scenarios of restart of services)
A very large file is uploaded causing server to become less responsive
Redirect loop causes denial of service (low priority)
Inefficient report slows down or hampers the production operations

Manual Testing
Automated Testing

Automation

The above scenarios need to be tested manually or in an automated fashion.

Environment

Functional Tests

Setup an environment with enough disk space, cpu and memory. Ensure that all the basic scenarios are covered by the functional tests. Run these tests continuously for days.

Environment Configuration

The system should be configured such that after it gives warning it continues working for sometime. For example, if OpenMRS runs out of number of threads

Connection Pool Size (min, max, increment)
- Each service should ideally use only one connection pool
- min=5, increment=1
- max=depending on the size of the deployment
Thread Pool Size
- The maximum size of the thread pool should not be very high (e.g. at JSS a thread pool size of 100 for OpenMRS, 20 for OpenELIS, should be enough)
Failed events size
- This size should be 10, so that the problem gets reported immediately
- While fixing the issue in production one may temporarily increase the size of this
Database
- Maximum number of connections (this includes the number of connections used for adhoc usage too. so keep this number slightly higher than the connection pool size given to the application)

Connection Pool Size

Sub-System	Service	Pool Name
OpenMRS	Application	default
OpenMRS	Dynamic Reports	default
OpenELIS
OpenERP
OpenERP	Atom Feed Service
Jasper Reports

Thread Pool Size

Sub-System	Service	Pool Name
OpenMRS	Application	default
OpenELIS	Application
OpenERP	Application
OpenERP	Atom Feed Service
Jasper Reports

Database

Server	Database	Max Connections
MySQL	OpenMRS
PostgreSQL	OpenELIS (clinlims)
PostgreSQL	OpenERP
MySQL	Jasper

Tomcat

Monitoring

Icinga
How to notify when something goes wrong in production environment
Test whether monitoring is working or not

Troubleshooting

It should be straightforward to get the runtime system parameters without bringing down the system.
Hospital's system administrator should be able to issue the command to extract these parameters from the running system.