Creating robust deployment

Scenarios

  1. Database Ids may overflow over time. It may have developed unnecessary gaps in Ids.
  2. The system leaks memory slowly and has to be restarted
  3. The system becomes slower over time and has to be restarted
  4. The system becomes slower with increase in database size
  5. Individual services runs out of threads
  6. Individual services runs out of database connection allocated to them
  7. The system is not reporting the errors
  8. The system is using more disk space than it is should
  9. On machine restart the services do not come back up
  10. On service restart it doesn't become available for service
  11. Active to Passive data replication is not happening
  12. Automatic switch back from active to passive doesn't work
  13. Automatic switch back from passive to active doesn't work
  14. Database connection becomes stale if not used for some time
  15. Failed events do not resolve on their own (including scenarios of restart of services)
  16. A very large file is uploaded causing server to become less responsive
  17. Redirect loop causes denial of service (low priority)
  18. Inefficient report slows down or hampers the production operations

Manual Testing
Automated Testing

Automation

The above scenarios need to be tested manually or in an automated fashion.

Environment

Functional Tests

Setup an environment with enough disk space, cpu and memory. Ensure that all the basic scenarios are covered by the functional tests. Run these tests continuously for days.

Environment Configuration

The system should be configured such that after it gives warning it continues working for sometime. For example, if OpenMRS runs out of number of threads 

  • Connection Pool Size (min, max, increment)
    • Each service should ideally use only one connection pool
    • min=5, increment=1
    • max=depending on the size of the deployment
  • Thread Pool Size
    • The maximum size of the thread pool should not be very high (e.g. at JSS a thread pool size of 100 for OpenMRS, 20 for OpenELIS, should be enough)
  • Failed events size
    • This size should be 10, so that the problem gets reported immediately
    • While fixing the issue in production one may temporarily increase the size of this
  • Database
    • Maximum number of connections (this includes the number of connections used for adhoc usage too. so keep this number slightly higher than the connection pool size given to the application)

Connection Pool Size

Sub-SystemServicePool NameMinMaxIncrement
OpenMRSApplicationdefault   
OpenMRSDynamic Reportsdefault   
OpenELIS     
OpenERP     
OpenERPAtom Feed Service    
Jasper Reports     

Thread Pool Size

Sub-SystemServicePool NameMinMaxIncrement
OpenMRSApplicationdefault   
OpenELISApplication    
OpenERPApplication    
OpenERPAtom Feed Service    
Jasper Reports     

Database

ServerDatabaseMax Connections
MySQLOpenMRS 
PostgreSQLOpenELIS (clinlims) 
PostgreSQLOpenERP 
MySQLJasper 

Tomcat

   
   
   
   

Monitoring

  • Icinga
  • How to notify when something goes wrong in production environment
  • Test whether monitoring is working or not

Troubleshooting

  • It should be straightforward to get the runtime system parameters without bringing down the system.
  • Hospital's system administrator should be able to issue the command to extract these parameters from the running system.