Troubleshooting and Fixing OutOfMemoryError in Spring Boot

Troubleshooting and Fixing OutOfMemoryError in Spring Boot

Photo by Clément Hélardot on Unsplash

I work at a leading multinational company in an engineering team that helps build great internal products for the employees . I primarily work on JAVA, Spring Boot and RDBMS(Postgres/MySQL) stack. Couple of years back , our team had embarked on a transformational journey from Monolith to Microservices and below is the architecture that we generally follow for our applications .

Monolith to Microservices

Performance Issue

Grafana dashboard showing huge CPU utilization

After a recent announcement in our company , we started seeing a huge unexpected surge in traffic to our application. End users started complaining that application has gotten slow,also we started receiving a lot of error alerts from LogInsight which is a log monitoring tool used in our team. From Grafana dashboard (see the two CPU utilization spikes) , we found that microservices pods are under huge stress which is causing the Spring Cloud Gateway pods to throw Connection Timeout exception , hence the alerts.

Application Technical Specifications :

The application is running on a micro-service architecture following Gateway design pattern . The backend pods are JAVA 8 based Spring Boot docker image . Deployment specs are 2CPU and 4GB RAM with 3 pods in total for each microservice.JAVA is running with default heap parameters . We use Postgresql as our RDBMS.

Initially , I thought because of unusual huge traffic , the pods are not able to sustain the load , so I got the resources on our TKG namespace bumped up so that I can increase the number of backend pods . It did the magic . Awesome , took a sigh of relief . Alas , after a couple of weeks, we started seeing the same alerts . It was time to buckle up and find the root cause of issue .

Impact :

End users were not happy with the user experience because of the slowness in application . Application admins were facing issue in completing approval process which increased their backlog.

How issue was triaged ?

Screenshot from LogInsight : Out of Memory error

LogInsight to rescue . I started scouring the logs of our microservices pods to find out the slowness and found that it’s the dreaded OOM error : Java heap space . I exported the logs from LogInsight , checked the stack trace of OOM error and found that there is an API which is causing the issue.

Screenshot from LogInsight : Stack trace of OOM error

How I fixed it ?

Ok ,so now I have narrowed down the API which is causing the bottleneck. Looks like the API in turn was issuing a database query which was returning more than 100k+ rows choking up the entire heap memory. Generally it’s not big count but this becomes issue if our Entity mapping class contains lot of attributes and has multiple associations . Instead of fetching all rows in single shot and mapping them to JAVA objects to get a computed value , I replaced it with an advanced efficient query to achieve the same result . This change greatly increased the application performance and drastically reduced heap size.

Apart from fixing this issue , I also tweaked certain things so we can sustain the growing surge in traffic .

1.Introduced liveness and readiness probes using Spring Kubernetes actuator endpoints ; so in case an application stops responding , Kubernetes can stop and restart the container .

2.JAVA 8 is not docker aware , hence proper JVM heap space is not allocated based on the requests and limits provided in Deployment YAML.The default “max heap” if you don’t specify -Xmx is 1/4 (25%) of the host RAM. So I added the initial and max heap size parameters in startup command in Dockerfile.

java -Xms1024m -Xmx2048m -jar /opt/java/microservices/$APP_NAME/jar/$APP_NAME*.jar

3.Introduced distributed caching for the most hit GET APIs using ehcache.

Other things which can be looked into :

Hibernate’s Query Plan Cache speeds up the preparation of your queries. That reduces their overall execution time, and improves the performance of your application. By default, the maximum number of entries in the plan cache is 2048 . If the HQLQueryPlan object occupies 2–3 MB , it can occupy 4–6GB of heap memory.

Decrease the query plan cache size by setting the following properties:

  • spring.jpa.properties.hibernate.query.plan_cache_max_size: controls the maximum number of entries in the plan cache (defaults to 2048)
  • spring.jpa.properties.hibernate.query.plan_parameter_metadata_max_size: manages the number of ParameterMetadata instances in the cache (defaults to 128)

More details in Thoreben’s blog.

Results :

No alerts from LogInsight , end users are happy .

Conclusion

This experience taught me how to debug and fix performance issues in Spring Boot application . Fine tune an application by fixing SQL queries , use indices and implement caching layer should be the primary step to improve performance of an application ; if this doesn’t solve the problem , then we should consider increasing resources .