Root-causing the random failures in the integration tests with ElasticSearch

In our recent development we were creating an integration test framework and some tests for manipulating data in the ElasticSearch cluster. Strangely the tests could succeed or fail randomly, even though we never made any changes to the code on the business logic at that time.

What did we have in the test cases?

  • @BeforeClass: load the test data into ElasticSearch cluster through ElasticSearch TransportClient.
  • @Test: retrieve test data and check equality on some fields.
  • @AfterClass: clean up the test data through ElasticSearch TransportClient.

Really just simple as this.

What did it the error message say when the tests failed? Well it complained about not being able to find the test data.

Strange. The @BeforeClass annotated method should always load the data into the cluster before executing the test cases and there were no errors about failing to load data. Feeling a bit stuck, I commented out the clean up code in the @AfterClass method. Now the tests passed consecutively on every test I issued but once I added back the cleanup code, it started failing occasionally again, especially when I ran the test right after the previous one finished.

This got me thinking: “Could it be possible that the test data was cleaned up at the end of the previous test but not loaded into the cluster in the next run even though @BeforeClass method was executed? ”

My suspicion was confirmed after some reading on how ElasticSearch loads data. Why did this happen? Because loading data into ElasticSearch cluster takes time and so does deleting them. The test cases were executed right after the load request was issued in the @BeforeClass method, but not necessarily after the request was processed by the cluster. In other words, it is asynchronous. We made a false assumption that the load request was processed and the data was present in the cluster immediately. This mindset may be OK in unit test but with integration test it can be problematic.

Stupid solution: Add a buffer before actually executing the tests, for example, Thread.sleep(30000) in the @BeforeClass method. However, this does not always guarantee the data was loaded if the data size is large.

Better solution: Send a request to verify that the request is actually processed given the request id. Wait in the @BeforeClass method until the request is finished.

Whatever you do, make sure that the test data are actually in the cluster before moving on.

 

 

Advertisements
Root-causing the random failures in the integration tests with ElasticSearch

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s