Thought Leadership

Benchmarking Mule Batch Approaches

By Victor Sosa, Director of Service Delivery at IO Connect Services

November 27, 2020

This article was originally written by Victor Sosa, Director of Enterprise Integrations. Note: This blog post demonstrates that when fine-tuning a Mule application, the processing of really big volumes of data can be achieved in a matter of minutes. A next blog post, written by Irving Casillas, shows you exactly how to do this but adding resiliency. See Mule Batch - Adding Resiliency to the Manual Pagination Approach.

Bulk processing is one of the most overlooked use cases in enterprise systems, even though they’re very useful when handling big loads of data. In this document, I will showcase different scenarios of bulk upsert to a database using Mule batch components to evaluate some aspects of performance. The objective is to create a best practice of what’s the best configuration of Mule batch components and jobs that process big loads of data for performance purposes. Our goal is to optimize timing executions without compromising computer resources like memory and network connections.

The computer used for this proof has a very commodity configuration as follows:

Lenovo ThinkPad T460
Intel Core i7 2.8 GHz
8 GB RAM Memory
HDD 100GB SSD

The software installed is:

MySQL Server v8.0.
Mule v3.8.2.
Anypoint Studio v6.2.2.
JProfiler v9.2.1

The evaluation consists of the processing comma-separated values (.CSV) file which contains 821000+ records and 40 columns for each record. First, we consume the file content, later we transform the data, and then we store them into a database table. The file sizes 741 MB when uncompressed. To ensure that each record has the latest information, the database queries must implement the upsert statement pattern.

Three approaches are shown here:

Out-of-the-box Mule batch jobs and batch commit components.
Using a custom pagination algorithm.
A hybrid solution.

The first two approaches use the bulk mode of the Insert operation in the Database connector. The third approach uses the Bulk operation in the Database connector.

One important distinction here: The term “upsert execution” is used to refer to the action of inserting a new record but if you find a record with that key then update it with the new values if any. In MySQL this is called “Insert … on duplicate key update” and you can find the documentation here https://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html The term “Insert operation” is used to refer to the Insert configuration element in Mule. You can find documentation here https://docs.mulesoft.com/mule-user-guide/v/3.8/database-connector-reference#insert In other words, we are executing MySQL upsert statements with the Insert operation from the Db component in Mule.

You can find the code in our GitHub repository https://github.com/ioconnectservices/upsertpoc

Approach 1: Out-of-the-box Mule batch jobs and batch commit components

Some background about the Mule Batch connector. You can configure the block size and the maximum amount of threads of the connect. Moreover, in the Batch Commit component, you can also configure the commit size of the batch. This gives a lot of flexibility in terms of performance and memory tuning.

This flexibility comes with a price: You must calculate the amount of memory the computer will be able to manage for this process only on each block size. This can be easily calculated with the following formula:

Maximum memory = Size of the record * block size * number of maximum threads

For instance, in our test, the size of the record -which is a SQL upsert statement- is 3.1 KB. The settings for the Batch component are 200 records of block size and 25 running threads. This will require a total of 15.13 MB per block size. In this case, this will be executed a minimum of 4105 times approximately (remember the 821000 records?). Also, you must verify that your computer host has enough CPU and memory available for the garbage collection too.

The flow

Figure 1. The batch flow.

The batch job is configured to use a maximum of 25 threads and a block size of 200 records.
The file content is transformed into an iterator which is then passed to the batch step.
The commit size of the Batch Commit is set to 200. This matches the block size of the batch job, meaning the full block will be committed.
The Database connector is using an Insert operation in bulk mode and it’s parameterized.

A simple flow as all the pagination and bulk construction is done by Mule, we just need to worry about the SQL statement and performance.

The metrics

As explained before, batch jobs are designed to run as fastest as possible by running multiple processes in threads.

The total time of execution is ~7 minutes.
4105 round trips are made to insert the records into the database.
The maximum memory used during the execution of the batch is 1.29 GB.

Approach 2: Custom pagination

The overall idea here is to read the file, then to transform the content into a list and iterate through the list to create a page of records that allows us to construct a series of SQL queries. Later, the queries are sent in bulk fashion to a Database connector with the bulk mode flag enabled.

The flow

Figure 2. Custom pagination flow.

The number of pages is determined based on the page size. In this case, the page size is set to 200 like in the first approach.
The For Each scope component takes the number of pages as the collection input.
The CSV File Reader consumes the file and builds the map that will be used as the payload, and then it maps the CSV fields to columns in a database record.
The created queries are passed to the Database asynchronous scope which executes the bulk statements with a maximum of 25 threads like in approach 1.

The metrics

The total time of execution was 5 minutes average. The processing of the SQL bulk is done in a single thread, but the upsert execution is done asynchronously.
The total of round trips to the database was 4109.
The memory consumption was at a maximum of 1.42 GB.

The extra approach 3: Hybrid

This is an approach that was also tested but the results were not as satisfactory as the two above in regards to execution timing but it showed the least memory consumption. The results of the testing are presented next.

The SQL bulk query is constructed manually but the pagination is now handled by the batch job.

The flow

Figure 3. Batch manually building the SQL bulk statement.

The CSV content is transformed into an iterator to be passed to the batch process step.
The batch process handles the block size and pagination.
In the batch process, each record is used to construct the SQL for one record in particular. The SQL query it is added to a collection that will be used to create one single SQL statement with all the queries appended to be processed in bulk.
The Batch Commit connector handles the same size as the block size of the batch job.
The Database connector uses the Bulk Execute operation to insert the records into the database.

The metrics

The total time of execution when completed successfully is 18 minutes average.
The number of connections matches the maximum 25 threads running in the batch job. This gives a total close to 4105 roundtrips to the database.
The maximum memory used during the execution of the batch is 922 MB.

Conclusion

Many times, a thoughtful design is more helpful than the out-of-the-box features that any platform may offer. In this scenario, the custom pagination approach is the fastest to upsert the records into the database than the batch approach. However, a couple of things to consider as the outcome of this proof of concept:

The custom pagination approach is more flexible at treating data that can’t be split in records so easily.
For scenarios where you have a source with millions of records coming from separate systems, it’s generally a good practice to consume the content in a stream fashion to not blow the memory or the network.
It’s easier to maintain the batch job flow than the custom pagination flow.
Using Mule’s batch jobs gives you more facilities for batch result reporting as it gives you the total, succeeded and failed records.
If memory management is the most important factor to honor in your solution, then a hybrid algorithm approach is better as it shows the best numbers in memory.

As side experiments, I also observed that using the Bulk Execute operation in the Database connector is slower in performance than the Insert operation in bulk mode. Moreover, the parameterized mode allows you to take the data from any source -trusted or untrusted- and still have the queries sanitized.