In large organisations such as Essent, monolithic systems such as an Enterprise Resource Planning (ERP) play a critical role. Combining these systems with cloud native systems pose challenges that needs addressing.
In this post we will describe the challenges we faced at Essent while combining these two worlds, and how we are addressing them.
CHALLENGES WHEN COMBINING THE OLD AND THE NEW
At Essent, we are building a cloud-native architecture using AWS's managed and serverless components. This makes handling more API calls, queries, or data relatively easy. However, just because our applications and databases can scale significantly, doesn't mean all other applications can scale just as easily. The main software at Essent, like at many other companies, is the ERP system. In the past, this was the primary software for inputting and managing data.
Making a move from on-premise systems to a cloud first architecture didn’t cause the ERP system to go away. In fact, integrating the cloud native systems with the ERP is a necessary step because the ERP holds important company data and processes. At first glance, integrating them seems simple: create some APIs in the ERP system to communicate with the other systems and adjust the data format as needed. But the challenge is that ERP systems often struggle to work smoothly with cloud technologies.
OPINIONATED LANGUAGE
ERP systems have a very opinionated language model. Usually, the database underneath these ERP systems contains normalized data with very technical field names that end up in the API, which leads to API's you cannot understand without talking to an ERP expert. Since you cannot change the language model of the ERP, this means that every other application now needs to conform to it, instead of the ERP conforming to the domain language used by business and IT people.
BATCH ORIENTED SYSTEMS
ERP systems can process and store massive amounts of data, however there is a limit to how many of these processes it can run simultaneously.
They can usually only scale vertically, and quite often this requires a contract change with the vendor. Scaling up or down can take weeks or months, instead of seconds. Even when lifting these systems to the cloud, they still often have these same limitations.
To work around it, ERP systems are very batch oriented. Special care is taken to spread out all heavy jobs, so they don’t overlap too much. This combines very poorly with a serverless landscape that can scale up significantly in seconds to respond to spikes in traffic, which very quickly causes a DDoS to the ERP.
LESS AVAILABLE
Frequent activities such as patching, upgrades and deployments can take well over an hour to be done in an ERP, with significant upfront planning and preparation. Often these take place outside of office hours, forcing the engineers to work overnight.
With ERP’s being placed as a central system in the landscape, it makes them part of the critical path of many processes and a single point of failure with long mean time to repair (MTTR). Downtime causes a ripple effect across the landscape with significant disruption.
HOW CAN WE DEAL WITH THESE LIMITATIONS?
Given the above-mentioned ERP system weaknesses, what did we at Essent IT do to mitigate them, especially when thinking about their impact on the cloud native landscape?
LANGUAGE POLLUTION
In order to solve the language problem, we built an anti-corruption layer (ACL) around the ERP system. This layer protects the ERP system from changes made to the systems it interacts with, and it protects those systems from changes made to the ERP.
We also introduced business events, which represent functional actions using the domain language of Essent, instead of the language of the ERP. All data coming from the ERP is mapped to the domain language, which prevents ERP specific language to leak into the rest of the landscape. And all events being sent towards the ERP contain domain language that is translated into the ERP’s data model.
LOTS OF QUEUES, AND AN EVENT BUS
We stopped exposing the ERP to synchronous calls from other systems. All systems aiming to send data to the ERP post their messages on an event bus, which is then processed by a queue. An Adapter, with exponential backoff, dequeues the messages from the queue and sends them to the ERP. This solves two problems:
- It protects the ERP from spikes in calls made by other systems. The queue batches and throttles all incoming traffic, preventing the ERP from being DDoS’ed by our own systems.
- It protects the other systems from the ERP’s downtime. Since all messages are now sent to the event bus, it does not matter to the other systems when the ERP is unavailable. The messages will remain in the queues until the ERP is again available.
To support all the events going in and out of the ERP system at scale, we use Kafka as an event bus.
All outgoing data is transformed into a business event and put into Kafka, where it can be consumed by a near infinite number of consumers, without any extra load to the ERP.
As for incoming data, it’s mapped from business events to the data model of the ERP, before it is persisted in its data store. All messages are stored indefinitely on Kafka, which also allows future consumers to read all the ERP data without any need to call it again.
LANGUAGE IS HARD
At first, it seemed fairly straightforward to map the ERP’s data model to a business event in response to an action performed in the ERP. However, it was often difficult to do so. One action in the ERP can translate to multiple business events; as well as many business events can map onto a single ERP action. Avoid creating events that represent the action performed in the ERP, but instead model the actual business event.
It’s not normally possible to change the ERP’s data model, so ERP specialists are used to the rest of the world conforming to their data model. It takes effort and time to raise awareness of the fact that even though the data originates from the ERP, that does not mean it is leading on how to model the events and the language. This requires people specialized in the ERP to adopt a new mindset, which can be challenging.
ATTEMPTING TO RETRY
At first, we started with a very basic exponential backoff of approximately an hour. After that the message would go on a dead letter queue and an alert would be raised, requiring manual intervention.
In our case it turned out that quite regularly messages would fail for hours, or in the case of the acceptance environment even for days. This caused us to write a more complex queuing mechanism that tries to send messages to the ERP five times in an hour. When that fails, it then puts the messages in a dead letter queue. We then have a second process running every three hours that picks up all the dead letter messages and puts them again in the queue. We take care that the message order is not lost by making sure the older messages end up at the front of the queue.
As the process is now fully automated, the alerts are raised when messages are in the dead letter queue for more than 12 hours, which removes a lot of the false positive alerts.
WHERE DO WE STAND NOW?
At Essent IT we have just started this way of integrating the ERP system with our cloud native landscape, with several teams redesigning their systems to follow the patterns described above.
We will probably run into many more challenges, but we hope this design pattern will provide a good foundation for everything we will do in the future since our large ERP system won't go out of business anytime soon.
If you have any questions and would like to better understand the design of our solution, trade-offs, or share information, feel free to leave a comment below and we’ll get in touch! And if you're interested in being a part of the creation and implementation of such initiatives, please take a look at our current job openings.