In September 2022, due to the war in Ukraine and the announcement of a significant rise in energy prices, Essent's customers took a greater interest in their energy consumption, leading to a surge in usage of Essent's mobile apps and website. This surge resulted in an unexpected increase in the number of requests our IT systems had to handle. Due to scalability issues in certain areas of our infrastructure, this led to a critical Priority 1 (P1) incident that persisted for one week.
It became evident that we needed to reassess our architecture. From the aftermath of this incident, the Cloud Community—a team of cloud architects and engineers—initiated a project to overhaul our platform and practices. This project was dubbed Essent's Cloud Reference Architecture 2.0 (CRA2).
CRA2’S FOUNDATIONAL PRINCIPLES
CRA2, driven by our Cloud Platform team, focuses on constructing a new platform centered around enhancing the happiness and productivity of software developers. It is based on Amazon AWS's Well-Architected Framework (WAF) and conforms to its six pillars, as depicted in the image below.
AWS Well-Architected Framework (WAF) pillars
Adapted to Essent's context, CRA2 rearranges the principles of WAF's pillars into four core CRA2 pillars:
- Stress-Free Environment: This includes simplicity, dependable environments, and team ownership
- Developer-Friendly: Encouraging quick failure and even quicker recovery, along with observability and autonomy
- Reliable and Resilient: Emphasizing asynchronous operations, comprehensive monitoring, and scalability
- Secure: Focused on isolated environments, leveraging cloud-native solutions, and implementing automated security scans
Note: the principles listed above are not exhaustive but rather a simplified overview for the purpose of this blog post.
BUT WHAT EXACTLY IS CRA2?
The cloud platform for developers was designed with the aim of streamlining the development process while providing robust support for collaboration, security, and innovation. At the heart of this platform is the principle of isolated environments for each team, ensuring that every group operates within its own set of accounts. This autonomy eliminates conflicts and resource competition, allowing teams to focus on their projects without interference.
One of the key features of this platform is network abstraction. The complexity of the underlying network infrastructure is hidden, offering teams a basic, yet fully functional networking layer that "just works". This setup not only simplifies initial use but also provides the flexibility for teams to expand their network configurations as needed, without requiring deep network engineering knowledge.
To maintain consistency and reliability, the platform offers mirrored environments for production and non-production use. This ensures that both environments contain the exact same components, eliminating discrepancies and enabling seamless transitions from development to deployment.
Centralized telemetry is another cornerstone of the platform, providing a unified repository for all metrics and logs. This feature fosters a collaborative environment where teams can share insights and learn from each other, with easy access to all crucial information in one place.
Simplified permissions streamline administrative tasks, granting users broad administrative capabilities with a few restrictions related to billing and organizational policies. This approach balances flexibility with control, ensuring teams can work efficiently while adhering to governance standards.
The platform also centralizes the management of inbound and outbound traffic, enhancing security and simplifying connectivity with partners. Traffic monitoring and protection are handled centrally, offering a comprehensive view of network activity and reducing the administrative burden on individual teams.
To ensure compliance and foster best practices, the platform provides easy-to-follow guidelines and inspection tools. This framework helps teams maintain high standards for their accounts without restraining innovation.
Moreover, the platform encourages experimentation and exploration. Teams can freely experiment and validate new concepts in non-production environments, with Infrastructure as Code (IaC) enforcements in production ensuring that proofs of concept are thoroughly vetted before launch.
Finally, the development process is further streamlined by enabling engineers to develop software as though their machines are part of the non-production network. This facilitates easier inspection and debugging, enabling developers to work more efficiently and with greater insight into their applications.
Overall, this cloud platform is designed to empower developers with the tools, autonomy, and support needed to innovate rapidly while maintaining high standards of reliability, security, and collaboration.
Platform level services, segregated in different accounts, that form the technical basis of CRA2
Fasten your seat-belts, we’re going nerdy now! 🤓
The networking uses Private Class B IP’s (172.16.0.0/12). A central VPC is hosted in a separate account that holds connectivity to all other accounts. We call this account the Central Networking Account. Not very creative, we know 😁
There, we have one Transit Gateway which is responsible for wiring all networking accounts (VPCs) together.
For the VPC itself we use a /24 range, which was in our analysis sufficient for hosting all the connectivity components that will support our entire networking connectivity, as well as private and partner connectivity: DirectConnect connections; backup Site-to-Site VPNs to our datacenters; proxies; and a NAT-Gateway per region for redundancy.
The NAT-Gateway is shared via the Transit Gateway and acts as a private gateway to all the workload accounts (account owned by the teams and where they run their own workloads), to provide one way public internet connectivity. It has the capability of inspecting outgoing traffic and act as a firewall.
The workload accounts have /24 networking ranges, which we concluded is sufficient for most cases. When it is exceptionally not sufficient, teams have the autonomy to create their own networking expansion and wire to the original bootstrapped VPC. This allows teams to grow their private networking without worrying about IP conflicts and, when cross-account communication is needed. Applications can also talk via Amazon VPC Lattice, controlling access natively via IAM policies.
Each workload account also owns one AWS Delegated Hosted zone, which allows APIs to be exposed independently under a specific subdomain.
When it comes to monitoring, we created a Telemetry account. All accounts forward logs and metrics to a central account to which every developer and stakeholder has access to. This gives us the ability to inspect and react to events taking place in the entire landscape from a central location, improving incident response times and allowing teams to easily help each other.
For CI/CD, teams can also make use of a central account called Shared-Services. Each team has its own ECR repository with cross-account access to their workload accounts, allowing them to take advantage of modern techniques and implement, as an example, build once and deploy many.
For security reasons, we won’t discuss details about incoming traffic in this post.
WHERE DOES ESSENT STAND WITH CRA2’S IMPLEMENTATION?
The Cloud Platform team is currently facilitating our engineering teams as they transition their services to our new platform. This transition includes both existing services and new ones under development, all integral to Essent's backend re-architecture initiative. Our goal with this re-architecture is to move away from monolithic, non-scalable systems towards a more flexible, serverless architecture.
If you have any questions and would like to better understand the design of our solution, trade-offs, or share information, feel free to leave a comment below and we’ll get in touch!
And if you're interested in being a part of the creation and implementation of such initiatives, please take a look at our current job openings.