CircuitWall Consultancy

Expertise. Innovation. Success


About Me



I’m Andrew Wu, an Engineering Manager based in Stockholm, Sweden, leading high-performing AI/ML teams in developing cutting-edge solutions. With over 15 years of experience in technology and a proven track record in both technical leadership and team management, I specialize in building cohesive, efficient teams that deliver exceptional results in the AI/ML space.

Leadership & Management

  • Building and leading cross-functional AI/ML engineering teams
  • Fostering a culture of innovation and continuous learning
  • Driving technical excellence and best practices
  • Mentoring and developing technical talent
  • Improving team efficiency and delivery processes

Technical Expertise

  • AI/ML Engineering

    • Machine Learning Operations (MLOps)
    • Production ML systems architecture
    • AI solution design and implementation
    • Data pipeline optimization
    • Model development and deployment
  • Technology Leadership

    • Cloud architecture and migration strategies
    • Enterprise application development
    • Big data infrastructure and analytics
    • System scalability and performance
    • Security and compliance

Professional Background

  • 15+ years in software engineering and technology
  • Extensive experience in Java, Python, and cloud technologies
  • Deep expertise in big data processing and ML systems
  • Strong focus on cloud-native architectures
  • Proven track record in team leadership and project delivery

I’m passionate about helping organizations build and scale their AI/ML capabilities while maintaining high team engagement and technical excellence. Whether you’re looking to enhance your ML operations, build a high-performing team, or need strategic technical guidance, I’m here to help.

Feel free to contact me or connect on LinkedIn to discuss how we can work together.

Recent Updates


Big Data & Machine Learning

Transform your data into actionable insights with cutting-edge Big Data and Machine Learning solutions. From data pipeline architecture to production-ready ML models, I help organizations harness their data for competitive advantage.

  • Data Engineering

    • Scalable data pipelines
    • Real-time processing
    • Data warehouse architecture
    • ETL optimization
  • Machine Learning

    • Custom model development
    • Real-time recommendations
    • Fraud detection
    • NLP & Computer Vision
    • Time series forecasting
  • Technology Stack

    • Processing: Apache Storm, Flink, Spark
    • Data Platform: Kafka, Cassandra, HDFS
    • Analytics: Athena, BigQuery
    • ML/AI: TensorFlow, PyTorch, Vertex AI
    • MLOps: Kubeflow, MLflow, DVC

Benefits

  • Data-driven decisions
  • Automated processes
  • Predictive capabilities
  • Real-time insights
  • Scalable solutions

Cloud Technology

Transform your business with enterprise-grade cloud solutions that drive innovation, reduce costs, and accelerate time to market. With extensive experience in major cloud platforms, I help organizations optimize their cloud infrastructure for maximum efficiency and scalability.

  • Cloud Migration & Modernization

    • Assessment and migration planning
    • Cloud-native transformation
    • Infrastructure as Code (IaC)
    • Cost optimization
  • Multi-Cloud Solutions

    • Google Cloud Platform: BigQuery, Cloud Storage, Vertex AI, Cloud Functions, Cloud Run
    • Amazon Web Services: EC2, VPC, S3, Athena, Lambda
    • Platform Engineering: Cloudfoundry, Kubernetes, CI/CD, Security & Compliance

Benefits

  • Reduced operational costs
  • Enhanced scalability and reliability
  • Improved security and compliance
  • Faster time to market
  • Data-driven decision making

Enterprise Application Development

Leverage 15+ years of enterprise software development expertise to build robust, scalable applications that drive business value. Specializing in modern architecture patterns and cloud-native development.

  • Enterprise Solutions

    • Microservices architecture
    • Cloud-native development
    • Legacy system modernization
    • API design and integration
  • Technology Stack

    • Backend: Java/Spring (10+ years), Python (5+ years), Golang (2+ years)
    • Frontend & API: Modern JavaScript, REST/GraphQL
    • Database: MySQL, PostgreSQL, MongoDB, Redis
    • MLOps: Vertex AI, ML pipelines, Model monitoring
  • Development Practices

    • Agile/CI/CD
    • Test-driven development
    • Infrastructure as Code
    • Security-first design

Benefits

  • Faster time-to-market
  • Improved performance
  • Scalable architecture
  • Lower maintenance costs

How Cursor AI Transformed This Website

Transforming a Website with Cursor AI: A Success Story

As a software engineer and tech enthusiast, I’m always excited to try new tools that promise to improve development workflows. Recently, I had the opportunity to use Cursor AI to upgrade this website, and the results were nothing short of impressive. In this post, I’ll share how Cursor AI helped transform this site from a basic portfolio into a modern, feature-rich platform.

Key Improvements

1. Content Enhancement

  • Grammar and Technical Writing: Cursor AI helped polish four technical articles (Kafka Basics, Kafka Multi-DC, Intro to Ratchet, and DIS2022), improving clarity and professionalism while maintaining technical accuracy.
  • Service Pages: Condensed and optimized three service pages (Going Cloud, Application Development, Big Data & ML) for better readability and impact.
  • About Page: Updated to better reflect current role and expertise in AI/ML leadership while maintaining technical credibility.

2. Modern UI Implementation

Services Section

  • Implemented a responsive card-based layout
  • Added smooth hover effects and transitions
  • Improved image handling with dynamic scaling
  • Enhanced visual hierarchy with subtle shadows
  • Optimized spacing and typography
<div class="service-card">
    <div class="card-img-container">
        <img class="card-img-top" src="img/service.jpg" alt="Service">
    </div>
    <div class="card-body">
        <h4 class="card-title">Service Title</h4>
        <div class="card-text">Content</div>
    </div>
</div>

Blog Section

  • Created a dedicated blog page with modern card layout
  • Implemented featured images and summaries
  • Added reading time estimates
  • Integrated keyword highlighting and linking
  • Improved mobile responsiveness

3. Technical Enhancements

Smart Keyword System

One of the most interesting features implemented was the automatic keyword highlighting system:

const keywords = {
    'Kafka': '/tags/kafka',
    'Machine Learning': '/tags/machine-learning',
    'Cloud': '/tags/cloud',
    // ... more keywords
};

function highlightKeywords(node) {
    // Intelligent keyword detection
    // Handles nested content
    // Preserves HTML structure
    // Links to relevant tag pages
}

Layout Optimization

  • Implemented wider containers for better content display
  • Added responsive breakpoints for various screen sizes
  • Optimized image loading and display
  • Enhanced typography and spacing

The Power of AI Pair Programming

What made this experience unique was how Cursor AI functioned as an intelligent pair programming partner:

  1. Context Awareness: Cursor understood the existing codebase and made suggestions that fit the established patterns and styles.

  2. Intelligent Refactoring: Instead of just making simple changes, Cursor proposed comprehensive improvements:

    • Restructuring layouts for better maintainability
    • Adding modern CSS features while maintaining compatibility
    • Implementing best practices for performance
  3. Problem Solving: When faced with challenges like the long homepage, Cursor suggested and implemented the solution of moving posts to a dedicated page with proper navigation.

Technical Implementation Details

Modern CSS Features

.container-wide {
    width: 100%;
    max-width: 1400px;
    margin-right: auto;
    margin-left: auto;
    padding: 0 30px;
}

.post-card {
    transition: transform 0.3s ease;
    &:hover {
        transform: translateY(-5px);
    }
}

Responsive Design

@media (max-width: 768px) {
    .container-wide {
        padding: 0 15px;
    }
    .post-card {
        margin-bottom: 20px;
    }
}

Introducing Ratchet at NDSML Summit & Google FSI 2023

At the NDSML Summit 2023, the premier annual event for the Nordic Data Science and Machine Learning community, I had the privilege of presenting “Elevating Your Existing Models to Vertex AI” alongside my colleague Lef Filippakis. Together, we shared our expertise in AI and cloud computing with an engaged audience.

NDSML Summit 2023

Vertex AI is Google Cloud’s unified platform for building and managing machine learning solutions throughout the entire ML lifecycle. It seamlessly integrates Google Cloud’s AI tools and services, providing comprehensive support for training, deployment, monitoring, and optimization of machine learning models.

Our presentation focused on demonstrating how organizations can leverage Vertex AI’s capabilities with their existing models. We showcased straightforward migration paths from popular frameworks like TensorFlow, PyTorch, scikit-learn, and XGBoost to Vertex AI. The presentation highlighted key benefits such as scalability, multi-framework support, and seamless integration with other Google Cloud services.

A centerpiece of our presentation was Ratchet, a tool we developed to streamline the process of converting existing models into Vertex AI-compatible formats. Ratchet supports major frameworks including TensorFlow, PyTorch, scikit-learn, and XGBoost, while automating model uploads to Vertex AI and endpoint creation for serving. Its simplicity and power make it an invaluable tool for organizations looking to modernize their ML infrastructure.

Key takeaways from our presentation:

  • Vertex AI provides a unified platform for end-to-end ML lifecycle management
  • Seamless integration of Google Cloud’s AI tools and services
  • Simple migration paths from popular ML frameworks to Vertex AI
  • Robust scalability to meet growing demands
  • Comprehensive support for multiple frameworks and languages
  • Deep integration with Google Cloud’s ecosystem
  • Ratchet enables easy conversion of existing models to Vertex AI formats
  • Automated model deployment and endpoint creation through Ratchet

The presentation was met with enthusiasm and generated valuable discussions about modernizing ML workflows with Vertex AI. For those who couldn’t attend, additional information and resources are available on the NDSML Summit 2023 website.

Reflections on Data Innovation Summit 2022

I recently had the honor of presenting at the Data Innovation Summit 2022, the premier annual Data and AI event in the Nordic region. My presentation, titled “Scaling ML in One of Europe’s Hottest Fintech Companies,” resonated strongly with the audience and generated engaging discussions about the challenges and opportunities in scaling machine learning operations.

Data Innovation Summit 2022

Recognition in Nordic 100

I am deeply honored to be named to the Nordic 100 in Data, Analytics & AI list. This prestigious recognition, curated by the Hyperight editorial team, highlights 100 practitioners and leaders who are actively advancing Data and AI innovation capabilities in the Nordic region. The complete list of Nordic 100 in Data, Analytics & AI can be found here.

Nordic 100 Recognition

Presentation Highlights

My presentation focused on Tink’s journey in scaling its machine learning capabilities. Key topics included:

  • The evolution from a single handcrafted data model to a comprehensive ML product offering
  • Strategic approaches to scaling teams and infrastructure
  • Critical factors in technology stack selection, including trade-offs and lessons learned
  • Future roadmap, including:
    • Legacy model modernization
    • Integration with data labeling platforms
    • Accelerating new model development
    • Leveraging Kubeflow pipelines and AutoML on Vertex AI

The Data Innovation Summit 2022 provided an invaluable platform for knowledge exchange and networking with industry leaders. The discussions and insights shared will continue to influence the direction of data innovation in the Nordic region and beyond.

References

Kafka 1: Basics


As a consultant, admitting “I don’t know” can be challenging. A few months ago, I began working as a DevSecOps engineer on a large Kafka (Confluent) installation for a bank, despite having only limited knowledge of Kafka at the time.

I’m writing this article to share key insights gained from working on and tuning this multi-datacenter setup. While I may not cover every important topic, I welcome feedback to help improve this guide.



What is Kafka?

Apache Kafka is an open-source stream-processing software platform originally developed by LinkedIn. It provides three core capabilities:

  • Publishing and subscribing to streams of records, similar to a message queue or enterprise messaging system
  • Storing streams of records in a fault-tolerant, durable way
  • Processing streams of records in real-time

Who is Using Kafka?

For an extensive list of companies and organizations using Kafka, visit: https://cwiki.apache.org/confluence/display/KAFKA/Powered+By


Kafka components


Kafka Broker

One or more of these forms a Kafka cluster, or sometimes be called Kafka server.

Also shows that Kafka is a brokered message queue system. (A non-brokered message queue system for example zeromq)

See this post here: https://stackoverflow.com/questions/39529747/advantages-disadvantages-of-brokered-vs-non-brokered-messaging-systems


Kafka Topics

Topic, queue or category of messages. Topics are constructed by a number of partitions.

Each topic is controlled mainly by several attributes: Number of replicas, Number of partitions and Retention time.

topic anatomy

Since Kafka is pub-sub, each consumer group is using their own offset, so clients can proceed with their own pace.

There are two types of retention policy:

  • Delete: Discard messages that are too old, or exceeding size limitation. This is useful for normal event logs.
  • Compact: Logs will be compacted to the keys or say to keep last state/message to such key. This is useful when treating topics as key-value database.

Kafka Partitions

Partition is THE atomic level in terms of storage, read, write and replication.

  • Number of partitions is the MAX parallelism of a topic.
  • Messages in a partition have strong ordering.

Messages with the same key will send to the same partition and a partition handles messages from multiple keys.

When a partition is replicated, it can be either leader, in-sync replica, follower.

A message/record can be written to the leader replica only, though messages can be read from any replica.

topic anatomy

Kafka partitions are assigned to brokers at:

  • When a topic is created for the first time.
  • When manually re-assigned.

So they are static at most of the time, even when node failure strikes.


Zookeeper

Zookeeper is an inseparable part of the Kafka cluster although it is not being used all the time. That has been said, Zookeeper is needed when starting Kafka, failure handling but not running Kafka.

  • Controller election

The controller is one of the most important broking entity in a Kafka ecosystem, and it also has the responsibility to maintain the leader-follower relationship across all the partitions. If a node by some reason is shutting down, it’s the controller’s responsibility to tell all the replicas to act as partition leaders in order to fulfill the duties of the partition leaders on the node that is about to fail. So, whenever a node shuts down, a new controller can be elected and it can also be made sure that at any given time, there is only one controller and all the follower nodes have agreed on that.

  • Configuration Of Topics

The configuration regarding all the topics including the list of existing topics, the number of partitions for each topic, the location of all the replicas, list of configuration overrides for all topics and which node is the preferred leader, etc.

  • Access control lists

Access control lists or ACLs for all the topics are also maintained within Zookeeper.

  • Membership of the cluster

Zookeeper also maintains a list of all the brokers that are functioning at any given moment and are a part of the cluster.


Kafka Producer

Any component in this landscape that sends data to Kafka is by definition a Kafka producer and uses producer API at some point.

Here, we are listing the Kafka Producer API’s main configuration settings:

  • [client.id]: It identifies the producer application.

  • [producer.type]: Either sync or async.

  • [acks]: Basically, it controls the criteria for producer requests that are considered complete.

    • -1/all: Make sure ALL replicas are written. (Slow/most reliable)
    • 0: As long as data is received by the replica leader. (Fastest/least reliable)
    • 1: Make sure when in-sync replicas are written. (Fast, also depends on min-ISR settings)
  • [retries]: “Retries” means if somehow producer request fails, then automatically retry with the specific value.

  • [bootstrap.servers]: It bootstraps list of brokers.

  • [linger.ms]: The producer will wait and batch for linger.ms before sending to the broker. This will significantly improve throughput by micro batching but will also add latency per request as well.

  • [key.serializer]: It is a key for the serializer interface.

  • [value.serializer]: A value for the serializer interface.

  • [batch.size]: Simply, Buffer size.

  • [buffer.memory]: “buffer.memory” controls the total amount of memory available to the producer for buffering.


Kafka Consumer

As Kafka producer, an application reads from Kafka uses consumer API at some point. And here comes a bit connection to the number of partitions and a concept called consumer group.

A consumer group is a way to consume records in Kafka in parallel. Each partition is consumed by Exactly one consumer in the group and the maximum consumer parallelism for a topic is the number of partitions.

Consumer group

Here, we are listing the configuration settings for the Consumer client API −

  • [bootstrap.servers]: It bootstraps list of brokers.

  • [group.id]: To assign an individual consumer to a group.

  • [enable.auto.commit]: Basically, it enables auto-commit for offsets if the value is true, otherwise not committed.

  • [auto.commit.interval.ms]: Basically, it returns how often updated consumed offsets are written to ZooKeeper.

  • [session.timeout.ms]: It indicates how many milliseconds Kafka will wait for the ZooKeeper to respond to a request (read or write) before giving up and continuing to consume messages.


Kafka Connect

Kafka connect is a common framework to transfer records in and out of Kafka cluster.

Why use Kafka connect?

  • Auto-recovery After Failure

To each record, a “source” connector can attach arbitrary “source location” information which it passes to Kafka Connect. Hence, at the time of failure Kafka Connect will automatically provide this information back to the connector. In this way, it can resume where it failed. Additionally, auto recovery for “sink” connectors is even easier.

  • Auto-failover

Auto-failover is possible because the Kafka Connect nodes build a Kafka cluster. That means if suppose one node fails the work that it is doing is redistributed to other nodes.

  • Simple Parallelism

A connector can define data import or export tasks, especially which execute in parallel.

  • Community and existing connectors (Incomplete list of existing connectors)
    • Kafka Connect ActiveMQ Connector
    • Kafka FileStream Connectors
    • Kafka Connect HDFS
    • Kafka Connect JDBC Connector
    • Kafka Connect S3
    • Kafka Connect Elasticsearch Connector
    • Kafka Connect IBM MQ Connector
    • Kafka Connect JMS ConnectorKafka Connect Cassandra Connector
    • Kafka Connect GCS
    • Kafka Connect Microsoft SQL Server Connector
    • Kafka Connect InfluxDB Connector
    • Kafka Connect Kinesis Source Connector
    • Kafka Connect MapR DB Connector
    • Kafka Connect MQTT Connector
    • Kafka Connect RabbitMQ Source Connector
    • Kafka Connect Salesforce Connector
    • Kafka Connect Syslog Connector
Why kafka

In case you need to develop a new connector, Kafka connect provides:

  • A common framework for Kafka connectors It standardizes the integration of other data systems with Kafka. Also, simplifies connector development, deployment, and management.

  • Distributed and standalone modes Scale up to a large, centrally managed service supporting an entire organization or scale down to development, testing, and small production deployments.

  • REST interface By an easy to use REST API, we can submit and manage connectors to our Kafka Connect cluster.

  • Automatic offset management However, Kafka Connect can manage the offset commit process automatically even with just a little information from connectors. Hence, connector developers do not need to worry about this error-prone part of connector development.

  • Distributed and scalable by default It builds upon the existing group management protocol. And to scale up a Kafka Connect cluster we can add more workers.

  • Streaming/batch integration We can say for bridging streaming and batch data systems, Kafka Connect is an ideal solution.


Schema Registry

Schema Registry stores a versioned history of all schemas and allows the evolution of schemas according to the configured compatibility settings. It also provides a plugin to clients that handle schema storage and retrieval for messages that are sent in Avro format.

Why do we need schema in the first place?

Kafka sees every record as bytes, so schema works and lives on the application level. It is very likely the producer and consumer is not the same application, not in the code base and there is a need collaboration between them.

A schema registry is here to:

  • Reduce payload Instead send data with a header, JSON structure, only actual payload needed to pass to Kafka.

  • Data validation and evolvement Invalid messages will never get to approach to Kafka. Schema can be evolved to the next version without breaking existing parts.

  • Schema access Instead of distributing class definition, object structure can be distributed via RESTful API alone with all previous versions.

Schema registry

In addition to schema management, use schema alone will also reduce record size to Kafka.

If you send a message using JSON, 50% or more payload could be wasted by message structure. Using a schema registry, you only need to transfer the schema identification alone with the payload.

A workflow using a schema registry:

  • The serializer places a call to the schema registry, to see if it has a format for the data the application wants to publish. If it does, schema registry passes that format to the application’s serializer, which uses it to filter out incorrectly formatted messages.

  • After checking the schema is authorized, it’s automatically serialized and there’s no effort you need to put into it. The message will, as expected, be delivered to the Kafka topic.

  • Your consumers will handle deserialization, making sure your data pipeline can quickly evolve and continue to have clean data. You simply need to have all applications call the schema registry when publishing.

References

Kafka 2: Multi Datacenter Setup


Organizations typically consider multi-datacenter setups for two main reasons:

  • Business expansion into different geographical regions
  • Enhanced reliability requirements that take precedence over pure performance

When implementing a multi-datacenter setup, there are three critical aspects to consider during both normal operations and disaster scenarios:

  • Consistency: Will data be lost if a datacenter fails and operations switch to another datacenter?
  • Availability: Will a datacenter failure cause partitions to go offline? This can happen due to:
    • No in-sync replicas being available
    • Zookeeper cluster failing to form a quorum and elect new leaders
    • Broker unavailability
  • Performance: What are the expected latency and throughput characteristics?

These considerations are closely tied to how producers receive confirmation of message writes, which is controlled by the acks parameter:

  • -1/all: All In-Sync Replicas (ISR) must report back and acknowledge the commit - slowest but most consistent
  • 0: Messages are considered written as soon as the broker receives them (before commit) - fastest but least consistent
  • 1: Messages are confirmed when the leader commits them - most common setting balancing speed and data security

Before discussing different setups, here are some key abbreviations used throughout this article:

nr-dc:      Number of data centers
nr-rep:     (default.replication.factor) Number of replicas
ISR:        In-sync replicas
min-ISR:    (min.insync.replicas) Minimum in-sync replicas

All scenarios discussed below assume producer setting acks=1 and topics created with default settings.


Two data centers and 2.5 data centers

Note: 2.5 DC is basically 2DC+1 zookeeper on the third DC.


To set a tone here, two data centers set up for Kafka is NOT ideal!


There will be trade-offs for two dc setups. before continue you need to understand the concept of CAP theorem. According to LinkedIn engineers, Kafka is a CA system. If there is a network partition happens, you will have either split brian situation or failed cluster. It is a choice when you have to shut down part of the cluster to prevent split brain.

When comes to two dc setups, you can not even achieve CA across the cluster either. Having said that, you can still tune individual topics to have either consistency OR availability.


Stretched cluster

With rack awareness, two data centers joined as one cluster. The rack awareness feature spreads replicas of the same partition across different racks/data centers. This extends the guarantees Kafka provides for broker-failure to cover rack-failure, limiting the risk of data loss should all the brokers on a rack/data center fail at once.


Consistency first configuration

min-ISR > round(nr-rep/nr-dc)

The minimum in-sync replica is larger than replicas assigned per data center.

Consistency first fail


Consistency

With this setup, it is guaranteed to have at least ONE in sync replica on the other data center. When there is one data center offline, the data is already secured on the other data center.

Availability

One data center goes down will render partitions goes offline because it just won’t have enough in-sync replicas (even at least one replica is in-sync).

The way to re-enable the cluster is to either

  • Change the min-ISR setting to less than replica per data center, a manual intervention.
  • Re-assign partition allocation, also a manual intervention.

In a single tenant environment, producers need to able to buffer messages locally until the cluster is back functioning.

In a multi-tenant environment, there will probably message lost.

Performance

Whichever partition leader you are writing to will need to replicate to the other data center. Your writing latency will be the sum of normal latency plus latency between data centers.


Availability first configuration

nr-rep – min-ISR >= nr-rep/nr-dc

There are more nodes can be down at the same time(nr-rep – min-ISR) then the number of replicas per data center.

Availability first

There are actually two cases here:

  • ISRs reside in the same data center, surprise! rack awareness does not honor ISR.
  • ISRs spread out to different data centers.

Availability first fail

Since there will be so many partitions in the cluster, you won’t avoid the first case anyway.

Why this is important? because by default, non-ISR cannot be elected as leader replica, if there is no ISR available, the partition goes offline.

To make this work, there is another piece to this puzzle: ```unclean.leader.election.enable=true````.

Here is what happens when one data center fails.

Availability first dc fail


Consistency

This setting will allow followers to become leaders and you will have data missing for records did not sync in time and there definitely will be some. When old leaders come back online, that will bring back the missing records and maybe create duplicates as well.

Availability

This setup allows one data center goes down entirely.

Performance

Less min-ISR means better performance, especially data not necessarily synced across to the other data center before responding to the producer.


2.5 data centers

As I mentioned before, to perform any failover recovery you need zookeeper quorum and we just mentioned partitions in previous setups.

In previous setups, you need to form a zookeeper cluster (odd nodes) across two data centers that means one in dc1 two in dc2 or two in dc1 and three in dc2.

If one data center with more zookeeper nodes went down it brings the zookeeper cluster with it, because zookeeper needs a majority to perform election.

With 1/3/5 zookeeper(s) resides on the third data center greatly reduced the risk on the zookeeper cluster.


Replicated cluster

From producer or consumer’s perspective, there are two clusters and data is synced with an almost guaranteed delay between them.

There are a set of popular tools:

  • Mirror maker - Apache
  • Proprietary Replicator - Confluent / Uber /IBM

Why? because Mirrormaker from Kafka is far from perfect.

  • It replicates data but topic configuration.
  • Takes a long time and sometimes give up rebalance when mirroring clusters with a lot of partitions. This effects every time you need to add a new topic as well.

Active-Active and Active-Passive are the two concepts to replicated clusters

AA: Producers and consumers may using both clusters to produce and consume data. With delays in between, the order is not guaranteed outside of the data center.

AP: The other cluster is hot standby, and switching to the other cluster needs reconfiguration on the client sides.


Consistency

Well there will be data cannot be synced in time, and that creates inconsistency.

Availability

With the active-active design, you can put both clusters into bootstrap server list in theory(and not recommended) but that will sacrifice ordering and a bit performance as well, since you may be connecting to the server that is further away geographically.

Other than the case described above, if there is a data center offline, you need to do a manual recover and switch to another cluster.

Performance

I think this may be one of the biggest advantages. Clients can choose to talk to the cluster geographically closer.


Three data centers or more

min-ISR > round(nr-rep/nr-dc) same formula as consistency first setup.

A stretched three data centers cluster is, in my opinion, the best setup. Consistency and availability can be achieved with little effort.

nr-rep  = nr-dc * n
min-ISR = n+1

Where n=1/2/3… and n+1 <= number of nodes per data center.

Here is a simulation for 3 replicas and min-ISR=2 setup:

Three dc setup


Consistency

When a record is delivered, it is guaranteed there is a replica lives on another data center.

Availability

It can suffer a total loss of ONE data center and still available to read and write.

Performance

Same performance attribute as two data centers stretched cluster. All requests have to add latency to sync replica to another data center.

Summary

With one data center, Kafka can achieve consistency and availability on the local level. When comes to multi-datacenter setup you can prevent incidents that happen once every few years.

Although data center failure is rare, it is important to feel safe and prepared for such situations.

Choose and implement a solution to your needs and remember that, you have the chance to customize topics individually as most of the configurations are per topic. If it is possible, go for the three data center setup, it can achieve both consistency and availability on the cluster level with little overhead.

I am interested in site reliability topics in general, let me know your thoughts!

References