My name is Andrew Wu, and I live and work in Stockholm, Sweden.
I am a Big Data and Machine Learning engineer with over 15 years of experience. I have a strong background in Java and Python development, as well as experience in IaC and machine learning.
My skills include:
I am happy to help out if you need any advice or assistance.
Feel free to contact me and you are welcome to add my Linkedin.
Big data and machine learning are two powerful technologies that are changing the way businesses operate. By harnessing the power of big data, businesses can collect and analyze vast amounts of information to gain insights that can help them improve their operations. Machine learning algorithms can then be used to automate tasks, make predictions, and even generate new ideas.
Big data and machine learning are already being used by businesses in a variety of ways, including:
Big data and machine learning are still relatively new technologies, but they have the potential to revolutionize the way businesses operate. By embracing these technologies, businesses can gain a competitive advantage and achieve their goals faster.
Key competences and Technologies
Cloud computing is a type of computing that relies on shared resources, such as hardware, software, and data, that are delivered and managed over the Internet (“the cloud”). Companies of all sizes use the cloud to lower costs, increase agility, and innovate faster.
Key Competences and Technologies
I have over 15 years of experience in application development, and I am proficient in a wide range of languages and technologies, including Java SE/EE, Spring, Springboot, Python 3, Golang, Javascript, CSS, REST APIs, MySQL, MongoDB, and Riak.
I have a strong foundation in software engineering principles, and I am passionate about building maintainable, high-performance applications. I am also a strong advocate for continuous integration and continuous delivery (CI/CD).
I am confident that I have the skills and experience that you are looking for in an application developer. I am eager to learn new things and I am always looking for ways to improve my skills. I am confident that I can make a significant contribution to your team.
Key competences and Technologies
I had the pleasure of presenting “Elevating Your Existing Models to Vertex AI” at the NDSML Summit 2023, the annual event that brings together the Data Science and Machine Learning community in the Nordics. My co-presenter Lef Filippakis and I are experts in the field of AI and cloud computing. Vertex AI is a unified platform for building and managing machine learning solutions across the entire ML lifecycle. It allows you to use the best of Google Cloud’s AI tools and services in a seamless and integrated way. Whether you want to train, deploy, monitor, or optimize your models, Vertex AI has you covered.
In our presentation, we showed how to take advantage of Vertex AI’s features and benefits using existing models that you have already built and trained. We demonstrated how to migrate models from TensorFlow, PyTorch, scikit-learn, and XGBoost to Vertex AI with ease. We also highlighted the benefits of using Vertex AI, such as its ability to scale to meet your needs, its support for multiple frameworks and languages, and its integration with other Google Cloud services.
One of the tools that we created and used in our presentation is called Ratchet. Ratchet is a tool that allows you to easily convert your existing models into Vertex AI compatible formats. Ratchet supports various frameworks and languages, such as TensorFlow, PyTorch, scikit-learn, and XGBoost. Ratchet also automates the process of uploading your models to Vertex AI and creating endpoints for serving them. Ratchet is a simple and powerful tool that can help you elevate your existing models to Vertex AI.
Here are some key takeaways from our presentation:
I recently had the opportunity to deliver a presentation at the Data Innovation Summit 2022, which is the largest and most influential annual Data and AI event in the Nordics and beyond . My presentation focused on Scaling ML in one of Europe’s hottest fintech companies and was well-received by the audience. I received positive feedback from several attendees, who found the insights I shared to be valuable and thought-provoking.
I am thrilled to share that I have been named to the Nordic 100 in Data, Analytics & AI list, which is an independent list curated by the Hyperight editorial team that features 100 Data, Advanced Analytics & AI practitioners or individuals strongly dedicated to supporting the data community and accelerating the Data and AI innovation capabilities in the Nordic region ¹. Being included in this list is a great honor, and I am grateful for the recognition. Please find the complete list of Nordic 100 in Data, Analytics & AI here.
In my presentation, I discussed the data challenges faced by fintech companies and how Tink has taken an iterative approach to scale their tools and team, resulting in impressive growth from one handcrafted data model and zero data scientist to today’s ML product offering. I also shared how we scaled our teams to meet the challenges and how we chose our stack with pros and cons. Finally, I discussed the accomplishments and future plans of Tink, such as upgrading their oldest model, integrating their models with their data labeling platform, accelerating creation of new models, and leveraging Kubeflow pipelines and AutoML on Vertex AI.
Overall, my experience at the Data Innovation Summit 2022 was incredibly rewarding. I had the opportunity to connect with other professionals in the field, learn about the latest trends and innovations, and share my own insights and experiences. I look forward to attending future events and continuing to contribute to the data community.
I hope this draft helps you get started on your blog post. Let me know if you have any other questions!
Tink | European open banking platform | 6000 connections. https://tink.com/.
Optimise loan origination with open banking | Tink blog. https://tink.com/blog/open-banking/lenders-guide-open-banking/.
New research published by open banking platform Tink. https://tink.com/press/open-banking-investments/.
As a consultant, it is hard to say “I don’t know”. With only very limited knowledge of Kafka, I started working as DevSecOps a few months ago on a large Kafka(confluent) installation for a bank.
I am writing this from my own perspective on the key takeaways after working and tuning this multi-dc setup. There will be topics that you feel important that is not covered here, please let me know so I can improve this.
Apache Kafka is an open-source stream-processing software platform developed by LinkedIn.
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
One or more of these forms a Kafka cluster, or sometimes be called Kafka server.
Also shows that Kafka is a brokered message queue system. (A non-brokered message queue system for example zeromq)
See this post here: https://stackoverflow.com/questions/39529747/advantages-disadvantages-of-brokered-vs-non-brokered-messaging-systems
Topic, queue or category of messages. Topics are constructed by a number of partitions.
Each topic is controlled mainly by several attributes: Number of replicas, Number of partitions and Retention time.
Since Kafka is pub-sub, each consumer group is using their own offset, so clients can proceed with their own pace.
There are two types of retention policy:
Partition is THE atomic level in terms of storage, read, write and replication.
Messages with the same key will send to the same partition and a partition handles messages from multiple keys.
When a partition is replicated, it can be either leader, in-sync replica, follower.
A message/record can be written to the leader replica only, though messages can be read from any replica.
Kafka partitions are assigned to brokers at:
So they are static at most of the time, even when node failure strikes.
Zookeeper is an inseparable part of the Kafka cluster although it is not being used all the time. That has been said, Zookeeper is needed when starting Kafka, failure handling but not running Kafka.
The controller is one of the most important broking entity in a Kafka ecosystem, and it also has the responsibility to maintain the leader-follower relationship across all the partitions. If a node by some reason is shutting down, it’s the controller’s responsibility to tell all the replicas to act as partition leaders in order to fulfill the duties of the partition leaders on the node that is about to fail. So, whenever a node shuts down, a new controller can be elected and it can also be made sure that at any given time, there is only one controller and all the follower nodes have agreed on that.
The configuration regarding all the topics including the list of existing topics, the number of partitions for each topic, the location of all the replicas, list of configuration overrides for all topics and which node is the preferred leader, etc.
Access control lists or ACLs for all the topics are also maintained within Zookeeper.
Zookeeper also maintains a list of all the brokers that are functioning at any given moment and are a part of the cluster.
Any component in this landscape that sends data to Kafka is by definition a Kafka producer and uses producer API at some point.
Here, we are listing the Kafka Producer API’s main configuration settings:
[client.id]: It identifies the producer application.
[producer.type]: Either sync or async.
[acks]: Basically, it controls the criteria for producer requests that are considered complete.
[retries]: “Retries” means if somehow producer request fails, then automatically retry with the specific value.
[bootstrap.servers]: It bootstraps list of brokers.
[linger.ms]: The producer will wait and batch for linger.ms before sending to the broker. This will significantly improve throughput by micro batching but will also add latency per request as well.
[key.serializer]: It is a key for the serializer interface.
[value.serializer]: A value for the serializer interface.
[batch.size]: Simply, Buffer size.
[buffer.memory]: “buffer.memory” controls the total amount of memory available to the producer for buffering.
As Kafka producer, an application reads from Kafka uses consumer API at some point. And here comes a bit connection to the number of partitions and a concept called consumer group.
A consumer group is a way to consume records in Kafka in parallel. Each partition is consumed by Exactly one consumer in the group and the maximum consumer parallelism for a topic is the number of partitions.
Here, we are listing the configuration settings for the Consumer client API −
[bootstrap.servers]: It bootstraps list of brokers.
[group.id]: To assign an individual consumer to a group.
[enable.auto.commit]: Basically, it enables auto-commit for offsets if the value is true, otherwise not committed.
[auto.commit.interval.ms]: Basically, it returns how often updated consumed offsets are written to ZooKeeper.
[session.timeout.ms]: It indicates how many milliseconds Kafka will wait for the ZooKeeper to respond to a request (read or write) before giving up and continuing to consume messages.
Kafka connect is a common framework to transfer records in and out of Kafka cluster.
Why use Kafka connect?
To each record, a “source” connector can attach arbitrary “source location” information which it passes to Kafka Connect. Hence, at the time of failure Kafka Connect will automatically provide this information back to the connector. In this way, it can resume where it failed. Additionally, auto recovery for “sink” connectors is even easier.
Auto-failover is possible because the Kafka Connect nodes build a Kafka cluster. That means if suppose one node fails the work that it is doing is redistributed to other nodes.
A connector can define data import or export tasks, especially which execute in parallel.
In case you need to develop a new connector, Kafka connect provides:
A common framework for Kafka connectors It standardizes the integration of other data systems with Kafka. Also, simplifies connector development, deployment, and management.
Distributed and standalone modes Scale up to a large, centrally managed service supporting an entire organization or scale down to development, testing, and small production deployments.
REST interface By an easy to use REST API, we can submit and manage connectors to our Kafka Connect cluster.
Automatic offset management However, Kafka Connect can manage the offset commit process automatically even with just a little information from connectors. Hence, connector developers do not need to worry about this error-prone part of connector development.
Distributed and scalable by default It builds upon the existing group management protocol. And to scale up a Kafka Connect cluster we can add more workers.
Streaming/batch integration We can say for bridging streaming and batch data systems, Kafka Connect is an ideal solution.
Schema Registry stores a versioned history of all schemas and allows the evolution of schemas according to the configured compatibility settings. It also provides a plugin to clients that handle schema storage and retrieval for messages that are sent in Avro format.
Why do we need schema in the first place?
Kafka sees every record as bytes, so schema works and lives on the application level. It is very likely the producer and consumer is not the same application, not in the code base and there is a need collaboration between them.
A schema registry is here to:
Reduce payload Instead send data with a header, JSON structure, only actual payload needed to pass to Kafka.
Data validation and evolvement Invalid messages will never get to approach to Kafka. Schema can be evolved to the next version without breaking existing parts.
Schema access Instead of distributing class definition, object structure can be distributed via RESTful API alone with all previous versions.
In addition to schema management, use schema alone will also reduce record size to Kafka.
If you send a message using JSON, 50% or more payload could be wasted by message structure. Using a schema registry, you only need to transfer the schema identification alone with the payload.
A workflow using a schema registry:
The serializer places a call to the schema registry, to see if it has a format for the data the application wants to publish. If it does, schema registry passes that format to the application’s serializer, which uses it to filter out incorrectly formatted messages.
After checking the schema is authorized, it’s automatically serialized and there’s no effort you need to put into it. The message will, as expected, be delivered to the Kafka topic.
Your consumers will handle deserialization, making sure your data pipeline can quickly evolve and continue to have clean data. You simply need to have all applications call the schema registry when publishing.
The multi-datacenter topic come up usually because of two reasons:
When comes to multi-datacenter setup there are, in my opinion, there are three major aspects during normal operations and when disaster strikes to consider:
This is given that the producer receives confirmation of message been written and this is highly related to consistency and performance. acks=
Before discuss into different setups, here are some heavily used abbreviations:
nr-dc: Number of data centers
nr-rep: (default.replication.factor) Number of replicas
ISR: In-sync replicas
min-ISR: (min.insync.replicas) Minimum in-sync replicas
All the scenarios are based on producer setting acks=1 and topics created using default settings.
Note: 2.5 DC is basically 2DC+1 zookeeper on the third DC.
To set a tone here, two data centers set up for Kafka is NOT ideal!
There will be trade-offs for two dc setups. before continue you need to understand the concept of CAP theorem. According to LinkedIn engineers, Kafka is a CA system. If there is a network partition happens, you will have either split brian situation or failed cluster. It is a choice when you have to shut down part of the cluster to prevent split brain.
When comes to two dc setups, you can not even achieve CA across the cluster either. Having said that, you can still tune individual topics to have either consistency OR availability.
With rack awareness, two data centers joined as one cluster. The rack awareness feature spreads replicas of the same partition across different racks/data centers. This extends the guarantees Kafka provides for broker-failure to cover rack-failure, limiting the risk of data loss should all the brokers on a rack/data center fail at once.
min-ISR > round(nr-rep/nr-dc)
The minimum in-sync replica is larger than replicas assigned per data center.
With this setup, it is guaranteed to have at least ONE in sync replica on the other data center. When there is one data center offline, the data is already secured on the other data center.
One data center goes down will render partitions goes offline because it just won’t have enough in-sync replicas (even at least one replica is in-sync).
The way to re-enable the cluster is to either
In a single tenant environment, producers need to able to buffer messages locally until the cluster is back functioning.
In a multi-tenant environment, there will probably message lost.
Whichever partition leader you are writing to will need to replicate to the other data center. Your writing latency will be the sum of normal latency plus latency between data centers.
nr-rep – min-ISR >= nr-rep/nr-dc
There are more nodes can be down at the same time(nr-rep – min-ISR
) then the number of replicas per data center.
There are actually two cases here:
Since there will be so many partitions in the cluster, you won’t avoid the first case anyway.
Why this is important? because by default, non-ISR cannot be elected as leader replica, if there is no ISR available, the partition goes offline.
To make this work, there is another piece to this puzzle: ```unclean.leader.election.enable=true````.
Here is what happens when one data center fails.
This setting will allow followers to become leaders and you will have data missing for records did not sync in time and there definitely will be some. When old leaders come back online, that will bring back the missing records and maybe create duplicates as well.
This setup allows one data center goes down entirely.
Less min-ISR means better performance, especially data not necessarily synced across to the other data center before responding to the producer.
As I mentioned before, to perform any failover recovery you need zookeeper quorum and we just mentioned partitions in previous setups.
In previous setups, you need to form a zookeeper cluster (odd nodes) across two data centers that means one in dc1 two in dc2 or two in dc1 and three in dc2.
If one data center with more zookeeper nodes went down it brings the zookeeper cluster with it, because zookeeper needs a majority to perform election.
With 1/3/5 zookeeper(s) resides on the third data center greatly reduced the risk on the zookeeper cluster.
From producer or consumer’s perspective, there are two clusters and data is synced with an almost guaranteed delay between them.
There are a set of popular tools:
Why? because Mirrormaker from Kafka is far from perfect.
Active-Active and Active-Passive are the two concepts to replicated clusters
AA: Producers and consumers may using both clusters to produce and consume data. With delays in between, the order is not guaranteed outside of the data center.
AP: The other cluster is hot standby, and switching to the other cluster needs reconfiguration on the client sides.
Well there will be data cannot be synced in time, and that creates inconsistency.
With the active-active design, you can put both clusters into bootstrap server list in theory(and not recommended) but that will sacrifice ordering and a bit performance as well, since you may be connecting to the server that is further away geographically.
Other than the case described above, if there is a data center offline, you need to do a manual recover and switch to another cluster.
I think this may be one of the biggest advantages. Clients can choose to talk to the cluster geographically closer.
min-ISR > round(nr-rep/nr-dc)
same formula as consistency first setup.
A stretched three data centers cluster is, in my opinion, the best setup. Consistency and availability can be achieved with little effort.
nr-rep = nr-dc * n
min-ISR = n+1
Where n=1/2/3… and n+1 <= number of nodes per data center.
Here is a simulation for 3 replicas and min-ISR=2 setup:
When a record is delivered, it is guaranteed there is a replica lives on another data center.
It can suffer a total loss of ONE data center and still available to read and write.
Same performance attribute as two data centers stretched cluster. All requests have to add latency to sync replica to another data center.
With one data center, Kafka can achieve consistency and availability on the local level. When comes to multi-datacenter setup you can prevent incidents that happen once every few years.
Although data center failure is rare, it is important to feel safe and prepared for such situations.
Choose and implement a solution to your needs and remember that, you have the chance to customize topics individually as most of the configurations are per topic. If it is possible, go for the three data center setup, it can achieve both consistency and availability on the cluster level with little overhead.
I am interested in site reliability topics in general, let me know your thoughts!