Greenfield as starting point
Building a Data Platform on a green field, such as we found here, offers a lot of possibilities but also challenges. In terms of tools there were no restrictions, but to consider using Google Cloud Components. There was a process for deploying products in place already, however, that had to be used.
This initial situation led to some consideration on tools and processes. Fixed in this process is Jenkins CI for deploying code to staging and production environments, with some other restrictions, like using a Infrastructure as Code approach to deploying. Toolwise we first took a look at Google provided tools like Bigquery, Dataflow and Composer, on which we then built the backbone of our data platform.
Bigquery
Bigquery is a Data Warehouse system provided by Google. Its serverless and scalable architecture in combination with a SQL dialect which is compatible to the SQL2011 standard make it an easy choice for providing the API to our data. It seamlessly integrates with Google Cloud Storage as well, where already a big part of our data was stored.
Dataflow
Dataflow as data processing engine has the ability to process both, streaming and batch data. This affords us the comfort to use one framework for both types of data processing. Since it is fully managed, this also reduces the overhead for reserving resources and servers.
Apache Beam
Beam is a unified programming model for batch and realtime processing, that is portable on different engines, like Dataflow, Spark, Flink, etc. Using this framework keeps us independent of the underlying system, so we can migrate from one to another with little overhead.
Composer
Composer is a hosted version of Apache Airflow. It provides scheduling and managing our workflows programmatically using a DAG based tool. Airflow is also used and developed outside of Google, so choosing this keeps us more independent.
Apache Geode
Geode is an in-memory datagrid, that can be used complimentary to a Data Warehouse like Bigquery for providing real time data access. It provides an out of the box REST Api for programmatically fetching results into data services.
Google PubSub
Another way to deliver data on time is to push them to a message queue. The one provided by Google is PubSub. With sending transformation results, e.g. predictions, to a queue, other services can consume these results in realtime and use them for decision making.
Technical setup
Making this all come together as our Data Platform was the main challenge in designing and implementing this system. For programming language we decided on Python. All the tools, but Apache Geode, can be programmed using it. As infrastructure we make use of the Kubernetes Cluster started by the Composer environment. Composer or rather Airflow can start both Docker containers and Dataflow jobs from the DAG.
So all our processes are running in either Dataflow / Apache Beam or are defined in a Docker container. This has the advantage, that each task has a separate environment of dependencies. This makes the tasks independent. As Composer is deployed using Python 2.7 by Google, we can still deploy tasks using Python 3 this way for example. Composer starts the task as a Pod in the Composer cluster, while Dataflow runs completely independent of this cluster. This enables us, to run more jobs in parallel.
Data management setup
Having set up the process structure like this we needed to provide a SQL compatible API for exposing the data to our clients. To handle data structures in several systems efficiently, we decided on Apache AVRO as data format. AVRO provides a schema for each file and the schema is stored with the data. In addition to this, the data is splittable and compressible by MapReduce. These attributes we use by creating external tables in BigQuery on top of these created AVRO files. Table definitions are read directly from the files and schema evolution is given by AVRO itself and supported by BigQuery. Data definitions are handled in our schema registry and only there to keep data definitions in one place and programmatically changeable everywhere in the system, just by deploying a new schema.The analytical setup based on BigQuery is complement by Apache Geode and PubSub for providing data in realtime.
Deploying this architecture
As mentioned above the complete architecture has to be deployed using Jenkins CI. This means creating a Jenkinsfile for each task. In order to make this deployment stable we are using tests. This includes unit tests, integration tests and data integrity tests. With this setup we can deploy our platform to environments where access is only granted to service accounts, so no manual interaction is possible.
Prototyping approach
To ensure, that all components are working fine together, we started building up a prototype. Step by step we matched up the components and added all the needed functionalities. This approach helped us to find out quite quickly, where our initial strategy handling all the different data was feasible and where we had to find another way to achieve our goals. Â During working on that we learned a lot, how to handle some cloud services, even if the documentation was telling us a different story. By the end we were confident, that our strategy would work without major complications and that the found architecture would satisfy our expectations.
Our final architecture integrates all the above mentioned tools and services to build a mansion for the data platform we aim to build. The different pillars
complete our vision of a platform to cater to all data needs of our organization. We want to provide data to the correct people at the right time in the correct format.