Tutorial : Building a S3-Parquet DataLake without a single line of Code !

Tomer Shaiman
7 min readFeb 24, 2020
Photo by Marcin Szmigiel on Unsplash

This post will show you how to solve a very painful issue in the streaming world : how do you stream a “raw” data format such as json, into a data-lake (such as S3) that need to have the format as Parquet.

By the end of this tutorial you will :

  • Set up the entire confluent Community platform using Docker Compose
  • Send a simple Json raw data
  • Convert the Json data to Avro using KSQL and Schema Registry
  • Build a Connector job that streams the Avro data to S3 Parquet.
  • Do all the above without a single line of code !

TLDR;

just open the tutorial.adoc file in this GitHub repo and follow the commands.

Business Use Case

In my current position at Novarize, an AI-driven audience discovery tool for marketeers, we collect billions of data points from across the web building precise audience segment profiles.
Usually the data format is Json, but since our data analysts need to read it in a Parquet format which is column based and more suitable for data manipulation and queries, we had to find a way to transform the data in near real time manner and without dedicated microservice.

Our first iteration was to use EMR on AWS which did the trick to convert Json-to-Parquet and write the files to s3 , but we did face some challenges :

  • It was Costly
  • It had too many moving parts
  • It was difficult to configure AWS Glue to our scale and it wasn’t real-time enough for our needs
  • There was no support to use Schema-Registry
  • It did not support complex schema models (nested, recursive)
  • It is slow since Glue Crawler needs to rescan a lot of data to find new partitions.

The Alternative is to use the power of S3 Connector from Confluent that recently got upgrade to support saving to s3 in parquet format !

I also want to show here the power of KSql in order to to translate the raw Json format into Avro, a task which I demonstrated in my previous blog with Kafka-Streams.

Here is a short recap of the Architecture we are about to build :

  • We have a kafka-connect cluster with a datagen job that sends Json based data to the ratings topic
  • A Ksql Server translates the json topic into avro topic using stream processing queries
  • The Ksql server also creates a corresponding schema in the Schema-registry server. the schema name is ratings_avro-value
  • Another S3-Sink connector reads from the avro Topic (ratings_avro) and writes to S3 Bucket.

Note : There is only 1 Kakfa-Connect Broker ,it is drawn here twice for clarity reasons.

I : Setup

For this tutorial you will need the following :

  • Your AWS credentials in order to grant access for Connect to write to S3
  • Docker
  • A text Editor

That’s all ??? Yes !

First , grab the code from here . Open the project and examine the docker-compose.yaml file.

You will notice the file contains many of the confluent-platform components such as Kafka, Zookeeper, Schema-Registry, KSql and Kafka Connect.

From the docker-compose.yaml file add your credentials for AWS :

AWS_ACCESS_KEY_ID: "<ADD-YOUR-KEY-HERE>"
AWS_SECRET_ACCESS_KEY: "<ADD-YOUR-KEY-HERE>"

Another interesting part in the Kafka-Connect section is the way I’ve added a connector called “DataGen” to assist me in producing some dummy rating data in a json format.
This connector is being injected to the Connect broker during startup, eliminating the need to write producers for this demo. look at the
docker-compose.yaml file at around line 100 :

as you can see we are using a build-in connector that is generating data for us in a json format into the ratings topic.

Now, Its time to start everything up :

> docker-compose up

you will see many logs from all the containers poured into you terminal. it is quite an exciting sight by the way.

after few seconds we can ensure our data-generator works correctly:

> kafkacat -C -b 0 -t ratings

The flag -b 0 stands for the kafka broker at localhost:9092

which should produce some json data like this :

our raw json data from the ratings topic

If you don’t have kafkacat yet ,now would be a good time to grab it here.

II : Writing Streams using KSQL

It is time to create the “Json-To-Avro” stream using Ksql .
First , we will log into the Ksql Shell using a dedicated docker for that.

> docker exec -it ksql-cli ksql http://ksql-server:8088

you should see a nice welcome stream with a prompt:

Next , thanks to our friend at Confluent rmoff , that wrote a very simple blog on the issue here , we can run the following commands :

The first will create a Json stream and define the source topic schema :

ksql> CREATE STREAM source_json (rating_id VARCHAR, user_id VARCHAR, stars INT, route_id VARCHAR, rating_time INT,channel VARCHAR,message VARCHAR)WITH (KAFKA_TOPIC='ratings', VALUE_FORMAT='JSON');

The second will create the avro stream based on the source stream into the “ratings_avro” topic :

ksql> CREATE STREAM target_avro WITH (KAFKA_TOPIC='ratings_avro',REPLICAS=1,PARTITIONS=1,VALUE_FORMAT='AVRO') AS SELECT * FROM source_json;

If we want to make sure the streams are configured correctly we could show them:

ksql> show streams;
listing topics

and we can also view the data of the avro stream :

ksql> print ratings_avro;

which prints the output of the avro stream :

printing the avro stream

another alternative would be to use a Kafkacat command from our terminal (please pay attention you are not running this in the ksql prompt):

> kafkacat    -b $KAFKA    \
-r http://localhost:8081 \
-s avro \
-t ratings_avro \
-C -u -o beginning -q | jq '.'

which will output the avro topic in a very “pretty” output :

nice printed avro stream using kafkacat

Last, but not least , if you are curious on looking at the schema that was created :

curl http://localhost:8081/subjects/ratings_avro-value/versions/latest

will output the schema definition in json format.

We have now a working Json-to-Avro Stream without a single line of Code !

III : Configuring Kafka-Connect Sink

In this last part of the tutorial we shall add the S3-Sink Connector that writes the Avro data into a S3-bucket.

In the repo you have cloned here ,there is a Json file that describes the connector :

  1. We use confluent S3SinkConnector connector
  2. we specify our topic as “ratings_avro” , our target bucket as “data” and our directory as “parquet-demo” . make sure to create the bucket in advance.
  3. we specify value converter as avro since this is our format on the designated topic, and the key converter as string since we don’t have avro format on the Keys.
  4. we use a Field partitioner based on the USER_ID (capital due to KSQL)
  5. we specify the schema registry url for the converters.

To apply this configuration we simply run

curl -X POST -H 'Content-Type:application/json' --data @"./s3-parquet-connector.json" http://localhost:8083/connectors/

which should return a 201 created status and the definition that was created.
It is also recommended to view the Kafka Connect logs before/after submission of new connectors.

The last thing we need to see is to view our S3 Bucket and confirm that the Parquet files exists and partitioned according to the USER_ID field :

data is being partitioned by USER_ID
The parquet files

IV : Conclusion

Kafka-Connect new ability to write into S3 in Parquet format is quite impressive ! Not only we can achieve it now easily using mainly configurations, it is dead-simple and most important : requires no extra microservices, code to maintain , complex deployments or spending too much time on building libraries that all they do it I/O to S3 buckets.

It is also costly in terms of using managed Spark services (EMR /DataBricks) so the advantage here is huge.

we still need a robust deployment using Kubernetes/Helm and I promise to get back to this topic in the coming blogs.

Photo by Jan Tinneberg on Unsplash

I would like to thank Dor Sever from BigPanda, Nadav Roiter and Guy Shemer from Novarize that helped me with this blog. You guys are awesome !

--

--

Tomer Shaiman

Principal Software Engineer at Microsoft ✰ Kubernetes and DevOps ✰ CKAD/CKA Certified