Files

Alejandro Leal a85f3aa907 Update to multiple READMEs (#740 )

* Update CONTRIBUTING.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* fix gcs to bq example readme

* tfdoc

Co-authored-by: Ludovico Magnocavallo <ludomagno@google.com>

2022-08-11 09:40:55 +02:00

data

Replace existing data platform

2022-02-05 08:51:11 +01:00

datapipeline_dc_tags.py

Update naming convention

2022-04-21 23:53:16 +02:00

datapipeline.py

Update naming convention

2022-04-21 23:53:16 +02:00

delete_table.py

Update naming convention

2022-04-21 23:53:16 +02:00

README.md

Update to multiple READMEs (#740 )

2022-08-11 09:40:55 +02:00

README.md

Data ingestion Demo

In this folder, you can find an example to ingest data on the data platform instantiated here.

The example is not intended to be a production-ready code.

Demo use case

The demo imports purchase data generated by a store.

Input files

Data are uploaded to the drop off GCS bucket. File structure:

customers.csv: Comma separate value with customer information in the following format: Customer ID, Name, Surname, Registration Timestamp
purchases.csv: Comma separate value with customer information in the following format: Item ID, Customer ID, Item, Item price, Purchase Timestamp

Data processing pipelines

Different data pipelines are provided to highlight different features and patterns. For the purpose of the example, a single pipeline handle all data lifecycles. When adapting them to your real use case, you may want to evaluate the option to handle each functional step on a separate pipeline or a dedicated tool. For example, you may want to use Dataform to handle data schemas lifecycle.

Below you can find a description of each example:

Simple import data: datapipeline.py is a simple pipeline to import provided data from the drop off Google Cloud Storage bucket to the Data Hub Confidential layer joining customers and purchases tables into customerpurchase table.
Import data with Policy Tags: datapipeline_dc_tags.py imports provided data from drop off bucket to the Data Hub Confidential layer protecting sensitive data using Data Catalog policy Tags.
Delete tables: delete_table.py deletes BigQuery tables created by import pipelines.

Running the demo

To run demo examples, please follow the following steps:

01: copy sample data to the drop off Cloud Storage bucket impersonating the load service account.
02: copy sample data structure definition in the orchestration Cloud Storage bucket impersonating the orchestration service account.
03: copy the Cloud Composer DAG to the Cloud Composer Storage bucket impersonating the orchestration service account.
04: Open the Cloud Composer Airflow UI and run the imported DAG.
05: Run the BigQuery query to see results.

You can find pre-computed commands in the demo_commands output variable of the deployed terraform data pipeline.