* Update CONTRIBUTING.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * fix gcs to bq example readme * tfdoc Co-authored-by: Ludovico Magnocavallo <ludomagno@google.com>
Data ingestion Demo
In this folder, you can find an example to ingest data on the data platform instantiated here.
The example is not intended to be a production-ready code.
Demo use case
The demo imports purchase data generated by a store.
Input files
Data are uploaded to the drop off GCS bucket. File structure:
customers.csv: Comma separate value with customer information in the following format: Customer ID, Name, Surname, Registration Timestamppurchases.csv: Comma separate value with customer information in the following format: Item ID, Customer ID, Item, Item price, Purchase Timestamp
Data processing pipelines
Different data pipelines are provided to highlight different features and patterns. For the purpose of the example, a single pipeline handle all data lifecycles. When adapting them to your real use case, you may want to evaluate the option to handle each functional step on a separate pipeline or a dedicated tool. For example, you may want to use Dataform to handle data schemas lifecycle.
Below you can find a description of each example:
- Simple import data:
datapipeline.pyis a simple pipeline to import provided data from thedrop offGoogle Cloud Storage bucket to the Data Hub Confidential layer joiningcustomersandpurchasestables intocustomerpurchasetable. - Import data with Policy Tags:
datapipeline_dc_tags.pyimports provided data fromdrop offbucket to the Data Hub Confidential layer protecting sensitive data using Data Catalog policy Tags. - Delete tables:
delete_table.pydeletes BigQuery tables created by import pipelines.
Running the demo
To run demo examples, please follow the following steps:
- 01: copy sample data to the
drop offCloud Storage bucket impersonating theloadservice account. - 02: copy sample data structure definition in the
orchestrationCloud Storage bucket impersonating theorchestrationservice account. - 03: copy the Cloud Composer DAG to the Cloud Composer Storage bucket impersonating the
orchestrationservice account. - 04: Open the Cloud Composer Airflow UI and run the imported DAG.
- 05: Run the BigQuery query to see results.
You can find pre-computed commands in the demo_commands output variable of the deployed terraform data pipeline.