In the second week of our Data Engineering Zoomcamp, we dove into the essentials of workflow orchestration, utilizing Mage as our primary tool.
Configuring Mage and PostgreSQL: Set up our environment for workflow orchestration, including configuring Mage and establishing a PostgreSQL database connection.
Building an ETL Pipeline with Mage: Utilized Mage to construct an ETL pipeline that processes and exports New York Taxi data, showcasing the power of Mage blocks (Data Loader, Transformer, Data Exporter) in streamlining data processing tasks.
The week commenced with the configuration of Mage, guided by the comprehensive setup instructions provided by @mattpalmer. While the initial configuration facilitated a seamless startup, certain adjustments were deemed necessary to better align with specific requirements.
We streamlined our database setup by transitioning Mage's metadata storage from MySQL to PostgreSQL, the primary database used in the course. This change aimed to simplify the environment by using a single database system.
To initiate Mage, we begin by launching the database and Adminer containers, which are categorized under the db profile within the Docker Compose configuration:
docker compose --profile db up
This command activates all components reliant on PostgreSQL, effectively setting up the necessary database environment.
We similarly need to start Mage,but a small problem that you would encounter will be that if there is no .env file in the root directory the variables assigned in docker compose turn out to be blank strings to prevent this we need to substitute the correct location of the .env file
docker compose --profile mage --env-file ./docker-envs/mage.env up
We utilize the code from the previous week and modify a bit to complete this code block.
After loading the data, the transformation block is utilized for any necessary modifications. For instance, with a dataset like taxi trips, this block allows for data manipulation, the creation of new columns based on conditions, and the elimination of irrelevant data.
Once you're confident in the quality and accuracy of your data, you have the option to export it to various destinations.
In Mage, workflows are scheduled using triggers, which can be set based on events, schedules, or API webhooks. This flexibility allows workflows to run automatically under specific conditions, enhancing efficiency and automation.