Medallion Architecture On BigQuery
A Journey From Raw Data to Actionable Insights
In the fast-evolving world of data, building a resilient, scalable data pipeline is essential for turning raw data into actionable insights. Recently, I had the opportunity to design a pipeline to ingest data from multiple APIs, transform it, and store it in BigQuery, following a strategic approach known as the Medallion Architecture.
This project was as much a journey of discovery and problem-solving as it was a technical endeavor, offering practical lessons for building scalable, flexible data architectures.
Here’s a behind-the-scenes look at each stage of this project — my goals, the obstacles I encountered, and the solutions that brought it all together. For those looking to deepen their data engineering expertise, I hope this story provides a blend of technical guidance and real-world insights into building modern data pipelines.
Setting the Stage: Medallion Architecture
Before diving into the technical work, I needed a framework to keep the project organized and ensure high data quality, maintainability, and traceability. That’s when Medallion Architecture came into focus. Picture it as a layered system, where data transformations occur in progressive stages, each with a purpose:
• Raw Layer: Holds untouched data from APIs. Unlike standard Medallion architecture, we added this layer to handle the unstructured JSON payloads.
• Bronze Layer: A basic, structured layer that organizes data, retaining all records as strings to preserve potential errors for debugging.
• Silver Layer: A refined, deduplicated, and standardized layer, ready for querying.
• Gold Layer: Contains feature-rich data optimized for analytics and machine learning.
Breaking the pipeline into these layers created a clear roadmap. But as I soon discovered, each layer had its own set of unique challenges and required tailored solutions.
Step 1: Ingesting and Storing Raw Data — Facing API Complexities Head-On
Goal: Capture data directly from multiple APIs, each with different formats, latency, and reliability levels.
This first step was deceptively simple: call the APIs, retrieve data, and store it. But with a mix of API formats and latency, it soon became clear that I needed more than just a straightforward fetch-and-store approach. I turned to Cloud Run Jobs (for low-frequency batch pull) and Cloud Compute (for high-frequency pull) these services allowed me to handle these variances without bottlenecks.
• Challenge #1: Data variability. Each API returned data in different formats, making standardization a headache.
– Solution: Custom parsers in Cloud Run allowed me to transform each payload into a more consistent structure.
• Challenge #2: Error handling. Since API calls aren’t always reliable, handling retries was crucial.
– Solution: I integrated retry logic and set up detailed logging. This meant that if an API failed temporarily, the pipeline could try again later without breaking the entire process.
By the end of this stage, I had a solid foundation of raw data, stored as-is in BigQuery’s raw dataset. Although raw, this data was accessible, retrievable, and ready for transformation.
Step 2: Raw to Bronze — Bringing Order to the Chaos
Goal: Structure and organize the raw data, creating the Bronze Layer as a cleaner, more consistent dataset.
Moving into the Bronze Layer, I could finally start bringing order to the chaotic mix of raw data. My task here was to extract the necessary fields from the JSON payloads, discarding irrelevant information while ensuring every important element had a place in BigQuery.
• Challenge #1: Schema definition. With such variability in the APIs, I needed a schema flexible enough to accommodate differences but strict enough to be reliable.
– Solution: A dynamic schema let me handle optional fields while enforcing required ones. This way, all essential data fields from each source fit seamlessly into the database.
• Challenge #2: Data quality checks. I wanted to ensure that even in the Bronze Layer, my data had a basic level of accuracy and consistency.
– Solution: Basic checks helped verify that key fields weren’t missing or malformed, helping to prevent major data quality issues down the line.
With this structured dataset in the Bronze Layer, I could confidently move forward, knowing that the core information was intact and accessible.
Step 3: Bronze to Silver — Refining and Standardizing
Goal: Ensure high data quality by deduplicating and standardizing the Bronze Layer into a refined Silver Layer.
As I ventured into the Silver Layer, my goal was to refine the data further. This involved deduplication, standardization, and formatting — making sure that all values across sources were consistent and easy to analyze.
• Challenge #1: Deduplication. Given multiple data sources, duplicates and minor discrepancies were inevitable.
– Solution: I built deduplication logic based on unique identifiers and timestamps, filtering out duplicates to create a more accurate dataset.
• Challenge #2: Data standardization and enforcing data type. Each API might have different formats for dates, categories, and units.
– Solution: Using lookup tables and conversion functions, I unified formats across sources. By explicitly casting data types on the columns I ensured that the data in the column has correct and consistent data type. Dates, units, and categories were standardized, so the data was consistent and reliable.
The Silver Layer now held lean, data with correct data type — a significant milestone, as it was finally ready to support analytics and machine learning.
Step 4: Silver to Gold — Feature Engineering for Insights
Goal: Transform data in the Silver Layer into feature-rich tables ready for analytics and machine learning.
At last, I reached the Gold Layer, where data becomes more than information. Here, it was all about creating features that analysts and machine learning models could directly use. I spent time engineering new metrics like rolling averages and cumulative counts, valuable tools for trend analysis and predictions.
• Challenge #1: Feature engineering. Turning raw data into meaningful metrics requires careful thought and planning.
– Solution: I built calculated fields and aggregates that added value to the dataset, ensuring it was ready for any analysis or ML task.
• Challenge #2: Handling ML readiness. To make the data usable for ML, I needed to address missing values, and scale features, and ensure data was optimized.
– Solution: I filled gaps in data, scaled continuous features, and arranged the data for easy access by ML models and analytics dashboards.
The Gold Layer was now ready — a resource-rich dataset poised to fuel dashboards, reports, and machine learning algorithms.
Scaling Up: Introducing PubSub as the Pipeline’s Backbone
As the pipeline grew, I realized that a scalable solution was essential to handle increasing data volume and complexity. That’s when PubSub came into play as a robust intermediary buffer, adding resilience and scalability to the architecture.
• Decoupling Stages: PubSub created a buffer between pipeline layers, letting each process data independently. This ensured smoother transitions between layers and minimized downtime.
• Traffic Management: During peak loads, PubSub’s buffering capabilities prevented data loss, allowing the system to handle spikes gracefully.
• Data Replay: PubSub’s retention feature meant that if any downstream process encountered an issue, data could be replayed for reprocessing — a lifesaver for long-term reliability.
With PubSub in place, the pipeline could scale seamlessly, ready to support increased data flow while maintaining quality and performance.
Reflections and Next Steps
This project taught me the power of Medallion Architecture and the importance of each stage in building a scalable, resilient data pipeline. With Cloud Run Jobs, Cloud Compute, BigQuery, and PubSub working together, I was able to create a system that handles the variability of real-world data, ensuring quality and scalability at every stage.
Key Takeaways:
• Technical Growth: I deepened my skills in orchestrating data pipelines using Cloud Run, BigQuery, and PubSub.
• Best Practices in Data Engineering: Practical experience in schema evolution, deduplication, and feature engineering was invaluable.
• Building for Scale: With a decoupled architecture, the pipeline can adapt to data growth without compromising performance.
What’s Next?
I’m now exploring options to add a monitoring and alerting layer, enhancing the pipeline’s reliability even further. Stay tuned as this data engineering journey continues, and I continue to refine and expand the system in pursuit of even more resilient and insightful data solutions!