Stratpoint Engineering

Data Engineering Internship 2026

Sign in with your Stratpoint Google account to continue.

Data Engineering
Data Engineering Internship 2026
Chapter 5

Code Patterns & References

Use these patterns as starting points. Read the official references alongside each week's activity.

Project 1 — Requirements & Deliverables

What to Deliver

#Deliverable / FeatureRequirementRequired?
1Entity-Relationship Diagram (ERD)Identifies all entities, attributes, and relationships for the given scenario. Shows cardinality (1:1, 1:N, N:M). Uses correct ERD notation.MVP
2Star Schema DesignFact table clearly identified with measurable metrics. At least 3 dimension tables. Foreign keys correctly defined.MVP
3Design Justification DocumentWritten explanation (1-2 pages): why you chose this schema, trade-offs considered, what queries it is optimised for.MVP
4Presentation (10 min)Walk through your ERD and star schema. Explain the design decisions. Handle 3-5 questions from instructors.MVP
5Snowflake schema variantExtend the star schema to snowflake by normalising at least one dimension table. Explain the trade-off.Stretch

Acceptance Criteria

CheckHow It's VerifiedBlocking?
ERD covers all entities described in the scenarioInstructor checks against the scenario specYes
Cardinality is marked on every relationshipVisual review during presentationYes
Star schema has a clear fact table with numeric measuresInstructor checks the schema diagramYes
Presentation covers design decisions and trade-offsPresentation reviewYes
You can answer questions about normalisation vs denormalisationLive Q&A during presentationYes

What to Submit

Deliverable Checklist

  • Send to your instructor before the presentation session:
  • [ ] ERD diagram (image or draw.io export)
  • [ ] Star schema diagram (image or draw.io export)
  • [ ] Design justification document (Google Doc or PDF)
  • [ ] Presentation slides
  • [ ] GitHub repo link with all materials committed

5.1 ERD Design Checklist

Before finalising your ERD, verify all of these:

CheckWhat to verify
Every entity is a nounEntities are things (Movie, Director, User) — not actions (Watching, Rating)
Every attribute is atomicNo multi-valued fields. Split "genres" into a separate table, not a comma-separated string.
Primary keys on every tableEvery entity has a unique identifier. Prefer surrogate keys (id SERIAL).
Foreign keys are explicitEvery relationship has a FK on the child table pointing to the PK of the parent.
Cardinality is marked1:1, 1:N, N:M clearly shown. N:M resolved into a junction table.
No redundant dataIf the same data appears in two places, it should be in one place with a FK.

5.2 Star Schema vs Snowflake Schema

DimensionStar SchemaSnowflake Schema
StructureDenormalised — all dimension attributes in one flat tableNormalised — dimension attributes split into related tables
Query speedFaster — fewer joinsSlower — more joins
StorageMore storage — some data repeatedLess storage — data normalised
MaintenanceEasier — fewer tables to manageHarder — more tables and relationships
When to useAnalytics and reporting (OLAP)When storage matters or data changes frequently

5.3 SQL Patterns to Know (Week 1)

-- Create a fact table with foreign keys
CREATE TABLE fact_movie_ratings (
  rating_id   SERIAL PRIMARY KEY,
  movie_id    INT REFERENCES dim_movies(movie_id),
  user_id     INT REFERENCES dim_users(user_id),
  date_id     INT REFERENCES dim_date(date_id),
  rating      DECIMAL(3,1),
  review_len  INT
);

-- Basic star schema query
SELECT
  m.title,
  m.genre,
  AVG(f.rating)  AS avg_rating,
  COUNT(*)       AS total_reviews
FROM fact_movie_ratings f
JOIN dim_movies m ON f.movie_id = m.movie_id
JOIN dim_date  d ON f.date_id  = d.date_id
WHERE d.year = 2024
GROUP BY m.title, m.genre
ORDER BY avg_rating DESC;

Week 1 Official References

CategoryResourceWhat You'll LearnURL (click or type)
DatabasePostgreSQL DocsFull reference for CREATE TABLE, constraints, data typespostgresql.org/docs/current/
Databasedraw.ioFree ERD and diagram tool (browser-based)app.diagrams.net
ConceptsKimball Data Warehouse ToolkitStar schema design principles (authoritative reference)kimballgroup.com/data-warehouse-business-intelligence-resources/
SQLSQL Tutorial — W3SchoolsCore SQL syntax reference with examplesw3schools.com/sql/
SQLSQLBoltInteractive SQL exercises for practicesqlbolt.com

Project 2 — Requirements & Deliverables

What to Build

#Deliverable / FeatureRequirementRequired?
1Shell script: data downloadBash script that downloads a dataset (CSV or JSON) using curl or wget. Handles errors. Logs download completion.MVP
2PySpark loaderReads the downloaded file using PySpark. Applies basic schema inference or explicit schema. Writes to a staging table in PostgreSQL.MVP
3Pandas cleanerReads the staging table. Applies at least 3 cleaning operations (nulls, type casting, deduplication). Writes clean data to a new table.MVP
4SQL queriesAt least 5 analytical SQL queries against the clean table. Must include GROUP BY, JOIN, and at least one window function.MVP
5READMEDocuments how to run each step. Explains the dataset. Shows sample query outputs.MVP
6Unit tests for cleaning functionspytest tests for at least 2 Pandas cleaning functions. Tests cover happy path and null/edge cases.Stretch

Acceptance Criteria

CheckHow It's VerifiedBlocking?
Shell script runs without errors on a clean machineInstructor runs it from scratchYes
PySpark job loads data into PostgreSQL staging tableInstructor checks the table exists and has dataYes
Pandas cleaning is documented — what was cleaned and whyCode comments + READMEYes
All 5 SQL queries return results and use correct syntaxInstructor runs each queryYes
Window function is present in at least one queryCode reviewYes
README explains how to reproduce the pipeline end to endInstructor follows README on clean environmentYes

What to Submit

Deliverable Checklist

  • Send to your instructor before the presentation session:
  • [ ] GitHub repo link with all code committed
  • [ ] Short video or live demo showing each pipeline step running
  • [ ] README with setup instructions and sample outputs
  • [ ] SQL queries file with comments explaining each query

5.4 Shell Script Template

#!/bin/bash
set -e   # exit immediately on any error

DATA_URL="https://example.com/movies.csv"
OUTPUT_DIR="./data/raw"
OUTPUT_FILE="$OUTPUT_DIR/movies.csv"

mkdir -p "$OUTPUT_DIR"

echo "[$(date)] Starting download from $DATA_URL"
curl -L -o "$OUTPUT_FILE" "$DATA_URL"
echo "[$(date)] Downloaded $(wc -l < $OUTPUT_FILE) lines to $OUTPUT_FILE"

5.5 PySpark: Load CSV to PostgreSQL

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

spark = SparkSession.builder \
    .appName("MoviesLoader") \
    .config("spark.jars", "/path/to/postgresql-42.7.3.jar") \
    .getOrCreate()

schema = StructType([
    StructField("movie_id",    IntegerType(), True),
    StructField("title",       StringType(),  True),
    StructField("genre",       StringType(),  True),
    StructField("release_year",IntegerType(), True),
    StructField("rating",      FloatType(),   True),
])

df = spark.read.csv("./data/raw/movies.csv", header=True, schema=schema)

DB_URL = "jdbc:postgresql://localhost:5432/bootcamp"
PROPS = {"user":"postgres","password":"postgres","driver":"org.postgresql.Driver"}

df.write.jdbc(url=DB_URL, table="staging_movies", mode="overwrite", properties=PROPS)
print(f"Loaded {df.count()} rows into staging_movies")

5.6 Pandas: Clean and Write

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("postgresql://postgres:postgres@localhost:5432/bootcamp")

# Read from staging
df = pd.read_sql("SELECT * FROM staging_movies", engine)

# 1. Drop duplicate rows
df = df.drop_duplicates(subset=["movie_id"])

# 2. Fill missing genres
df["genre"] = df["genre"].fillna("Unknown")

# 3. Remove rows with no title
df = df.dropna(subset=["title"])

# 4. Normalise text columns
df["title"] = df["title"].str.strip().str.title()
df["genre"] = df["genre"].str.strip().str.lower()

# 5. Clamp ratings to valid range
df = df[df["rating"].between(0, 10)]

# Write clean data
df.to_sql("clean_movies", engine, if_exists="replace", index=False)
print(f"Wrote {len(df)} clean rows to clean_movies")

5.7 SQL Patterns for Analytics

-- Window function: rank movies by rating within each genre
SELECT
  title,
  genre,
  rating,
  RANK() OVER (PARTITION BY genre ORDER BY rating DESC) AS genre_rank
FROM clean_movies;

-- LAG / LEAD: year-over-year change
SELECT
  release_year,
  COUNT(*)                                            AS movie_count,
  LAG(COUNT(*)) OVER (ORDER BY release_year)         AS prev_year_count,
  COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY release_year) AS yoy_change
FROM clean_movies
GROUP BY release_year
ORDER BY release_year;

Week 2 Official References

CategoryResourceWhat You'll LearnURL (click or type)
PythonPython 3 Official DocsLanguage reference, standard librarydocs.python.org/3/
PandasPandas User GuideDataFrame operations, I/O, cleaningpandas.pydata.org/docs/user_guide/
PySparkPySpark DocumentationSparkSession, DataFrames, SQLspark.apache.org/docs/latest/api/python/
PostgreSQLpsycopg2 DocsPython PostgreSQL adapterpsycopg.org/docs/
SQLAlchemySQLAlchemy DocsPython ORM and database toolkitdocs.sqlalchemy.org
SQLPostgreSQL Window FunctionsRANK, LAG, LEAD, ROW_NUMBER explainedpostgresql.org/docs/current/tutorial-window.html
Testingpytest DocsPython testing frameworkdocs.pytest.org

Project 3 — Requirements & Deliverables

What to Build

#Deliverable / FeatureRequirementRequired?
1dbt project structureSeparate folders for staging, intermediate, and marts models. Naming conventions followed: stg_, int_, fct_, dim_.MVP
2Staging modelsOne stg_ model per source table. Renames columns to snake_case. Casts data types. No business logic.MVP
3Intermediate modelsAt least one int_ model that joins or enriches staging data. Uses CTEs, no SELECT *.MVP
4Mart modelsAt least one fct_ fact table and one dim_ dimension table ready for reporting.MVP
5dbt testsunique and not_null tests on all primary keys. At least one accepted_values test. schema.yml with column descriptions.MVP
6Airflow DAGDAG that runs the full pipeline: extract → load → dbt run → dbt test. Scheduled to run daily.MVP
7dbt docsdbt docs generate + serve. Screenshot of lineage graph submitted.Stretch
8Custom singular testA .sql test file that validates a specific business rule (e.g. no rating above 10).Stretch

Acceptance Criteria

CheckHow It's VerifiedBlocking?
dbt run completes with zero errorsInstructor runs: dbt runYes
dbt test passes with zero failuresInstructor runs: dbt testYes
All models follow staging/intermediate/mart structureInstructor checks project folder layoutYes
No SELECT * in any modelCode reviewYes
Airflow DAG runs successfully end to endInstructor triggers the DAG and checks all tasks greenYes
schema.yml has descriptions on all models and key columnsCode reviewYes

What to Submit

Deliverable Checklist

  • Send to your instructor:
  • [ ] GitHub repo link with full dbt project and Airflow DAG committed
  • [ ] Screenshot of dbt test results (all green)
  • [ ] Screenshot of Airflow DAG run (all tasks green)
  • [ ] README explaining how to run the pipeline locally

5.8 dbt Project Structure

my_project/
  dbt_project.yml
  profiles.yml
  models/
    staging/
      stg_movies.sql
      stg_ratings.sql
      _stg_models.yml      # sources + tests
    intermediate/
      int_movies_enriched.sql
    marts/
      fct_ratings.sql
      dim_movies.sql
      _mart_models.yml
  tests/
    assert_rating_range.sql
  macros/

5.9 dbt Model Templates

Staging model — stg_movies.sql

-- models/staging/stg_movies.sql
WITH source AS (
    SELECT * FROM {{ source("raw", "movies") }}
),

renamed AS (
    SELECT
        movie_id::INT           AS movie_id,
        TRIM(title)             AS title,
        LOWER(TRIM(genre))      AS genre,
        release_year::INT       AS release_year,
        rating::FLOAT           AS rating
    FROM source
    WHERE movie_id IS NOT NULL
)

SELECT * FROM renamed

Mart model — fct_ratings.sql

-- models/marts/fct_ratings.sql
WITH movies AS (
    SELECT * FROM {{ ref("stg_movies") }}
),

enriched AS (
    SELECT * FROM {{ ref("int_movies_enriched") }}
),

final AS (
    SELECT
        m.movie_id,
        m.title,
        m.genre,
        m.release_year,
        e.avg_rating,
        e.total_reviews
    FROM movies m
    LEFT JOIN enriched e USING (movie_id)
)

SELECT * FROM final

Schema tests — _mart_models.yml

# models/marts/_mart_models.yml
version: 2

models:
  - name: fct_ratings
    description: "One row per movie with aggregated rating stats."
    columns:
      - name: movie_id
        description: "Unique movie identifier"
        tests:
          - unique
          - not_null
      - name: genre
        description: "Normalised genre string"
        tests:
          - not_null
      - name: avg_rating
        description: "Mean rating across all reviews"

5.10 Airflow DAG Template

# dags/movies_pipeline.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator

default_args = {
    "owner": "intern",
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    dag_id="movies_pipeline",
    default_args=default_args,
    schedule="@daily",
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    extract = BashOperator(
        task_id="extract",
        bash_command="bash /opt/airflow/scripts/download.sh",
    )

    load = BashOperator(
        task_id="load",
        bash_command="python /opt/airflow/scripts/load_pyspark.py",
    )

    transform = BashOperator(
        task_id="dbt_run",
        bash_command="dbt run --project-dir /opt/airflow/dbt",
    )

    test = BashOperator(
        task_id="dbt_test",
        bash_command="dbt test --project-dir /opt/airflow/dbt",
    )

    extract >> load >> transform >> test

Weeks 3–5 Official References

CategoryResourceWhat You'll LearnURL (click or type)
dbtdbt Core DocsProject setup, models, tests, macros, sourcesdocs.getdbt.com
dbtdbt Best PracticesStaging/intermediate/mart naming and structuredocs.getdbt.com/guides/best-practices
dbtref() and source()How to reference models and declare raw sourcesdocs.getdbt.com/reference/dbt-jinja-functions/ref
dbtdbt TestsGeneric tests: unique, not_null, accepted_valuesdocs.getdbt.com/docs/build/data-tests
AirflowAirflow ConceptsDAGs, Operators, Tasks, Scheduling explainedairflow.apache.org/docs/apache-airflow/stable/core-concepts/
AirflowBashOperatorRun shell commands from a DAGairflow.apache.org/docs/apache-airflow/stable/howto/operator/bash.html
AirflowDocker Compose SetupOfficial local development environmentairflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html
SQLPostgreSQL CTEsWITH clause for readable, modular queriespostgresql.org/docs/current/queries-with.html

Project 4 — Requirements & Deliverables

What to Build

#Deliverable / FeatureRequirementRequired?
1Power BI data modelConnect to PostgreSQL. Import fact and dimension tables. Define relationships. Use star schema layout in Model view.MVP
2Base DAX measuresAt least 5 measures: Total Count, Average Rating, % of Total, period comparison, and one conditional measure.MVP
3Dashboard page 1: OverviewHigh-level KPIs visible at a glance. Cards, a bar/column chart, a trend line. Page-level filters.MVP
4Dashboard page 2: Deep-diveInteractive drill-through or cross-filter behaviour. Slicers for at least 2 dimensions. One advanced visual.MVP
5Data story narrativeA written paragraph (in the report or as a document) explaining: what the data shows, what action it suggests, who the audience is.MVP
6Time intelligence measuresSAMEPERIODLASTYEAR, TOTALYTD, or DATESYTD using a proper date dimension.Stretch

Acceptance Criteria

CheckHow It's VerifiedBlocking?
Data model uses star schema layout in Power BI Model viewVisual review of the .pbix fileYes
All relationships defined correctly (no ambiguous relationships)Instructor inspects Model viewYes
At least 5 DAX measures defined in a dedicated Measures tableCode reviewYes
Dashboard answers at least 3 specific business questionsPresentation walkthroughYes
Slicers and cross-filter interactions work correctlyLive demo during presentationYes
Data story explains insight and recommended actionWritten narrative reviewedYes

What to Submit

Deliverable Checklist

  • Send to your instructor:
  • [ ] .pbix file uploaded to the shared folder or GitHub repo
  • [ ] Screenshot of data model (Model view)
  • [ ] PDF export of the dashboard
  • [ ] Data story narrative (in the report or as a separate document)

5.11 Power BI Data Model Checklist

CheckWhat to verify
Star schema layoutFact table in the centre. Dimension tables surrounding it. No circular relationships.
Relationship cardinalityFact-to-dimension: Many-to-One (*:1). Cross-filter direction: Single.
Date tableA proper date dimension table. Mark as Date Table in Power BI. Continuous date range, no gaps.
Measures tableCreate an empty table called "Measures" and place all DAX measures there. Never put measures in dimension tables.
No calculated columns for aggregationsUse measures for SUM, AVERAGE, COUNT. Calculated columns for row-level values only.
Hide FK columnsHide foreign key columns in the fact table from report view — they are only needed for relationships.

5.12 DAX Measure Patterns

-- Basic measures
Total Movies = COUNTROWS(fct_ratings)
Avg Rating = AVERAGE(fct_ratings[avg_rating])
Total Reviews = SUM(fct_ratings[total_reviews])

-- % of total
Rating Share % =
DIVIDE(
    [Total Movies],
    CALCULATE([Total Movies], ALL(dim_movies[genre]))
)

-- Conditional measure
High Rated Movies =
CALCULATE(
    [Total Movies],
    fct_ratings[avg_rating] >= 7.5
)

-- Year-over-year comparison (requires date table)
Movies Last Year =
CALCULATE([Total Movies], SAMEPERIODLASTYEAR(dim_date[date]))

5.13 Data Storytelling Principles

PrincipleWhat it means
Lead with the insightStart with the conclusion, not the data. "Ratings declined 18% in 2024" not "Here is a chart of ratings over time."
Every visual earns its placeRemove any chart that does not directly support a finding. Less is more.
Use consistent colourOne colour for the primary metric. A second for comparison. Never rainbow charts.
Label the key pointsAnnotate the highest and lowest values. Label the most important bar or line directly.
State the recommended actionA dashboard without a recommendation is just decoration. End with what to do next.

Weeks 6–7 Official References

CategoryResourceWhat You'll LearnURL (click or type)
Power BIPower BI Desktop DocsFull product documentation and tutorialslearn.microsoft.com/en-us/power-bi/
DAXDAX GuideEvery DAX function with examplesdax.guide
DAXSQLBI DAX PatternsProduction-grade DAX patterns (free)daxpatterns.com
Power BIConnect to PostgreSQLStep-by-step connector setuplearn.microsoft.com/en-us/power-bi/connect-data/desktop-data-sources
VisualisationStorytelling with Data (Cole Knaflic)Data visualisation design principlesstorytellingwithdata.com/chart-guide