Add program_enrollment_with_user_report#1914
Conversation
- Create reporting model that extends marts__combined_program_enrollment_detail - Join marts__combined__users for user demographic data (education, industry, company, etc.) - Join marts__combined_coursesinprogram and marts__combined_course_enrollment_detail for courses_taken_in_program - Add capstone_ind indicator based on course title - Add program_complete_days calculated from first courserun start to certificate date - Add full documentation and tests to _reporting__models.yml - Replaces Superset virtual dataset with physical table
There was a problem hiding this comment.
Pull request overview
This pull request converts the Superset virtual dataset Program_Enrollment_with_user into a physical dbt reporting model to improve query performance. The new model extends the existing marts__combined_program_enrollment_detail mart by joining user demographic data from marts__combined__users and recalculating program course statistics filtered to verified enrollments only.
Changes:
- Creates new
program_enrollment_with_user_reportmodel that joins program enrollment data with user demographics (education, industry, company, job title, birth year, gender, address fields) - Adds calculated fields:
courses_taken_in_program(count of verified courses),capstone_ind(capstone completion indicator), andprogram_complete_days(days from first course start to program completion) - Adds YAML documentation with column descriptions and data quality tests for the new model
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/ol_dbt/models/reporting/program_enrollment_with_user_report.sql | New SQL model that extends program enrollment detail with user demographics and verified course statistics |
| src/ol_dbt/models/reporting/_reporting__models.yml | Adds model and column documentation with basic not_null tests for key fields |
Comments suppressed due to low confidence (1)
src/ol_dbt/models/reporting/_reporting__models.yml:826
- The YAML documentation only documents a subset of the columns that will be in the final model. Since line 40 uses
enrollment_detail.*, all columns frommarts__combined_program_enrollment_detailwill be included in the output (program_id, program_title, program_track, program_type, programenrollment_is_active, programcertificate_created_on, unique_courses_taken_in_program, capstone_indicator, program_completion_days, etc.).
The YAML should either:
- Document ALL columns that will be in the final model (including those from enrollment_detail), or
- Explicitly select only the needed columns from enrollment_detail in the SQL (recommended approach, see combined_enrollments_with_gender_and_date.sql for an example)
This is especially important because some columns will be duplicated with different semantics (e.g., unique_courses_taken_in_program vs courses_taken_in_program, capstone_indicator vs capstone_ind, program_completion_days vs program_complete_days).
- name: program_enrollment_with_user_report
description: Program enrollment report enriched with user demographic data and course
completion details. Extends marts__combined_program_enrollment_detail with user
profile information, capstone completion indicator, and program completion duration.
columns:
- name: user_email
description: string, email address of the enrolled user. May be null for some
platforms.
- name: program_name
description: string, name of the program the user is enrolled in
tests:
- not_null
- name: platform_name
description: string, name of the platform (e.g. edX.org, MITx Online, xPRO)
tests:
- not_null
- name: user_highest_education
description: string, highest level of education attained by the user. Coalesced
from edX.org and hashed ID user matches.
- name: user_industry
description: string, industry the user works in. Coalesced from edX.org and hashed
ID user matches.
- name: user_company
description: string, company the user works for. Coalesced from edX.org and hashed
ID user matches.
- name: user_job_title
description: string, job title of the user. Coalesced from edX.org and hashed
ID user matches.
- name: user_birth_year
description: int, birth year of the user. Coalesced from edX.org and hashed ID
user matches.
- name: user_gender
description: string, gender of the user. Coalesced from edX.org and hashed ID
user matches.
- name: user_address_country
description: string, country of the user. Coalesced from edX.org and hashed ID
user matches.
- name: user_street_address
description: string, street address of the user from edX.org match
- name: user_address_city
description: string, city of the user from edX.org match
- name: user_address_state_or_territory
description: string, state or territory of the user from edX.org match
- name: user_address_postal_code
description: string, postal code of the user from edX.org match
- name: courses_taken_in_program
description: int, count of distinct verified courses the user has taken within
the program
- name: capstone_ind
description: str, indicator of whether the user has completed a capstone course.
'Y' if capstone completed, 'N' otherwise.
- name: program_complete_days
description: int, number of days between the first course run start date and the
program certificate creation date
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
|
|
||
| select | ||
| enrollment_detail.* |
There was a problem hiding this comment.
Using enrollment_detail.* will include ALL columns from marts__combined_program_enrollment_detail, which already contains unique_courses_taken_in_program, capstone_indicator, and program_completion_days. This creates duplicate fields with different values/semantics:
unique_courses_taken_in_program(all enrollments) vscourses_taken_in_program(verified only)capstone_indicator(based on all enrollments) vscapstone_ind(based on verified enrollments)program_completion_days(may have reversed date order) vsprogram_complete_days
This duplication will be confusing for end users. Consider explicitly selecting only the needed columns from enrollment_detail rather than using .*, similar to how other reporting models like combined_enrollments_with_gender_and_date handle this pattern.
| from courses_in_program | ||
| inner join course_enrollment_detail | ||
| on courses_in_program.course_readable_id = course_enrollment_detail.course_readable_id | ||
| where course_enrollment_detail.courserunenrollment_enrollment_mode = 'verified' |
There was a problem hiding this comment.
The courses_detail CTE calculates course statistics by joining courses_in_program with course_enrollment_detail on course_readable_id only, filtering to verified enrollments only. However, marts__combined_program_enrollment_detail already includes similar fields (unique_courses_taken_in_program, capstone_indicator, first_courserun_start_on_date) but calculated from ALL enrollments, not just verified ones.
This creates a semantic difference: the new courses_taken_in_program counts only verified course enrollments within the program, while the existing unique_courses_taken_in_program from the mart counts all enrollments. This difference should be clearly documented or the fields should be renamed to avoid confusion (e.g., verified_courses_taken_in_program).
| , coalesce( | ||
| combined_users.user_address_country, combined_users2.user_address_country | ||
| ) as user_address_country | ||
| , courses_detail.courses_taken_in_program |
There was a problem hiding this comment.
When a user has enrolled in a program but has no verified course enrollments within that program, the LEFT JOIN to courses_detail at lines 69-71 will not find a match. This causes courses_taken_in_program to be NULL instead of 0, which could be misleading. Consider wrapping this field with COALESCE(courses_detail.courses_taken_in_program, 0) to make it clearer that the user has taken zero verified courses rather than having missing data.
| , courses_detail.courses_taken_in_program | |
| , coalesce(courses_detail.courses_taken_in_program, 0) as courses_taken_in_program |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
…d NULL handling (#1920) * Initial plan * Apply PR review feedback: explicit column selection and COALESCE for courses_taken_in_program Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| , combined_users.user_street_address | ||
| , combined_users.user_address_city | ||
| , combined_users.user_address_state_or_territory | ||
| , combined_users.user_address_postal_code |
There was a problem hiding this comment.
These address fields reference combined_users without a coalesce fallback to combined_users2. Since the first join on line 84-86 only matches edX.org platform records (due to the platform_name = 'edX.org' condition), these fields will always be NULL for non-edX.org platforms (MITx Online, xPro, etc.), even when user data exists in combined_users2. Consider applying the same coalesce pattern used for the demographic fields above (lines 61-71).
| , combined_users.user_street_address | |
| , combined_users.user_address_city | |
| , combined_users.user_address_state_or_territory | |
| , combined_users.user_address_postal_code | |
| , coalesce( | |
| combined_users.user_street_address, combined_users2.user_street_address | |
| ) as user_street_address | |
| , coalesce( | |
| combined_users.user_address_city, combined_users2.user_address_city | |
| ) as user_address_city | |
| , coalesce( | |
| combined_users.user_address_state_or_territory | |
| , combined_users2.user_address_state_or_territory | |
| ) as user_address_state_or_territory | |
| , coalesce( | |
| combined_users.user_address_postal_code, combined_users2.user_address_postal_code | |
| ) as user_address_postal_code |
| - name: program_enrollment_with_user_report | ||
| description: Program enrollment report enriched with user demographic data and course | ||
| completion details. Extends marts__combined_program_enrollment_detail with user | ||
| profile information, capstone completion indicator, and program completion duration. | ||
| columns: | ||
| - name: user_email | ||
| description: string, email address of the enrolled user. May be null for some | ||
| platforms. | ||
| - name: program_name | ||
| description: string, name of the program the user is enrolled in | ||
| tests: | ||
| - not_null | ||
| - name: platform_name | ||
| description: string, name of the platform (e.g. edX.org, MITx Online, xPRO) | ||
| tests: | ||
| - not_null | ||
| - name: user_highest_education | ||
| description: string, highest level of education attained by the user. Coalesced | ||
| from edX.org and hashed ID user matches. | ||
| - name: user_industry | ||
| description: string, industry the user works in. Coalesced from edX.org and hashed | ||
| ID user matches. | ||
| - name: user_company | ||
| description: string, company the user works for. Coalesced from edX.org and hashed | ||
| ID user matches. | ||
| - name: user_job_title | ||
| description: string, job title of the user. Coalesced from edX.org and hashed | ||
| ID user matches. | ||
| - name: user_birth_year | ||
| description: int, birth year of the user. Coalesced from edX.org and hashed ID | ||
| user matches. | ||
| - name: user_gender | ||
| description: string, gender of the user. Coalesced from edX.org and hashed ID | ||
| user matches. | ||
| - name: user_address_country | ||
| description: string, country of the user. Coalesced from edX.org and hashed ID | ||
| user matches. | ||
| - name: user_street_address | ||
| description: string, street address of the user from edX.org match | ||
| - name: user_address_city | ||
| description: string, city of the user from edX.org match | ||
| - name: user_address_state_or_territory | ||
| description: string, state or territory of the user from edX.org match | ||
| - name: user_address_postal_code | ||
| description: string, postal code of the user from edX.org match | ||
| - name: courses_taken_in_program | ||
| description: int, count of distinct verified courses the user has taken within | ||
| the program | ||
| - name: capstone_ind | ||
| description: str, indicator of whether the user has completed a capstone course. | ||
| 'Y' if capstone completed, 'N' otherwise. | ||
| - name: program_complete_days | ||
| description: int, number of days between the first course run start date and the | ||
| program certificate creation date |
There was a problem hiding this comment.
Missing grain validation test. According to the codebase convention for reporting models, this model should include a dbt_expectations.expect_compound_columns_to_be_unique test to validate the grain of the data and ensure data integrity. Since this model extends marts__combined_program_enrollment_detail (which has grain tests on lines 534-539 of _marts__combined__models.yml), this report should maintain the same grain. Add a tests section with expect_compound_columns_to_be_unique using the appropriate column combinations that match the upstream mart's grain.
|
@copilot open a new pull request to apply changes based on the comments in this thread |
KatelynGit
left a comment
There was a problem hiding this comment.
It does look like some definitions are missing and will need to be added however the build ran clean
…n tests, full column docs (#1926) * Initial plan * Address PR review: fix address field coalesce, add grain tests, expand column docs Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com> * Fix yamlfmt: rewrap program_complete_days description to match formatter output Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com>
- Add user_hashed_id to courses_detail CTE for reliable joining - Change join condition from user_email to user_hashed_id - Prevents incorrect capstone_ind and NULL program_complete_days for users with NULL emails - Addresses Sentry feedback on NULL handling in joins
|
@copilot code review[agent] can I get another review on this with the most updated commits |
…user_report (#1942) * Initial plan * Fix courses_detail CTE: remove user_email from GROUP BY to prevent row multiplication Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com>
- Move verified enrollment filter from WHERE clause to CASE statement - Only counts verified courses in courses_taken_in_program - Calculates program_complete_days using ANY enrollment mode - Prevents NULL completion time for users with audit/honor enrollments - Addresses Sentry feedback on NULL handling for program completion
Resolved conflict in _reporting__models.yml by keeping all three model definitions: - program_enrollment_with_user_report (from this branch) - mitxonline_course_engagements_daily_report (from main) - combined_video_engagements_counts_report (from main)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) as program_complete_days | ||
| from enrollment_detail | ||
| left join combined_users | ||
| on enrollment_detail.platform_name = 'edX.org' |
There was a problem hiding this comment.
This uses a hard-coded platform display name ('edX.org'). In this repo, platform names are typically sourced via dbt vars (e.g., {{ var('edxorg') }}) to avoid drift if display names ever change. Using the var here would keep the join consistent with upstream marts/intermediate models.
| on enrollment_detail.platform_name = 'edX.org' | |
| on enrollment_detail.platform_name = {{ var('edxorg') }} |
| - dbt_expectations.expect_compound_columns_to_be_unique: | ||
| column_list: ["program_title", "user_username", "platform_name", "programenrollment_created_on"] | ||
| row_condition: "program_id !=2 and platform_name !='edX.org'" | ||
| - dbt_expectations.expect_compound_columns_to_be_unique: | ||
| column_list: ["user_username", "user_email"] | ||
| row_condition: "program_id =2 and platform_name ='edX.org'" |
There was a problem hiding this comment.
The uniqueness tests are heavily scoped with row_condition, including a hard-coded exception for program_id = 2. This significantly weakens the grain validation and may mask real duplication introduced by the model joins. Prefer a single compound uniqueness test that matches the model’s intended grain (e.g., keys based on platform_name, program_id/program_readable_id, user_hashed_id, and an enrollment/certificate identifier), or add a brief rationale and adjust the SQL to eliminate duplicates rather than excluding them in tests.
| - dbt_expectations.expect_compound_columns_to_be_unique: | |
| column_list: ["program_title", "user_username", "platform_name", "programenrollment_created_on"] | |
| row_condition: "program_id !=2 and platform_name !='edX.org'" | |
| - dbt_expectations.expect_compound_columns_to_be_unique: | |
| column_list: ["user_username", "user_email"] | |
| row_condition: "program_id =2 and platform_name ='edX.org'" | |
| # The following compound uniqueness test is intended to match the grain of | |
| # program_enrollment_with_user_report: one row per user/program enrollment | |
| # instance across all programs and platforms. Previously, separate tests with | |
| # row_condition filters (including a special case for program_id = 2) weakened | |
| # grain validation by excluding subsets of data from uniqueness checks. | |
| - dbt_expectations.expect_compound_columns_to_be_unique: | |
| column_list: | |
| - program_id | |
| - platform_name | |
| - user_username | |
| - user_email | |
| - programenrollment_created_on |
| select | ||
| courses_in_program.program_name | ||
| , course_enrollment_detail.user_hashed_id | ||
| , count( |
There was a problem hiding this comment.
courses_detail is aggregated only by program_name + user_hashed_id, but marts__combined_coursesinprogram also has platform and program_id and program_name may not uniquely identify a program/track. This can merge course counts across distinct programs that share a name. Include courses_in_program.platform and a stable program identifier (program_id and/or program_readable_id) in the aggregation, and propagate it through to the final join so metrics stay scoped to the correct program instance.
| left join courses_detail | ||
| on enrollment_detail.program_name = courses_detail.program_name | ||
| and enrollment_detail.user_hashed_id = courses_detail.user_hashed_id |
There was a problem hiding this comment.
The join to courses_detail only matches on program_name and user_hashed_id. If program_name is reused across multiple program IDs/tracks on the same platform, this can attach the wrong course aggregates. Join using the same program disambiguators used elsewhere (e.g., platform_name/platform and program_id, handling null IDs with is not distinct from or a coalesce sentinel) to keep the grain correct.
|
@copilot open a new pull request to apply changes based on the comments in this thread and also resolve merge conflicts with against upstream changes in main |
…1951) * Fix courses_detail join grain, use var('edxorg'), and update uniqueness tests Co-authored-by: quazi-h <59845076+quazi-h@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…_report - Change courses_detail join from user_hashed_id to user_email - user_hashed_id is generated differently in enrollment vs course enrollment: - enrollment_detail: hashes user_id + platform_name - course_enrollment_detail: fallback (user_id → user_email → user_full_name) - Using user_email matches original Superset virtual dataset logic - Prevents NULL/0 values for courses_taken_in_program and program_complete_days - Addresses Sentry feedback on join failure
Resolved conflict in _reporting__models.yml by keeping both model definitions: - program_enrollment_with_user_report (from this branch) - problem_engagement_detail_report (from main)
…port Issue 1: Missing platform in course join - Add platform to join condition between courses_in_program and course_enrollment_detail - Prevents cross-product when same course_readable_id exists on multiple platforms - Ensures accurate aggregations in courses_detail CTE Issue 2: Inconsistent capstone_sum logic - Add verified enrollment mode filter to capstone_sum calculation - Matches courses_taken_in_program logic (both now filter for verified only) - Prevents capstone_ind='Y' for audit enrollments not counted in courses_taken - Aligns with original Superset virtual dataset behavior Issue 3: Aggregation already correct - courses_detail already groups by platform + program_id + user_email - Final join already uses these keys correctly - No changes needed for this issue Addresses all three Sentry feedback items
…only - Wrap date calculation in CASE to only include verified enrollment mode - Ensures program_complete_days is calculated from first VERIFIED enrollment - Prevents inflated completion times for users with audit/honor before verified - All three metrics now consistently filter for verified enrollments: - courses_taken_in_program - capstone_sum - first_courserun_start_on_date - Matches original Superset virtual dataset behavior Addresses Sentry feedback on program_complete_days inflation
Issue: - Test was failing with 1.8M duplicate rows - Used single test: [platform_name, program_id, user_username, user_email, programenrollment_created_on] - Upstream marts__combined_program_enrollment_detail uses TWO conditional tests Fix: - Match upstream conditional uniqueness tests exactly: 1. Non-edX.org (program_id !=2): [program_title, user_username, platform_name, programenrollment_created_on] 2. edX.org (program_id =2): [user_username, user_email] - This accounts for edX.org allowing multiple enrollments per user (re-enrollment scenarios) Why different tests: - edX.org programs (program_id=2) have different grain due to re-enrollment behavior - Non-edX programs use programenrollment_created_on as part of uniqueness - edX uses simpler user_username + user_email combination Addresses build failure reported by coworker
- Add comment explaining tests match upstream marts model design - Clarify this is a passthrough/enrichment model that inherits grain - Explain platform-specific behavior (edX.org re-enrollment vs others) - Add TODO referencing future investigation of upstream grain design
What are the relevant tickets?
https://github.com/mitodl/hq/issues/7644
Description (What does it do?)
This PR converts the Superset virtual dataset
Program_Enrollment_with_userinto a physical dbt reporting model to improve query performance in Superset.The new
program_enrollment_with_user_reportmodel extends the existingmarts__combined_program_enrollment_detailmart by joining additional data that was previously calculated in the virtual dataset:user_highest_education,user_industry,user_company,user_job_title,user_birth_year,user_gender,user_address_country, etc.) joined frommarts__combined__userswith a coalesce fallback for edX.org userscourses_taken_in_program- Count of distinct verified courses taken within the programcapstone_ind- 'Y'/'N' indicator of whether the user has taken a capstone courseprogram_complete_days- Number of days from first courserun start to program certificate dateHow can this be tested?
Build and test the model:
dbt build --select program_enrollment_with_user_report --vars 'schema_suffix: <your_username>' --target dev_productionOnce this PR is merged and deployed, the corresponding Superset virtual dataset can be replaced with this physical table.