Skip to content

Ingest real Fall 2026 BU course offerings (sections, instructors, meeting times, locations) #280

Description

@AndresL230

Summary

Staging currently has a catalog-only academics dataset: ~8,195 real BU courses (code / name / department / description) each paired with a single hollow course_offerings row. The operational layer — sections, instructors, meeting times, locations, syllabi, and correct term — exists only for the 4 hand-written seed_staging demo classes. This issue tracks ingesting real Fall 2026 (fall-2026) BU offering data so courses behave like a real registrar.

The scrape itself will be run separately; this issue covers defining the scrape contract and building the import + supporting model/API changes.

Current state (measured against staging)

Table Rows Notes
courses 8,195 full BU catalog; school_id null on ~8,192, credits null on all but 3
course_offerings 8,196 every course has exactly 1 offering; CS101 has 2 (the +1)
terms 4 fall-2025, spring-2026, summer-2026, fall-2026 all seeded (0019)
schools 1 only "Sapling Demo University" (seed-school-demo)
enrollments 4 demo only

Offering field population across all 8,196 rows:

Field Populated
section 0 (null on every row)
instructor_name 4 (demo only)
meeting_times 4 (demo only)
location 4 (demo only)
syllabus_url 0

Term spread of existing offerings: spring-2026 → 8,194, fall-2025 → 2. fall-2026 has 0 offerings today.

Relevant code:

  • backend/db/seed_staging.py — demo seed only; never touches the real catalog.
  • backend/services/academics.pyresolve_offering / current_term / term_for_offering.
  • backend/db/migrations/0020_academics_split.sql — the catalog/offering split + UNIQUE (course_id, term_id, section).
  • backend/routes/academics.py — currently only exposes GET /semesters.
  • All DB access goes through backend/db/connection.py::table() (PostgREST, no DDL).

Tasks

1. Scraper / data spec (contract only — scrape run separately)

  • Define the structured output schema the Fall 2026 scrape must produce (the contract the importer consumes). Proposed per-section record:
    • course_code (must match catalog format, e.g. CAS CS 111 — space-separated, uppercase)
    • section (e.g. A1, B2)
    • instructor_name
    • meeting_times (e.g. MWF 09:00)
    • location
    • course_name, credits, description, syllabus_url (optional; used to enrich/create the catalog course if missing)
  • Decide format (JSON lines / CSV) and where the scraped file lives (not committed; gitignored ops input).
  • Document course_code normalization rules so scrape output joins cleanly to courses.course_code.

2. Importer + section model (core)

  • New idempotent ops script (e.g. backend/db/import_offerings.py, run like seed_staging under the target env) that:
    • resolves each record's course_code → existing courses.id (create the catalog course if absent, enriching name/credits/description),
    • upserts one course_offerings row per section for term_id = 'fall-2026',
    • writes section, instructor_name, meeting_times, location, syllabus_url,
    • is idempotent (deterministic id or upsert-on-UNIQUE) so re-runs add nothing.
  • All access via db/connection.py::table(); env-agnostic (run against staging first).
  • Fix the section dedup semantics. UNIQUE (course_id, term_id, section) is NULL-distinct today, so real multi-section data won't dedup. Add an append-only migration (next number, currently at 0028 → 0029) to either set a non-null default section (e.g. '') or switch to UNIQUE ... NULLS NOT DISTINCT (PG 15+). Pick one and make the importer consistent with it.
  • Tests in backend/tests/ (mirror test_seed_staging.py): idempotency, code→course resolution, per-section row creation, dedup on re-run.

3. BU school + catalog linkage

  • Add a real schools row for Boston University (proper name/slug).
  • Link catalog courses to it (school_id is null on ~8,192 rows today).
  • Pre-check for duplicate course_codes before linking — UNIQUE (school_id, course_code) is currently NULL-distinct, so dup codes may exist that would collide once school_id is set. Resolve/merge duplicates as part of this task.

4. API + frontend display

  • Audit the offering read path and surface section / instructor_name / meeting_times / location / syllabus_url through the course/offering API (today routes/academics.py only returns /semesters).
  • Update the relevant frontend components so a course with a Fall 2026 offering shows its section, instructor, meeting times, and location.

Notes / gotchas

  • fall-2026 runs 2026-08-24 → 2027-01-03; the term row already exists, so the importer just references it by id.
  • current_term() is date-derived — today (2026-06-26) resolves to Summer 2026 (0 offerings). The importer should target fall-2026 explicitly, not "current term".
  • The knowledge graph keys on the abstract course_id (cumulative across terms); gradebook on enrollment_id; study/analytics on offering_id. Adding Fall 2026 offerings must not fork a course's graph identity — resolve to the existing catalog course_id.
  • Staging and prod both currently hold only catalog data (no real user data in prod per the redesign). Run/verify on staging first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium prioritybackendBackend / APIdata-integrityCorrectness of stored/served dataenhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions