Skip to content

PRD: 港股优先的 v2 财报三表标准化升级 #12

@cengbiao-code

Description

@cengbiao-code

Problem Statement

当前 fin_data 已经具备本地 PDF 抽取、SQLite audit DB、DuckDB analytics DB、market rule packs、validation engine、review workbook、trusted publish 和港股 PDF 编码诊断等基础能力,但项目仍带有早期“本地文本型 PDF MVP”的产品假设。用户现在的真实需求是个人研究优先的跨市场财报标准化系统,其中第一阶段优先解决港股年报/中报三大报表的高准确率提取、可复核 Excel 输出和本地数据库沉淀。

用户不希望重写项目,也不希望无条件兼容所有旧实现。原项目的问题主要是实现质量和边界 case,而不是根架构错误。因此需要采用选择性复用策略:保留现有架构资产,围绕港股 v2 主线建立新的 source adapter、normalized financials、QA/approved 数据契约,并逐步替换质量不足的旧实现。

Solution

在原项目内实施 v2 升级,第一市场选择港股,第一 golden sample 选择美团 2025 年报,第二样本选择腾讯。交付节奏分三步:

  1. Milestone 1:设计包。建立 v2 文档入口、source adapter contract、normalized financials schema、港股 MVP 计划和最小代码骨架,不破坏现有 pipeline。
  2. Milestone 2:可运行 MVP。跑通美团 2025 年报,支持本地 PDF/HKEX URL/公司 IR 直接 PDF URL、条件式中英文配对、三表标准化、双层 Excel、raw/normalized/approved 入库、Streamlit 最小 UI。
  3. Milestone 3:高质量 MVP。加入腾讯年报和美团/腾讯中报样本,接入 review workflow、manual statement approval、mapping candidate 流程、低置信截图保存、legacy HK 命令转发。

第一阶段采用个人研究优先:Excel 和 Streamlit 阅读体验要好,数据库必须沉淀 raw + normalized + statement-level approved,但不追求批量吞吐、全市场自动发现或复杂 Web App。

User Stories

  1. As an investor, I want to process a Hong Kong annual report from a local PDF path, so that I can extract the three primary statements without manually copying tables.
  2. As an investor, I want to process a Hong Kong report from an HKEX PDF URL, so that I can avoid manually downloading every filing before extraction.
  3. As an investor, I want the downloader to only accept trusted/whitelisted domains, so that untrusted URLs do not enter my research database.
  4. As an investor, I want company IR direct PDF URLs to be accepted after domain confirmation, so that I can use company-hosted source files without broad web crawling.
  5. As an investor, I want the system to detect when a Chinese Hong Kong PDF text layer is unusable, so that it does not silently produce bad extracted data.
  6. As an investor, I want the system to conditionally find or request the corresponding English PDF when the Chinese PDF has encoding or classification issues, so that extraction can continue with higher confidence.
  7. As an investor, I want Chinese and English source documents grouped into a filing package, so that the system understands they belong to the same company, period, report type, and disclosure event.
  8. As an investor, I want the system to preserve each source document separately, so that every value can trace back to a concrete file, URL, language, page, and retrieval time.
  9. As an investor, I want annual reports and interim reports supported first, so that the system handles complete three-statement packages before tackling quarterly partial packages.
  10. As an investor, I want quarterly and KPI-only packages explicitly reserved for later partial-package handling, so that incomplete disclosures do not pollute the full three-statement database.
  11. As an investor, I want report periods stored exactly as disclosed, so that YTD, interim, annual, and single-period values are not silently mixed.
  12. As an investor, I want future derived single-quarter calculations to be stored separately from source-reported facts, so that calculated values do not masquerade as disclosed values.
  13. As an investor, I want the income statement, balance sheet, and cash flow statement to be normalized into a long-form financials table first, so that all outputs derive from one auditable fact layer.
  14. As an investor, I want normalized wide statement tabs generated from the long table, so that I can read the statements in a familiar format.
  15. As an investor, I want English and Chinese line item columns in the Excel workbook, so that I can review both the extraction source and the Chinese financial-report language.
  16. As an investor, I want translated or mapping-derived labels marked distinctly, so that the workbook does not imply a translation came directly from the source when it did not.
  17. As an investor, I want core line items to be complete for annual and interim packages, so that the resulting statements are useful for personal research.
  18. As an investor, I want non-core unmapped line items preserved in normalized long form with QA flags, so that I can review them later without blocking the whole package.
  19. As an investor, I want statement-level approved status, so that a clean balance sheet can be approved even if the cash flow statement needs review.
  20. As an investor, I want a separate full three-statement status, so that I know whether the entire package is complete and approved.
  21. As an investor, I want raw extraction, normalized financials, and approved views all stored locally, so that I can audit, improve, and reuse data over time.
  22. As an investor, I want source-reported facts, derived calculations, analyst adjustments, assumptions, and missing sources labeled differently, so that I can judge reliability.
  23. As an investor, I want values blocked from approved status when unit, period, currency, or source location is unclear, so that the database remains trustworthy.
  24. As an investor, I want QA flags for text encoding issues, source conflicts, low-confidence mappings, unit ambiguity, period ambiguity, and failed financial checks, so that review work is focused.
  25. As an investor, I want low-confidence or failed cases to save page/table screenshots, so that I can visually inspect difficult PDF regions.
  26. As an investor, I want successful high-confidence packages not to save unnecessary screenshots, so that storage stays manageable.
  27. As an investor, I want a double-layer Excel workbook, so that I can read key statements up front and inspect audit/source details in later tabs.
  28. As an investor, I want Cover, Normalized_IS, Normalized_BS, Normalized_CF, Checks, and QA_Flags early in the workbook, so that the first read is useful.
  29. As an investor, I want Source_Index, Normalized_Financials_Long, Mapping_Dictionary, Conflict_Log, and Raw_Extract preserved later in the workbook, so that the workbook remains auditable.
  30. As an investor, I want automatic extraction to pass an accuracy gate before approval, so that only reliable statement data enters the approved layer.
  31. As an investor, I want review workbook flow preserved for exceptions, so that I can manually confirm or correct failed statements without changing raw extraction records.
  32. As an investor, I want manual statement approval to require a reason and source location, so that manually confirmed data remains auditable.
  33. As an investor, I want review-based mappings to become mapping candidates first, so that one report-specific correction does not automatically pollute global rules.
  34. As an investor, I want to promote mapping candidates to global rules only after confirmation, so that rule quality improves safely over time.
  35. As an investor, I want a one-shot CLI command for daily use, so that I can normalize a report quickly.
  36. As an investor, I want step-by-step CLI commands for debugging, so that I can inspect ingest, pairing, extraction, normalization, validation, export, and publishing independently.
  37. As an investor, I want a small Streamlit UI, so that I can run the Hong Kong workflow locally without memorizing commands.
  38. As an investor, I want the Streamlit UI to accept a local path or direct PDF URL and show QA/Checks/Source_Index summaries, so that the UI is useful for personal research without becoming a full Web App.
  39. As a developer, I want the old export-pdf-statements command preserved initially, so that existing workflows and tests do not break.
  40. As a developer, I want the old HK shortcut eventually to route to the v2 pipeline, so that users get the improved behavior without learning every new command.
  41. As a developer, I want a unified SourceAdapter interface, so that HK PDF, local PDF, future SEC XBRL, and future A-share sources can share a stable contract.
  42. As a developer, I want SEC XBRL only documented and adapter-ready in the first phase, so that Hong Kong delivery stays focused while future US support is not blocked.
  43. As a developer, I want a small Meituan fixture instead of the full PDF in git, so that regression tests remain stable without bloating the repo or adding filing artifacts.
  44. As a developer, I want Tencent added as a second golden sample in the high-quality MVP, so that the system does not overfit Meituan.
  45. As a developer, I want lightweight new dependencies only where needed, so that the project remains simple while supporting URL download, fuzzy matching candidates, and Streamlit UI.

Implementation Decisions

  • Use selective reuse rather than rewrite. Keep the existing repository, tests, CLI structure, audit database, DuckDB analytics concept, review workbook concept, validation/trusted publish concepts, HK encoding diagnostics, and PDF extraction experience.
  • Do not blindly preserve all old implementation details. Rework or replace HK statement location, bilingual pairing, normalized financials, approved statement gates, and Excel output where old quality is insufficient.
  • Prioritize Hong Kong for v2. US SEC XBRL remains part of the longer-term architecture, but Milestone 1 only defines the generic SourceAdapter interface and documentation for future US support.
  • Use Meituan 2025 annual report as the first golden sample. Use Tencent as the second golden sample. Add one Meituan or Tencent interim report for Milestone 3.
  • Support annual and interim reports in the first Hong Kong phase. Quarterly and KPI-only packages are reserved for later partial-package handling.
  • Use conditional bilingual pairing. First inspect the provided Chinese/user PDF; if text encoding, title detection, table classification, or validation fails, attempt semi-automatic English PDF pairing.
  • Implement semi-automatic English pairing for HKEX-style inputs. The system may infer a paired English URL from an HKEX PDF identifier/date/language pattern, but must not invent a source if inference fails.
  • Accept local PDF paths, HKEX PDF URLs, and company IR direct PDF URLs. Do not implement stock-code-plus-period auto-discovery in the first phase.
  • Use trusted-domain download rules. Default allow HKEX domains; company IR direct PDF domains require user confirmation/configuration.
  • Introduce filing_package / source_group as an explicit concept. A package groups Chinese PDF, English PDF, HKEX URL, IR URL, and future revised sources for the same company/period/disclosure event.
  • Store source documents separately under a filing package. Source_Index is a reader-facing/export view derived from package and source-document metadata.
  • Add a v2 normalized financials layer compatible with financials-normalizer discipline: Source_Index, Normalized_Financials_Long, Normalized_IS, Normalized_BS, Normalized_CF, QA_Flags, Conflict_Log, Assumptions_Register, Mapping_Dictionary, and Checks.
  • Keep extracted_facts/trusted_facts as compatibility assets. Add a bridge from old extracted facts into v2 normalized records where useful, but do not force every new v2 source through the old fact shape.
  • Use statement-level approval for Milestone 2/3. IS, BS, and CF can be approved independently; full three-statement status is derived from all three statement statuses.
  • Use a 2.5 accuracy gate for Hong Kong approved status: three statement pages located when required; units/currency/period clear; core line items complete; core financial checks pass; source IDs and locations exist; no unresolved blocker/high QA; if English extraction assists, results must link back to the same filing package and source pages.
  • For annual/interim reports, require core line item completeness. Core IS: revenue, gross_profit, operating_income, pretax_income, tax_expense, net_income. Core BS: cash_equivalents, total_assets, total_liabilities, total_equity, liabilities_equity. Core CF: cfo, cfi, cff, net_change_cash, beginning_cash, ending_cash.
  • Preserve non-core rows in normalized long form even when unmapped or low confidence. They should not block Excel generation unless they affect core checks or source integrity.
  • Store raw, normalized, and approved layers locally. Raw extraction is immutable; normalized records carry evidence/confidence; approved statement views expose only data passing gates or manual confirmation.
  • Allow manual statement approval with required approval reason, source location, and created_at metadata. Reviewer may be optional for a personal local app.
  • Store manual mapping confirmations as mapping candidates first. Promote them to global mapping rules only after explicit user approval.
  • Implement a double-layer Excel workbook. Reader-facing tabs come first; audit/source tabs follow.
  • Keep review workbook as the exception flow. Automatic success writes approved statement data; failures or ambiguous cases export review artifacts and QA flags for human correction.
  • Implement manual LLM fallback behind explicit --llm-assist only. Default behavior must not call an LLM. LLM input must be limited to small unresolved snippets such as unmapped labels, candidate page text, failed-check facts, or conflict summaries. Output must be structured and locally validated.
  • Use Streamlit as the first local Web UI. It lives inside the repo for now, likely under an app-level directory, and can be split later if it grows.
  • Do not implement static HTML output in the first phase. Excel plus Streamlit is sufficient for the first personal-research UI.
  • Add minimal dependencies only: httpx, rapidfuzz, and streamlit. Continue using current openpyxl for Excel in the first phase.
  • Use httpx for URL downloads because it provides a modern sync/async-capable Python HTTP client. Official docs describe it as a fully featured HTTP client with sync and async APIs and HTTP/1.1/HTTP/2 support.
  • Use Streamlit because official docs support the simple local app execution model via streamlit run, which fits a personal local research UI.
  • Keep SEC XBRL in architecture docs because SEC provides official EDGAR APIs for structured filing/company facts data, but do not implement the SEC adapter in this PRD.
  • Retain HKEX as the authoritative Hong Kong filing source for first-phase URL handling and paired source resolution.

Deep Modules To Build Or Modify

  • Source adapter layer: stable source ingestion contract for local files, HKEX/IR URLs, filing packages, source documents, and future source types.
  • Hong Kong PDF source adapter: handles local/HKEX/IR input, trusted-domain download, text health checks, conditional English pairing, and source package metadata.
  • Normalized financials model: source-index rows, long-form normalized records, wide statement projections, evidence/confidence labels, QA flags, conflicts, assumptions, and checks.
  • Normalization bridge: maps existing extracted facts/raw tables into v2 normalized records without breaking old pipeline behavior.
  • Approval engine: calculates statement-level approved/partial/needs-review status and full three-statement status from validation, completeness, source metadata, and QA flags.
  • Excel normalized workbook exporter: creates the double-layer workbook for reader-facing statements and audit/source review.
  • Download policy module: enforces trusted domains, hashes downloaded files, caches them, and records retrieval metadata.
  • Mapping candidate manager: captures report-specific manual mappings and promotes them only after explicit approval.
  • Streamlit local UI: wraps the one-shot Hong Kong v2 workflow and displays run status, Excel download, QA summary, Checks, and Source_Index.
  • Optional LLM exception resolver: explicit assist mode only; accepts bounded snippets and returns validated structured suggestions.

Testing Decisions

  • Tests should validate external behavior and data contracts, not internal implementation details.
  • Keep the existing test suite passing throughout Milestone 1 so old CLI and pipeline behavior are not accidentally broken.
  • Add contract tests for normalized financials models: required fields, evidence labels, confidence labels, bilingual labels, source locations, and statement type values.
  • Add source adapter tests for accepted and rejected URL domains, local-file input, source document metadata, and filing package grouping.
  • Add Hong Kong pairing tests using deterministic HKEX-style URL/filename fixtures. Do not require network in unit tests.
  • Add Meituan golden sample fixture tests using small JSON/text/table snippets, not the full PDF.
  • Add approval gate tests for statement-level approved, partial, needs_review, and manually_confirmed statuses.
  • Add tests that blocker/high QA flags prevent approved status.
  • Add tests that annual/interim core line item completeness is required, while non-core low-confidence rows are preserved but do not necessarily block output.
  • Add tests that quarterly/partial packages are represented as partial and do not enter the full three-statement approved view in this phase.
  • Add tests that raw extraction records remain immutable and review corrections are append-only.
  • Add tests that manual mapping corrections create candidates, not global mapping rules.
  • Add tests that --llm-assist is the only path that can call the LLM exception resolver.
  • Add tests that Streamlit app imports without triggering a pipeline run, so UI code remains testable.
  • Add smoke tests for Milestone 2 with the local Meituan PDF outside git, using committed expected core normalized JSON as the assertion target.
  • Prior art to reuse: existing tests for audit DB, extractors, fact extractor, HK content decoder, PDF font inspector, review workbook, table classifier, trusted publish, validation runner, and statement workbook.

Out of Scope

  • Rewriting the project in a new repository.
  • Implementing US SEC XBRL ingestion in Milestone 1.
  • Implementing A-share v2 source adapters in Milestone 1.
  • Supporting quarterly/KPI-only partial packages as a first-phase golden path.
  • Building a full multi-user Web App or authentication system.
  • Building static HTML report output in the first phase.
  • Adding PaddleOCR, Arelle, xlsxwriter, or full HTML rendering in the first phase.
  • Auto-searching HKEX by ticker and fiscal period.
  • Crawling company IR homepages to discover PDFs.
  • Automatically deriving single-quarter values from YTD disclosures in the first phase.
  • Letting LLM output directly enter trusted or approved data.
  • Treating English-derived labels or analyst translations as source-reported Chinese labels.
  • Saving screenshots for every high-confidence successful table.

Further Notes

  • This PRD intentionally changes the project stance from a PDF-first MVP to a Hong Kong-first v2 financials normalizer while preserving valuable existing assets.
  • Official Streamlit docs support the local streamlit run execution model, which fits the selected personal research UI direction.
  • Official HTTPX docs describe sync and async support and strict timeout behavior, supporting the choice for URL downloads.
  • SEC EDGAR APIs remain strategically relevant for future US support, but Hong Kong is explicitly first.
  • The ready-for-agent label should be applied to this issue. The first implementation issue after this PRD should be Milestone 1 only, not the full three-milestone build.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions