Problem Statement
当前 fin_data 已经具备本地 PDF 抽取、SQLite audit DB、DuckDB analytics DB、market rule packs、validation engine、review workbook、trusted publish 和港股 PDF 编码诊断等基础能力,但项目仍带有早期“本地文本型 PDF MVP”的产品假设。用户现在的真实需求是个人研究优先的跨市场财报标准化系统,其中第一阶段优先解决港股年报/中报三大报表的高准确率提取、可复核 Excel 输出和本地数据库沉淀。
用户不希望重写项目,也不希望无条件兼容所有旧实现。原项目的问题主要是实现质量和边界 case,而不是根架构错误。因此需要采用选择性复用策略:保留现有架构资产,围绕港股 v2 主线建立新的 source adapter、normalized financials、QA/approved 数据契约,并逐步替换质量不足的旧实现。
Solution
在原项目内实施 v2 升级,第一市场选择港股,第一 golden sample 选择美团 2025 年报,第二样本选择腾讯。交付节奏分三步:
- Milestone 1:设计包。建立 v2 文档入口、source adapter contract、normalized financials schema、港股 MVP 计划和最小代码骨架,不破坏现有 pipeline。
- Milestone 2:可运行 MVP。跑通美团 2025 年报,支持本地 PDF/HKEX URL/公司 IR 直接 PDF URL、条件式中英文配对、三表标准化、双层 Excel、raw/normalized/approved 入库、Streamlit 最小 UI。
- Milestone 3:高质量 MVP。加入腾讯年报和美团/腾讯中报样本,接入 review workflow、manual statement approval、mapping candidate 流程、低置信截图保存、legacy HK 命令转发。
第一阶段采用个人研究优先:Excel 和 Streamlit 阅读体验要好,数据库必须沉淀 raw + normalized + statement-level approved,但不追求批量吞吐、全市场自动发现或复杂 Web App。
User Stories
- As an investor, I want to process a Hong Kong annual report from a local PDF path, so that I can extract the three primary statements without manually copying tables.
- As an investor, I want to process a Hong Kong report from an HKEX PDF URL, so that I can avoid manually downloading every filing before extraction.
- As an investor, I want the downloader to only accept trusted/whitelisted domains, so that untrusted URLs do not enter my research database.
- As an investor, I want company IR direct PDF URLs to be accepted after domain confirmation, so that I can use company-hosted source files without broad web crawling.
- As an investor, I want the system to detect when a Chinese Hong Kong PDF text layer is unusable, so that it does not silently produce bad extracted data.
- As an investor, I want the system to conditionally find or request the corresponding English PDF when the Chinese PDF has encoding or classification issues, so that extraction can continue with higher confidence.
- As an investor, I want Chinese and English source documents grouped into a filing package, so that the system understands they belong to the same company, period, report type, and disclosure event.
- As an investor, I want the system to preserve each source document separately, so that every value can trace back to a concrete file, URL, language, page, and retrieval time.
- As an investor, I want annual reports and interim reports supported first, so that the system handles complete three-statement packages before tackling quarterly partial packages.
- As an investor, I want quarterly and KPI-only packages explicitly reserved for later partial-package handling, so that incomplete disclosures do not pollute the full three-statement database.
- As an investor, I want report periods stored exactly as disclosed, so that YTD, interim, annual, and single-period values are not silently mixed.
- As an investor, I want future derived single-quarter calculations to be stored separately from source-reported facts, so that calculated values do not masquerade as disclosed values.
- As an investor, I want the income statement, balance sheet, and cash flow statement to be normalized into a long-form financials table first, so that all outputs derive from one auditable fact layer.
- As an investor, I want normalized wide statement tabs generated from the long table, so that I can read the statements in a familiar format.
- As an investor, I want English and Chinese line item columns in the Excel workbook, so that I can review both the extraction source and the Chinese financial-report language.
- As an investor, I want translated or mapping-derived labels marked distinctly, so that the workbook does not imply a translation came directly from the source when it did not.
- As an investor, I want core line items to be complete for annual and interim packages, so that the resulting statements are useful for personal research.
- As an investor, I want non-core unmapped line items preserved in normalized long form with QA flags, so that I can review them later without blocking the whole package.
- As an investor, I want statement-level approved status, so that a clean balance sheet can be approved even if the cash flow statement needs review.
- As an investor, I want a separate full three-statement status, so that I know whether the entire package is complete and approved.
- As an investor, I want raw extraction, normalized financials, and approved views all stored locally, so that I can audit, improve, and reuse data over time.
- As an investor, I want source-reported facts, derived calculations, analyst adjustments, assumptions, and missing sources labeled differently, so that I can judge reliability.
- As an investor, I want values blocked from approved status when unit, period, currency, or source location is unclear, so that the database remains trustworthy.
- As an investor, I want QA flags for text encoding issues, source conflicts, low-confidence mappings, unit ambiguity, period ambiguity, and failed financial checks, so that review work is focused.
- As an investor, I want low-confidence or failed cases to save page/table screenshots, so that I can visually inspect difficult PDF regions.
- As an investor, I want successful high-confidence packages not to save unnecessary screenshots, so that storage stays manageable.
- As an investor, I want a double-layer Excel workbook, so that I can read key statements up front and inspect audit/source details in later tabs.
- As an investor, I want Cover, Normalized_IS, Normalized_BS, Normalized_CF, Checks, and QA_Flags early in the workbook, so that the first read is useful.
- As an investor, I want Source_Index, Normalized_Financials_Long, Mapping_Dictionary, Conflict_Log, and Raw_Extract preserved later in the workbook, so that the workbook remains auditable.
- As an investor, I want automatic extraction to pass an accuracy gate before approval, so that only reliable statement data enters the approved layer.
- As an investor, I want review workbook flow preserved for exceptions, so that I can manually confirm or correct failed statements without changing raw extraction records.
- As an investor, I want manual statement approval to require a reason and source location, so that manually confirmed data remains auditable.
- As an investor, I want review-based mappings to become mapping candidates first, so that one report-specific correction does not automatically pollute global rules.
- As an investor, I want to promote mapping candidates to global rules only after confirmation, so that rule quality improves safely over time.
- As an investor, I want a one-shot CLI command for daily use, so that I can normalize a report quickly.
- As an investor, I want step-by-step CLI commands for debugging, so that I can inspect ingest, pairing, extraction, normalization, validation, export, and publishing independently.
- As an investor, I want a small Streamlit UI, so that I can run the Hong Kong workflow locally without memorizing commands.
- As an investor, I want the Streamlit UI to accept a local path or direct PDF URL and show QA/Checks/Source_Index summaries, so that the UI is useful for personal research without becoming a full Web App.
- As a developer, I want the old
export-pdf-statements command preserved initially, so that existing workflows and tests do not break.
- As a developer, I want the old HK shortcut eventually to route to the v2 pipeline, so that users get the improved behavior without learning every new command.
- As a developer, I want a unified SourceAdapter interface, so that HK PDF, local PDF, future SEC XBRL, and future A-share sources can share a stable contract.
- As a developer, I want SEC XBRL only documented and adapter-ready in the first phase, so that Hong Kong delivery stays focused while future US support is not blocked.
- As a developer, I want a small Meituan fixture instead of the full PDF in git, so that regression tests remain stable without bloating the repo or adding filing artifacts.
- As a developer, I want Tencent added as a second golden sample in the high-quality MVP, so that the system does not overfit Meituan.
- As a developer, I want lightweight new dependencies only where needed, so that the project remains simple while supporting URL download, fuzzy matching candidates, and Streamlit UI.
Implementation Decisions
- Use selective reuse rather than rewrite. Keep the existing repository, tests, CLI structure, audit database, DuckDB analytics concept, review workbook concept, validation/trusted publish concepts, HK encoding diagnostics, and PDF extraction experience.
- Do not blindly preserve all old implementation details. Rework or replace HK statement location, bilingual pairing, normalized financials, approved statement gates, and Excel output where old quality is insufficient.
- Prioritize Hong Kong for v2. US SEC XBRL remains part of the longer-term architecture, but Milestone 1 only defines the generic SourceAdapter interface and documentation for future US support.
- Use Meituan 2025 annual report as the first golden sample. Use Tencent as the second golden sample. Add one Meituan or Tencent interim report for Milestone 3.
- Support annual and interim reports in the first Hong Kong phase. Quarterly and KPI-only packages are reserved for later partial-package handling.
- Use conditional bilingual pairing. First inspect the provided Chinese/user PDF; if text encoding, title detection, table classification, or validation fails, attempt semi-automatic English PDF pairing.
- Implement semi-automatic English pairing for HKEX-style inputs. The system may infer a paired English URL from an HKEX PDF identifier/date/language pattern, but must not invent a source if inference fails.
- Accept local PDF paths, HKEX PDF URLs, and company IR direct PDF URLs. Do not implement stock-code-plus-period auto-discovery in the first phase.
- Use trusted-domain download rules. Default allow HKEX domains; company IR direct PDF domains require user confirmation/configuration.
- Introduce
filing_package / source_group as an explicit concept. A package groups Chinese PDF, English PDF, HKEX URL, IR URL, and future revised sources for the same company/period/disclosure event.
- Store source documents separately under a filing package.
Source_Index is a reader-facing/export view derived from package and source-document metadata.
- Add a v2 normalized financials layer compatible with financials-normalizer discipline: Source_Index, Normalized_Financials_Long, Normalized_IS, Normalized_BS, Normalized_CF, QA_Flags, Conflict_Log, Assumptions_Register, Mapping_Dictionary, and Checks.
- Keep
extracted_facts/trusted_facts as compatibility assets. Add a bridge from old extracted facts into v2 normalized records where useful, but do not force every new v2 source through the old fact shape.
- Use statement-level approval for Milestone 2/3. IS, BS, and CF can be approved independently; full three-statement status is derived from all three statement statuses.
- Use a 2.5 accuracy gate for Hong Kong approved status: three statement pages located when required; units/currency/period clear; core line items complete; core financial checks pass; source IDs and locations exist; no unresolved blocker/high QA; if English extraction assists, results must link back to the same filing package and source pages.
- For annual/interim reports, require core line item completeness. Core IS: revenue, gross_profit, operating_income, pretax_income, tax_expense, net_income. Core BS: cash_equivalents, total_assets, total_liabilities, total_equity, liabilities_equity. Core CF: cfo, cfi, cff, net_change_cash, beginning_cash, ending_cash.
- Preserve non-core rows in normalized long form even when unmapped or low confidence. They should not block Excel generation unless they affect core checks or source integrity.
- Store raw, normalized, and approved layers locally. Raw extraction is immutable; normalized records carry evidence/confidence; approved statement views expose only data passing gates or manual confirmation.
- Allow manual statement approval with required approval reason, source location, and created_at metadata. Reviewer may be optional for a personal local app.
- Store manual mapping confirmations as mapping candidates first. Promote them to global mapping rules only after explicit user approval.
- Implement a double-layer Excel workbook. Reader-facing tabs come first; audit/source tabs follow.
- Keep review workbook as the exception flow. Automatic success writes approved statement data; failures or ambiguous cases export review artifacts and QA flags for human correction.
- Implement manual LLM fallback behind explicit
--llm-assist only. Default behavior must not call an LLM. LLM input must be limited to small unresolved snippets such as unmapped labels, candidate page text, failed-check facts, or conflict summaries. Output must be structured and locally validated.
- Use Streamlit as the first local Web UI. It lives inside the repo for now, likely under an app-level directory, and can be split later if it grows.
- Do not implement static HTML output in the first phase. Excel plus Streamlit is sufficient for the first personal-research UI.
- Add minimal dependencies only:
httpx, rapidfuzz, and streamlit. Continue using current openpyxl for Excel in the first phase.
- Use
httpx for URL downloads because it provides a modern sync/async-capable Python HTTP client. Official docs describe it as a fully featured HTTP client with sync and async APIs and HTTP/1.1/HTTP/2 support.
- Use Streamlit because official docs support the simple local app execution model via
streamlit run, which fits a personal local research UI.
- Keep SEC XBRL in architecture docs because SEC provides official EDGAR APIs for structured filing/company facts data, but do not implement the SEC adapter in this PRD.
- Retain HKEX as the authoritative Hong Kong filing source for first-phase URL handling and paired source resolution.
Deep Modules To Build Or Modify
- Source adapter layer: stable source ingestion contract for local files, HKEX/IR URLs, filing packages, source documents, and future source types.
- Hong Kong PDF source adapter: handles local/HKEX/IR input, trusted-domain download, text health checks, conditional English pairing, and source package metadata.
- Normalized financials model: source-index rows, long-form normalized records, wide statement projections, evidence/confidence labels, QA flags, conflicts, assumptions, and checks.
- Normalization bridge: maps existing extracted facts/raw tables into v2 normalized records without breaking old pipeline behavior.
- Approval engine: calculates statement-level approved/partial/needs-review status and full three-statement status from validation, completeness, source metadata, and QA flags.
- Excel normalized workbook exporter: creates the double-layer workbook for reader-facing statements and audit/source review.
- Download policy module: enforces trusted domains, hashes downloaded files, caches them, and records retrieval metadata.
- Mapping candidate manager: captures report-specific manual mappings and promotes them only after explicit approval.
- Streamlit local UI: wraps the one-shot Hong Kong v2 workflow and displays run status, Excel download, QA summary, Checks, and Source_Index.
- Optional LLM exception resolver: explicit assist mode only; accepts bounded snippets and returns validated structured suggestions.
Testing Decisions
- Tests should validate external behavior and data contracts, not internal implementation details.
- Keep the existing test suite passing throughout Milestone 1 so old CLI and pipeline behavior are not accidentally broken.
- Add contract tests for normalized financials models: required fields, evidence labels, confidence labels, bilingual labels, source locations, and statement type values.
- Add source adapter tests for accepted and rejected URL domains, local-file input, source document metadata, and filing package grouping.
- Add Hong Kong pairing tests using deterministic HKEX-style URL/filename fixtures. Do not require network in unit tests.
- Add Meituan golden sample fixture tests using small JSON/text/table snippets, not the full PDF.
- Add approval gate tests for statement-level approved, partial, needs_review, and manually_confirmed statuses.
- Add tests that blocker/high QA flags prevent approved status.
- Add tests that annual/interim core line item completeness is required, while non-core low-confidence rows are preserved but do not necessarily block output.
- Add tests that quarterly/partial packages are represented as partial and do not enter the full three-statement approved view in this phase.
- Add tests that raw extraction records remain immutable and review corrections are append-only.
- Add tests that manual mapping corrections create candidates, not global mapping rules.
- Add tests that
--llm-assist is the only path that can call the LLM exception resolver.
- Add tests that Streamlit app imports without triggering a pipeline run, so UI code remains testable.
- Add smoke tests for Milestone 2 with the local Meituan PDF outside git, using committed expected core normalized JSON as the assertion target.
- Prior art to reuse: existing tests for audit DB, extractors, fact extractor, HK content decoder, PDF font inspector, review workbook, table classifier, trusted publish, validation runner, and statement workbook.
Out of Scope
- Rewriting the project in a new repository.
- Implementing US SEC XBRL ingestion in Milestone 1.
- Implementing A-share v2 source adapters in Milestone 1.
- Supporting quarterly/KPI-only partial packages as a first-phase golden path.
- Building a full multi-user Web App or authentication system.
- Building static HTML report output in the first phase.
- Adding PaddleOCR, Arelle, xlsxwriter, or full HTML rendering in the first phase.
- Auto-searching HKEX by ticker and fiscal period.
- Crawling company IR homepages to discover PDFs.
- Automatically deriving single-quarter values from YTD disclosures in the first phase.
- Letting LLM output directly enter trusted or approved data.
- Treating English-derived labels or analyst translations as source-reported Chinese labels.
- Saving screenshots for every high-confidence successful table.
Further Notes
- This PRD intentionally changes the project stance from a PDF-first MVP to a Hong Kong-first v2 financials normalizer while preserving valuable existing assets.
- Official Streamlit docs support the local
streamlit run execution model, which fits the selected personal research UI direction.
- Official HTTPX docs describe sync and async support and strict timeout behavior, supporting the choice for URL downloads.
- SEC EDGAR APIs remain strategically relevant for future US support, but Hong Kong is explicitly first.
- The
ready-for-agent label should be applied to this issue. The first implementation issue after this PRD should be Milestone 1 only, not the full three-milestone build.
Problem Statement
当前
fin_data已经具备本地 PDF 抽取、SQLite audit DB、DuckDB analytics DB、market rule packs、validation engine、review workbook、trusted publish 和港股 PDF 编码诊断等基础能力,但项目仍带有早期“本地文本型 PDF MVP”的产品假设。用户现在的真实需求是个人研究优先的跨市场财报标准化系统,其中第一阶段优先解决港股年报/中报三大报表的高准确率提取、可复核 Excel 输出和本地数据库沉淀。用户不希望重写项目,也不希望无条件兼容所有旧实现。原项目的问题主要是实现质量和边界 case,而不是根架构错误。因此需要采用选择性复用策略:保留现有架构资产,围绕港股 v2 主线建立新的 source adapter、normalized financials、QA/approved 数据契约,并逐步替换质量不足的旧实现。
Solution
在原项目内实施 v2 升级,第一市场选择港股,第一 golden sample 选择美团 2025 年报,第二样本选择腾讯。交付节奏分三步:
第一阶段采用个人研究优先:Excel 和 Streamlit 阅读体验要好,数据库必须沉淀 raw + normalized + statement-level approved,但不追求批量吞吐、全市场自动发现或复杂 Web App。
User Stories
export-pdf-statementscommand preserved initially, so that existing workflows and tests do not break.Implementation Decisions
filing_package/source_groupas an explicit concept. A package groups Chinese PDF, English PDF, HKEX URL, IR URL, and future revised sources for the same company/period/disclosure event.Source_Indexis a reader-facing/export view derived from package and source-document metadata.extracted_facts/trusted_factsas compatibility assets. Add a bridge from old extracted facts into v2 normalized records where useful, but do not force every new v2 source through the old fact shape.--llm-assistonly. Default behavior must not call an LLM. LLM input must be limited to small unresolved snippets such as unmapped labels, candidate page text, failed-check facts, or conflict summaries. Output must be structured and locally validated.httpx,rapidfuzz, andstreamlit. Continue using currentopenpyxlfor Excel in the first phase.httpxfor URL downloads because it provides a modern sync/async-capable Python HTTP client. Official docs describe it as a fully featured HTTP client with sync and async APIs and HTTP/1.1/HTTP/2 support.streamlit run, which fits a personal local research UI.Deep Modules To Build Or Modify
Testing Decisions
--llm-assistis the only path that can call the LLM exception resolver.Out of Scope
Further Notes
streamlit runexecution model, which fits the selected personal research UI direction.ready-for-agentlabel should be applied to this issue. The first implementation issue after this PRD should be Milestone 1 only, not the full three-milestone build.