Skip to content

Massive dataset rebuild: CPU + brand + GPU + smartphone + SoC (1989-2026) #1

@Seungpyo1007

Description

@Seungpyo1007

Purpose

Long-running tracker for the TechAPI dataset rebuild across brand, cpu, gpu, smartphone, soc, tablet, watch, and pda.

This issue intentionally stays open while bulk imports, validation hardening, public dump refreshes, and manual verification work continue. PRs may use Closes #1 so GitHub Development links the PR back to this tracker. Repository auto-close is disabled for this workflow, so merging linked PRs should not finish this issue.

Current Status

Latest Data Snapshot

Category Total Verified Unverified Missing verified Verified %
brand 189 0 60 129 0.0%
soc 2,079 58 2,021 0 2.8%
smartphone 42,051 184 41,867 0 0.4%
tablet 1,171 0 1,171 0 0.0%
watch 176 0 176 0 0.0%
pda 110 0 110 0 0.0%
gpu 2,030 0 2,030 0 0.0%
cpu 3,977 976 3,001 0 24.5%
all 51,783 1,218 50,436 129 2.4%

Recent PR History

PR Status Main change
#36 Open Import Phones 2024 smartphones, tablets, watches, and refresh public dump
#35 Merged Import Global Smartphone Database variants, add SoC stubs, normalize duplicate brand slugs, and refresh public dump
#34 Merged Import GSMArena Kaggle smartphones, tablets, watches, and refresh public dump
#33 Merged Import PhoneDB/Kaggle smartphone variants, tablets, SoC stubs, and public dump refresh
#32 Merged Add tablet/watch/PDA/API/site category support and prior mobile dump refresh
#25 Merged Add 5,000 PhoneDB raw smartphone variants plus 45 Mobiles 2025 records
#24 Merged Add smartphone and SoC records, improve PR metadata and project automation
#23 Merged Import a larger smartphone batch
#22 Merged Add smartphone and SoC records from Kaggle-derived sources
#17 Merged Expand GPU imports and public data refresh
#16 Merged Expand CPU imports and public data refresh

Sources Currently Used

Validation Policy

Every data PR should include TechEngineBot comments for:

  • Changed data summary: added, modified, deleted, verified/unverified source counts, and examples
  • Validation stats: category totals, verified coverage, warning callouts, and key validation output
  • Checks: python -m app.validate, python integrity_check.py TechAPI/data --strict, and site build when site files change
  • Heuristic review: naming, typo-like patterns, duplicate-looking fields, and data-quality warnings

Low verified coverage is allowed for bulk import PRs, but should be called out as a follow-up warning instead of failing validation.

Remaining Work

  • Continue large unverified imports where source coverage is useful
  • Rebase data/import-staging before each push and keep commits split by source, brand, era, or category
  • Backfill manual verification for imported smartphone, tablet, watch, PDA, GPU, CPU, and SoC records
  • Add or repair brand verified flags so the brand category no longer has missing verification metadata
  • Dedupe or collapse raw mobile variants where a source creates excessive regional/storage duplicates
  • Improve source attribution and audit notes for records imported from broad datasets
  • Keep public v1/index.json and category dumps refreshed after each data batch

Operational Notes

  • Assignees: @Seungpyo1007 and @TechEngineBot
  • Labels: data, enhancement
  • Milestone: Massive dataset rebuild (1989-2026)
  • Projects: TechEngine work and TechAPI-Project
  • Priority: High for bulk data PRs
  • Start date: 2026-06-20
  • Target date: 2026-09-30

TechEngineBot should add or update a tracking comment on this issue whenever a linked data PR is opened or synchronized.

Latest linked PR: #36

Metadata

Metadata

Labels

dataDataset changesenhancementNew feature or request
No fields configured for Feature.

Projects

Status
In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions