Skip to content

Generated SRT contains only one subtitle cue with invalid timestamp 00:00:00,000 --> 00:00:00,000. #3059

Description

@zack-the-worker

🐛 Bug

Generated SRT contains only one subtitle cue with invalid timestamp 00:00:00,000 --> 00:00:00,000.

I used the subtitle generation script on an MP4 video, but the output .srt file contains only a single cue. The cue includes the full transcribed text, but both start and end timestamps are 00:00:00,000.

This makes the SRT unusable for subtitle display or downstream TTS/dubbing workflows, because there is no timing information for individual spoken segments.

To Reproduce

Steps to reproduce the behavior:

  1. Run the subtitle generation command:
python generate_subtitle.py 7646260767334944046.mp4 -o ./sub.srt --lang zh

(Video link at: https://www.douyin.com/video/7646260767334944046)

  1. Open the generated sub.srt.

  2. The output contains only one cue, similar to:

1
00:00:00,000 --> 00:00:00,000
<full transcript text here>

Code sample

No custom code was used. I reproduced the issue using the provided subtitle generation script directly.

# No custom code

Expected behavior

The generated SRT should contain multiple subtitle cues with valid start and end timestamps, for example:

1
00:00:00,000 --> 00:00:03,500
First spoken sentence...

2
00:00:03,500 --> 00:00:07,200
Next spoken sentence...

The subtitle segments should be split according to the actual speech timing in the video/audio.

Error logs

No Python traceback or runtime error was shown.

The issue is in the generated SRT output:

Only one subtitle cue is generated.
The timestamp is always:
00:00:00,000 --> 00:00:00,000

Environment

  • OS: macOS
  • Device: MacBook M5
  • Python version: Not confirmed
  • FunASR version: Not confirmed
  • ModelScope version: Not confirmed
  • PyTorch / torchaudio version: Not confirmed
  • Install method (pip, source, Docker): Not confirmed
  • Device (cuda, cpu, mps): Not confirmed
  • GPU model: Apple Silicon / integrated GPU
  • CUDA/cuDNN version: N/A
  • Docker image tag, if used: N/A

Audio details

  • Duration: Not confirmed
  • Sample rate: Not confirmed
  • Format: MP4 video
  • Language/dialect: Chinese (--lang zh)
  • Speaker count: Not confirmed
  • Background noise/music: Not confirmed

Additional context

This issue is important for subtitle-based TTS/dubbing workflows. When the full transcript is placed into a single cue with timestamp 00:00:00,000 --> 00:00:00,000, downstream voice generation tools cannot preserve pauses, speech timing, or sentence-level alignment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageNeeds maintainer triage and routing

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions