PlotweaverModel commited on
Commit
ef8b3fe
·
verified ·
1 Parent(s): e9715fa

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +71 -20
  2. app.py +672 -1573
  3. requirements.txt +3 -9
README.md CHANGED
@@ -1,25 +1,76 @@
1
  ---
2
- title: PlotWeaver - Live Commentary Translation
3
- emoji: "\U0001F3DF\uFE0F"
4
  colorFrom: green
5
- colorTo: yellow
6
  sdk: gradio
7
- sdk_version: "5.50.0"
 
8
  app_file: app.py
9
- pinned: true
10
- license: mit
11
- hardware: t4-small
12
- models:
13
- - PlotweaverAI/whisper-small-de-en
14
- - PlotweaverAI/nllb-200-distilled-600M-african-6lang
15
- - PlotweaverAI/yoruba-mms-tts-new
16
- tags:
17
- - speech-to-speech
18
- - translation
19
- - dubbing
20
- - multi-language
21
- - football
22
- - commentary
23
- - streaming
24
- short_description: Translate live English commentary to 40+ languages with AI
25
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Live Football Commentary Translator
3
+ emoji:
4
  colorFrom: green
5
+ colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 5.42.0
8
+ python_version: "3.11"
9
  app_file: app.py
10
+ pinned: false
11
+ license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
+
14
+ # Live Football Commentary Translator
15
+
16
+ Speak (or upload) commentary in one language, hear it spoken in another.
17
+
18
+ ## What this is
19
+
20
+ A HuggingFace Space that translates football commentary between languages.
21
+ Two modes:
22
+
23
+ - **Single clip** — record or upload one clip, get one translation.
24
+ - **Continuous live** — start a session, speak naturally, translations chunk
25
+ at natural pauses (~0.8s silence) and play sequentially.
26
+
27
+ Sources: English, Scottish English, German, Spanish, Arabic
28
+ Targets: all of the above + Swahili, Amharic, Afrikaans
29
+
30
+ ## How it works
31
+
32
+ Two pipelines, routed by target language:
33
+
34
+ | Target language | Pipeline | Cost |
35
+ |---|---|---|
36
+ | English, Scottish-EN, German, Spanish, Arabic | Single Qwen-Omni call: audio in → translated speech out | 1 API call |
37
+ | Swahili, Amharic, Afrikaans | Qwen-Omni (audio → translated text), then YourVoic (text → speech) | 2 API calls |
38
+
39
+ Qwen-Omni is `qwen3.5-omni-plus` on DashScope International. YourVoic is the
40
+ fallback for languages Qwen-Omni doesn't cover natively. This split exists
41
+ because Qwen-Omni does not produce intelligible speech in Swahili, Amharic,
42
+ or Afrikaans on its own.
43
+
44
+ ## Deploy
45
+
46
+ 1. Create a new HuggingFace Space, SDK = Gradio
47
+ 2. Upload `app.py`, `requirements.txt`, and this `README.md`
48
+ 3. Add secrets in **Settings → Variables and secrets**:
49
+ - `DASHSCOPE_API_KEY` (required) — get one from DashScope International
50
+ - `YOURVOIC_API_KEY` (required for Swahili/Amharic/Afrikaans only)
51
+ 4. (Recommended) Set hardware to **ZeroGPU** if you have access. CPU also works
52
+ but will be slower on the audio-decode steps.
53
+
54
+ ## Expected latency
55
+
56
+ On free ZeroGPU, expect 3-8 seconds from end-of-speech to start-of-output. The
57
+ demo is designed to feel "live-ish" but not simultaneous-interpretation grade.
58
+ Speak in short bursts — one play, one tackle, one moment — rather than long
59
+ monologues.
60
+
61
+ ## Known limitations
62
+
63
+ - "Scottish English" is treated as accented English in the system prompt rather
64
+ than a separate language. Qwen-Omni's Scottish accent is decent but not
65
+ authentic.
66
+ - YourVoic voice support per language is sparsely documented. The code falls
67
+ back to a universal voice ("Peter") if the primary choice fails.
68
+ - Arabic voice cloning is intentionally not exposed — the underlying
69
+ `qwen3-tts-vc` model doesn't support Arabic.
70
+ - Free-tier ZeroGPU has cold-start delays. First call after idle is slower.
71
+
72
+ ## Files
73
+
74
+ - `app.py` — Gradio UI and pipeline
75
+ - `requirements.txt` — Python dependencies
76
+ - `README.md` — this file (also the Space metadata header)
app.py CHANGED
@@ -1,1680 +1,779 @@
1
  """
2
- PlotWeaver — Live Commentary Translation Platform (Single File)
3
- ================================================================
4
- Two engines: Qwen Omni | YourVoic API (with NLLB MT)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  """
6
 
7
- import os, io, re, time, base64, struct, shutil, subprocess, tempfile, logging
8
- import torch, numpy as np, requests, soundfile as sf, gradio as gr
9
-
10
- logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
11
- logger = logging.getLogger(__name__)
12
-
13
-
14
- # =============================================================================
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  # LANGUAGES
16
- # =============================================================================
17
-
18
- # Qwen Omni voices (work across all Qwen-supported languages)
19
- QWEN_VOICES = [
20
- "Cherry", "Serena", "Ethan", "Chelsie", "Momo", "Vivian", "Moon", "Maia",
21
- "Kai", "Nofish", "Bella", "Jennifer", "Ryan", "Katerina", "Aiden",
22
- "Eldric Sage", "Mia", "Mochi", "Bellona", "Vincent", "Bunny", "Neil",
23
- "Elias", "Arthur", "Seren", "Bodega", "Sonrisa", "Alek", "Dolce",
24
- "Sohee", "Ono Anna", "Lenn", "Emilien", "Andre",
25
- ]
26
-
27
- # Each language entry:
28
- # "Display Name": {
29
- # "nllb": NLLB-200 language code (for local/yourvoic pipeline translation),
30
- # "yourvoic_lang": YourVoic language code (or None),
31
- # "yourvoic_voices": list of YourVoic voice names,
32
- # "tts_engine": "qwen" | "yourvoic" | "local",
33
- # "qwen_code": short language code for Qwen prompts (or None),
34
- # "qwen_name": full language name for Qwen system prompt (or None),
35
- # }
36
-
37
- LANGUAGES = {
38
- # ---- Qwen Omni Languages (end-to-end speech-to-speech, 11 languages) ----
39
- "English": {
40
- "nllb": "eng_Latn", "yourvoic_lang": "en-US",
41
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
42
- "qwen_code": "en", "qwen_name": "English",
43
- },
44
- "Chinese (Mandarin)": {
45
- "nllb": "zho_Hans", "yourvoic_lang": "zh-CN",
46
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
47
- "qwen_code": "zh", "qwen_name": "Mandarin Chinese",
48
- },
49
- "Japanese": {
50
- "nllb": "jpn_Jpan", "yourvoic_lang": "ja-JP",
51
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
52
- "qwen_code": "ja", "qwen_name": "Japanese",
53
- },
54
- "Korean": {
55
- "nllb": "kor_Hang", "yourvoic_lang": "ko-KR",
56
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
57
- "qwen_code": "ko", "qwen_name": "Korean",
58
- },
59
- "German": {
60
- "nllb": "deu_Latn", "yourvoic_lang": "de-DE",
61
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
62
- "qwen_code": "de", "qwen_name": "German",
63
- },
64
- "French": {
65
- "nllb": "fra_Latn", "yourvoic_lang": "fr-FR",
66
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
67
- "qwen_code": "fr", "qwen_name": "French",
68
- },
69
- "Russian": {
70
- "nllb": "rus_Cyrl", "yourvoic_lang": "ru-RU",
71
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
72
- "qwen_code": "ru", "qwen_name": "Russian",
73
- },
74
- "Portuguese": {
75
- "nllb": "por_Latn", "yourvoic_lang": "pt-BR",
76
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
77
- "qwen_code": "pt", "qwen_name": "Portuguese",
78
- },
79
- "Spanish": {
80
- "nllb": "spa_Latn", "yourvoic_lang": "es-ES",
81
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
82
- "qwen_code": "es", "qwen_name": "Spanish",
83
- },
84
- "Italian": {
85
- "nllb": "ita_Latn", "yourvoic_lang": "it-IT",
86
- "yourvoic_voices": ["Peter", "Kylie"], "tts_engine": "qwen",
87
- "qwen_code": "it", "qwen_name": "Italian",
88
- },
89
- "Arabic": {
90
- "nllb": "arb_Arab", "yourvoic_lang": "ar-SA",
91
- "yourvoic_voices": ["Peter"], "tts_engine": "qwen",
92
- "qwen_code": "ar", "qwen_name": "Modern Standard Arabic",
93
- },
94
-
95
- # ---- African Languages (YourVoic API) ----
96
- "Swahili": {
97
- "nllb": "swh_Latn", "yourvoic_lang": "sw-KE",
98
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
99
- "qwen_code": None, "qwen_name": None,
100
- },
101
- "Amharic": {
102
- "nllb": "amh_Ethi", "yourvoic_lang": "am-ET",
103
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
104
- "qwen_code": None, "qwen_name": None,
105
- },
106
- "Afrikaans": {
107
- "nllb": "afr_Latn", "yourvoic_lang": "af-ZA",
108
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
109
- "qwen_code": None, "qwen_name": None,
110
- },
111
-
112
- # ---- South Asian (YourVoic TTS + NLLB MT) ----
113
- "Hindi": {
114
- "nllb": "hin_Deva", "yourvoic_lang": "hi-IN",
115
- "yourvoic_voices": ["Rahul", "Deepika", "Aditya"], "tts_engine": "yourvoic",
116
- "qwen_code": None, "qwen_name": None,
117
- },
118
- "Bengali": {
119
- "nllb": "ben_Beng", "yourvoic_lang": "bn-IN",
120
- "yourvoic_voices": ["Sneha", "Aryan"], "tts_engine": "yourvoic",
121
- "qwen_code": None, "qwen_name": None,
122
- },
123
- "Tamil": {
124
- "nllb": "tam_Taml", "yourvoic_lang": "ta-IN",
125
- "yourvoic_voices": ["Priya", "Kumar"], "tts_engine": "yourvoic",
126
- "qwen_code": None, "qwen_name": None,
127
- },
128
- "Telugu": {
129
- "nllb": "tel_Telu", "yourvoic_lang": "te-IN",
130
- "yourvoic_voices": ["Arjun", "Lakshmi"], "tts_engine": "yourvoic",
131
- "qwen_code": None, "qwen_name": None,
132
- },
133
- "Marathi": {
134
- "nllb": "mar_Deva", "yourvoic_lang": "mr-IN",
135
- "yourvoic_voices": ["Anjali", "Rohan"], "tts_engine": "yourvoic",
136
- "qwen_code": None, "qwen_name": None,
137
- },
138
- "Urdu": {
139
- "nllb": "urd_Arab", "yourvoic_lang": "ur-PK",
140
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
141
- "qwen_code": None, "qwen_name": None,
142
- },
143
- "Nepali": {
144
- "nllb": "npi_Deva", "yourvoic_lang": "ne-NP",
145
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
146
- "qwen_code": None, "qwen_name": None,
147
- },
148
-
149
- # ---- Southeast Asian (YourVoic) ----
150
- "Indonesian": {
151
- "nllb": "ind_Latn", "yourvoic_lang": "id-ID",
152
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
153
- "qwen_code": None, "qwen_name": None,
154
- },
155
- "Vietnamese": {
156
- "nllb": "vie_Latn", "yourvoic_lang": "vi-VN",
157
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
158
- "qwen_code": None, "qwen_name": None,
159
- },
160
- "Thai": {
161
- "nllb": "tha_Thai", "yourvoic_lang": "th-TH",
162
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
163
- "qwen_code": None, "qwen_name": None,
164
- },
165
- "Malay": {
166
- "nllb": "zsm_Latn", "yourvoic_lang": "ms-MY",
167
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
168
- "qwen_code": None, "qwen_name": None,
169
- },
170
- "Filipino": {
171
- "nllb": "tgl_Latn", "yourvoic_lang": "fil-PH",
172
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
173
- "qwen_code": None, "qwen_name": None,
174
- },
175
-
176
- # ---- European (YourVoic) ----
177
- "Dutch": {
178
- "nllb": "nld_Latn", "yourvoic_lang": "nl-NL",
179
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
180
- "qwen_code": None, "qwen_name": None,
181
- },
182
- "Polish": {
183
- "nllb": "pol_Latn", "yourvoic_lang": "pl-PL",
184
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
185
- "qwen_code": None, "qwen_name": None,
186
- },
187
- "Turkish": {
188
- "nllb": "tur_Latn", "yourvoic_lang": "tr-TR",
189
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
190
- "qwen_code": None, "qwen_name": None,
191
- },
192
- "Swedish": {
193
- "nllb": "swe_Latn", "yourvoic_lang": "sv-SE",
194
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
195
- "qwen_code": None, "qwen_name": None,
196
- },
197
- "Romanian": {
198
- "nllb": "ron_Latn", "yourvoic_lang": "ro-RO",
199
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
200
- "qwen_code": None, "qwen_name": None,
201
- },
202
- "Greek": {
203
- "nllb": "ell_Grek", "yourvoic_lang": "el-GR",
204
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
205
- "qwen_code": None, "qwen_name": None,
206
- },
207
- "Ukrainian": {
208
- "nllb": "ukr_Cyrl", "yourvoic_lang": "uk-UA",
209
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
210
- "qwen_code": None, "qwen_name": None,
211
- },
212
- "Finnish": {
213
- "nllb": "fin_Latn", "yourvoic_lang": "fi-FI",
214
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
215
- "qwen_code": None, "qwen_name": None,
216
- },
217
- "Danish": {
218
- "nllb": "dan_Latn", "yourvoic_lang": "da-DK",
219
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
220
- "qwen_code": None, "qwen_name": None,
221
- },
222
- "Norwegian": {
223
- "nllb": "nob_Latn", "yourvoic_lang": "nb-NO",
224
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
225
- "qwen_code": None, "qwen_name": None,
226
- },
227
-
228
- # ---- Middle Eastern (YourVoic) ----
229
- "Persian": {
230
- "nllb": "pes_Arab", "yourvoic_lang": "fa-IR",
231
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
232
- "qwen_code": None, "qwen_name": None,
233
- },
234
- "Hebrew": {
235
- "nllb": "heb_Hebr", "yourvoic_lang": "he-IL",
236
- "yourvoic_voices": ["Peter"], "tts_engine": "yourvoic",
237
- "qwen_code": None, "qwen_name": None,
238
- },
239
  }
240
 
241
-
242
- # Group languages by category for the UI
243
- LANGUAGE_GROUPS = {
244
- "Global Languages": [
245
- "Spanish", "French", "German", "Mandarin", "Italian",
246
- "Japanese", "Portuguese", "Hindi", "Arabic", "Korean", "Russian",
247
- ],
248
- "African Languages": [
249
- "Swahili", "Amharic", "Afrikaans",
250
- ],
251
- "South Asian": [
252
- "Bengali", "Tamil", "Telugu", "Marathi", "Urdu", "Nepali",
253
- ],
254
- "Southeast Asian": [
255
- "Indonesian", "Vietnamese", "Thai", "Malay", "Filipino",
256
- ],
257
- "European": [
258
- "Dutch", "Polish", "Turkish", "Swedish", "Romanian",
259
- "Greek", "Ukrainian", "Finnish", "Danish", "Norwegian",
260
- ],
261
- "Middle Eastern": [
262
- "Persian", "Hebrew",
263
- ],
264
  }
265
 
266
- # All language display names (for dropdowns)
267
- ALL_LANGUAGE_NAMES = sorted(LANGUAGES.keys())
268
-
269
- # Languages that use YourVoic API
270
- YOURVOIC_LANGUAGES = [k for k, v in LANGUAGES.items() if v["tts_engine"] == "yourvoic"]
271
-
272
- # Languages that use YourVoic API
273
- YOURVOIC_LANGUAGES = [k for k, v in LANGUAGES.items() if v["tts_engine"] == "yourvoic"]
274
-
275
-
276
- # =============================================================================
277
- # PIPELINE: ASR + MT + Video helpers
278
- # =============================================================================
279
-
280
- DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
281
- TORCH_DTYPE = torch.float16 if torch.cuda.is_available() else torch.float32
282
-
283
- # Models (loaded once at startup)
284
- asr_pipe = None
285
- mt_tokenizer = None
286
- mt_model = None
287
 
 
 
 
 
 
288
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289
 
290
- def load_models():
291
- """Load all models at startup."""
292
- global asr_pipe, mt_tokenizer, mt_model
293
- from transformers import (
294
- pipeline as hf_pipeline,
295
- AutoTokenizer,
296
- AutoModelForSeq2SeqLM,
297
- )
298
 
299
- print(f"Device: {DEVICE} | Dtype: {TORCH_DTYPE}")
300
- print("Loading models...")
301
-
302
- # ASR
303
- ASR_MODEL_ID = "PlotweaverAI/whisper-small-de-en"
304
- print(f" Loading ASR: {ASR_MODEL_ID}")
305
- asr_pipe = hf_pipeline(
306
- "automatic-speech-recognition",
307
- model=ASR_MODEL_ID,
308
- device=DEVICE,
309
- torch_dtype=TORCH_DTYPE,
 
 
 
 
 
 
 
 
 
 
 
 
 
310
  )
311
- print(" ASR loaded")
312
-
313
- # MT
314
- MT_MODEL_ID = "PlotweaverAI/nllb-200-distilled-600M-african-6lang"
315
- print(f" Loading MT: {MT_MODEL_ID}")
316
- mt_tokenizer = AutoTokenizer.from_pretrained(MT_MODEL_ID)
317
- mt_model = AutoModelForSeq2SeqLM.from_pretrained(
318
- MT_MODEL_ID, torch_dtype=TORCH_DTYPE
319
- ).to(DEVICE)
320
- mt_tokenizer.src_lang = "eng_Latn"
321
- print(" MT loaded")
322
-
323
- # Diagnostics
324
- print(f"\n=== Device diagnostics ===")
325
- print(f"CUDA available: {torch.cuda.is_available()}")
326
- if torch.cuda.is_available():
327
- print(f"CUDA device: {torch.cuda.get_device_name(0)}")
328
- print(f"ASR on: {next(asr_pipe.model.parameters()).device}")
329
- print(f"MT on: {next(mt_model.parameters()).device}")
330
- print(f"YourVoic API key: {'set' if os.environ.get('YOURVOIC_API_KEY') else 'NOT SET'}")
331
- print(f"Dashscope key: {'set' if os.environ.get('DASHSCOPE_API_KEY') else 'NOT SET'}")
332
- print(f"==========================\n")
333
- print("All models loaded!")
334
-
335
-
336
- # ---- Text Processing ----
337
-
338
- def split_into_sentences(text):
339
- """Split raw ASR text into individual sentences."""
340
- text = text.strip()
341
- if not text:
342
- return []
343
- text = '. '.join(s.strip().capitalize() for s in text.split('. ') if s.strip())
344
- if re.search(r'[.!?]', text):
345
- sentences = re.split(r'(?<=[.!?])\s+', text)
346
- return [s.strip() for s in sentences if s.strip()]
347
- words = text.split()
348
- MAX_WORDS = 12
349
- sentences = []
350
- for i in range(0, len(words), MAX_WORDS):
351
- chunk = ' '.join(words[i:i + MAX_WORDS])
352
- if not chunk.endswith(('.', '!', '?')):
353
- chunk += '.'
354
- chunk = chunk[0].upper() + chunk[1:] if len(chunk) > 1 else chunk.upper()
355
- sentences.append(chunk)
356
- return sentences
357
-
358
-
359
- # ---- ASR ----
360
-
361
- def transcribe(audio_array, sample_rate=16000):
362
- """ASR: English audio to text. Handles both short and long audio."""
363
- if len(audio_array) < 1600:
364
- return ""
365
-
366
- duration_s = len(audio_array) / sample_rate
367
-
368
- if sample_rate != 16000:
369
- import torchaudio.functional as F_audio
370
- audio_tensor = torch.from_numpy(audio_array).float()
371
- audio_tensor = F_audio.resample(audio_tensor, sample_rate, 16000)
372
- audio_array = audio_tensor.numpy()
373
- sample_rate = 16000
374
-
375
- if duration_s <= 28:
376
- result = asr_pipe(
377
- {"raw": audio_array, "sampling_rate": sample_rate},
378
- return_timestamps=False,
379
- )
380
- return result["text"].strip()
381
-
382
- # Long-form: native Whisper generate
383
- model = asr_pipe.model
384
- processor = asr_pipe.feature_extractor
385
- tokenizer = asr_pipe.tokenizer
386
-
387
- inputs = processor(
388
- audio_array, sampling_rate=16000, return_tensors="pt",
389
- truncation=False, padding="longest", return_attention_mask=True,
390
  )
391
- input_features = inputs.input_features.to(DEVICE, dtype=TORCH_DTYPE)
392
- attention_mask = inputs.attention_mask.to(DEVICE) if "attention_mask" in inputs else None
393
-
394
- generate_kwargs = {"return_timestamps": True, "language": "en", "task": "transcribe"}
395
- if attention_mask is not None:
396
- generate_kwargs["attention_mask"] = attention_mask
397
-
398
- with torch.no_grad():
399
- predicted_ids = model.generate(input_features, **generate_kwargs)
400
-
401
- transcription = tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)[0]
402
- return transcription.strip()
403
-
404
-
405
- # ---- MT ----
406
-
407
- def translate_sentence(text, target_nllb_code, fast=True, max_length=256):
408
- """Translate a single sentence from English to target language."""
409
- inputs = mt_tokenizer(text, return_tensors="pt", truncation=True).to(DEVICE)
410
- tgt_lang_id = mt_tokenizer.convert_tokens_to_ids(target_nllb_code)
411
-
412
- generate_kwargs = {
413
- "forced_bos_token_id": tgt_lang_id,
414
- "repetition_penalty": 1.5,
415
- "no_repeat_ngram_size": 3,
416
- }
417
- if fast:
418
- generate_kwargs.update({"max_length": 128, "num_beams": 1, "do_sample": False})
419
- else:
420
- generate_kwargs.update({"max_length": max_length, "num_beams": 4, "early_stopping": True})
421
-
422
- with torch.no_grad():
423
- output_ids = mt_model.generate(**inputs, **generate_kwargs)
424
-
425
- return mt_tokenizer.decode(output_ids[0], skip_special_tokens=True)
426
-
427
-
428
- def translate_text(text, target_nllb_code, fast=True):
429
- """Split and translate full text sentence-by-sentence."""
430
- sentences = split_into_sentences(text)
431
- if not sentences:
432
- return "", [], []
433
- translations = []
434
- for s in sentences:
435
- yo = translate_sentence(s, target_nllb_code, fast=fast)
436
- translations.append(yo)
437
- return ' '.join(translations), sentences, translations
438
-
439
-
440
- # ---- Video Processing ----
441
-
442
- def extract_audio_from_video(video_path, output_path, target_sr=16000):
443
- """Extract audio track from video as 16kHz mono WAV."""
444
- cmd = [
445
- "ffmpeg", "-y", "-i", video_path,
446
- "-vn", "-acodec", "pcm_s16le", "-ar", str(target_sr), "-ac", "1",
447
- output_path,
448
- ]
449
- result = subprocess.run(cmd, capture_output=True, text=True)
450
- if result.returncode != 0:
451
- raise RuntimeError(f"ffmpeg extraction failed: {result.stderr[:200]}")
452
- return output_path
453
-
454
-
455
- def get_media_duration(path):
456
- """Get duration in seconds."""
457
- cmd = [
458
- "ffprobe", "-v", "error",
459
- "-show_entries", "format=duration",
460
- "-of", "default=noprint_wrappers=1:nokey=1", path,
461
- ]
462
- result = subprocess.run(cmd, capture_output=True, text=True)
463
- if result.returncode != 0:
464
- raise RuntimeError(f"ffprobe failed: {result.stderr[:200]}")
465
- return float(result.stdout.strip())
466
-
467
-
468
- def stretch_audio_to_duration(input_path, output_path, target_duration_s):
469
- """Stretch/compress audio to match target duration."""
470
- current_duration = get_media_duration(input_path)
471
- if current_duration <= 0:
472
- raise RuntimeError("Invalid audio duration")
473
-
474
- ratio = current_duration / target_duration_s
475
- filters = []
476
- remaining = ratio
477
- while remaining > 2.0:
478
- filters.append("atempo=2.0")
479
- remaining /= 2.0
480
- while remaining < 0.5:
481
- filters.append("atempo=0.5")
482
- remaining /= 0.5
483
- filters.append(f"atempo={remaining:.4f}")
484
-
485
- cmd = ["ffmpeg", "-y", "-i", input_path, "-filter:a", ",".join(filters), output_path]
486
- result = subprocess.run(cmd, capture_output=True, text=True)
487
- if result.returncode != 0:
488
- raise RuntimeError(f"ffmpeg tempo failed: {result.stderr[:200]}")
489
- return output_path
490
-
491
-
492
- def mux_video_audio(video_path, audio_path, output_path, extend_video=False, target_duration=None):
493
- """Combine video with new audio. Optionally extend video by freezing last frame."""
494
- if extend_video and target_duration:
495
- cmd = [
496
- "ffmpeg", "-y", "-i", video_path, "-i", audio_path,
497
- "-filter_complex", f"[0:v]tpad=stop_mode=clone:stop_duration={target_duration}[v]",
498
- "-map", "[v]", "-map", "1:a:0",
499
- "-c:v", "libx264", "-preset", "fast", "-c:a", "aac",
500
- "-t", str(target_duration), output_path,
501
- ]
502
- else:
503
- cmd = [
504
- "ffmpeg", "-y", "-i", video_path, "-i", audio_path,
505
- "-c:v", "copy", "-c:a", "aac",
506
- "-map", "0:v:0", "-map", "1:a:0", "-shortest", output_path,
507
- ]
508
- result = subprocess.run(cmd, capture_output=True, text=True)
509
- if result.returncode != 0:
510
- raise RuntimeError(f"ffmpeg mux failed: {result.stderr[:200]}")
511
- return output_path
512
-
513
-
514
- # =============================================================================
515
- # TTS ENGINE: YourVoic API
516
- # =============================================================================
517
-
518
- YOURVOIC_API_KEY = os.environ.get("YOURVOIC_API_KEY", "")
519
- YOURVOIC_STREAM_URL = "https://yourvoic.com/api/v1/tts/stream"
520
-
521
-
522
- def synthesize_yourvoic(text, language_code, voice="Peter", speed=1.0):
523
- """Synthesize text using YourVoic API."""
524
- if not YOURVOIC_API_KEY:
525
- raise RuntimeError("YOURVOIC_API_KEY not set.")
526
-
527
- headers = {"X-API-Key": YOURVOIC_API_KEY, "Content-Type": "application/json"}
528
- payload = {"text": text, "voice": voice, "language": language_code, "model": "aura-prime", "speed": speed}
529
-
530
- logger.info(f"YourVoic request: lang={language_code}, voice={voice}, text={text[:50]}")
531
-
532
- t0 = time.time()
533
- response = requests.post(YOURVOIC_STREAM_URL, headers=headers, json=payload, stream=True, timeout=60)
534
-
535
- if response.status_code != 200:
536
- error_text = response.text[:300] if hasattr(response, 'text') else str(response.status_code)
537
- raise RuntimeError(f"YourVoic error {response.status_code}: {error_text}")
538
-
539
- # Detect format from content-type header
540
- ct = response.headers.get("content-type", "").lower()
541
- logger.info(f"YourVoic content-type: {ct}")
542
-
543
- # Collect audio bytes
544
- audio_data = b""
545
- for chunk in response.iter_content(chunk_size=8192):
546
- audio_data += chunk
547
-
548
- elapsed = time.time() - t0
549
- logger.info(f"YourVoic TTS: {len(text)} chars, {elapsed:.2f}s, {len(audio_data)} bytes")
550
-
551
- # Check for empty response
552
- if len(audio_data) == 0:
553
- raise RuntimeError(
554
- f"YourVoic returned empty audio. This usually means the voice '{voice}' "
555
- f"doesn't support language '{language_code}'. Try a different voice."
556
- )
557
-
558
- # Log first bytes for format detection
559
- magic = audio_data[:16] if len(audio_data) > 16 else audio_data
560
- logger.info(f"YourVoic first bytes: {magic[:8]}")
561
-
562
- # Handle raw PCM directly (audio/pcm) — wrap in WAV header
563
- if "pcm" in ct or "linear16" in ct or "l16" in ct:
564
- sr = 24000 # YourVoic default sample rate for PCM
565
- wav_path = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name
566
- with open(wav_path, "wb") as f:
567
- f.write(b"RIFF")
568
- f.write(struct.pack("<I", 36 + len(audio_data)))
569
- f.write(b"WAVE")
570
- f.write(b"fmt ")
571
- f.write(struct.pack("<IHHIIHH", 16, 1, 1, sr, sr * 2, 2, 16))
572
- f.write(b"data")
573
- f.write(struct.pack("<I", len(audio_data)))
574
- f.write(audio_data)
575
- try:
576
- audio_array, sample_rate = sf.read(wav_path, dtype="float32")
577
- os.unlink(wav_path)
578
- logger.info(f"YourVoic PCM decoded: {len(audio_array)/sample_rate:.2f}s at {sample_rate}Hz")
579
- return audio_array, sample_rate
580
- except Exception as e:
581
- os.unlink(wav_path)
582
- raise RuntimeError(f"PCM decode failed: {e}")
583
-
584
- # Determine file extension from magic bytes or content-type
585
- if b"RIFF" in audio_data[:4]:
586
- ext = ".wav"
587
- elif b"\xff\xfb" in audio_data[:3] or b"\xff\xf3" in audio_data[:3] or b"ID3" in audio_data[:3]:
588
- ext = ".mp3"
589
- elif b"OggS" in audio_data[:4]:
590
- ext = ".ogg"
591
- elif b"fLaC" in audio_data[:4]:
592
- ext = ".flac"
593
- elif "mp3" in ct or "mpeg" in ct:
594
- ext = ".mp3"
595
- elif "ogg" in ct:
596
- ext = ".ogg"
597
- elif "wav" in ct:
598
- ext = ".wav"
599
- elif "flac" in ct:
600
- ext = ".flac"
601
- else:
602
- ext = ".mp3"
603
- logger.warning(f"Unknown YourVoic format (ct={ct}), guessing mp3")
604
-
605
- # Save with correct extension
606
- tmp_path = tempfile.NamedTemporaryFile(suffix=ext, delete=False).name
607
- with open(tmp_path, "wb") as f:
608
- f.write(audio_data)
609
 
610
- # Try reading directly with soundfile
611
  try:
612
- audio_array, sample_rate = sf.read(tmp_path, dtype="float32")
613
- os.unlink(tmp_path)
614
- return audio_array, sample_rate
615
- except Exception as e:
616
- logger.warning(f"soundfile can't read {ext}: {e}")
617
-
618
- # Fallback: convert with ffmpeg
619
- try:
620
- wav_path = tmp_path + ".wav"
621
- result = subprocess.run(
622
- ["ffmpeg", "-y", "-i", tmp_path, "-acodec", "pcm_s16le", "-ar", "24000", "-ac", "1", wav_path],
623
- capture_output=True, text=True,
 
 
 
624
  )
625
- os.unlink(tmp_path)
626
- if result.returncode != 0:
627
- raise RuntimeError(f"ffmpeg failed: {result.stderr[-300:]}")
628
- audio_array, sample_rate = sf.read(wav_path, dtype="float32")
629
- os.unlink(wav_path)
630
- return audio_array, sample_rate
631
- except Exception as e2:
632
- for f in [tmp_path, tmp_path + ".wav"]:
633
- if os.path.exists(f): os.unlink(f)
634
- raise RuntimeError(f"YourVoic decode failed: {e2}")
635
-
636
-
637
- def synthesize_yourvoic_to_file(text, language_code, output_path, voice="Peter", speed=1.0):
638
- """Synthesize via YourVoic and save to file."""
639
- audio, sr = synthesize_yourvoic(text, language_code, voice, speed)
640
- sf.write(output_path, audio, sr)
641
- return output_path, sr
642
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
643
 
 
 
644
 
645
 
646
- def synthesize_chunked(text, language_config, sentences_per_chunk=2):
647
- """
648
- Synthesize long text by chunking into sentence groups via YourVoic API.
649
-
650
- Args:
651
- text: Full text to synthesize
652
- language_config: Dict from LANGUAGES (has yourvoic_lang, yourvoic_voices, etc.)
653
- sentences_per_chunk: How many sentences to synthesize per API call
654
-
655
- Returns:
656
- (audio_array, sample_rate)
657
- """
658
- sentences = re.split(r'(?<=[.!?])\s+', text)
659
- sentences = [s.strip() for s in sentences if s.strip()]
660
-
661
- if not sentences:
662
- return np.zeros(int(0.5 * 16000), dtype=np.float32), 16000
663
-
664
- audio_segments = []
665
- output_sr = None
666
 
667
- for i in range(0, len(sentences), sentences_per_chunk):
668
- chunk_text = ' '.join(sentences[i:i + sentences_per_chunk])
669
- if not chunk_text:
670
- continue
 
671
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
672
  try:
673
- voice = language_config["yourvoic_voices"][0] if language_config.get("yourvoic_voices") else "Peter"
674
- lang_code = language_config["yourvoic_lang"]
675
- audio_seg, seg_sr = synthesize_yourvoic(chunk_text, lang_code, voice)
676
-
677
- if output_sr is None:
678
- output_sr = seg_sr
679
- if len(audio_seg) > 0:
680
- audio_segments.append(audio_seg)
681
- silence = np.zeros(int(0.15 * seg_sr), dtype=np.float32)
682
- audio_segments.append(silence)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
683
 
684
  except Exception as e:
685
- logger.error(f"TTS chunk failed: {e}")
686
  continue
687
 
688
- if not audio_segments:
689
- fallback_sr = output_sr or 16000
690
- logger.warning("All TTS chunks failed — returning silence")
691
- return np.zeros(int(0.5 * fallback_sr), dtype=np.float32), fallback_sr
692
 
693
- return np.concatenate(audio_segments), output_sr
694
 
 
 
 
 
 
 
 
 
 
 
 
 
695
 
696
- # =============================================================================
697
- # QWEN OMNI ENGINE
698
- # =============================================================================
699
 
700
- QWEN_MODEL = "qwen3.5-omni-plus"
701
- QWEN_BASE_URL = "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
 
 
702
 
 
703
 
704
- def _get_client():
705
- """Create OpenAI-compatible client for Qwen Dashscope API."""
706
- from openai import OpenAI
707
- api_key = os.environ.get("DASHSCOPE_API_KEY", "")
708
- if not api_key:
709
- raise RuntimeError(
710
- "DASHSCOPE_API_KEY not set. Add it as a Space secret."
711
  )
712
- return OpenAI(api_key=api_key, base_url=QWEN_BASE_URL)
713
-
714
-
715
- def _wav_to_base64(wav_path):
716
- """Read WAV file and return base64 string."""
717
- with open(wav_path, "rb") as f:
718
- return base64.b64encode(f.read()).decode("utf-8")
719
-
720
-
721
- def _base64_to_wav(b64_data, output_path):
722
- """Convert raw PCM base64 audio to WAV file (24kHz, mono, 16-bit)."""
723
- audio_bytes = base64.b64decode(b64_data)
724
- sample_rate = 24000
725
- num_channels = 1
726
- bits_per_sample = 16
727
- byte_rate = sample_rate * num_channels * bits_per_sample // 8
728
- block_align = num_channels * bits_per_sample // 8
729
- data_size = len(audio_bytes)
730
- with open(output_path, "wb") as f:
731
- f.write(b"RIFF")
732
- f.write(struct.pack("<I", 36 + data_size))
733
- f.write(b"WAVE")
734
- f.write(b"fmt ")
735
- f.write(struct.pack("<I", 16))
736
- f.write(struct.pack("<H", 1))
737
- f.write(struct.pack("<H", num_channels))
738
- f.write(struct.pack("<I", sample_rate))
739
- f.write(struct.pack("<I", byte_rate))
740
- f.write(struct.pack("<H", block_align))
741
- f.write(struct.pack("<H", bits_per_sample))
742
- f.write(b"data")
743
- f.write(struct.pack("<I", data_size))
744
- f.write(audio_bytes)
745
-
746
-
747
- def _extract_audio_chunk(video_path, output_wav, start_sec, duration_sec):
748
- """Extract a chunk of audio from video as 16kHz mono WAV."""
749
- subprocess.run(
750
- ["ffmpeg", "-y", "-ss", str(start_sec), "-t", str(duration_sec),
751
- "-i", video_path, "-vn", "-acodec", "pcm_s16le",
752
- "-ar", "16000", "-ac", "1", output_wav],
753
- capture_output=True, check=True,
754
- )
755
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
756
 
757
- def _get_duration(filepath):
758
- """Get media file duration in seconds."""
759
- result = subprocess.run(
760
- ["ffprobe", "-v", "quiet", "-show_entries", "format=duration",
761
- "-of", "default=noprint_wrappers=1:nokey=1", filepath],
762
- capture_output=True, text=True,
763
- )
764
- return float(result.stdout.strip())
765
 
766
 
767
- def _concatenate_wavs(wav_files, output_path):
768
- """Concatenate WAV files using ffmpeg."""
769
- if len(wav_files) == 1:
770
- shutil.copy2(wav_files[0], output_path)
 
 
 
 
 
 
771
  return
772
- list_file = output_path + ".txt"
773
- with open(list_file, "w") as f:
774
- for wav in wav_files:
775
- f.write(f"file '{wav}'\n")
776
- subprocess.run(
777
- ["ffmpeg", "-y", "-f", "concat", "-safe", "0",
778
- "-i", list_file, "-c", "copy", output_path],
779
- capture_output=True, check=True,
780
- )
781
- os.remove(list_file)
782
-
783
-
784
- def _build_system_prompt(language_name):
785
- """Build Qwen system prompt for a target language."""
786
- return (
787
- f"You are a professional video dubbing translator. You will receive audio in English.\n"
788
- f"Your task:\n"
789
- f"1. Listen carefully to the English speech.\n"
790
- f"2. Translate it into natural, fluent {language_name}.\n"
791
- f"3. Respond ONLY with the {language_name} translation spoken aloud — no English, no commentary,\n"
792
- f" no meta-text, no transliteration. Speak entirely in {language_name}.\n"
793
- f"4. Match the tone, emotion, and pacing of the original speaker as closely as possible.\n"
794
- f"5. If there are pauses or silence in the original audio, maintain similar pacing.\n"
795
- f"6. Translate idioms and cultural references into their {language_name} equivalents.\n"
796
- f"7. Use clear, professional pronunciation suitable for a broad audience."
797
- )
798
-
799
-
800
- def translate_chunk_qwen(wav_path, voice, language_name, chunk_index=0):
801
- """
802
- Translate a single audio chunk using Qwen Omni.
803
-
804
- Args:
805
- wav_path: Path to input WAV file (English audio)
806
- voice: Qwen voice name (e.g. "Ethan", "Cherry")
807
- language_name: Full language name for the system prompt
808
- chunk_index: For logging
809
-
810
- Returns:
811
- (output_wav_path, transcript) or (None, transcript) if no audio
812
- """
813
- client = _get_client()
814
- audio_b64 = _wav_to_base64(wav_path)
815
- output_wav = wav_path.replace(".wav", f"_qwen_{chunk_index}.wav")
816
-
817
- system_prompt = _build_system_prompt(language_name)
818
- user_prompt = f"Translate this English speech into {language_name}. Respond only with the spoken {language_name} translation."
819
 
820
  t0 = time.time()
821
- completion = client.chat.completions.create(
822
- model=QWEN_MODEL,
823
- messages=[
824
- {"role": "system", "content": system_prompt},
825
- {
826
- "role": "user",
827
- "content": [
828
- {
829
- "type": "input_audio",
830
- "input_audio": {
831
- "data": f"data:audio/wav;base64,{audio_b64}",
832
- "format": "wav",
833
- },
834
- },
835
- {"type": "text", "text": user_prompt},
836
- ],
837
- },
838
- ],
839
- modalities=["text", "audio"],
840
- audio={"voice": voice, "format": "wav"},
841
- stream=True,
842
- stream_options={"include_usage": True},
843
  )
844
-
845
- audio_chunks = []
846
- transcript_parts = []
847
-
848
- for event in completion:
849
- if not event.choices:
850
- continue
851
- delta = event.choices[0].delta
852
- if hasattr(delta, "content") and delta.content:
853
- transcript_parts.append(delta.content)
854
- if hasattr(delta, "audio") and delta.audio:
855
- if isinstance(delta.audio, dict):
856
- if "data" in delta.audio:
857
- audio_chunks.append(delta.audio["data"])
858
- elif hasattr(delta.audio, "data") and delta.audio.data:
859
- audio_chunks.append(delta.audio.data)
860
-
861
- transcript = "".join(transcript_parts)
862
  elapsed = time.time() - t0
863
- logger.info(f"Qwen chunk {chunk_index}: {elapsed:.1f}s, transcript={transcript[:60]}")
864
-
865
- if audio_chunks:
866
- full_audio_b64 = "".join(audio_chunks)
867
- _base64_to_wav(full_audio_b64, output_wav)
868
- return output_wav, transcript
869
-
870
- return None, transcript
871
-
872
-
873
- def dub_video_qwen(video_path, language_name, voice="Ethan", chunk_seconds=120, progress_fn=None):
874
- """
875
- Full video dubbing pipeline using Qwen Omni.
876
- Splits video into chunks, translates each chunk via Qwen API,
877
- concatenates results, and muxes back onto video.
878
-
879
- Args:
880
- video_path: Path to input video
881
- language_name: Full language name (e.g. "French", "Arabic")
882
- voice: Qwen voice name
883
- chunk_seconds: Audio chunk duration for API calls
884
- progress_fn: Optional gradio progress callback
885
-
886
- Returns:
887
- (output_video_path, log_text)
888
- """
889
- tmp_dir = tempfile.mkdtemp(prefix=f"qwen_dub_")
890
- log = []
 
 
 
 
 
 
891
 
 
 
892
  try:
893
- # Duration
894
- if progress_fn:
895
- progress_fn(0.05, desc="Analyzing video...")
896
- total_duration = _get_duration(video_path)
897
- log.append(f"**Video:** {total_duration:.1f}s")
898
- log.append(f"**Engine:** Qwen 3.5 Omni")
899
- log.append(f"**Voice:** {voice}")
900
- log.append(f"**Language:** {language_name}")
901
-
902
- if total_duration > 3600:
903
- return None, "Video longer than 1 hour — please use a shorter clip."
904
-
905
- # Split into chunks
906
- if progress_fn:
907
- progress_fn(0.1, desc="Extracting audio chunks...")
908
- num_chunks = max(1, int(total_duration // chunk_seconds) + (1 if total_duration % chunk_seconds > 0 else 0))
909
- log.append(f"**Chunks:** {num_chunks} ({chunk_seconds}s each)")
910
-
911
- input_chunks = []
912
- for i in range(num_chunks):
913
- start = i * chunk_seconds
914
- duration = min(chunk_seconds, total_duration - start)
915
- chunk_path = os.path.join(tmp_dir, f"chunk_{i:03d}.wav")
916
- _extract_audio_chunk(video_path, chunk_path, start, duration)
917
- input_chunks.append(chunk_path)
918
-
919
- # Translate each chunk
920
- output_chunks = []
921
- all_transcripts = []
922
-
923
- for i, chunk_path in enumerate(input_chunks):
924
- if progress_fn:
925
- frac = 0.15 + 0.7 * (i / num_chunks)
926
- progress_fn(frac, desc=f"Translating chunk {i+1}/{num_chunks}...")
927
-
928
- result_path, transcript = translate_chunk_qwen(
929
- chunk_path, voice, language_name, i
930
- )
931
- if transcript:
932
- all_transcripts.append(f"**[{i+1}]** {transcript}")
933
 
934
- if result_path:
935
- output_chunks.append(result_path)
936
- else:
937
- # Silence fallback
938
- duration = _get_duration(chunk_path)
939
- silence_path = os.path.join(tmp_dir, f"silence_{i:03d}.wav")
940
- subprocess.run(
941
- ["ffmpeg", "-y", "-f", "lavfi",
942
- "-i", "anullsrc=r=24000:cl=mono",
943
- "-t", str(duration), "-acodec", "pcm_s16le", silence_path],
944
- capture_output=True, check=True,
945
- )
946
- output_chunks.append(silence_path)
947
-
948
- # Concatenate
949
- if progress_fn:
950
- progress_fn(0.88, desc="Assembling audio...")
951
- full_audio = os.path.join(tmp_dir, "full_dubbed.wav")
952
- _concatenate_wavs(output_chunks, full_audio)
953
-
954
- # Mux onto video
955
- if progress_fn:
956
- progress_fn(0.93, desc="Combining audio and video...")
957
- output_video = os.path.join(tmp_dir, "dubbed_output.mp4")
958
- subprocess.run(
959
- ["ffmpeg", "-y", "-i", video_path, "-i", full_audio,
960
- "-c:v", "copy", "-map", "0:v:0", "-map", "1:a:0",
961
- "-shortest", output_video],
962
- capture_output=True, check=True,
963
  )
964
-
965
- if progress_fn:
966
- progress_fn(1.0, desc="Done!")
967
-
968
- log.append(f"\n**Transcript:**")
969
- log.extend(all_transcripts)
970
-
971
- return output_video, "\n".join(log)
972
-
973
  except Exception as e:
974
- logger.exception("Qwen dubbing failed")
975
- shutil.rmtree(tmp_dir, ignore_errors=True)
976
- return None, f"Error: {str(e)}"
977
-
978
-
979
- # =============================================================================
980
- # GRADIO APP
981
- # =============================================================================
982
 
983
- # Load models at startup
984
- load_models()
985
 
986
- # =============================================================================
987
- # Helper functions
988
- # =============================================================================
989
-
990
- def get_voices_for_language(lang_name):
991
- """Get available voices for a language based on its engine."""
992
- config = LANGUAGES.get(lang_name, {})
993
- engine = config.get("tts_engine", "local")
994
- if engine == "qwen":
995
- return QWEN_VOICES
996
- elif engine == "yourvoic" and config.get("yourvoic_voices"):
997
- return config["yourvoic_voices"]
998
- elif engine == "local":
999
- return ["Peter"]
1000
- return ["Peter"]
1001
-
1002
-
1003
- def full_pipeline_audio(audio_input, target_language):
1004
- """Full pipeline: English audio → target language audio."""
1005
- if audio_input is None:
1006
- return None, "Please upload or record audio."
1007
-
1008
- lang_config = LANGUAGES.get(target_language)
1009
- if not lang_config:
1010
- return None, f"Language '{target_language}' not configured."
1011
 
1012
- sample_rate, audio_array = audio_input
1013
- audio_array = audio_array.astype(np.float32)
1014
- if audio_array.ndim > 1:
1015
- audio_array = audio_array.mean(axis=1)
1016
- if audio_array.max() > 1.0 or audio_array.min() < -1.0:
1017
- max_val = max(abs(audio_array.max()), abs(audio_array.min()))
1018
- if max_val > 0:
1019
- audio_array = audio_array / max_val
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1020
 
1021
- log = []
1022
- total_start = time.time()
1023
 
1024
- # ASR
1025
- t0 = time.time()
1026
- english = transcribe(audio_array, sample_rate)
1027
- log.append(f"**ASR** ({time.time()-t0:.2f}s)\n{english}")
1028
- if not english:
1029
- return None, "ASR returned empty text."
1030
 
1031
- # MT
1032
- t0 = time.time()
1033
- nllb_code = lang_config["nllb"]
1034
- translated, en_sents, tgt_sents = translate_text(english, nllb_code, fast=False)
1035
- log.append(f"\n**Translation** ({time.time()-t0:.2f}s)")
1036
- for e, t in zip(en_sents, tgt_sents):
1037
- log.append(f" EN: {e}\n {target_language.upper()}: {t}")
1038
- if not translated:
1039
- return None, "Translation returned empty."
1040
-
1041
- # TTS
1042
- t0 = time.time()
1043
- audio_out, sr_out = synthesize_chunked(
1044
- translated, lang_config
1045
  )
1046
- log.append(f"\n**TTS** ({time.time()-t0:.2f}s) = {len(audio_out)/sr_out:.1f}s audio")
1047
-
1048
- total = time.time() - total_start
1049
- log.append(f"\n**Total: {total:.2f}s**")
1050
-
1051
- return (sr_out, audio_out), "\n".join(log)
1052
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1053
 
1054
- def full_pipeline_text(english_text, target_language, voice_name):
1055
- """Text-only pipeline: English text → target language audio."""
1056
- if not english_text or not english_text.strip():
1057
- return None, "Please enter English text."
1058
-
1059
- lang_config = LANGUAGES.get(target_language)
1060
- if not lang_config:
1061
- return None, f"Language '{target_language}' not configured."
1062
-
1063
- log = []
1064
- total_start = time.time()
1065
-
1066
- # MT
1067
- t0 = time.time()
1068
- nllb_code = lang_config["nllb"]
1069
- translated, en_sents, tgt_sents = translate_text(english_text.strip(), nllb_code, fast=False)
1070
- log.append(f"**Translation** ({time.time()-t0:.2f}s)")
1071
- for e, t in zip(en_sents, tgt_sents):
1072
- log.append(f" EN: {e}\n {target_language.upper()}: {t}")
1073
- if not translated:
1074
- return None, "Translation returned empty."
1075
-
1076
- # TTS
1077
- t0 = time.time()
1078
- audio_out, sr_out = synthesize_chunked(
1079
- translated, lang_config
1080
  )
1081
- log.append(f"\n**TTS** ({time.time()-t0:.2f}s) = {len(audio_out)/sr_out:.1f}s audio")
1082
-
1083
- total = time.time() - total_start
1084
- log.append(f"\n**Total: {total:.2f}s**")
1085
-
1086
- return (sr_out, audio_out), "\n".join(log)
1087
-
1088
-
1089
- def dub_video(video_path, target_languages, dub_voice, chunk_seconds, progress=gr.Progress()):
1090
- """
1091
- Dub a video into one or more target languages.
1092
- Routes to Qwen Omni for global languages, YourVoic for others.
1093
- """
1094
- if video_path is None:
1095
- return None, "Please upload a video."
1096
-
1097
- if not target_languages:
1098
- return None, "Please select at least one target language."
1099
-
1100
- results_log = []
1101
- output_videos = []
1102
-
1103
- for lang_name in target_languages:
1104
- lang_config = LANGUAGES.get(lang_name)
1105
- if not lang_config:
1106
- results_log.append(f"**{lang_name}**: not configured, skipped")
1107
- continue
1108
-
1109
- engine = lang_config.get("tts_engine", "local")
1110
- results_log.append(f"\n{'='*50}")
1111
- results_log.append(f"**Dubbing: {lang_name}** (engine: {engine})")
1112
- results_log.append(f"{'='*50}")
1113
-
1114
- try:
1115
- if engine == "qwen":
1116
- # Qwen Omni: end-to-end speech-to-speech (best for global languages)
1117
- qwen_lang_name = lang_config.get("qwen_name", lang_name)
1118
- voice = dub_voice if dub_voice in QWEN_VOICES else "Ethan"
1119
- out_video, log_text = dub_video_qwen(
1120
- video_path, qwen_lang_name, voice=voice,
1121
- chunk_seconds=chunk_seconds, progress_fn=progress,
1122
- )
1123
- results_log.append(log_text)
1124
- if out_video:
1125
- output_videos.append(out_video)
1126
-
1127
- else:
1128
- # Local/YourVoic pipeline: ASR → NLLB → TTS
1129
- work_dir = tempfile.mkdtemp(prefix=f"dub_{lang_name}_")
1130
- extracted_audio = os.path.join(work_dir, "audio.wav")
1131
- tgt_audio_raw = os.path.join(work_dir, "tgt_raw.wav")
1132
- tgt_audio_aligned = os.path.join(work_dir, "tgt_aligned.wav")
1133
- output_video = os.path.join(work_dir, f"dubbed_{lang_name}.mp4")
1134
-
1135
- progress(0.05, desc=f"{lang_name}: extracting audio...")
1136
- extract_audio_from_video(video_path, extracted_audio)
1137
- video_duration = get_media_duration(video_path)
1138
- results_log.append(f"Video: {video_duration:.1f}s")
1139
-
1140
- audio_array, sr = sf.read(extracted_audio, dtype="float32")
1141
- if audio_array.ndim > 1:
1142
- audio_array = audio_array.mean(axis=1)
1143
-
1144
- progress(0.15, desc=f"{lang_name}: transcribing...")
1145
- t0 = time.time()
1146
- english = transcribe(audio_array, sr)
1147
- results_log.append(f"ASR: {time.time()-t0:.1f}s")
1148
- if not english:
1149
- results_log.append("ASR empty — skipped")
1150
- continue
1151
-
1152
- progress(0.4, desc=f"{lang_name}: translating...")
1153
- t0 = time.time()
1154
- nllb_code = lang_config["nllb"]
1155
- translated, _, _ = translate_text(english, nllb_code, fast=True)
1156
- results_log.append(f"MT: {time.time()-t0:.1f}s")
1157
- if not translated:
1158
- results_log.append("Translation empty — skipped")
1159
- continue
1160
-
1161
- progress(0.65, desc=f"{lang_name}: synthesizing...")
1162
- t0 = time.time()
1163
- tgt_audio, tgt_sr = synthesize_chunked(
1164
- translated, lang_config
1165
- )
1166
- sf.write(tgt_audio_raw, tgt_audio, tgt_sr)
1167
- tgt_duration = len(tgt_audio) / tgt_sr
1168
- results_log.append(f"TTS: {time.time()-t0:.1f}s ({tgt_duration:.1f}s audio)")
1169
-
1170
- progress(0.85, desc=f"{lang_name}: aligning...")
1171
- MAX_STRETCH = 1.2
1172
- stretch_ratio = tgt_duration / video_duration
1173
-
1174
- if stretch_ratio <= MAX_STRETCH:
1175
- if abs(stretch_ratio - 1.0) > 0.02:
1176
- stretch_audio_to_duration(tgt_audio_raw, tgt_audio_aligned, video_duration)
1177
- else:
1178
- import shutil
1179
- shutil.copy(tgt_audio_raw, tgt_audio_aligned)
1180
- extend_video = False
1181
- final_duration = video_duration
1182
- else:
1183
- shutil.copy(tgt_audio_raw, tgt_audio_aligned)
1184
- extend_video = True
1185
- final_duration = tgt_duration
1186
- results_log.append(f"Audio longer ({stretch_ratio:.1f}x) — extending video")
1187
-
1188
- progress(0.95, desc=f"{lang_name}: combining...")
1189
- mux_video_audio(
1190
- video_path, tgt_audio_aligned, output_video,
1191
- extend_video=extend_video, target_duration=final_duration
1192
- )
1193
- output_videos.append(output_video)
1194
-
1195
- except Exception as e:
1196
- logger.exception(f"Dubbing {lang_name} failed")
1197
- results_log.append(f"Error: {str(e)}")
1198
-
1199
- progress(1.0, desc="Done!")
1200
- final_video = output_videos[0] if output_videos else None
1201
- return final_video, "\n".join(results_log)
1202
-
1203
-
1204
- def update_voices(language):
1205
- """Update voice dropdown when language changes."""
1206
- voices = get_voices_for_language(language)
1207
- return gr.update(choices=voices, value=voices[0])
1208
-
1209
-
1210
- # =============================================================================
1211
- # Live Streaming Translation
1212
- # =============================================================================
1213
-
1214
- class LiveStreamState:
1215
- """Manages audio buffer and transcript for live streaming mode."""
1216
- def __init__(self, chunk_duration_s=5):
1217
- self.chunk_duration_s = chunk_duration_s
1218
- self.audio_buffer = np.array([], dtype=np.float32)
1219
- self.buffer_sr = 16000
1220
- self.transcript_pairs = []
1221
- self.chunk_count = 0
1222
- self.total_time = 0.0
1223
- self.target_language = "Spanish"
1224
- self.target_voice = "Ethan"
1225
 
1226
- def reset(self):
1227
- self.audio_buffer = np.array([], dtype=np.float32)
1228
- self.transcript_pairs = []
1229
- self.chunk_count = 0
1230
- self.total_time = 0.0
1231
 
 
 
 
 
 
 
 
 
 
 
 
 
1232
 
1233
- def live_stream_process(audio_chunk, target_language, target_voice, state):
1234
- """
1235
- Process streaming audio chunks for live translation.
1236
- Called repeatedly by Gradio's .stream() as mic audio arrives.
1237
- """
1238
- if state is None:
1239
- state = LiveStreamState()
1240
 
1241
- state.target_language = target_language
1242
- state.target_voice = target_voice
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1243
 
1244
- if audio_chunk is None:
1245
- return None, _format_live_log(state), state
 
 
 
 
 
 
1246
 
1247
- sample_rate, audio_data = audio_chunk
1248
- audio_data = audio_data.astype(np.float32)
1249
- if audio_data.ndim > 1:
1250
- audio_data = audio_data.mean(axis=1)
1251
- if len(audio_data) > 0:
1252
- mx = max(abs(audio_data.max()), abs(audio_data.min()))
1253
- if mx > 1.0:
1254
- audio_data = audio_data / mx
1255
 
1256
- state.buffer_sr = sample_rate
1257
- state.audio_buffer = np.concatenate([state.audio_buffer, audio_data])
1258
 
1259
- required_samples = int(state.chunk_duration_s * sample_rate)
 
 
 
 
1260
 
1261
- if len(state.audio_buffer) < required_samples:
1262
- buffered_s = len(state.audio_buffer) / sample_rate
1263
- return None, _format_live_log(state, buffered_s), state
1264
 
1265
- chunk = state.audio_buffer[:required_samples]
1266
- state.audio_buffer = state.audio_buffer[required_samples:]
1267
 
1268
- lang_config = LANGUAGES.get(target_language)
1269
- if not lang_config:
1270
- return None, f"Language '{target_language}' not configured.", state
1271
 
1272
- t_start = time.time()
1273
- engine = lang_config.get("tts_engine", "yourvoic")
1274
 
1275
- try:
1276
- if engine == "qwen":
1277
- tmp_wav = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name
1278
- sf.write(tmp_wav, chunk, sample_rate)
1279
- qwen_lang = lang_config.get("qwen_name", target_language)
1280
- voice = target_voice if target_voice in QWEN_VOICES else "Ethan"
1281
- result_wav, transcript = translate_chunk_qwen(tmp_wav, voice, qwen_lang, state.chunk_count)
1282
- os.unlink(tmp_wav)
1283
- english_text = "(Qwen end-to-end)"
1284
- translated_text = transcript or "(no transcript)"
1285
- if result_wav:
1286
- audio_out, sr_out = sf.read(result_wav, dtype="float32")
1287
- else:
1288
- audio_out, sr_out = None, None
1289
- else:
1290
- english_text = transcribe(chunk, sample_rate)
1291
- if not english_text:
1292
- return None, _format_live_log(state), state
1293
- nllb_code = lang_config["nllb"]
1294
- translated_text, _, _ = translate_text(english_text, nllb_code, fast=True)
1295
- if not translated_text:
1296
- return None, _format_live_log(state), state
1297
- audio_out, sr_out = synthesize_chunked(translated_text, lang_config)
1298
 
1299
- except Exception as e:
1300
- logger.error(f"Live chunk failed: {e}")
1301
- return None, _format_live_log(state) + f"\n\n**Error:** {e}", state
1302
-
1303
- elapsed = time.time() - t_start
1304
- state.chunk_count += 1
1305
- state.total_time += elapsed
1306
- state.transcript_pairs.append((english_text, translated_text))
1307
-
1308
- if audio_out is not None and len(audio_out) > 0:
1309
- return (sr_out, audio_out), _format_live_log(state), state
1310
- return None, _format_live_log(state), state
1311
-
1312
-
1313
- def _format_live_log(state, buffered_s=None):
1314
- lines = [f"**Chunks processed:** {state.chunk_count}"]
1315
- if state.chunk_count > 0:
1316
- lines.append(f"**Avg time:** {state.total_time / state.chunk_count:.1f}s/chunk")
1317
- if buffered_s is not None:
1318
- lines.append(f"**Buffering:** {buffered_s:.1f}s / {state.chunk_duration_s}s")
1319
- lines.append("\n---\n**Live transcript:**\n")
1320
- start = max(0, len(state.transcript_pairs) - 8)
1321
- for i in range(start, len(state.transcript_pairs)):
1322
- en, tgt = state.transcript_pairs[i]
1323
- lines.append(f"**[{i+1}]** EN: {en[:100]}")
1324
- lines.append(f" {state.target_language}: {tgt[:100]}\n")
1325
- return "\n".join(lines)
1326
-
1327
-
1328
- def clear_live_stream():
1329
- return None, "Stream cleared. Start recording to begin.", LiveStreamState()
1330
-
1331
-
1332
- # =============================================================================
1333
- # Gradio UI
1334
- # =============================================================================
1335
-
1336
- EXAMPLES = [
1337
- "And it's a brilliant goal from the striker!",
1338
- "The referee has shown a yellow card. Corner kick for the home team.",
1339
- "What a save by the goalkeeper! The match is heading into injury time.",
1340
- "He dribbles past two defenders and shoots! The ball hits the back of the net!",
1341
- ]
1342
 
1343
- CSS = """
1344
- .main-header { text-align: center; margin-bottom: 0.5rem; }
1345
- .main-header h1 { font-size: 1.8rem; font-weight: 700; margin: 0; }
1346
- .main-header p { color: #666; font-size: 0.95rem; }
1347
- .lang-group-label { font-weight: 600; font-size: 0.85rem; color: #888; text-transform: uppercase; letter-spacing: 0.05em; margin-top: 0.5rem; }
1348
- """
1349
 
1350
- with gr.Blocks(
1351
- title="PlotWeaver — Live Commentary Translation",
1352
- theme=gr.themes.Soft(),
1353
- css=CSS,
1354
- ) as demo:
1355
 
1356
- gr.HTML("""
1357
- <div class="main-header">
1358
- <h1>PlotWeaver</h1>
1359
- <p>Live commentary translation platform &mdash; English to 40+ languages</p>
1360
- <p style="font-size:0.8rem; color:#999">Qwen Omni (11 languages) + YourVoic API + NLLB-200 (27 languages)</p>
1361
- </div>
1362
- """)
 
 
 
 
 
 
 
 
 
 
 
1363
 
 
1364
  with gr.Tabs():
1365
 
1366
- # ====== TAB 1: EVENT MANAGEMENT ======
1367
- with gr.TabItem("Event Management"):
1368
- gr.Markdown("### Create new event")
1369
- gr.Markdown("Configure your live broadcast event with target languages and input source.")
1370
-
1371
  with gr.Row():
1372
- with gr.Column(scale=2):
1373
- event_name = gr.Textbox(
1374
- label="Event name",
1375
- placeholder="e.g. Premier League: Arsenal vs. Chelsea",
1376
- )
1377
- with gr.Row():
1378
- start_time = gr.Textbox(label="Start time", placeholder="08:30 PM")
1379
- end_time = gr.Textbox(label="End time", placeholder="10:30 PM")
1380
- event_date = gr.Textbox(label="Date", placeholder="2026-06-06")
1381
-
1382
- gr.Markdown("#### Input source")
1383
- input_method = gr.Radio(
1384
- choices=["RTMP Stream", "WebRTC (Browser)", "Direct Audio Feed"],
1385
- value="RTMP Stream",
1386
- label="Input method",
1387
- )
1388
-
1389
- gr.Markdown("#### Target languages")
1390
- gr.Markdown("Select languages for simultaneous broadcast. Additional languages consume more stream minutes.")
1391
-
1392
- # Language checkboxes grouped by category
1393
- target_langs = gr.CheckboxGroup(
1394
- choices=ALL_LANGUAGE_NAMES,
1395
- label="Languages",
1396
- value=["Spanish"],
1397
- )
1398
-
1399
- with gr.Column(scale=1):
1400
- gr.Markdown("#### Estimate summary")
1401
- estimate_display = gr.Markdown(
1402
- value="**Event:** Not configured\n\n**Languages:** 1 selected\n\n**Estimated duration:** --\n\n**Total estimate:** --"
1403
- )
1404
- create_event_btn = gr.Button("Create Event", variant="primary", size="lg")
1405
- event_status = gr.Markdown("")
1406
-
1407
- def update_estimate(name, langs, start, end):
1408
- n_langs = len(langs) if langs else 0
1409
- lang_list = ", ".join(langs) if langs else "None"
1410
- return (
1411
- f"**Event:** {name or 'Not set'}\n\n"
1412
- f"**Languages:** {n_langs} selected\n\n"
1413
- f"{lang_list}\n\n"
1414
- f"**Input:** Configured\n\n"
1415
- f"**Rate:** 1x (Standard)"
1416
- )
1417
-
1418
- for inp in [event_name, target_langs, start_time, end_time]:
1419
- inp.change(
1420
- fn=update_estimate,
1421
- inputs=[event_name, target_langs, start_time, end_time],
1422
- outputs=[estimate_display],
1423
- )
1424
-
1425
- def create_event(name, langs):
1426
- if not name:
1427
- return "Please enter an event name."
1428
- if not langs:
1429
- return "Please select at least one language."
1430
- return f"Event **{name}** created with {len(langs)} languages: {', '.join(langs)}"
1431
-
1432
- create_event_btn.click(
1433
- fn=create_event,
1434
- inputs=[event_name, target_langs],
1435
- outputs=[event_status],
1436
- )
1437
-
1438
- # ====== TAB 2: LIVE STUDIO ======
1439
- with gr.TabItem("Live Studio"):
1440
- gr.Markdown("### Live Studio")
1441
-
1442
- with gr.Tabs():
1443
- # ── Live Streaming Mode ──
1444
- with gr.TabItem("🔴 Live Translation"):
1445
- gr.Markdown(
1446
- "**Speak English into the mic** — the pipeline translates and plays back in your target language "
1447
- "every 5 seconds. The transcript updates live below."
1448
- )
1449
-
1450
- live_state = gr.State(LiveStreamState())
1451
-
1452
- with gr.Row():
1453
- live_lang = gr.Dropdown(
1454
- choices=ALL_LANGUAGE_NAMES, value="Spanish",
1455
- label="Target language",
1456
- )
1457
- live_voice = gr.Dropdown(
1458
- choices=get_voices_for_language("Spanish"),
1459
- value=get_voices_for_language("Spanish")[0],
1460
- label="Voice",
1461
- )
1462
-
1463
- live_lang.change(fn=update_voices, inputs=[live_lang], outputs=[live_voice])
1464
-
1465
- with gr.Row():
1466
- with gr.Column():
1467
- live_mic = gr.Audio(
1468
- label="Microphone (streaming)",
1469
- type="numpy",
1470
- sources=["microphone"],
1471
- streaming=True,
1472
  )
1473
- live_clear_btn = gr.Button("Clear & Reset", variant="secondary")
1474
-
1475
- with gr.Column():
1476
- live_audio_out = gr.Audio(
1477
- label="Translated output",
1478
- type="numpy",
1479
- autoplay=True,
1480
  )
1481
- live_log = gr.Markdown(value="Click the mic to start recording.")
1482
-
1483
- live_mic.stream(
1484
- fn=live_stream_process,
1485
- inputs=[live_mic, live_lang, live_voice, live_state],
1486
- outputs=[live_audio_out, live_log, live_state],
1487
- time_limit=600,
1488
- stream_every=1.0,
1489
- )
1490
 
1491
- live_clear_btn.click(
1492
- fn=clear_live_stream,
1493
- outputs=[live_audio_out, live_log, live_state],
 
 
1494
  )
1495
 
1496
- # ── Batch Upload Mode ──
1497
- with gr.TabItem("Upload / Record"):
1498
- gr.Markdown("Upload or record a full audio clip, then translate it all at once.")
1499
-
1500
- with gr.Row():
1501
- st_lang = gr.Dropdown(
1502
- choices=ALL_LANGUAGE_NAMES, value="Spanish",
1503
- label="Target language",
1504
- )
1505
- st_voice = gr.Dropdown(
1506
- choices=get_voices_for_language("Spanish"),
1507
- value=get_voices_for_language("Spanish")[0],
1508
- label="Voice",
1509
- )
1510
-
1511
- st_lang.change(fn=update_voices, inputs=[st_lang], outputs=[st_voice])
1512
-
1513
- with gr.Row():
1514
- with gr.Column():
1515
- st_in = gr.Audio(
1516
- label="English audio",
1517
- type="numpy",
1518
- sources=["upload", "microphone"],
1519
- )
1520
- st_btn = gr.Button("Translate", variant="primary", size="lg")
1521
- with gr.Column():
1522
- st_out = gr.Audio(label="Translated audio", type="numpy", autoplay=True)
1523
- st_log = gr.Markdown()
1524
-
1525
- st_btn.click(
1526
- fn=full_pipeline_audio,
1527
- inputs=[st_in, st_lang],
1528
- outputs=[st_out, st_log],
1529
- )
1530
 
1531
- # ====== TAB 3: VIDEO DUBBING ======
1532
- with gr.TabItem("Video Dubbing"):
1533
- gr.Markdown("### Video dubbing (English → multi-language)")
1534
  gr.Markdown(
1535
- "Upload a video with English commentary and get back a dubbed version. "
1536
- "**Global languages** (Arabic, French, Spanish, etc.) use Qwen Omni for best quality. "
1537
- "**African/regional languages** use YourVoic API with NLLB translation."
 
 
 
1538
  )
1539
 
1540
  with gr.Row():
1541
  with gr.Column():
1542
- dub_video_in = gr.Video(label="Upload English video", sources=["upload"])
1543
- dub_languages = gr.CheckboxGroup(
1544
- choices=ALL_LANGUAGE_NAMES,
1545
- label="Target languages",
1546
- value=["Spanish"],
 
 
 
 
1547
  )
1548
- with gr.Row():
1549
- dub_voice = gr.Dropdown(
1550
- choices=QWEN_VOICES,
1551
- value="Ethan",
1552
- label="Voice (for Qwen languages)",
1553
- info="Applies to Arabic, French, Spanish, etc. Local languages use default voice.",
1554
- )
1555
- dub_chunk_slider = gr.Slider(
1556
- minimum=30, maximum=300, value=120, step=10,
1557
- label="Chunk duration (seconds)",
1558
- info="Shorter = more API calls but less timeout risk.",
1559
- )
1560
- dub_btn = gr.Button("Dub Video", variant="primary", size="lg")
1561
 
1562
  with gr.Column():
1563
- dub_video_out = gr.Video(label="Dubbed video (download from player)")
1564
- dub_log = gr.Markdown(
1565
- label="Processing log",
1566
- value="Upload a video and select languages to start."
 
1567
  )
1568
-
1569
- dub_btn.click(
1570
- fn=dub_video,
1571
- inputs=[dub_video_in, dub_languages, dub_voice, dub_chunk_slider],
1572
- outputs=[dub_video_out, dub_log],
 
 
 
 
 
 
 
 
1573
  )
1574
 
1575
- # ====== TAB 4: TEXT TRANSLATION ======
1576
- with gr.TabItem("Text \u2192 Audio"):
1577
- gr.Markdown("### Text to translated speech")
1578
- gr.Markdown("Type English text, choose a language, and hear the translated audio.")
1579
-
1580
- with gr.Row():
1581
- text_language = gr.Dropdown(
1582
- choices=ALL_LANGUAGE_NAMES,
1583
- value="Spanish",
1584
- label="Target language",
1585
- )
1586
- text_voice = gr.Dropdown(
1587
- choices=get_voices_for_language("Spanish"),
1588
- value=get_voices_for_language("Spanish")[0],
1589
- label="Voice",
1590
- )
1591
-
1592
- text_language.change(
1593
- fn=update_voices,
1594
- inputs=[text_language],
1595
- outputs=[text_voice],
1596
  )
1597
 
1598
- with gr.Row():
1599
- with gr.Column():
1600
- text_input = gr.Textbox(
1601
- label="English text",
1602
- placeholder="Type English football commentary here...",
1603
- lines=4,
1604
- )
1605
- text_btn = gr.Button("Translate to speech", variant="primary", size="lg")
1606
- gr.Examples(
1607
- examples=[[e] for e in EXAMPLES],
1608
- inputs=[text_input],
1609
- label="Example commentary",
1610
- )
1611
-
1612
- with gr.Column():
1613
- text_audio_out = gr.Audio(label="Translated audio", type="numpy", autoplay=True)
1614
- text_log = gr.Markdown(label="Pipeline log")
1615
-
1616
- text_btn.click(
1617
- fn=full_pipeline_text,
1618
- inputs=[text_input, text_language, text_voice],
1619
- outputs=[text_audio_out, text_log],
1620
  )
1621
 
1622
- # ====== TAB 5: RECORDINGS ======
1623
- with gr.TabItem("Recordings & Clips"):
1624
- gr.Markdown("### Recordings management")
1625
- gr.Markdown(
1626
- "Past dubbed recordings will appear here. "
1627
- "This feature is coming soon — for now, use Video Dubbing to create new recordings "
1628
- "and download them from the player."
1629
  )
1630
 
1631
- # ====== TAB 6: VOICE MODELS ======
1632
- with gr.TabItem("Voice Models"):
1633
- gr.Markdown("### Voice model library")
1634
- gr.Markdown("Browse available voices for each language.")
1635
 
1636
- voice_lang_select = gr.Dropdown(
1637
- choices=ALL_LANGUAGE_NAMES,
1638
- value="Spanish",
1639
- label="Select language",
1640
- )
1641
- voice_info = gr.Markdown()
1642
-
1643
- def show_voice_info(lang):
1644
- config = LANGUAGES.get(lang, {})
1645
- engine = config.get("tts_engine", "unknown")
1646
- voices = config.get("yourvoic_voices", [])
1647
-
1648
- info = f"### {lang}\n\n"
1649
- if engine == "qwen":
1650
- info += f"**Engine:** Qwen 3.5 Omni (end-to-end speech-to-speech)\n\n"
1651
- info += f"This is the highest quality option. Qwen handles ASR + translation + TTS in a single API call, "
1652
- info += f"preserving tone, emotion, and pacing from the original speaker.\n\n"
1653
- info += f"**Available voices ({len(QWEN_VOICES)}):** {', '.join(QWEN_VOICES[:10])}... and {len(QWEN_VOICES)-10} more\n\n"
1654
- info += f"All voices support all Qwen languages."
1655
- elif engine == "yourvoic":
1656
- info += f"**Engine:** YourVoic API (TTS) + NLLB-200 (translation)\n\n"
1657
- info += f"**YourVoic language:** `{config.get('yourvoic_lang', 'N/A')}`\n\n"
1658
- info += f"**Available voices:** {', '.join(voices) if voices else 'Peter (default)'}"
1659
- else:
1660
- info += f"**Engine:** Not available\n\n"
1661
- info += f"**NLLB code:** `{config.get('nllb', 'N/A')}`\n\n"
1662
- info += "Uses locally fine-tuned models on GPU. Voice selection not available."
1663
-
1664
- return info
1665
-
1666
- voice_lang_select.change(fn=show_voice_info, inputs=[voice_lang_select], outputs=[voice_info])
1667
- demo.load(fn=show_voice_info, inputs=[voice_lang_select], outputs=[voice_info])
1668
-
1669
- gr.Markdown("""
1670
- ---
1671
- **PlotWeaver** by PlotweaverAI | Models:
1672
- [ASR](https://huggingface.co/PlotweaverAI/whisper-small-de-en) |
1673
- [MT](https://huggingface.co/PlotweaverAI/nllb-200-distilled-600M-african-6lang) |
1674
- [TTS](https://yourvoic.com) |
1675
- [Qwen Omni](https://www.alibabacloud.com/help/en/model-studio/qwen-omni)
1676
- """)
1677
 
1678
 
1679
  if __name__ == "__main__":
1680
- demo.launch()
 
1
  """
2
+ Live Football Commentary Translator
3
+ ====================================
4
+ Audio in (live commentator) -> Translate -> Audio out (target language).
5
+
6
+ Two modes:
7
+ 1. Single clip: record/upload, click translate, hear result.
8
+ 2. Continuous live: start recording, speak naturally, translations queue up
9
+ and play sequentially. Energy-based VAD chunks speech at ~0.8s pauses.
10
+
11
+ Engines:
12
+ - Qwen-Omni (qwen3.5-omni-plus) handles audio-in -> translated-speech-out
13
+ in ONE call for languages it covers (English, German, Spanish, Arabic,
14
+ Scottish-accented English).
15
+ - For African target languages (Swahili, Amharic, Afrikaans), Qwen-Omni
16
+ does audio -> translated text, then YourVoic does text -> speech.
17
+
18
+ Deploy as a Hugging Face Space (SDK: Gradio). Add these secrets:
19
+ - DASHSCOPE_API_KEY (required, for Qwen-Omni)
20
+ - YOURVOIC_API_KEY (required for Swahili/Amharic/Afrikaans targets)
21
  """
22
 
23
+ import os
24
+ import base64
25
+ import json
26
+ import struct
27
+ import subprocess
28
+ import tempfile
29
+ import threading
30
+ import time
31
+ import uuid
32
+ import queue
33
+ from dataclasses import dataclass, field
34
+ from typing import Optional
35
+
36
+ import numpy as np
37
+ import gradio as gr
38
+ import requests as http_requests
39
+ from openai import OpenAI
40
+
41
+ # ==========================================
42
+ # CONFIGURATION
43
+ # ==========================================
44
+ OMNI_MODEL = "qwen3.5-omni-plus"
45
+ DASHSCOPE_BASE_URL = "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
46
+
47
+ YOURVOIC_TTS_URL = "https://yourvoic.com/api/v1/tts/generate"
48
+
49
+ # Continuous-mode VAD tuning
50
+ VAD_SILENCE_SEC = 0.8 # pause length that ends an utterance
51
+ VAD_MIN_UTTERANCE_SEC = 1.2 # don't send anything shorter than this
52
+ VAD_MAX_UTTERANCE_SEC = 12.0 # force-flush if user keeps talking
53
+ VAD_RMS_THRESHOLD = 0.015 # RMS above this = voice. Lower = more sensitive.
54
+
55
+ # Poll interval for the output drain loop
56
+ OUTPUT_POLL_SEC = 0.3
57
+
58
+ # ==========================================
59
  # LANGUAGES
60
+ # ==========================================
61
+ SOURCE_LANGUAGES = {
62
+ "English": {"code": "en", "omni_hint": "English"},
63
+ "Scottish English": {"code": "en-scot", "omni_hint": "Scottish-accented English"},
64
+ "German": {"code": "de", "omni_hint": "German"},
65
+ "Spanish": {"code": "es", "omni_hint": "Spanish"},
66
+ "Arabic": {"code": "ar", "omni_hint": "Arabic"},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  }
68
 
69
+ TARGET_LANGUAGES = {
70
+ "English": {"engine": "qwen", "omni_hint": "English"},
71
+ "Scottish English": {"engine": "qwen", "omni_hint": "Scottish-accented English"},
72
+ "German": {"engine": "qwen", "omni_hint": "German"},
73
+ "Spanish": {"engine": "qwen", "omni_hint": "Spanish"},
74
+ "Arabic": {"engine": "qwen", "omni_hint": "Arabic"},
75
+ "Swahili": {"engine": "yourvoic", "omni_hint": "Swahili", "yourvoic_lang": "sw-KE"},
76
+ "Amharic": {"engine": "yourvoic", "omni_hint": "Amharic", "yourvoic_lang": "am-ET"},
77
+ "Afrikaans": {"engine": "yourvoic", "omni_hint": "Afrikaans", "yourvoic_lang": "af-ZA"},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  }
79
 
80
+ QWEN_VOICES = [
81
+ "Ethan -- Warm, energetic (good default)",
82
+ "Ryan -- Dramatic, rhythmic (good for live action)",
83
+ "Cherry -- Sunny, friendly",
84
+ "Jennifer -- Cinematic narrator",
85
+ "Vincent -- Rich, theatrical",
86
+ "Bellona -- Strong, commanding",
87
+ ]
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
+ YOURVOIC_VOICE_MAP = {
90
+ "Swahili": ["Peter"],
91
+ "Amharic": ["Peter"],
92
+ "Afrikaans": ["Peter"],
93
+ }
94
 
95
+ YOURVOIC_MODEL = "aura-prime"
96
+
97
+ # ==========================================
98
+ # HELPERS
99
+ # ==========================================
100
+ def voice_name(label: str) -> str:
101
+ return label.split("--")[0].strip()
102
+
103
+
104
+ def write_wav(samples: np.ndarray, sample_rate: int, output_path: str) -> None:
105
+ """Write a numpy int16/float audio array to a WAV file."""
106
+ if samples.dtype == np.float32 or samples.dtype == np.float64:
107
+ samples = np.clip(samples, -1.0, 1.0)
108
+ samples = (samples * 32767).astype(np.int16)
109
+ elif samples.dtype != np.int16:
110
+ samples = samples.astype(np.int16)
111
+ if samples.ndim > 1:
112
+ samples = samples.mean(axis=1).astype(np.int16)
113
+
114
+ nc, bps = 1, 16
115
+ sr = sample_rate
116
+ br = sr * nc * bps // 8
117
+ ba = nc * bps // 8
118
+ raw = samples.tobytes()
119
+ ds = len(raw)
120
+ with open(output_path, "wb") as f:
121
+ f.write(b"RIFF"); f.write(struct.pack("<I", 36 + ds))
122
+ f.write(b"WAVE"); f.write(b"fmt ")
123
+ f.write(struct.pack("<I", 16)); f.write(struct.pack("<H", 1))
124
+ f.write(struct.pack("<H", nc)); f.write(struct.pack("<I", sr))
125
+ f.write(struct.pack("<I", br)); f.write(struct.pack("<H", ba))
126
+ f.write(struct.pack("<H", bps)); f.write(b"data")
127
+ f.write(struct.pack("<I", ds)); f.write(raw)
128
 
 
 
 
 
 
 
 
 
129
 
130
+ def base64_to_wav(b64_data: str, output_path: str) -> None:
131
+ """Qwen-Omni returns base64 PCM @ 24kHz. Wrap in WAV container."""
132
+ audio_bytes = base64.b64decode(b64_data)
133
+ sr, nc, bps = 24000, 1, 16
134
+ br = sr * nc * bps // 8
135
+ ba = nc * bps // 8
136
+ ds = len(audio_bytes)
137
+ with open(output_path, "wb") as f:
138
+ f.write(b"RIFF"); f.write(struct.pack("<I", 36 + ds))
139
+ f.write(b"WAVE"); f.write(b"fmt ")
140
+ f.write(struct.pack("<I", 16)); f.write(struct.pack("<H", 1))
141
+ f.write(struct.pack("<H", nc)); f.write(struct.pack("<I", sr))
142
+ f.write(struct.pack("<I", br)); f.write(struct.pack("<H", ba))
143
+ f.write(struct.pack("<H", bps)); f.write(b"data")
144
+ f.write(struct.pack("<I", ds)); f.write(audio_bytes)
145
+
146
+
147
+ def normalize_audio_file(input_path: str, out_dir: str) -> str:
148
+ """Convert any audio file to 16kHz mono WAV (what Omni expects)."""
149
+ out_path = os.path.join(out_dir, f"in_{uuid.uuid4().hex[:8]}.wav")
150
+ subprocess.run(
151
+ ["ffmpeg", "-y", "-i", input_path,
152
+ "-ar", "16000", "-ac", "1", "-acodec", "pcm_s16le", out_path],
153
+ capture_output=True, check=True,
154
  )
155
+ return out_path
156
+
157
+
158
+ def audio_file_to_data_uri(path: str) -> str:
159
+ b64 = base64.b64encode(open(path, "rb").read()).decode()
160
+ return f"data:audio/wav;base64,{b64}"
161
+
162
+
163
+ # ==========================================
164
+ # CORE: Qwen-Omni audio -> translated speech (one call)
165
+ # ==========================================
166
+ def omni_audio_to_speech(client: OpenAI,
167
+ audio_path: str,
168
+ source_hint: str,
169
+ target_hint: str,
170
+ voice: str,
171
+ out_dir: str) -> tuple:
172
+ audio_uri = audio_file_to_data_uri(audio_path)
173
+
174
+ sys_prompt = (
175
+ f"You are a live football commentary translator. "
176
+ f"The user will speak in {source_hint}. "
177
+ f"Listen carefully and respond by speaking the equivalent commentary in {target_hint}. "
178
+ f"Match the energy and excitement of live football commentary. "
179
+ f"Keep the same meaning. Do NOT add commentary of your own. "
180
+ f"Respond ONLY with the spoken {target_hint} translation."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
181
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
 
183
  try:
184
+ completion = client.chat.completions.create(
185
+ model=OMNI_MODEL,
186
+ messages=[
187
+ {"role": "system", "content": sys_prompt},
188
+ {"role": "user", "content": [
189
+ {"type": "input_audio",
190
+ "input_audio": {"data": audio_uri, "format": "wav"}},
191
+ {"type": "text",
192
+ "text": f"Translate this commentary into {target_hint} and speak it."},
193
+ ]},
194
+ ],
195
+ modalities=["text", "audio"],
196
+ audio={"voice": voice, "format": "wav"},
197
+ stream=True,
198
+ stream_options={"include_usage": True},
199
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
 
201
+ audio_parts, text_parts = [], []
202
+ for event in completion:
203
+ if not event.choices:
204
+ continue
205
+ delta = event.choices[0].delta
206
+ if hasattr(delta, "content") and delta.content:
207
+ text_parts.append(delta.content)
208
+ if hasattr(delta, "audio") and delta.audio:
209
+ if isinstance(delta.audio, dict) and "data" in delta.audio:
210
+ audio_parts.append(delta.audio["data"])
211
+ elif hasattr(delta.audio, "data") and delta.audio.data:
212
+ audio_parts.append(delta.audio.data)
213
+
214
+ transcript = "".join(text_parts).strip()
215
+ if not audio_parts:
216
+ return None, transcript, "No audio received from Qwen-Omni"
217
+
218
+ out_wav = os.path.join(out_dir, f"out_{uuid.uuid4().hex[:8]}.wav")
219
+ base64_to_wav("".join(audio_parts), out_wav)
220
+ return out_wav, transcript, None
221
 
222
+ except Exception as e:
223
+ return None, "", f"Qwen-Omni error: {e}"
224
 
225
 
226
+ def omni_audio_to_text(client: OpenAI,
227
+ audio_path: str,
228
+ source_hint: str,
229
+ target_hint: str) -> tuple:
230
+ audio_uri = audio_file_to_data_uri(audio_path)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
 
232
+ sys_prompt = (
233
+ f"You are a translator. The user will speak in {source_hint}. "
234
+ f"Translate what they say into {target_hint}. "
235
+ f"Output ONLY the {target_hint} translation as plain text. No commentary, no quotes."
236
+ )
237
 
238
+ try:
239
+ completion = client.chat.completions.create(
240
+ model=OMNI_MODEL,
241
+ messages=[
242
+ {"role": "system", "content": sys_prompt},
243
+ {"role": "user", "content": [
244
+ {"type": "input_audio",
245
+ "input_audio": {"data": audio_uri, "format": "wav"}},
246
+ {"type": "text",
247
+ "text": f"Translate into {target_hint}."},
248
+ ]},
249
+ ],
250
+ modalities=["text"],
251
+ )
252
+ text = completion.choices[0].message.content.strip()
253
+ return text, None
254
+ except Exception as e:
255
+ return "", f"Qwen-Omni translation error: {e}"
256
+
257
+
258
+ def yourvoic_speak(text: str,
259
+ target_language: str,
260
+ target_config: dict,
261
+ api_key: str,
262
+ out_dir: str) -> tuple:
263
+ yourvoic_lang = target_config["yourvoic_lang"]
264
+ voices_to_try = list(YOURVOIC_VOICE_MAP.get(target_language, ["Peter"]))
265
+ if "Peter" not in voices_to_try:
266
+ voices_to_try.append("Peter")
267
+
268
+ last_error = None
269
+ for voice in voices_to_try:
270
+ payload = {
271
+ "text": text,
272
+ "voice": voice,
273
+ "language": yourvoic_lang,
274
+ "model": YOURVOIC_MODEL,
275
+ "speed": 1.0,
276
+ }
277
  try:
278
+ resp = http_requests.post(
279
+ YOURVOIC_TTS_URL,
280
+ json=payload,
281
+ headers={"X-API-Key": api_key, "Content-Type": "application/json"},
282
+ timeout=60,
283
+ )
284
+ if resp.status_code != 200:
285
+ last_error = f"YourVoic {resp.status_code}: {resp.text[:200]}"
286
+ if "voice" in resp.text.lower() or resp.status_code == 400:
287
+ continue
288
+ return None, last_error
289
+
290
+ ctype = resp.headers.get("Content-Type", "")
291
+ ext = "mp3" if "mp3" in ctype.lower() else "wav"
292
+ raw_path = os.path.join(out_dir, f"yv_{uuid.uuid4().hex[:8]}.{ext}")
293
+
294
+ if "application/json" in ctype:
295
+ data = resp.json()
296
+ audio_url = data.get("audio_url") or data.get("url")
297
+ if not audio_url:
298
+ return None, "No audio URL in YourVoic response"
299
+ audio_resp = http_requests.get(audio_url, timeout=60)
300
+ with open(raw_path, "wb") as f:
301
+ f.write(audio_resp.content)
302
+ else:
303
+ with open(raw_path, "wb") as f:
304
+ f.write(resp.content)
305
+
306
+ wav_path = os.path.join(out_dir, f"yv_{uuid.uuid4().hex[:8]}.wav")
307
+ subprocess.run(
308
+ ["ffmpeg", "-y", "-i", raw_path,
309
+ "-ar", "24000", "-ac", "1", "-acodec", "pcm_s16le", wav_path],
310
+ capture_output=True, check=True,
311
+ )
312
+ return wav_path, None
313
 
314
  except Exception as e:
315
+ last_error = f"YourVoic exception: {e}"
316
  continue
317
 
318
+ return None, last_error or "YourVoic failed for all candidate voices"
 
 
 
319
 
 
320
 
321
+ # ==========================================
322
+ # SHARED TRANSLATION (used by both modes)
323
+ # ==========================================
324
+ def translate_audio_file(audio_file: str,
325
+ source_language: str,
326
+ target_language: str,
327
+ qwen_voice_label: str,
328
+ work_dir: str) -> tuple:
329
+ """Run audio_file through the pipeline. Returns (wav_path, transcript, error)."""
330
+ ds_key = os.environ.get("DASHSCOPE_API_KEY", "")
331
+ if not ds_key:
332
+ return None, "", "DASHSCOPE_API_KEY not set"
333
 
334
+ src_config = SOURCE_LANGUAGES[source_language]
335
+ tgt_config = TARGET_LANGUAGES[target_language]
336
+ client = OpenAI(api_key=ds_key, base_url=DASHSCOPE_BASE_URL)
337
 
338
+ try:
339
+ norm_path = normalize_audio_file(audio_file, work_dir)
340
+ except subprocess.CalledProcessError as e:
341
+ return None, "", f"ffmpeg normalize failed: {(e.stderr or b'').decode()[:200]}"
342
 
343
+ engine = tgt_config["engine"]
344
 
345
+ if engine == "qwen":
346
+ voice = voice_name(qwen_voice_label)
347
+ return omni_audio_to_speech(
348
+ client, norm_path,
349
+ src_config["omni_hint"], tgt_config["omni_hint"],
350
+ voice, work_dir,
 
351
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
352
 
353
+ elif engine == "yourvoic":
354
+ yv_key = os.environ.get("YOURVOIC_API_KEY", "")
355
+ if not yv_key:
356
+ return None, "", "YOURVOIC_API_KEY not set"
357
+ translated_text, err = omni_audio_to_text(
358
+ client, norm_path,
359
+ src_config["omni_hint"], tgt_config["omni_hint"],
360
+ )
361
+ if err or not translated_text:
362
+ return None, translated_text, err or "empty translation"
363
+ wav, yv_err = yourvoic_speak(
364
+ translated_text, target_language, tgt_config, yv_key, work_dir,
365
+ )
366
+ return wav, translated_text, yv_err
367
 
368
+ return None, "", f"Unknown engine '{engine}'"
 
 
 
 
 
 
 
369
 
370
 
371
+ # ==========================================
372
+ # SINGLE-CLIP MODE
373
+ # ==========================================
374
+ def single_clip_translate(audio_input,
375
+ source_language: str,
376
+ target_language: str,
377
+ qwen_voice_label: str):
378
+ """Yield (audio_path, status_markdown, transcript) as work progresses."""
379
+ if audio_input is None:
380
+ yield None, "**Status:** no audio provided.", ""
381
  return
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
 
383
  t0 = time.time()
384
+ work_dir = tempfile.mkdtemp(prefix="commentary_single_")
385
+ yield None, f"**Status:** translating {source_language} -> {target_language}...", ""
386
+ wav, transcript, err = translate_audio_file(
387
+ audio_input, source_language, target_language, qwen_voice_label, work_dir,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
388
  )
389
+ if err:
390
+ yield None, f"**Error:** {err}", transcript or ""
391
+ return
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
392
  elapsed = time.time() - t0
393
+ yield wav, f"**Done in {elapsed:.1f}s** - {source_language} -> {target_language}", transcript or ""
394
+
395
+
396
+ # ==========================================
397
+ # CONTINUOUS MODE -- per-session state
398
+ # ==========================================
399
+ @dataclass
400
+ class LiveSession:
401
+ """Holds per-session state for continuous-mode streaming."""
402
+ work_dir: str
403
+ source_language: str
404
+ target_language: str
405
+ qwen_voice_label: str
406
+ buffer: list = field(default_factory=list) # list of float32 numpy chunks
407
+ sample_rate: int = 16000
408
+ last_voice_ts: float = 0.0
409
+ in_utterance: bool = False
410
+ utterance_start_ts: float = 0.0
411
+ output_queue: "queue.Queue" = field(default_factory=queue.Queue)
412
+ transcripts: list = field(default_factory=list)
413
+ error_msg: str = ""
414
+ closed: bool = False
415
+
416
+
417
+ def make_session(source_language: str,
418
+ target_language: str,
419
+ qwen_voice_label: str) -> LiveSession:
420
+ return LiveSession(
421
+ work_dir=tempfile.mkdtemp(prefix="commentary_live_"),
422
+ source_language=source_language,
423
+ target_language=target_language,
424
+ qwen_voice_label=qwen_voice_label,
425
+ )
426
+
427
 
428
+ def session_translate_utterance(session: LiveSession, utterance_samples: np.ndarray) -> None:
429
+ """Background thread: translates one utterance, enqueues result."""
430
  try:
431
+ utt_path = os.path.join(session.work_dir, f"utt_{uuid.uuid4().hex[:8]}.wav")
432
+ write_wav(utterance_samples, session.sample_rate, utt_path)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
433
 
434
+ wav, transcript, err = translate_audio_file(
435
+ utt_path, session.source_language, session.target_language,
436
+ session.qwen_voice_label, session.work_dir,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
437
  )
438
+ if err:
439
+ session.error_msg = err
440
+ return
441
+ if wav:
442
+ session.output_queue.put({"wav": wav, "transcript": transcript or ""})
 
 
 
 
443
  except Exception as e:
444
+ session.error_msg = f"Background translate error: {e}"
 
 
 
 
 
 
 
445
 
 
 
446
 
447
+ def session_process_chunk(session: LiveSession,
448
+ sample_rate: int,
449
+ chunk: np.ndarray) -> None:
450
+ """Called per streaming audio chunk. Updates session state,
451
+ fires utterance to background translation when silence detected."""
452
+ if session.closed:
453
+ return
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
454
 
455
+ # Normalize to float32 mono
456
+ if chunk.ndim > 1:
457
+ chunk = chunk.mean(axis=1)
458
+ if chunk.dtype == np.int16:
459
+ chunk = chunk.astype(np.float32) / 32768.0
460
+ elif chunk.dtype != np.float32:
461
+ chunk = chunk.astype(np.float32)
462
+
463
+ # Resample if mic sample rate != 16kHz (cheap linear interp)
464
+ if sample_rate != session.sample_rate:
465
+ ratio = session.sample_rate / sample_rate
466
+ n_out = int(len(chunk) * ratio)
467
+ if n_out > 0:
468
+ chunk = np.interp(
469
+ np.linspace(0, len(chunk) - 1, n_out),
470
+ np.arange(len(chunk)),
471
+ chunk,
472
+ ).astype(np.float32)
473
+
474
+ now = time.time()
475
+
476
+ rms = float(np.sqrt(np.mean(chunk ** 2))) if len(chunk) > 0 else 0.0
477
+ is_voice = rms > VAD_RMS_THRESHOLD
478
+
479
+ if is_voice:
480
+ if not session.in_utterance:
481
+ session.in_utterance = True
482
+ session.utterance_start_ts = now
483
+ session.buffer = []
484
+ session.last_voice_ts = now
485
+ session.buffer.append(chunk)
486
+ else:
487
+ if session.in_utterance:
488
+ # Keep recording trailing silence so we don't cut mid-word
489
+ session.buffer.append(chunk)
490
 
491
+ if not session.in_utterance:
492
+ return
493
 
494
+ utt_dur = now - session.utterance_start_ts
495
+ silence_dur = now - session.last_voice_ts
 
 
 
 
496
 
497
+ should_flush = (
498
+ utt_dur >= VAD_MAX_UTTERANCE_SEC or
499
+ (silence_dur >= VAD_SILENCE_SEC and utt_dur >= VAD_MIN_UTTERANCE_SEC)
 
 
 
 
 
 
 
 
 
 
 
500
  )
 
 
 
 
 
 
501
 
502
+ if should_flush and session.buffer:
503
+ all_samples = np.concatenate(session.buffer)
504
+ session.buffer = []
505
+ session.in_utterance = False
506
+ threading.Thread(
507
+ target=session_translate_utterance,
508
+ args=(session, all_samples),
509
+ daemon=True,
510
+ ).start()
511
+
512
+
513
+ # ==========================================
514
+ # CONTINUOUS MODE -- Gradio handlers
515
+ # ==========================================
516
+ def live_start(source_language, target_language, qwen_voice_label):
517
+ """Click Start: validate keys, create session, reveal mic + drain timer."""
518
+ ds_key = os.environ.get("DASHSCOPE_API_KEY", "")
519
+ if not ds_key:
520
+ return (
521
+ None,
522
+ "**Error:** DASHSCOPE_API_KEY not set in Space secrets.",
523
+ gr.update(visible=False), gr.update(visible=False), gr.update(visible=True),
524
+ gr.update(active=False), "", None,
525
+ )
526
+ tgt_engine = TARGET_LANGUAGES.get(target_language, {}).get("engine")
527
+ if tgt_engine == "yourvoic" and not os.environ.get("YOURVOIC_API_KEY", ""):
528
+ return (
529
+ None,
530
+ f"**Error:** YOURVOIC_API_KEY required for {target_language}.",
531
+ gr.update(visible=False), gr.update(visible=False), gr.update(visible=True),
532
+ gr.update(active=False), "", None,
533
+ )
534
 
535
+ session = make_session(source_language, target_language, qwen_voice_label)
536
+ return (
537
+ session,
538
+ f"**Live session active** ({source_language} -> {target_language}). "
539
+ "Press the record button on the microphone below to begin speaking.",
540
+ gr.update(visible=True),
541
+ gr.update(visible=True),
542
+ gr.update(visible=False),
543
+ gr.update(active=True),
544
+ "",
545
+ None,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
546
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
547
 
 
 
 
 
 
548
 
549
+ def live_stop(session: Optional[LiveSession]):
550
+ """Click Stop: close session, hide mic, stop drain timer."""
551
+ if session is not None:
552
+ session.closed = True
553
+ return (
554
+ None,
555
+ "**Status:** session stopped. Click Start to begin a new one.",
556
+ gr.update(visible=False),
557
+ gr.update(visible=False),
558
+ gr.update(visible=True),
559
+ gr.update(active=False),
560
+ )
561
 
 
 
 
 
 
 
 
562
 
563
+ def live_on_stream(audio_chunk, session: Optional[LiveSession]):
564
+ """Called by streaming mic for every chunk. Must return State to keep it alive."""
565
+ if session is None or audio_chunk is None:
566
+ return session
567
+ try:
568
+ sample_rate, samples = audio_chunk
569
+ except (TypeError, ValueError):
570
+ return session
571
+ if samples is None:
572
+ return session
573
+ samples = np.asarray(samples)
574
+ if samples.size == 0:
575
+ return session
576
+ try:
577
+ session_process_chunk(session, sample_rate, samples)
578
+ except Exception as e:
579
+ session.error_msg = f"Stream chunk error: {e}"
580
+ return session
581
+
582
+
583
+ def live_drain(session: Optional[LiveSession]):
584
+ """gr.Timer tick. Pops at most one finished translation per tick (sequential play)."""
585
+ if session is None or session.closed:
586
+ return None, gr.update(), gr.update()
587
+
588
+ if session.error_msg:
589
+ msg = session.error_msg
590
+ session.error_msg = ""
591
+ return (
592
+ None,
593
+ gr.update(),
594
+ f"**Background error:** {msg}",
595
+ )
596
 
597
+ try:
598
+ item = session.output_queue.get_nowait()
599
+ except queue.Empty:
600
+ if session.in_utterance:
601
+ status = "**Status:** listening (in utterance)..."
602
+ else:
603
+ status = "**Status:** waiting for speech..."
604
+ return None, gr.update(), gr.update(value=status)
605
 
606
+ session.transcripts.append(item["transcript"])
607
+ transcript_md = "\n\n---\n\n".join(t for t in session.transcripts if t)
608
+ return item["wav"], transcript_md, "**Status:** playing translation"
 
 
 
 
 
609
 
 
 
610
 
611
+ # ==========================================
612
+ # UI
613
+ # ==========================================
614
+ DESCRIPTION = """
615
+ # Live Football Commentary Translator
616
 
617
+ Translate live commentary between languages.
 
 
618
 
619
+ **Sources:** English, Scottish English, German, Spanish, Arabic
620
+ **Targets:** all of the above + Swahili, Amharic, Afrikaans
621
 
622
+ Two modes -- pick a tab below:
623
+ - **Single clip:** record or upload one clip, get one translation.
624
+ - **Continuous live:** start a session, speak naturally, hear translations queued and played in order.
625
 
626
+ Latency on free ZeroGPU: roughly 3-8 seconds per utterance.
627
+ """
628
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
629
 
630
+ def on_target_change(target_lang_choice):
631
+ cfg = TARGET_LANGUAGES.get(target_lang_choice, {})
632
+ if cfg.get("engine") == "qwen":
633
+ return gr.update(visible=True)
634
+ return gr.update(visible=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
635
 
 
 
 
 
 
 
636
 
637
+ with gr.Blocks(title="Live Football Commentary Translator") as demo:
638
+ gr.Markdown(DESCRIPTION)
 
 
 
639
 
640
+ # ===== Shared language controls =====
641
+ with gr.Row():
642
+ source_lang = gr.Dropdown(
643
+ choices=list(SOURCE_LANGUAGES.keys()),
644
+ value="English",
645
+ label="Source (what the commentator speaks)",
646
+ )
647
+ target_lang = gr.Dropdown(
648
+ choices=list(TARGET_LANGUAGES.keys()),
649
+ value="Swahili",
650
+ label="Target (what you want to hear)",
651
+ )
652
+ qwen_voice = gr.Dropdown(
653
+ choices=QWEN_VOICES,
654
+ value=QWEN_VOICES[0],
655
+ label="Voice (Qwen targets only)",
656
+ visible=False,
657
+ )
658
 
659
+ # ===== Tabs =====
660
  with gr.Tabs():
661
 
662
+ # ---- Tab 1: Single clip ----
663
+ with gr.Tab("Single clip"):
 
 
 
664
  with gr.Row():
665
+ with gr.Column():
666
+ with gr.Tabs():
667
+ with gr.Tab("Live microphone"):
668
+ mic_input = gr.Audio(
669
+ sources=["microphone"], type="filepath",
670
+ label="Speak your commentary (short bursts, 5-15s each)",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
671
  )
672
+ mic_btn = gr.Button("Translate microphone clip", variant="primary")
673
+ with gr.Tab("Upload file"):
674
+ file_input = gr.Audio(
675
+ sources=["upload"], type="filepath",
676
+ label="Upload an audio clip (.wav, .mp3, .m4a, etc.)",
 
 
677
  )
678
+ file_btn = gr.Button("Translate uploaded clip", variant="primary")
 
 
 
 
 
 
 
 
679
 
680
+ with gr.Column():
681
+ single_status = gr.Markdown(value="*Waiting for input...*")
682
+ single_audio = gr.Audio(label="Translated audio", type="filepath", autoplay=True)
683
+ single_transcript = gr.Textbox(
684
+ label="Translated text", lines=4, interactive=False,
685
  )
686
 
687
+ mic_btn.click(
688
+ fn=single_clip_translate,
689
+ inputs=[mic_input, source_lang, target_lang, qwen_voice],
690
+ outputs=[single_audio, single_status, single_transcript],
691
+ )
692
+ file_btn.click(
693
+ fn=single_clip_translate,
694
+ inputs=[file_input, source_lang, target_lang, qwen_voice],
695
+ outputs=[single_audio, single_status, single_transcript],
696
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
697
 
698
+ # ---- Tab 2: Continuous live ----
699
+ with gr.Tab("Continuous live"):
 
700
  gr.Markdown(
701
+ "**How it works:**\n"
702
+ "1. Pick source and target languages above.\n"
703
+ "2. Click **Start Live Translation**.\n"
704
+ "3. Press the record button on the microphone that appears.\n"
705
+ "4. Speak naturally -- translations chunk at pauses and play in order.\n"
706
+ "5. Click **Stop** to end the session.\n"
707
  )
708
 
709
  with gr.Row():
710
  with gr.Column():
711
+ start_btn = gr.Button("Start Live Translation", variant="primary", size="lg")
712
+ stop_btn = gr.Button("Stop", variant="stop", visible=False)
713
+
714
+ live_mic = gr.Audio(
715
+ sources=["microphone"],
716
+ streaming=True,
717
+ type="numpy",
718
+ label="Live microphone (press record to begin streaming)",
719
+ visible=False,
720
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
721
 
722
  with gr.Column():
723
+ live_status = gr.Markdown(value="*Click Start to begin.*")
724
+ live_audio = gr.Audio(
725
+ label="Translated audio (auto-plays each chunk in order)",
726
+ type="filepath",
727
+ autoplay=True,
728
  )
729
+ live_transcripts = gr.Markdown(value="", label="Translation log")
730
+
731
+ # Hidden state + drain timer
732
+ live_state = gr.State(value=None)
733
+ drain_timer = gr.Timer(value=OUTPUT_POLL_SEC, active=False)
734
+
735
+ start_btn.click(
736
+ fn=live_start,
737
+ inputs=[source_lang, target_lang, qwen_voice],
738
+ outputs=[
739
+ live_state, live_status, live_mic, stop_btn, start_btn,
740
+ drain_timer, live_transcripts, live_audio,
741
+ ],
742
  )
743
 
744
+ stop_btn.click(
745
+ fn=live_stop,
746
+ inputs=[live_state],
747
+ outputs=[
748
+ live_state, live_status, live_mic, stop_btn, start_btn, drain_timer,
749
+ ],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
750
  )
751
 
752
+ live_mic.stream(
753
+ fn=live_on_stream,
754
+ inputs=[live_mic, live_state],
755
+ outputs=[live_state],
756
+ show_progress="hidden",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
757
  )
758
 
759
+ drain_timer.tick(
760
+ fn=live_drain,
761
+ inputs=[live_state],
762
+ outputs=[live_audio, live_transcripts, live_status],
763
+ show_progress="hidden",
 
 
764
  )
765
 
766
+ # ===== Show/hide Qwen voice based on target =====
767
+ target_lang.change(fn=on_target_change, inputs=target_lang, outputs=qwen_voice)
768
+ demo.load(fn=on_target_change, inputs=target_lang, outputs=qwen_voice)
 
769
 
770
+ gr.Markdown(
771
+ "---\n"
772
+ "**Architecture:** Qwen-Omni (`qwen3.5-omni-plus`) handles audio to speech for "
773
+ "English / Scottish-EN / German / Spanish / Arabic. For Swahili / Amharic / Afrikaans: "
774
+ "Omni translates to text, then YourVoic speaks it."
775
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
776
 
777
 
778
  if __name__ == "__main__":
779
+ demo.launch(ssr_mode=False, show_api=False)
requirements.txt CHANGED
@@ -1,10 +1,4 @@
1
- torch>=2.0.0
2
- torchaudio>=2.0.0
3
- transformers>=4.36.0
4
- accelerate>=0.25.0
5
- soundfile>=0.12.0
6
  numpy>=1.24.0
7
- gradio>=5.0.0
8
- audioop-lts
9
- requests>=2.28.0
10
- openai>=1.0.0
 
1
+ gradio==5.42.0
2
+ openai>=1.40.0
3
+ requests>=2.31.0
 
 
4
  numpy>=1.24.0