codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
EDU›Tauri 2 — desktop · mobile in one codebase›Step 6

Step 6

OCR / STT / TTS

0 views

OCR / STT / TTS

Add "image → text", "speech → text", and "text → speech" — three practical bits.

1. OCR — Tesseract wasm

pnpm add tesseract.js
import Tesseract from "tesseract.js";
const { data: { text } } = await Tesseract.recognize(imageFile, "kor+eng", {
  logger: (m) => console.log(m.status, m.progress),
});
  • kor+eng together
  • ~10MB wasm (lazy load)
  • Downloads traineddata on first use

2. Preprocessing

function preprocessImage(canvas: HTMLCanvasElement) {
  const ctx = canvas.getContext("2d")!;
  const img = ctx.getImageData(0, 0, canvas.width, canvas.height);
  const d = img.data;
  for (let i = 0; i < d.length; i += 4) {
    const gray = 0.299 * d[i] + 0.587 * d[i+1] + 0.114 * d[i+2];
    const bw = gray > 128 ? 255 : 0;
    d[i] = d[i+1] = d[i+2] = bw;
  }
  ctx.putImageData(img, 0, 0);
}

Grayscale + threshold: +10–20pp accuracy.

3. STT — Web Speech API

const rec = new (window.webkitSpeechRecognition || window.SpeechRecognition)();
rec.lang = "ko-KR";
rec.continuous = false;
rec.interimResults = true;
rec.onresult = (e) => {
  const text = Array.from(e.results).map(r => r[0].transcript).join("");
};
rec.start();

Free, online, Chromium-based engines. For offline, swap to Vosk or Whisper.cpp.

4. TTS — Web Speech API

function speak(text: string, lang = "ko-KR") {
  const u = new SpeechSynthesisUtterance(text);
  u.lang = lang; u.rate = 1.0; u.pitch = 1.0;
  window.speechSynthesis.speak(u);
}
function cancel() { window.speechSynthesis.cancel(); }

Toggle speaking ? cancel() : speak() for intuitive UX.

5. Permissions (mobile)

Android requires runtime RECORD_AUDIO for the microphone. OCR needs no permission (image selection only).

6. Language packs

Tesseract traineddata is ~10MB per language. Bundle vs download at runtime:

  • Bundled — offline ready, bigger app
  • Runtime download — smaller app, first use delay

Bundle for mobile to respect data plans.

7. Gotchas

  • Long first OCR delay → show a progress UI
  • No Korean voice → user must install OS voice pack
  • STT battery drain → cancel on onend
  • Noisy OCR output → regex /\s+/g, " " cleanup

Closing

OCR · STT · TTS often combine in real products (language apps, accessibility tools). Web APIs go a long way under Tauri.

Next

  • 07-admob-shipping

← Step 5

Android build

Step 7 →

AdMob + shipping