
videobrev: YouTube Transcripts & AI Summaries
A Go + SolidJS app for extracting YouTube transcripts and generating fast AI summaries. Decide to watch a vid by giving it a quick skim first
This project started for a very practical reason: a friend kept sending me long YouTube videos, and I wanted to know whether they were worth watching before donating half my evening to them.
So VideoBrev became a kind of YouTube filter. Grab the transcript, run it through an LLM, and give me the quick version first. The goal is not "replace the video". Itās "help me decide whether this deserves my eyeballs".
Tech Stack Talk
The interesting bits:
- Learning how to extract YouTube transcripts
- Stream AI summaries in real-time using Server-Sent Events
- AI model selection and handling videos from seconds to 8+ hours (~76k tokens)
- First time using this tech stack and shipped a working project: Golang, SolidJS, UnoCSS
- Multi-provider LLM support (Gemini, Claude, OpenAI, Groq, OpenRouter)
- Circuit breaker pattern for LLMs, because AI APIs are still a bit like genius interns with unreliable calendars
Real stats (30 days, no SEO, 2 Reddit posts):
This was enough to prove that real people were using it, which is always a nice surprise:
- 122 unique devices
- 46 requests completed
- Avg transcript extraction: 1.3 seconds
- Avg AI summary: 6.27 seconds end-to-end
Getting YouTube Transcripts
Getting YouTube transcripts is one of those tasks that feels oddly secretive, like everyone knows thereās a method but nobody wants to say it too loudly in case the wall starts listening.
I wonāt spell out every detail here, mostly because these integrations tend to be fragile and YouTube changes things often. The short version is that thereās a rough two-step API flow, itās not documented particularly well, and the open source community is doing a lot of the heavy lifting whenever it shifts.
Since shipping VideoBrev, YouTube changed things at least three times in ways that broke transcript extraction. The fixes were small, but the lesson was clear: if your product depends on someone elseās data pipe, maintenance is not optional. Itās rent.
AI Model Choice
The part people miss is that model choice is not just about price and speed. Models have personalities. Not in a mystical sense. More in the sense that, after a while, you learn what each one tends to do well, what it fumbles, and how often it falls over at inconvenient moments.
At first I assumed any cheap, fast model would be fine because the task sounded simple: summarise a transcript. Then two problems showed up:
- If the model was not "smart" enough, the summaries were actually missing lots of key info
- LLM inference endpoints have notoriously bad uptimes (compared to non-LLM software)
The first issue mattered a lot. If the summary skips the useful bits, the whole product becomes a very confident waste of time. Longer transcripts made this worse. An eight-hour Python tutorial at roughly 76k transcript tokens was enough to expose which models were just not up to the job.
Here are the models I tested and disabled:
DISABLED_MODELS:
- qwen-qwq-32b
- llama-3.1-8b-instant
- gpt-4o-mini
- meta-llama/llama-4-scout-17b-16e-instruct
- meta-llama/llama-4-maverick-17b-128e-instruct
- mistralai/mistral-nemo
- qwen/qwen-turbo
The optimisation target here was simple: fast, cheap, and smart enough. A lot of models managed two out of three. That sounds fine until you realise the missing third one is the entire product.
So I was looking for the cheap end of the Pareto frontier, not just the absolute cheapest thing with a pulse.

Big 3 and Downtime
In ordinary software, weāve become used to very high uptime. Four nines means about 53 minutes of downtime a year. That is both impressive and, for modern infrastructure, weirdly normal.

| Provider | SLA Uptime | Downtime/yr | Notes |
|---|---|---|---|
| Gemini | 99.5% | 43.8 hrs | Provides financial credits for missed SLA |
| OpenAI | 99.9% | 8.76 hrs | Enterprise customers only |
| Claude | 99.5% | 43.8 hrs | Best effort, Priority Tier |
For a small app builder, Gemini via Vertex looked like the best deal on paper because you could get an SLA without needing an enterprise sales ritual.
Then I looked at the last 90 days of public status data and compared that with the promises.
Downtime in the past 90 days (10 Dec 2025)
| Provider | Downtime | Yearly Projected | SLA Target | Met SLA? |
|---|---|---|---|---|
| Gemini | 97.4 hrs | 95.5% (~1.3 nines) |
99.5% | No |
| OpenAI | 72.7 hrs | 96.6% (~1.5 nines) |
99.9% | No |
| Claude | 11.8 hrs | 99.5% (~2.3 nines) |
99.5% | Yes |
Note: SLA targets are product-tier specific (Vertex AI, OpenAI Enterprise, Claude Priority Tier) and may not apply to the same endpoints tracked in the status pages above.
The punchline is not subtle: AI endpoints are still much less reliable than the rest of the software stack weāre used to. Even for a small app, "hope the provider behaves today" is not a strategy. Itās a coin toss wearing a lanyard.
That is what pushed me to build circuit breakers.
AI Models Chosen and Circuit Breakers
Here are the models I ended up using:
DEFAULT_LLM_PROVIDER_SHORT_CONTENT=openrouter
DEFAULT_LLM_PROVIDER_LONG_CONTENT=google
DEFAULT_GOOGLE_MODEL=gemini-2.0-flash
DEFAULT_OPENROUTER_MODEL="meta-llama/llama-3.1-8b-instruct" # Cerebras only
DEFAULT_ANTHROPIC_MODEL=claude-3-5-haiku-latest
DEFAULT_OPENAI_MODEL=gpt-4.1-mini
Why so many? Because short and long videos behave differently. For short videos, the "smart enough" problem usually doesnāt show up, so I could prioritise cost and speed much more aggressively.
Why not Groq for everything? In my testing, I sometimes saw long startup waits before inference even began. If the queue burns ten seconds up front, the cheap fast model is no longer looking especially fast.
Hereās the cost table I was using to compare, from cheapest to most expensive:
| Provider | Model | Input Cost | Output Cost |
|---|---|---|---|
| OpenRouter | Meta Llama 3.1 8B (Groq) | $0.05 | $0.08 |
| OpenRouter | Meta Llama 3.1 8B (Cerebras) | $0.10 | $0.10 |
| Gemini 2.0 Flash | $0.10 | $0.40 | |
| Gemini 2.5 Flash Preview | $0.15 | $0.60 | |
| OpenAI | GPT-4.1-mini | $0.40 | $1.60 |
| Anthropic | Claude 3.5 Haiku | $0.80 | $4.00 |
Gemini Flash models were close to ideal for this workload: good summaries, decent speed, and low enough cost. They handled very long transcripts well. But I still wanted something even cheaper for short videos, and Gemini also threw enough overload errors in testing to make fallback logic necessary anyway.
Thatās where the circuit breaker pattern came in.

Circuit breakers are a clean fit for flaky third-party services. When one provider starts failing, you stop sending it traffic for a while and fall back to the next option. Then you test it again later to see if it has recovered. Simple idea. Very useful.
Each provider in the fallback chain has its own circuit breaker with three states:
- Closed - Provider is healthy, requests go through normally
- Open - Provider is failing, skip it entirely and move to the next fallback
- Half-Open - Cautiously testing if the provider has recovered
So I layered backup providers into two separate fallback chains. Short videos have this priority: Llama 3.1, Gemini Flash 2, GPT-4.1-mini Long videos have this priority: Gemini Flash 2, GPT-4.1-mini, Claude 3.5 Haiku
If an API fails three times, that circuit opens for 60 seconds and traffic moves to the next provider. GPT-4.1-mini and Claude 3.5 Haiku are not my favourite options on cost, but reliability likes diversity.
Example: You submit a 10-minute video, so we start with Gemini 2.0 Flash. If that provider starts returning errors, the circuit breaker tracks the failure rate. Once we've seen at least 3 requests and 60% or more have failed, the breaker trips to Open. For the next 60 seconds, we don't even attempt Gemini 2.0 Flash. Requests immediately fall through to the next provider (OpenAI GPT-4.1-mini in this case).
After that 60-second timeout, the breaker enters Half-Open and lets exactly one test request through. If it succeeds, we're back in business with that provider. If it fails, back to Open for another minute.
This keeps the app responsive when providers are having a bad day. Instead of politely queueing for more failure, the system routes around the problem.
Prompting
Prompting is both simpler and harder than people say.
One practical approach is to dump your goals, constraints, and examples into a rough brief, then ask the model to help shape the prompt. Thatās more or less how I started. Trial and error first, then refinement.
Different models also respond differently to the same prompt. Some need more steering. Some need less. If youāre trying to support multiple models with one prompt, more detail usually helps.
Prompt evolution:
- V1: "Summarize this transcript" ā hallucinated names, wrong terms
- V2: Added ASR (Automatic Speech Recognition) error handling instructions, still not a completely solved problem
- V3: Structured output (TLDR + detailed sections)
Backend and Frontend Architecture

The app uses a Go backend with a SolidJS SPA frontend. At the time I wanted to try SolidJS and UnoCSS, and both were enjoyable to work with. Fast, lightweight, good developer experience.
That said, with hindsight I probably wouldnāt choose the same stack again for this project. Iād likely move it toward a Go MPA with Datastar and simplify the whole shape of the app.
Frontend
SolidJS did make me appreciate JSX more than I expected. But frontend speed is not just about the framework. Itās also about the ecosystem around it, and thatās where larger ecosystems still have a real edge.
I came into web development during the Svelte hype years, so React never became my default instinct. Part of that is taste. I like lighter tools when theyāre good enough. Iām always trading off raw ecosystem size against performance and simplicity.
UnoCSS had a similar tradeoff. Fast and flexible, yes. But fewer ready-made UI patterns meant more tinkering than I really wanted. That tradeoff matters when the goal is to ship, not just admire your tooling choices in the mirror.
Why SSE over WebSockets: Simpler protocol, unidirectional is sufficient, automatic reconnection, HTTP/2 compatible.
Key Takeaways
Technical:
- AI model providers are still far off from the reliability we expect from regular non-LLM apps
- Circuit breakers should be mandatory for AI services, but careful prompting is needed for harder tasks than just summarisation
- Per-model prompts are likely better than one prompt for all models
- If you are going to use one prompt for all models, give it more detail and instruction
- Prompting is important because it is one of the few levers you directly control
Product:
- Building a business on someone else's data has to be done with care and maintenance in mind
- Edge cases break everything, so test the 8-hour videos early
Future Work
Iāll probably leave this app as an artifact rather than fully rebuild it, but the ideas are still useful.
Architecture:
- Golang MPA with templ and Datastar, which removes the frontend framework and the OpenAPI-spec coupling
Features:
- Caching layer for popular videos
- Timestamp references, because model citations are still weak without vector search or something similar
- Jargon web lookup. For example, the fast cheap models still donāt always know what MCP (Model Context Protocol) stands for
Product:
- Chrome extension for in-page summaries
- Saved transcripts and summaries (user accounts)
- Export to Notion, Obsidian, markdown