Project: Israeli Tech Content Corpus
Card Summary (Layer 1)
Automated pipeline for collecting, transcribing, and translating Israeli tech podcasts and articles. Built to understand the Israeli ecosystem through primary sources - 31 podcast episodes + 75+ articles processed.Quick Facts (Layer 1)
- Sources: Reversim podcast, Geektime articles, NoCamels innovation coverage
- Processing: Hebrew→English translation, AI summarization, entity extraction
- Volume: 31 podcast episodes (full transcripts), 75+ articles
- Tech: Whisper transcription, Gemini translation, RSS parsing
- Status: Active data collection
What It Does (Layer 2)
The Problem
Understanding the Israeli tech ecosystem requires reading Hebrew content (podcasts, articles, industry reports). English sources are limited and often surface-level.The Solution
Built automated content collection and processing:1. Podcast Transcription: Reversim episodes → Whisper → timestamped transcripts 2. Translation Pipeline: Batch Hebrew→English translation (Gemini 2.5 Flash Lite) 3. Article Scraping: RSS feeds from Geektime, NoCamels → structured data 4. AI Analysis: Extract companies mentioned, people, technologies, key topics 5. Structured Storage: JSON with metadata (topics, confidence, summaries)
Why It Matters
- Primary sources: Not filtered through US tech press
- Hebrew fluency: Access to content unavailable in English
- Pattern recognition: Identify trends before they go global
- Network mapping: Track who's building what, where
Technical Details (Layer 3)
Architecture
RSS Feeds (Reversim, Geektime, NoCamels)
↓
Feed Parser (Python feedparser)
↓
Audio Transcription (Whisper for Hebrew)
↓
Batch Translation (Gemini 2.5 Flash Lite)
- 20 segments per batch
- ~5 sec per 512 segments
↓
AI Analysis (entity extraction, summarization)
↓
Structured JSON Storage
- Hebrew + English text
- Metadata (topics, companies, people)
- Confidence scores
Translation Optimization
Batch Processing:def translate_in_batches(segments, batch_size=20):
"""Combine 20 segments, send to Gemini, parse results"""
for batch in chunks(segments, batch_size):
# Format: [0] speaker: text\n[1] speaker: text...
combined = "\n".join(
f"[{i}] {seg['speaker']}: {seg['text']}"
for i, seg in enumerate(batch)
)
# Single API call for 20 segments
translation = gemini.translate(combined)
# Parse numbered results back to individual segments
return parse_translation_results(translation)
Why Batch?:
- Single API call vs 20 calls = 20x faster
- Lower cost (fewer API requests)
- Better context for translation (sees surrounding segments)
Content Analysis
Entity Extraction:def analyze_article(content):
"""Extract structured metadata from Hebrew content"""
return {
"main_topic": "...",
"key_points": [...],
"companies_mentioned": [...],
"people_mentioned": [...],
"technologies": [...],
"category": "...", # funding, product launch, hiring, etc.
"sentiment": "...",
"importance_score": 0.85
}
Data Structure
Podcast Episode JSON:{
"episode_number": 502,
"title": "...",
"date": "2024-10-15",
"summary_hebrew": "...",
"summary_english": "...",
"transcript": [
{
"timestamp": "00:05:23",
"speaker": "Ran Tavory",
"text_hebrew": "...",
"text_english": "..."
}
],
"topics": ["AI", "Startups", "Investment"],
"companies_mentioned": ["Wiz", "monday.com"],
"translation_confidence": 0.89
}
Implementation Challenges
Challenge 1: Hebrew transcription accuracy- Solution: Whisper model tuned for Hebrew, manual verification
- Result: ~85% accuracy, good enough for gist extraction
- Solution: Batch translation maintains context, technical terms preserved
- Result: "RAG" stays "RAG", not translated to Hebrew equivalent
- Solution: Pivoted to accessible feeds (Geektime, NoCamels)
- Result: 75+ articles from open sources
Links
Code
- GitHub: Part of larger content processing suite
- Data: JSON files with full transcripts + translations
Outputs
- 31 Reversim episodes fully translated
- 75+ articles (30 Geektime Hebrew + 45 NoCamels English)
- Structured metadata for all content
Tech Stack Tags
python whisper openrouter gemini feedparser rss hebrew translation transcription nlp
---
What This Proves
Language Flexibility: Hebrew→English translation pipeline Data Processing: Automated content collection at scale Systematic Approach: Not one-off scraping - reusable pipeline Domain Knowledge: Understanding Israeli tech ecosystem through primary sources Tool Selection: Chose right tools (Whisper, Gemini, feedparser) for each stepBuilt to solve my own need: understand the ecosystem I'm operating in.
---
Meta Notes
Depth:
●
○
○
○