Fully Local AI Meeting Summaries
Remote work has transformed how we conduct meetings, but it's also created a new challenge: keeping track of everything discussed across countless video calls. While most platforms offer cloud recording, searching through hours of audio for that one key decision made three meetings ago remains a painful experience. This led me to build Essence, a fully local command-line tool that transcribes meeting audio and generates structured summaries without sending any data to external services.
This project builds on my previous exploration of local AI with Rust - RAG Pipeline to Chat with My Obsidian Vault, where I first dove into running AI models locally for privacy and performance.
The Problem
Meeting recordings pile up quickly, but finding specific information requires listening through entire recordings. What I needed was something that could:
- Convert audio recordings to searchable text
- Generate concise summaries highlighting key decisions and action items
- Work entirely offline for privacy and speed
- Integrate seamlessly into my existing CLI-based workflow
I did not want to use/pay for any cloud services that join the meetings for me (**cough** [Some Stupid Name] The Note Taker **cough**), and I wanted to use my own hardware.
Technical Architecture
Essence is built in Rust and follows a modular design with two primary components: transcription and summarization. I'm using OpenAI's Whisper model running locally via the whisper-rs crate for speech recognition, and Ollama for local language model inference. Everything runs entirely on my machine - no data ever leaves my system, which you know I love.
I achieved good results by using Apple's Metal acceleration for the Whisper model (ggml-large-v3.bin), and the gemma3:27b model from Ollama for summarization. On my M4 Pro, transcribing a and summarizing a 45-minute meeting took less than 8 minutes (your mileage may vary).
Transcription
The transcription module wraps Whisper's C++ implementation, handling the complex audio preprocessing required for speech recognition.
We first have to read the wav audio file (using hound) into a vector of 16-bit integers.
let samples: = open
.map_err?
.
.map
.?;
The Whisper model expects 16KHz mono f32 samples, meaning we have to do just a bit more work:
let mut inter_samples = vec!;
convert_integer_to_float_audio
.map_err?;
let samples = convert_stereo_to_mono_audio
.map_err?;
After that's done, we can run the model and get back a vector of text segments. Et voila!
Summarization
The summarization component interfaces with Ollama to run large language models locally. The model parameters are tuned for consistent, focused output - low temperature reduces randomness while constrained sampling ensures the model stays on topic.
The prompt engineering focuses on extracting actionable information:
Unix Philosophy and CLI Design
I designed Essence to follow Unix principles: do one thing well and play nicely with other tools. The CLI exposes two primary commands that can be chained together or used independently:
Following Unix conventions, the tool outputs results to stdout and logs to stderr. This makes it perfect for scripting and automation. Here's a script that takes a locally recorded meeting video and produces a summary:
#!/bin/bash
# extract_and_summarize.sh - Convert video to summary
VIDEO_FILE=""
AUDIO_FILE="meeting_audio.wav"
TRANSCRIPT_FILE="transcript.txt"
# Extract audio from video
# Transcribe the audio
# Generate summary
# Cleanup
Now imagine running this script after each meeting, saving the result in your Obsidian vault, and then chatting with it using the fully-local RAG pipeline I previously built.