AI & Automation

How We Built a Discord Bot to Record, Transcribe, and Summarize Team Calls

We built a Discord bot that records voice channels, transcribes them with Fireflies.ai, and posts structured summaries back into Discord threads. Action items get pushed to Jira automatically.

26 June 20268 min read

Our team runs on Discord for voice — standups, client calls, planning sessions. The problem is notes. Sometimes someone takes them, often they don't, and action items disappear after the call ends. We could have switched to Zoom or Google Meet for built-in transcription, but Discord works for everything else we do. So instead we built a bot.

When someone joins a voice channel, the bot asks if they want to record. If they say yes, it records the call, sends the audio to Fireflies.ai for transcription, and posts a structured summary back into a Discord thread. Action items go straight into Jira.

This is a walkthrough of how it actually works.

The User Flow

The bot monitors voice channel joins. When the first person joins a channel, it sends a message into the channel's built-in text tab:

👋 Alex just joined. Want to record this meeting? React ✅ to start, or use /record.

React ✅ and recording starts. Once it's going, the bot adds a 🛑 reaction to the same message. React 🛑 and it stops. There's also /record and /stop as slash commands. A 5-minute cooldown per channel keeps the bot quiet when people shuffle in and out during a session.

We kept the interaction as minimal as possible — two reactions on one message.

Recording the Audio

This is the part that took the most work to get right. Discord doesn't give you one mixed audio stream — each speaker comes in as a separate Opus-encoded stream. So the bot captures them separately and mixes everything at the end.

When a user starts speaking, the bot opens an Opus stream for them, decodes each audio packet to raw PCM, and appends it to a file named after their user ID ({userId}.pcm). If they stop and start again, the bot re-subscribes and keeps appending to the same file. One file per speaker, for the full duration of the call.

[@portabletext/react] Unknown block type "codeBlock", specify a component for it in the `components.types` prop

Discord audio is 48kHz, 2-channel, 16-bit signed little-endian PCM after decoding. We write that directly to disk.

Mixing and Upload

When /stop is called:

  • Collect all .pcm files from the temp directory
  • Feed them to ffmpeg with the amix filter, which blends multiple audio streams into one track. We use normalize=0 so quiet speakers aren't artificially boosted to match louder ones
  • Output a 128kbps MP3
[@portabletext/react] Unknown block type "codeBlock", specify a component for it in the `components.types` prop

Then we upload the MP3 to Cloudflare R2. The bucket is private. Instead of making audio files publicly accessible, we generate a presigned URL with a 6-hour expiry and pass that to Fireflies. Fireflies downloads the file using the URL, processes it, and we delete the file from R2 once that's done.

Submitting to Fireflies

Fireflies has a GraphQL API for uploading audio. We call their uploadAudio mutation with the presigned URL, meeting title, attendee names, and a field called clientReferenceId.

That clientReferenceId is the key piece. It's a small JSON blob we control — we encode the Discord channel ID, message ID, guild ID, and a timestamp into it. When Fireflies fires a webhook back at us (5–15 minutes later), it includes that same blob in the payload. We decode it to find exactly which Discord message to update and where to post the summary thread.

[@portabletext/react] Unknown block type "codeBlock", specify a component for it in the `components.types` prop

Fireflies just passes whatever you gave it straight back in the webhook. It's a simple way to carry state across an async process.

The Webhook

When processing is done, Fireflies POSTs to /webhook on our server. We run a small Express server alongside the bot for this.

First we verify the HMAC signature. We parse the raw request body before running JSON.parse on it — the HMAC has to be computed over the exact bytes that Fireflies signed, not over a re-serialized JSON object. We use timingSafeEqual to avoid timing attacks in the comparison.

[@portabletext/react] Unknown block type "codeBlock", specify a component for it in the `components.types` prop

After verifying, we respond with 200 immediately so Fireflies doesn't retry. Then we do the actual work asynchronously.

Posting Back to Discord

Once we have the transcript, we:

  • Edit the '⏳ Processing...' message to show the meeting title
  • Create a thread from that message
  • Post three embeds into the thread: Overview (title, date, duration, attendees, AI summary), Key Points (bullet gist, topics, keywords), and Action Items
  • Post a link to the full recording in Fireflies

The thread keeps the recordings channel clean. The parent message is just the title and date. Everything else is inside the thread.

Jira Integration

Each action item from the Fireflies summary becomes a Jira Task. Fireflies formats them like **Name** — task description, so we extract the first name from that heading and map it to a Jira account email. Labels get extracted from the meeting keywords and matched against a list of known client and project names.

The Jira step is entirely optional. If any of the four Jira env vars (JIRA_HOST, JIRA_EMAIL, JIRA_API_TOKEN, JIRA_PROJECT_KEY) are missing, action items are skipped silently and everything else continues normally.

Architecture Overview

The full flow end to end: user joins voice channel → bot sends prompt → user reacts → bot joins channel and starts recording → user stops → bot mixes audio → uploads to R2 → submits to Fireflies → Fireflies webhook fires → bot fetches transcript → posts Discord thread → creates Jira tasks → deletes audio from R2.

A Few Notes on the Implementation

Multiple simultaneous recordings work fine. Each voice channel gets its own session tracked independently. /stop has logic to figure out which channel to stop if several are running.

The bot never auto-records. It won't join a voice channel on its own. The join prompt is just a text message in the channel tab. Nothing happens until someone reacts ✅ or runs /record.

Speaker labeling: Fireflies labels speakers as 'Speaker 1', 'Speaker 2', etc. from its own voice diarization. Discord display names are passed as attendee metadata and show up in the meeting overview, but they're not used to label individual transcript lines. This is a Fireflies limitation.

Audio is deleted automatically after Fireflies processes it. We don't keep recordings anywhere long-term.

Tech Stack

  • discord.js + @discordjs/voice for bot infrastructure and voice capture
  • @discordjs/opus for decoding Opus audio packets
  • ffmpeg-static + fluent-ffmpeg for audio mixing
  • @aws-sdk/client-s3 for Cloudflare R2 (R2 exposes an S3-compatible API)
  • graphql-request for the Fireflies API
  • jira.js for Jira
  • express for the webhook server
  • Deployed on Railway

Setting It Up

You need a Fireflies Pro account (the upload API isn't on their free plan), a Cloudflare account with an R2 bucket, and a public URL for the webhook endpoint. We deploy on Railway; ngrok works fine for local development.

The short version:

  • Create a Discord application and bot, invite it to your server with the right permissions
  • Set up a Cloudflare R2 bucket (keep it private)
  • Deploy to Railway and copy the generated public domain
  • In Fireflies → Integrations → Webhooks, add your URL (/webhook) and subscribe to Meeting Summarized events
  • Run npm run register-commands once locally to register the slash commands with Discord
  • Fill in all the env vars (there's a .env.example in the repo)

The full setup is documented in the README, including exact Discord permissions and OAuth scopes.

The whole thing is about 600 lines of TypeScript across 15 or so files. The trickiest part was the audio recording — understanding that Discord sends per-speaker Opus streams rather than a pre-mixed feed, and making sure PCM appends worked correctly across multiple speaking segments within a single recording session.

Building something similar?

This article came from real project experience. Book a call and we'll give you the honest take on your specific situation.

Book a discovery call