Building a Scalable and Accurate Audio Interview Transcription Pipeline with Google Gemini

This text is co-authored by Ugo Pradère and David Haüet

can or not it’s to transcribe an interview? You feed the audio to an AI mannequin, wait a couple of minutes, and growth: excellent transcript, proper? Properly… not fairly.

In the case of precisely transcribe lengthy audio interviews, much more when the spoken language isn’t English, issues get much more difficult. You want prime quality transcription with dependable speaker identification, exact timestamps, and all that at an reasonably priced worth. Not so easy in spite of everything.

On this article, we take you behind the scenes of our journey to construct a scalable and production-ready transcription pipeline utilizing Google’s Vertex AI and Gemini fashions. From surprising mannequin limitations to price range analysis and timestamp drift disasters, we’ll stroll you thru the actual challenges, and the way we solved them.

Whether or not you’re constructing your individual Audio Processing device or simply inquisitive about what occurs “underneath the hood” of a strong transcription system utilizing a multimodal mannequin, you will discover sensible insights, intelligent workarounds, and classes discovered that ought to be value your time.

Context of the venture and constraints

Firstly of 2025, we began an interview transcription venture with a transparent aim: to construct a system able to transcribing interviews in French, usually involving a journalist and a visitor, however not restricted to this case, and lasting from a couple of minutes to over an hour. The ultimate output was anticipated to be only a uncooked transcript however needed to replicate the pure spoken dialogue written in a “book-like” dialogue, making certain each a trustworthy transcription of the unique audio content material and a very good readability.

Earlier than diving into growth, we carried out a brief market overview of present options, however the outcomes had been by no means passable: the standard was usually disappointing, the pricing positively too excessive for an intensive utilization, and typically, each without delay. At that time, we realized a customized pipeline can be vital.

As a result of our group is engaged within the Google ecosystem, we had been required to make use of Google Vertex AI providers. Google Vertex AI affords quite a lot of Speech-to-Textual content (S2T) fashions for audio transcription, together with specialised ones equivalent to “Chirp,” “Latestlong,” or “Telephone name,” whose names already trace at their meant use circumstances. Nevertheless, producing an entire transcription of an interview that mixes excessive accuracy, speaker diarization, and exact timestamping, particularly for lengthy recordings, stays an actual technical and operational problem.

First makes an attempt and limitations

We initiated our venture by evaluating all these fashions on our use case. Nevertheless, after in depth testing, we got here shortly to the next conclusion: no Vertex AI service absolutely meets the entire set of necessities and can permit us to realize our aim in a easy and efficient method. There was at all times not less than one lacking specification, often on timestamping or diarization.

The horrible Google documentation, this should be mentioned, price us a big period of time throughout this preliminary analysis. This prompted us to ask Google for a gathering with a Google Cloud Machine Studying Specialist to try to discover a answer to our downside. After a fast video name, our dialogue with the Google rep shortly confirmed our conclusions: what we aimed to realize was not so simple as it appeared at first. The complete set of necessities couldn’t be fulfilled by a single Google service and a customized implementation of a VertexAI S2T service needed to be developed.

We offered our preliminary work and determined to proceed exploring two methods:

Use Chirp2 to generate the transcription and timestamping of lengthy audio recordsdata, then use Gemini for diarization.
Use Gemini 2.0 Flash for transcription and diarization, though the timestamping is approximate and the token output size requires looping.

In parallel of those investigations, we additionally needed to take into account the monetary facet. The device can be used for a whole lot of hours of transcription per thirty days. Not like textual content, which is mostly low-cost sufficient to not have to consider it, audio will be fairly expensive. We due to this fact included this parameter from the start of our exploration to keep away from ending up with an answer that labored however was too costly to be exploited in manufacturing.

Deep dive into transcription with Chirp2

We started with a deeper investigation of the Chirp2 mannequin since it’s thought-about because the “finest in school” Google S2T service. A simple utility of the documentation supplied the anticipated end result. The mannequin turned out to be fairly efficient, providing good transcription with word-by-word timestamping based on the next output in json format:

"transcript":"Oui, en effet",
"confidence":0.7891818284988403
"phrases":[
  {
    "word":"Oui",
    "start-offset":{
      "seconds":3.68
    },
    "end-offset":{
      "seconds":3.84
    },
    "confidence":0.5692862272262573
  }
  {
    "word":"en",
    "start-offset":{
      "seconds":3.84
    },
    "end-offset":{
      "seconds":4.0
    },
    "confidence":0.758037805557251
  },
  {
    "word":"effet",
    "start-offset":{
      "seconds":4.0
    },
    "end-offset":{
      "seconds":4.64
    },
    "confidence":0.8176857233047485
  },
]

Nevertheless, a brand new requirement got here alongside the venture added by the operational staff: the transcription should be as trustworthy as potential to the unique audio content material and embrace small filler phrases, interjections, onomatopoeia and even mumbling that may add which means to a dialog, and usually come from the non-speaking participant both on the identical time or towards the tip of a sentence of the talking one. We’re speaking about phrases like “oui oui,” “en effet” but additionally easy expressions like (hmm, ah, and so forth.), so typical of the French language! It’s truly not unusual to validate or, extra hardly ever, oppose somebody level with a easy “Hmm Hmm”. Upon analyzing Chirp with transcription, we seen that whereas a few of these small phrases had been current, numerous these expressions had been lacking. First draw back for Chirp2.

The principle problem on this method lies within the reconstruction of the speaker sentences whereas performing diarization. We shortly deserted the concept of giving Gemini the context of the interview and the transcription textual content, and asking it to find out who mentioned what. This technique might simply end in incorrect diarization. We as an alternative explored sending the interview context, the audio file, and the transcription content material in a compact format, instructing Gemini to solely carry out diarization and sentence reconstruction with out re-transcribing the audio file. We requested a TSV format, a really perfect structured format for transcription: “human readable” for quick high quality checking, simple to course of algorithmically, and light-weight. Its construction is as follows:

First line with speaker presentation:

Diarization Speaker_1:speaker_nameSpeaker_2:speaker_nameSpeaker_3:speaker_nameSpeaker_4:speaker_name, and so forth.

Then the transcription within the following format:

speaker_idttime_startttime_stoptext with:

speaker: Numeric speaker ID (e.g., 1, 2, and so forth.)
time_start: Phase begin time within the format 00:00:00
time_stop: Phase finish time within the format 00:00:00
textual content: Transcribed textual content of the dialogue phase

An instance output:

Diarization Speaker_1:Lea FinchSpeaker_2:David Albec

1 00:00:00 00:03:00 Hello Andrew, how are you?

2 00:03:00 00:03:00 High quality thanks.

1 00:04:00 00:07:00 So, let’s begin the interview

2 00:07:00 00:08:00 All proper.

A easy model of the context supplied to the LLM:

Right here is the interview of David Albec, skilled soccer participant, by journalist Lea Finch

The end result was pretty qualitative with what seemed to be correct diarization and sentence reconstruction. Nevertheless, as an alternative of getting the very same textual content, it appeared barely modified in a number of locations. Our conclusion was that, regardless of our clear directions, Gemini in all probability carries out extra than simply diarization and really carried out partial transcription.

We additionally evaluated at this level the price of transcription with this technique. Under is the approximate calculation based mostly solely on audio processing:

Chirp2 worth /min: 0.016 usd

Gemini 2.0 flash /min: 0,001875 usd

Value /hour: 1,0725 usd

Chirp2 is certainly fairly “costly”, about ten instances greater than Gemini 2.0 flash on the time of writing, and nonetheless requires the audio to be processed by Gemini for diarization. We due to this fact determined to place this technique apart for now and discover a method utilizing the model new multimodal Gemini 2.0 Flash alone, which had simply left experimental mode.

Subsequent: exploring audio transcription with Gemini flash 2.0

We supplied Gemini with each the interview context and the audio file requesting a structured output in a constant format. By rigorously crafting our immediate with normal LLM tips, we had been in a position to specify our transcription necessities with a excessive diploma of precision. As well as with the everyday components any immediate engineer may embrace, we emphasised a number of key directions important for making certain a high quality transcription (remark in italic):

Transcribe interjections and onomatopoeia even when mid-sentence.
Protect the total expression of phrases, together with slang, insults, or inappropriate language. => the mannequin tends to vary phrases it considers inappropriate. For this particular level, we needed to require Google to deactivate the protection guidelines on our Google Cloud Undertaking.
Construct full sentences, paying specific consideration to modifications in speaker mid-sentence, for instance when one speaker finishes one other’s sentence or interrupts. => Such errors have an effect on diarization and accumulate all through the transcript till context is powerful sufficient for the LLM to appropriate.
Normalize extended phrases or interjections like “euuuuuh” to “euh.” and never “euh euh euh euh euh …” => this was a classical bug we had been encountering known as “repetition bug” and is mentioned in additional element under
Determine audio system by voice tone whereas utilizing context to find out who’s the journalist and who’s the interviewee. => as well as we will move the knowledge of the primary speaker within the immediate

Preliminary outcomes had been truly fairly satisfying when it comes to transcription, diarization, and sentence building. Transcribing brief take a look at recordsdata made us really feel just like the venture was almost full… till we tried longer recordsdata.

Coping with Lengthy Audio and LLM Token Limitations

Our early checks on brief audio clips had been encouraging however scaling the method to longer audios shortly revealed new challenges: what initially appeared like a easy extension of our pipeline turned out to be a technical hurdle in itself. Processing recordsdata longer than only a few minutes revealed certainly a sequence of challenges associated to mannequin constraints, token limits, and output reliability:

One of many first issues we encountered with lengthy audio was the token restrict: the variety of output tokens exceeded the utmost allowed (MAX_INPUT_TOKEN = 8192) forcing us to implement a looping mechanism by repeatedly calling Gemini whereas resending the beforehand generated transcript, the preliminary immediate, a continuation immediate, and the identical audio file.

Right here is an instance of the continuation immediate we used:

Proceed transcribing audio interview from the earlier end result. Begin processing the audio file from the earlier generated textual content. Don’t begin from the start of the audio. Watch out to proceed the beforehand generated content material which is accessible between the next tags .

Utilizing this transcription loop with massive knowledge inputs appears to considerably degrade the LLM output high quality, particularly for timestamping. On this configuration, timestamps can drift by over 10 minutes on an hour-long interview. If a couple of seconds drift was thought-about appropriate with our meant use, a couple of minutes made timestamping ineffective.

Our preliminary take a look at on brief audios of some minutes resulted in a most 5 to 10 seconds drift, and important drift was noticed typically after the primary loop when max enter token was reached. We conclude from these experimental observations, that whereas this looping method ensures continuity in transcription pretty effectively, it not solely results in cumulative timestamp errors but additionally to a drastic lack of LLM timestamps accuracy.

We additionally encountered a recurring and notably irritating bug: the mannequin would typically fall right into a loop, repeating the identical phrase or phrase over dozens of strains. This conduct made total parts of the transcript unusable and sometimes regarded one thing like this:

1 00:00:00 00:03:00 Hello Andrew, how are you?

2 00:03:00 00:03:00 High quality thanks.

2 00:03:00 00:03:00 High quality thanks

2 00:03:00 00:03:00 High quality thanks.

2 00:03:00 00:03:00 High quality thanks

2 00:03:00 00:03:00 High quality thanks.

and so forth.

This bug appears erratic however seems extra continuously with medium-quality audio with robust background noise, distant speaker for instance. And “on the sphere”, that is usually the case.. Likewise, speaker hesitations or phrase repetitions appear to set off it. We nonetheless don’t know precisely what causes this “repetition bug”. Google Vertex staff is conscious of it however hasn’t supplied a transparent clarification.

The results of this bug had been particularly limiting: as soon as it occurred, the one viable answer was to restart the transcription from scratch. Unsurprisingly, the longer the audio file, the upper the likelihood of encountering the problem. In our checks, it affected roughly one out of each three runs on recordings longer than an hour, making it extraordinarily tough to ship a dependable, production-quality service underneath such situations.

To make it worse, resuming transcription after a reached Max_token “cutoff” required resending all the audio file every time. Though we solely wanted the subsequent phase, the LLM would nonetheless course of the total file once more (with out outputting the transcription), which means we had been billed the total audio time lenght for each resend.

In observe, we discovered that the token restrict was usually reached between the fifteenth and twentieth minute of the audio. Because of this, transcribing a one hour lengthy interview usually required 4 to five separate LLM calls, resulting in a complete billing equal of 4 to five hours of audio for a single file.

With this course of, the price of audio transcription doesn’t scale linearly. Whereas a 15-minute audio can be billed as quarter-hour, in a single LLM name, a 1-hour file might successfully price 4 hours, and a 2-hour file might enhance to 16 hours, following a close to quadratic sample (≈ 4^x, the place x = variety of hours).
This made lengthy audio processing not simply unreliable, but additionally costly for lengthy audio recordsdata.

Pivoting to Chunked Audio Transcription

Given these main limitations, and being rather more assured within the capacity of the LLM to deal with text-based duties over audio, we determined to shift our method and isolate the audio transcription course of to take care of excessive transcription high quality. A high quality transcription is certainly the important thing step of the necessity and it is sensible to make sure that this a part of the method ought to be on the core of the technique.

At this level, splitting audio into chunks turned the perfect answer. Not solely, it appeared more likely to enormously enhance timestamp accuracy by avoiding the LLM timestamping efficiency degradation after looping and cumulative drift, but additionally decreasing worth since every chunk can be runned ideally as soon as. Whereas it launched new uncertainties round merging partial transcriptions, the tradeoff appeared to our benefit.

We thus targeted on breaking lengthy audio into shorter chunks that may insure a single LLM transcription request. Throughout our checks, we noticed that points like repetition loops or timestamp drift usually started across the 18-minute mark in most interviews. It turned clear that we must always use 15-minute (or shorter) chunks for security. Why not use 5-minute chunks? The standard enchancment regarded minimal to us whereas tripling the variety of segments. As well as, shorter chunks scale back the general context, which might damage diarization.

Additionally this setup drastically minimized the repetition bug, we noticed that it nonetheless occurred sometimes. In a want to supply the very best service potential, we positively wished to seek out an environment friendly counterback to this downside and we recognized a chance with our beforehand annoying max_input_token: with 10-minute chunks, we might positively be assured that token limits wouldn’t be exceeded in almost all circumstances. Thus, if the token restrict was hit, we knew for certain the repetition bug occurred and will restart that chunk transcription. This pragmatic technique turned out to be certainly very efficient at figuring out and avoiding the bug. Nice information.

Correcting audio chunks transcription

With good transcripts of 10 minutes audio chunk in hand, we applied at this stage an algorithmic post-processing of every transcript to handle minor points:

Removing of header tags like tsv or json added at the beginning and the tip of the transcription content material:

Regardless of optimizing the immediate, we couldn’t absolutely remove this facet impact with out hurting the transcription high quality. Since that is simply dealt with algorithmically, we selected to take action.

Changing speaker IDs with names:

Speaker identification by title solely begins as soon as the LLM has sufficient context to find out who’s the journalist and who’s being interviewed. This leads to incomplete diarization initially of the transcript with early segments utilizing numeric IDs (first speaker in chunk = 1, and so forth.). Furthermore, since every chunk might have a distinct ID order (first individual to speak being speaker 1), this is able to create confusion throughout merging. We instructed the LLM to solely use IDs and supply a diarization mapping within the first line, in the course of the transcription course of. The speaker ids are due to this fact changed in the course of the algorithmic correction and the diarization headline eliminated.

Not often, malformed or empty transcript strains are encountered. These strains are deleted, however we flag them with a be aware to the person: “formatting difficulty on this line” so customers are not less than conscious of a possible content material loss and proper it will definitely handwise. In our last optimized model, such strains had been extraordinarily uncommon.

Merging chunks and sustaining content material continuity

On the earlier stage of audio chunking, we initially tried to make chunks with clear cuts. Unsurprisingly, this led to phrases and even full sentences loss at minimize factors. So we naturally switched to overlapping chunk cuts to keep away from such content material loss, leaving the optimization of the dimensions of the overlap to the chunk merging course of.

And not using a clear minimize between chunks, the chance to merge the chunks algorithmically disappeared. For a similar audio enter, the transcript strains output will be fairly totally different with breaks at totally different factors of the sentences and even filler phrases or hesitations being displayed otherwise. In such a state of affairs, it’s advanced, to not say unimaginable, to make an efficient algorithm for a clear merge.

This left us with using the LLM choice after all. Shortly, few checks confirmed the LLM might higher merge collectively segments when overlaps included full sentences. A 30-second overlap proved ample. With a ten min audio chunk construction, this is able to implies the next chunks cuts:

1st transcript: 0 to 10 minutes
2nd transcript: 9m30s to 19m30s
third transcript: 19m to 29m …and so forth.

Picture by the authors

These overlapped chunk transcripts had been corrected by the beforehand described algorithm and despatched to the LLM for merging to reconstruct the total audio transcript. The concept was to ship the total set of chunk transcripts with a immediate instructing the LLM to merge and provides the total merged audio transcript in tsv format because the earlier LLM transcription step. On this configuration, the merging course of has primarily three high quality criterias:

Guarantee transcription continuity with out content material loss or duplication.
Regulate timestamps to renew from the place the earlier chunk ended.
Protect diarization.

As anticipated, max_input_token was exceeded, forcing us into an LLM name loop. Nevertheless, since we had been now utilizing textual content enter, we had been extra assured within the reliability of the LLM… in all probability an excessive amount of. The results of the merge was passable typically however vulnerable to a number of points: tag insertions, multi-line entries merged into one line, incomplete strains, and even hallucinated continuations of the interview. Regardless of many immediate optimizations, we couldn’t obtain sufficiently dependable outcomes for manufacturing use.

As with audio transcription, we recognized the quantity of enter data as the principle difficulty. We had been sending a number of hundred, even 1000’s of textual content strains containing the immediate, the set of partial transcripts to fuse, a roughly comparable quantity with the earlier transcript, and a few extra with the immediate and its instance. Undoubtedly an excessive amount of for a exact utility of our set of directions.

On the plus facet, timestamp accuracy did certainly enhance considerably with this chunking method: we maintained a drift of simply 5 to 10 seconds max on transcriptions over an hour. As the beginning of a transcript ought to have minimal drift in timestamping, we instructed the LLM to make use of the timestamps of the “ending chunk” as reference for the fusion and proper any drift by a second per sentence. This made the minimize factors seamless and saved total timestamp accuracy.

Splitting the chunk transcripts for full transcript reconstruction

In a modular method just like the workaround we used for transcription, we determined to hold out the merge of the transcript individually, so as to keep away from the beforehand described points. To take action, every 10 minute transcript is break up into three elements based mostly on the start_time of the segments:

Overlap phase to merge initially: 0 to 1 minute
Primary phase to stick: 1 to 9 minutes
Overlap phase to merge on the finish: 9 to 10 minutes

NB: Since every chunk, together with first and final ones, is processed the identical method, the overlap initially of the primary chunk is straight merged with the principle phase, and the overlap on the finish of the final chunk (if there may be one) is merged accordingly.

The start and finish segments are then despatched in pairs to be merged. As anticipated, the standard of the output drastically elevated, leading to an environment friendly and dependable merge between the transcripts chunk. With this process, the response of the LLM proved to be extremely dependable and confirmed not one of the beforehand talked about errors encountered in the course of the looping course of.

The method of transcript meeting for an audio of 28 minutes 42 seconds:

Full transcript reconstruction

At this last stage, the one remaining process was to reconstruct the entire transcript from the processed splits. To realize this, we algorithmically mixed the principle content material segments with their corresponding merged overlaps alternately.

Total course of overview

The general course of entails 6 steps from which 2 are carried out by Gemini:

Chunking the audio into overlapped audio chunks
Transcribing every chunks into partial textual content transcripts (LLM step)
Correction of partial transcripts
Splitting audio chunks transcripts into begin, fundamental, and finish textual content splits
Fusing finish and begin splits of every couple of chunk splits (LLM step)
Reconstructing the total transcripts

The general course of takes about 5 min per hour of transcription deserved to the person in an asynchronous device. Fairly affordable contemplating the amount of labor executed behind the scene, and this for a fraction of the value of different instruments or pre-built Google fashions like Chirp2.

One extra enchancment that we thought-about however finally determined to not implement was the timestamp correction. We noticed that timestamps on the finish of every chunk usually ran about 5 seconds forward of the particular audio. A simple answer might have been to incrementally regulate the timestamps algorithmically by roughly one second each two minutes to appropriate most of this drift. Nevertheless, we selected to not implement this adjustment, because the minor discrepancy was acceptable for our enterprise wants.

Conclusion

Constructing a high-quality, scalable transcription pipeline for lengthy interviews turned out to be rather more advanced than merely selecting the “proper” Speech-to-Textual content mannequin. Our journey with Google’s Vertex AI and Gemini fashions highlighted key challenges round diarization, timestamping, cost-efficiency, and lengthy audio dealing with, particularly when aiming to export the total data of an audio.

Utilizing cautious immediate engineering, sensible audio chunking methods, and iterative refinements, we had been in a position to construct a strong system that balances accuracy, efficiency, and operational price, turning an initially fragmented course of right into a easy, production-ready pipeline.

There’s nonetheless room for enchancment however this workflow now types a stable basis for scalable, high-fidelity audio transcription. As LLMs proceed to evolve and APIs develop into extra versatile, we’re optimistic about much more streamlined options within the close to future.

Key takeaways

No Vertex AI S2T mannequin met all our wants: Google Vertex AI offers specialised fashions, however each has limitations when it comes to transcription accuracy, diarization, or timestamping for lengthy audios.
Token limits and lengthy prompts affect transcription high quality drastically: Gemini output token limitation considerably degrades transcription high quality for lengthy audios, requiring closely prompted looping methods and at last forcing us to shift to shorter audio chunks.
Chunked audio transcription and transcript reconstruction considerably improves high quality and cost-efficiency:
Splitting audio into 10 minute overlapping segments minimized important bugs like repeated sentences and timestamp drift, enabling larger high quality outcomes and drastically lowered prices.
Cautious immediate engineering stays important: Precision in prompts, particularly relating to diarization and interjections for transcription, in addition to transcript fusions, proved to be essential for dependable LLM efficiency.
Brief transcript fusion merging maximize reliability:
Splitting every chunk transcript into smaller segments with finish to begin merging of overlaps supplied excessive accuracy and prevented frequent LLM points like hallucinations or incorrect formatting.

Source link

Roleplay AI Chatbot Apps with the Best Memory: Tested

How to Perform Comprehensive Large Scale LLM Validation

What If I Had AI in 2020: Rent The Runway Dynamic Pricing Model

Roleplay AI Chatbot Apps with the Best Memory: Tested

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

America’s sandwich generation is overwhelmed. Can this app help?

Sam Altman’s $200 ChatGPT Has a Big Elon Problem… and It’s Not Just the Price Tag | by Nauman Saghir | Jan, 2025

Last Chance to Get Windows 11 Pro at an All-Time Low Price

Our Picks

Roleplay AI Chatbot Apps with the Best Memory: Tested

Top Tools and Skills for AI/ML Engineers in 2025 | by Raviishankargarapti | Aug, 2025

PwC Reducing Entry-Level Hiring, Changing Processes

Building a Scalable and Accurate Audio Interview Transcription Pipeline with Google Gemini

Context of the venture and constraints

First makes an attempt and limitations

Deep dive into transcription with Chirp2

Subsequent: exploring audio transcription with Gemini flash 2.0

Coping with Lengthy Audio and LLM Token Limitations

Pivoting to Chunked Audio Transcription

Correcting audio chunks transcription

Merging chunks and sustaining content material continuity

Splitting the chunk transcripts for full transcript reconstruction

Full transcript reconstruction

Total course of overview

Conclusion

Key takeaways

Related Posts