How To Redact Sensitive Data Before Passing To LLM-OpenAI

Shweta Lodha
5 min readNov 12, 2023

In my previous article “Passing An Audio File To LLM”, I explained how one can pass an audio file to LLM. In continuation to that, I’m extending this article by adding more value to it by addressing the use case of sensitive information.

Let’s say, an audio file contains information about your bank account number, your secure id, your pin, your passcode, your date of birth, or any such information which has to be kept secured. You will find this kind of information if you are dealing with customer facing audio calls, specifically in finance sector. As these details, which are also known as PII (Personal Identifiable Information), are very sensitive and it is not at all safe to keep them only on any server. Hence, one should be very careful while dealing with such kind of data.

Now, when it comes to using PII with generative AI based application, we need a way wherein we can just remove such information from data before passing that to LLM and that’s what this article is all about.

In this article, I’ll show you a very quick way to redact such sensitive information from an audio file and save it back. So, that this updated audio file can be transcribed and sent to LLM.

High-level steps

To execute the solution from end-to-end, we need to work with below components/libraries:

Redaction And Transcription

  • For redaction and transcript generation, we will be using AssemblyAI

Embedding Generator

  • For generating the embeddings, we will be using OpenAIEmbeddings

Vector Database

  • Chroma will be used as an in-memory database for storing the vectors

Large Language Model

  • OpenAI as LLM

And all these are wrapped under a library called Langchain, so we will be highly utilizing that too.

First of all, we need to grab the keys as shown below:

Get An OpenAI API Key

To get the OpenAI key, you need to go to https://openai.com/, login and then grab the keys using…

--

--