Get Answers From Your PDF — Azure OpenAI and Langchain

Shweta Lodha
2 min readJun 6, 2023

In this article, I’ll walk you through all the steps required to query your PDFs and get response out of it using Azure OpenAI.

Image by Denys Vitali from Pixabay

Let’s get started by importing the required packages.

Import Required Packages

Here are the packages which we need to import to get started:

from dotenv import load_dotenv
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import AzureOpenAI

Set Environment Variables

First of all, we need to set few variables with information from Azure portal and Azure OpenAI Studio:

OPENAI_API_TYPE = "Azure"
OPENAI_API_VERSION = "2022-12-01"
OPENAI_API_BASE = "ENDPOINT"
OPENAI_API_KEY = "API_KEY"
DEPLOYMENT_NAME = "DEPLOYMENT_NAME_FROM_AI_STUDIO"

If you are not sure how to grab above values, I would recommend you watch my below video on this.

Next, we will go ahead and use above variables to set environment variables:

from dotenv import load_dotenv

os.environ["OPENAI_API_TYPE"] = OPENAI_API_TYPE
os.environ["OPENAI_API_VERSION"] = OPENAI_API_VERSION
os.environ["OPENAI_API_BASE"] = OPENAI_API_BASE
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
load_dotenv()

Load PDF

Next, we will load our PDF using UnstructuredFielLoader class which comes with Langchain.

loader = UnstructuredFileLoader('Sample.pdf')
documents = loader.load()

Split Documents Into Chunks

Once the PDF is loaded, next we need to divide our huge text into chunks. You can define chunk size based on your need, here I’m taking chunk size as 800 and chunk overlap as 0.

--

--