Speak to your slide deck utilizing multimodal basis fashions hosted on Amazon Bedrock

In Part 1 of this collection, we offered an answer that used the Amazon Titan Multimodal Embeddings mannequin to transform particular person slides from a slide deck into embeddings. We saved the embeddings in a vector database after which used the Large Language-and-Vision Assistant (LLaVA 1.5-7b) mannequin to generate textual content responses to person questions primarily based on essentially the most related slide retrieved from the vector database. We used AWS companies together with Amazon Bedrock, Amazon SageMaker, and Amazon OpenSearch Serverless on this answer.

On this put up, we exhibit a distinct strategy. We use the Anthropic Claude 3 Sonnet mannequin to generate textual content descriptions for every slide within the slide deck. These descriptions are then transformed into textual content embeddings utilizing the Amazon Titan Text Embeddings mannequin and saved in a vector database. Then we use the Claude 3 Sonnet mannequin to generate solutions to person questions primarily based on essentially the most related textual content description retrieved from the vector database.

You’ll be able to check each approaches in your dataset and consider the outcomes to see which strategy offers you the most effective outcomes. In Half 3 of this collection, we consider the outcomes of each strategies.

Answer overview

The answer gives an implementation for answering questions utilizing info contained in textual content and visible parts of a slide deck. The design depends on the idea of Retrieval Augmented Era (RAG). Historically, RAG has been related to textual information that may be processed by massive language fashions (LLMs). On this collection, we lengthen RAG to incorporate photographs as nicely. This gives a strong search functionality to extract contextually related content material from visible parts like tables and graphs together with textual content.

This answer consists of the next parts:

Amazon Titan Textual content Embeddings is a textual content embeddings mannequin that converts pure language textual content, together with single phrases, phrases, and even massive paperwork, into numerical representations that can be utilized to energy use instances similar to search, personalization, and clustering primarily based on semantic similarity.
Claude 3 Sonnet is the subsequent technology of state-of-the-art fashions from Anthropic. Sonnet is a flexible instrument that may deal with a variety of duties, from advanced reasoning and evaluation to speedy outputs, in addition to environment friendly search and retrieval throughout huge quantities of data.
OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the Amazon Titan Textual content Embeddings mannequin. An index created within the OpenSearch Serverless assortment serves because the vector retailer for our RAG answer.
Amazon OpenSearch Ingestion (OSI) is a totally managed, serverless information collector that delivers information to OpenSearch Service domains and OpenSearch Serverless collections. On this put up, we use an OSI pipeline API to ship information to the OpenSearch Serverless vector retailer.

The answer design consists of two components: ingestion and person interplay. Throughout ingestion, we course of the enter slide deck by changing every slide into a picture, producing descriptions and textual content embeddings for every picture. We then populate the vector information retailer with the embeddings and textual content description for every slide. These steps are accomplished previous to the person interplay steps.

Within the person interplay section, a query from the person is transformed into textual content embeddings. A similarity search is run on the vector database to discover a textual content description comparable to a slide that might doubtlessly comprise solutions to the person query. We then present the slide description and the person query to the Claude 3 Sonnet mannequin to generate a solution to the question. All of the code for this put up is offered within the GitHub repo.

The next diagram illustrates the ingestion structure.

The workflow consists of the next steps:

Slides are transformed to picture recordsdata (one per slide) in JPG format and handed to the Claude 3 Sonnet mannequin to generate textual content description.
The information is shipped to the Amazon Titan Textual content Embeddings mannequin to generate embeddings. On this collection, we use the slide deck Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023 to exhibit the answer. The pattern deck has 31 slides, due to this fact we generate 31 units of vector embeddings, every with 1536 dimensions. We add further metadata fields to carry out wealthy search queries utilizing OpenSearch’s highly effective search capabilities.
The embeddings are ingested into an OSI pipeline utilizing an API name.
The OSI pipeline ingests the information as paperwork into an OpenSearch Serverless index. The index is configured because the sink for this pipeline and is created as a part of the OpenSearch Serverless assortment.

The next diagram illustrates the person interplay structure.

The workflow consists of the next steps:

A person submits a query associated to the slide deck that has been ingested.
The person enter is transformed into embeddings utilizing the Amazon Titan Textual content Embeddings mannequin accessed utilizing Amazon Bedrock. An OpenSearch Service vector search is carried out utilizing these embeddings. We carry out a k-nearest neighbor (k-NN) search to retrieve essentially the most related embeddings matching the person question.
The metadata of the response from OpenSearch Serverless incorporates a path to the picture and outline comparable to essentially the most related slide.
A immediate is created by combining the person query and the picture description. The immediate is supplied to Claude 3 Sonnet hosted on Amazon Bedrock.
The results of this inference is returned to the person.

We focus on the steps for each levels within the following sections, and embrace particulars in regards to the output.

Stipulations

To implement the answer supplied on this put up, you need to have an AWS account and familiarity with FMs, Amazon Bedrock, SageMaker, and OpenSearch Service.

This answer makes use of the Claude 3 Sonnet and Amazon Titan Textual content Embeddings fashions hosted on Amazon Bedrock. Guarantee that these fashions are enabled to be used by navigating to the Mannequin entry web page on the Amazon Bedrock console.

If fashions are enabled, the Entry standing will state Entry granted.

If the fashions will not be accessible, allow entry by selecting Handle mannequin entry, choosing the fashions, and selecting Request mannequin entry. The fashions are enabled to be used instantly.

Use AWS CloudFormation to create the answer stack

You need to use AWS CloudFormation to create the answer stack. In case you have created the answer for Half 1 in the identical AWS account, make sure to delete that earlier than creating this stack.

AWS Area	Hyperlink
`us-east-1`
`us-west-2`

After the stack is created efficiently, navigate to the stack’s Outputs tab on the AWS CloudFormation console and word the values for MultimodalCollectionEndpoint and OpenSearchPipelineEndpoint. You utilize these within the subsequent steps.

The CloudFormation template creates the next sources:

IAM roles – The next AWS Identity and Access Management (IAM) roles are created. Replace these roles to use least-privilege permissions, as mentioned in Security best practices.
- SMExecutionRole with Amazon Simple Storage Service (Amazon S3), SageMaker, OpenSearch Service, and Amazon Bedrock full entry.
- OSPipelineExecutionRole with entry to the S3 bucket and OSI actions.
SageMaker pocket book – All code for this put up is run utilizing this pocket book.
OpenSearch Serverless assortment – That is the vector database for storing and retrieving embeddings.
OSI pipeline – That is the pipeline for ingesting information into OpenSearch Serverless.
S3 bucket – All information for this put up is saved on this bucket.

The CloudFormation template units up the pipeline configuration required to configure the OSI pipeline with HTTP as supply and the OpenSearch Serverless index as sink. The SageMaker pocket book 2_data_ingestion.ipynb shows the right way to ingest information into the pipeline utilizing the Requests HTTP library.

The CloudFormation template additionally creates network, encryption and data access insurance policies required in your OpenSearch Serverless assortment. Replace these insurance policies to use least-privilege permissions.

The CloudFormation template title and OpenSearch Service index title are referenced within the SageMaker pocket book 3_rag_inference.ipynb. For those who change the default names, be sure to replace them within the pocket book.

Check the answer

After you might have created the CloudFormation stack, you’ll be able to check the answer. Full the next steps:

On the SageMaker console, select Notebooks within the navigation pane.
Choose MultimodalNotebookInstance and select Open JupyterLab.
In File Browser, traverse to the notebooks folder to see notebooks and supporting recordsdata.

The notebooks are numbered within the sequence through which they run. Directions and feedback in every pocket book describe the actions carried out by that pocket book. We run these notebooks one after the other.

Select 1_data_prep.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

This pocket book will obtain a publicly accessible slide deck, convert every slide into the JPG file format, and add these to the S3 bucket.

Select 2_data_ingestion.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

On this pocket book, you create an index within the OpenSearch Serverless assortment. This index shops the embeddings information for the slide deck. See the next code:

session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
  hosts = [{'host': host, 'port': 443}],
  http_auth = auth,
  use_ssl = True,
  verify_certs = True,
  connection_class = RequestsHttpConnection,
  pool_maxsize = 20
)

index_body = """
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector_embedding": {
        "sort": "knn_vector",
        "dimension": 1536,
        "technique": {
          "title": "hnsw",
          "engine": "nmslib",
          "parameters": {}
        }
      },
      "image_path": {
        "sort": "textual content"
      },
      "slide_text": {
        "sort": "textual content"
      },
      "slide_number": {
        "sort": "textual content"
      },
      "metadata": { 
        "properties" :
          {
            "filename" : {
              "sort" : "textual content"
            },
            "desc":{
              "sort": "textual content"
            }
          }
      }
    }
  }
}
"""
index_body = json.masses(index_body)
attempt:
  response = os_client.indices.create(index_name, physique=index_body)
  logger.information(f"response acquired for the create index -> {response}")
besides Exception as e:
  logger.error(f"error in creating index={index_name}, exception={e}")

You utilize the Claude 3 Sonnet and Amazon Titan Textual content Embeddings fashions to transform the JPG photographs created within the earlier pocket book into vector embeddings. These embeddings and extra metadata (such because the S3 path and outline of the picture file) are saved within the index together with the embeddings. The next code snippet exhibits how Claude 3 Sonnet generates picture descriptions:

def get_img_desc(image_file_path: str, immediate: str):
    # learn the file, MAX picture measurement supported is 2048 * 2048 pixels
    with open(image_file_path, "rb") as image_file:
        input_image_b64 = image_file.learn().decode('utf-8')
  
    physique = json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1000,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/jpeg",
                                "data": input_image_b64
                            },
                        },
                        {"type": "text", "text": prompt},
                    ],
                }
            ],
        }
    )
    
    response = bedrock.invoke_model(
        modelId=g.CLAUDE_MODEL_ID,
        physique=physique
    )

    resp_body = json.masses(response['body'].learn().decode("utf-8"))
    resp_text = resp_body['content'][0]['text'].exchange('"', "'")

    return resp_text

The picture descriptions are handed to the Amazon Titan Textual content Embeddings mannequin to generate vector embeddings. These embeddings and extra metadata (such because the S3 path and outline of the picture file) are saved within the index together with the embeddings. The next code snippet exhibits the decision to the Amazon Titan Textual content Embeddings mannequin:

def get_text_embedding(bedrock: botocore.shopper, prompt_data: str) -> np.ndarray:
    physique = json.dumps({
        "inputText": prompt_data,
    })    
    attempt:
        response = bedrock.invoke_model(
            physique=physique, modelId=g.TITAN_MODEL_ID, settle for=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.masses(response['body'].learn())
        embedding = response_body.get('embedding')
    besides Exception as e:
        logger.error(f"exception={e}")
        embedding = None

    return embedding

The information is ingested into the OpenSearch Serverless index by making an API name to the OSI pipeline. The next code snippet exhibits the decision made utilizing the Requests HTTP library:

information = json.dumps([{
    "image_path": input_image_s3, 
    "slide_text": resp_text, 
    "slide_number": slide_number, 
    "metadata": {
        "filename": obj_name, 
        "desc": "" 
    }, 
    "vector_embedding": embedding
}])

r = requests.request(
    technique='POST', 
    url=osi_endpoint, 
    information=information,
    auth=AWSSigV4('osis'))

Select 3_rag_inference.ipynb to open it in JupyterLab.
On the Run menu, select Run All Cells to run the code on this pocket book.

This pocket book implements the RAG answer: you exchange the person query into embeddings, discover a related picture description from the vector database, and supply the retrieved description to Claude 3 Sonnet to generate a solution to the person query. You utilize the next immediate template:

  llm_prompt: str = """

  Human: Use the abstract to offer a concise reply to the query to the most effective of your talents. For those who can not reply the query from the context then say I have no idea, don't make up a solution.
  <query>
  {query}
  </query>

  <abstract>
  {abstract}
  </abstract>

  Assistant:"""

The next code snippet gives the RAG workflow:

def get_llm_response(bedrock: botocore.shopper, query: str, abstract: str) -> str:
    immediate = llm_prompt.format(query=query, abstract=abstract)
    
    physique = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ],
            }
        ],
    })
        
    attempt:
        response = bedrock.invoke_model(
        modelId=g.CLAUDE_MODEL_ID,
        physique=physique)

        response_body = json.masses(response['body'].learn().decode("utf-8"))
        llm_response = response_body['content'][0]['text'].exchange('"', "'")
        
    besides Exception as e:
        logger.error(f"exception whereas slide_text={abstract[:10]}, exception={e}")
        llm_response = None

    return llm_response


# create immediate and convert to embeddings query: str = "How does Inf2 evaluate in efficiency to comparable EC2 cases? I want numbers." text_embedding = get_text_embedding(bedrock, query) # vector db search vector_db_response: Dict = find_similar_data(text_embeddings) # obtain picture for native pocket book show s3_img_path = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_path') logger.information(f"going to reply the query="{query}" utilizing the picture "{s3_img_path}"") !aws s3 cp {s3_img_path} . local_img_path = os.path.basename(s3_img_path) show(filename=local_img_path) # Ask Claude 3 Sonnet slide_text = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('slide_text') llm_response = get_llm_response(bedrock, query, slide_text) print(llm_response)

Outcomes

The next desk incorporates some person questions and responses generated by our implementation. The Query column captures the person query, and the Reply column is the textual response generated by Claude 3 Sonnet. The Picture column exhibits the k-NN slide match returned by the OpenSearch Serverless vector search.

Multimodal RAG outcomes

Query	Reply	Picture
How does Inf2 evaluate in efficiency to comparable EC2 cases? I want numbers.	Based mostly on the data supplied within the abstract, the important thing efficiency comparability numbers given for the Amazon EC2 Inf2 cases in comparison with different cases are: 1. As much as 4x larger throughput. 2. As much as 10x decrease latency. The abstract doesn’t present particular numbers evaluating the Inf2 cases to different EC2 occasion sorts. Nevertheless, it highlights the throughput and latency enhancements as the primary efficiency benefits of the Inf2 cases over comparable cases for working generative AI fashions.
As per the AI/ML flywheel, what do the AWS AI/ML companies present?	Based mostly on the abstract, the AWS AI/ML companies present a variety of capabilities that gas an AI/ML flywheel. This flywheel consists of innovation in creating new and higher algorithms, enabling extra use-cases and functions, driving wider adoption and demand, which in flip results in additional funding in analysis and optimization to drive extra innovation. The AWS AI/ML companies appear to supply the instruments, sources, and infrastructure to assist this steady cycle of innovation, software improvement, adoption, and reinvestment within the area of synthetic intelligence and machine studying.
In comparison with GPT-2, what number of extra parameters does GPT-3 have? What’s the numerical distinction between the parameter measurement of GPT-2 and GPT-3?	In keeping with the data supplied within the abstract, GPT-3 from 2020 had 175B (175 billion) parameters, whereas GPT-2 from 2019 had 1.5B (1.5 billion) parameters. Subsequently, the numerical distinction between the parameter measurement of GPT-2 and GPT-3 is 173.5 billion parameters. GPT-3 has roughly 115 instances extra parameters than GPT-2.
What are quarks in particle physics?	Sadly, the given abstract doesn’t comprise any details about quarks in particle physics. The abstract describes a picture associated to the development of pure language processing and generative AI applied sciences, nevertheless it doesn’t point out something about particle physics or the idea of quarks.

Question your index

You need to use OpenSearch Dashboards to work together with the OpenSearch API to run fast checks in your index and ingested information.

Cleanup

To keep away from incurring future fees, delete the sources. You are able to do this by deleting the stack utilizing the AWS CloudFormation console.

Conclusion

Enterprises generate new content material on a regular basis, and slide decks are a typical technique to share and disseminate info internally inside the group and externally with clients or at conferences. Over time, wealthy info can stay buried and hidden in non-text modalities like graphs and tables in these slide decks.

You need to use this answer and the ability of multimodal FMs such because the Amazon Titan Textual content Embeddings and Claude 3 Sonnet to find new info or uncover new views on content material in slide decks. You’ll be able to attempt totally different Claude fashions accessible on Amazon Bedrock by updating the CLAUDE_MODEL_ID within the globals.py file.

That is Half 2 of a three-part collection. We used the Amazon Titan Multimodal Embeddings and the LLaVA mannequin in Half 1. In Half 3, we are going to evaluate the approaches from Half 1 and Half 2.

Parts of this code are launched beneath the Apache 2.0 License.

Concerning the authors

Amit Arora is an AI and ML Specialist Architect at Amazon Internet Companies, serving to enterprise clients use cloud-based machine studying companies to quickly scale their improvements. He’s additionally an adjunct lecturer within the MS information science and analytics program at Georgetown College in Washington D.C.

Manju Prasad is a Senior Options Architect at Amazon Internet Companies. She focuses on offering technical steering in quite a lot of technical domains, together with AI/ML. Previous to becoming a member of AWS, she designed and constructed options for firms within the monetary companies sector and in addition for a startup. She is keen about sharing information and fostering curiosity in rising expertise.

Archana Inapudi is a Senior Options Architect at AWS, supporting a strategic buyer. She has over a decade of cross-industry experience main strategic technical initiatives. Archana is an aspiring member of the AI/ML technical area neighborhood at AWS. Previous to becoming a member of AWS, Archana led a migration from conventional siloed information sources to Hadoop at a healthcare firm. She is keen about utilizing expertise to speed up development, present worth to clients, and obtain enterprise outcomes.

Antara Raisa is an AI and ML Options Architect at Amazon Internet Companies, supporting strategic clients primarily based out of Dallas, Texas. She additionally has earlier expertise working with massive enterprise companions at AWS, the place she labored as a Associate Success Options Architect for digital-centered clients.