1935 lines
165 KiB
Plaintext
1935 lines
165 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "_A2SZPmbD4Pk"
|
|
},
|
|
"source": [
|
|
"<h1>Chapter 8 - Semantic Search and Retrieval-Augmented Generation</h1>\n",
|
|
"<i>Exploring a vital part of LLMs, search.</i>\n",
|
|
"\n",
|
|
"<a href=\"https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961\"><img src=\"https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon\"></a>\n",
|
|
"<a href=\"https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/\"><img src=\"https://img.shields.io/badge/O'Reilly-white.svg?logo=\"></a>\n",
|
|
"<a href=\"https://github.com/HandsOnLLM/Hands-On-Large-Language-Models\"><img src=\"https://img.shields.io/badge/GitHub%20Repository-black?logo=github\"></a>\n",
|
|
"[](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter08/Chapter%208%20-%20Semantic%20Search.ipynb)\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"This notebook is for Chapter 8 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"<a href=\"https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961\">\n",
|
|
"<img src=\"https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png\" width=\"350\"/></a>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### [OPTIONAL] - Installing Packages on <img src=\"https://colab.google/static/images/icons/colab.png\" width=100>\n",
|
|
"\n",
|
|
"If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to\n",
|
|
"**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.\n",
|
|
"\n",
|
|
"---\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %%capture\n",
|
|
"# !pip install langchain==0.2.5 faiss-cpu==1.8.0 cohere==5.5.8 langchain-community==0.2.5 rank_bm25==0.2.2 sentence-transformers==3.0.1\n",
|
|
"# !pip install llama-cpp-python==0.2.78 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "Ye0HbBr3EV0P"
|
|
},
|
|
"source": [
|
|
"# Dense Retrieval Example\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "Svgdo3y3F741"
|
|
},
|
|
"source": [
|
|
"## 1. Getting the text archive and chunking it\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "uOFFg7YWFoaf"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import cohere\n",
|
|
"\n",
|
|
"# Paste your API key here. Remember to not share publicly\n",
|
|
"api_key = ''\n",
|
|
"\n",
|
|
"# Create and retrieve a Cohere API key from os.cohere.ai\n",
|
|
"co = cohere.Client(api_key)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "_Dcq1j_xFxIr"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"text = \"\"\"\n",
|
|
"Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.\n",
|
|
"It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.\n",
|
|
"Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.\n",
|
|
"\n",
|
|
"Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.\n",
|
|
"Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.\n",
|
|
"Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.\n",
|
|
"Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.\n",
|
|
"Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.\n",
|
|
"\n",
|
|
"Interstellar premiered on October 26, 2014, in Los Angeles.\n",
|
|
"In the United States, it was first released on film stock, expanding to venues using digital projectors.\n",
|
|
"The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.\n",
|
|
"It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.\n",
|
|
"It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.\n",
|
|
"Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades\"\"\"\n",
|
|
"\n",
|
|
"# Split into a list of sentences\n",
|
|
"texts = text.split('.')\n",
|
|
"\n",
|
|
"# Clean up to remove empty spaces and new lines\n",
|
|
"texts = [t.strip(' \\n') for t in texts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "krDDOpcZF5qo"
|
|
},
|
|
"source": [
|
|
"## 2. Embedding the Text Chunks\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 1117,
|
|
"status": "ok",
|
|
"timestamp": 1718963354789,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "xooetZg0Fz4K",
|
|
"outputId": "1f105f8c-e6a9-4cc1-b620-1f9a03bec288"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"(15, 4096)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"# Get the embeddings\n",
|
|
"response = co.embed(\n",
|
|
" texts=texts,\n",
|
|
" input_type=\"search_document\",\n",
|
|
").embeddings\n",
|
|
"\n",
|
|
"embeds = np.array(response)\n",
|
|
"print(embeds.shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "fLVdFg1PF4GG"
|
|
},
|
|
"source": [
|
|
"## 3. Building The Search Index\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "JyqzN2-JF24N"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import faiss\n",
|
|
"\n",
|
|
"dim = embeds.shape[1]\n",
|
|
"index = faiss.IndexFlatL2(dim)\n",
|
|
"index.add(np.float32(embeds))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "T6qRFo8dGGrJ"
|
|
},
|
|
"source": [
|
|
"## 4. Search the index\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "o83pxM5sGHxp"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"def search(query, number_of_results=3):\n",
|
|
"\n",
|
|
" # 1. Get the query's embedding\n",
|
|
" query_embed = co.embed(texts=[query],\n",
|
|
" input_type=\"search_query\",).embeddings[0]\n",
|
|
"\n",
|
|
" # 2. Retrieve the nearest neighbors\n",
|
|
" distances , similar_item_ids = index.search(np.float32([query_embed]), number_of_results)\n",
|
|
"\n",
|
|
" # 3. Format the results\n",
|
|
" texts_np = np.array(texts) # Convert texts list to numpy for easier indexing\n",
|
|
" results = pd.DataFrame(data={'texts': texts_np[similar_item_ids[0]],\n",
|
|
" 'distance': distances[0]})\n",
|
|
"\n",
|
|
" # 4. Print and return the results\n",
|
|
" print(f\"Query:'{query}'\\nNearest neighbors:\")\n",
|
|
" return results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 180
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 403,
|
|
"status": "ok",
|
|
"timestamp": 1718963357199,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "Rq_2knm_GLgR",
|
|
"outputId": "eaee1a34-a690-4f6c-ecb1-53974ae6f319"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Query:'how precise was the science'\n",
|
|
"Nearest neighbors:\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.google.colaboratory.intrinsic+json": {
|
|
"summary": "{\n \"name\": \"results\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"texts\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics\",\n \"Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar\",\n \"Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"distance\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 3,\n \"samples\": [\n 10757.3798828125,\n 11566.1318359375,\n 11922.8330078125\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
|
|
"type": "dataframe",
|
|
"variable_name": "results"
|
|
},
|
|
"text/html": [
|
|
"\n",
|
|
" <div id=\"df-61ac7a5e-82b0-4fd0-9859-07cced4c6061\" class=\"colab-df-container\">\n",
|
|
" <div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>texts</th>\n",
|
|
" <th>distance</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>It has also received praise from many astronom...</td>\n",
|
|
" <td>10757.379883</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Caltech theoretical physicist and 2017 Nobel l...</td>\n",
|
|
" <td>11566.131836</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Interstellar uses extensive practical and mini...</td>\n",
|
|
" <td>11922.833008</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>\n",
|
|
" <div class=\"colab-df-buttons\">\n",
|
|
"\n",
|
|
" <div class=\"colab-df-container\">\n",
|
|
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-61ac7a5e-82b0-4fd0-9859-07cced4c6061')\"\n",
|
|
" title=\"Convert this dataframe to an interactive table.\"\n",
|
|
" style=\"display:none;\">\n",
|
|
"\n",
|
|
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
|
|
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
|
|
" </svg>\n",
|
|
" </button>\n",
|
|
"\n",
|
|
" <style>\n",
|
|
" .colab-df-container {\n",
|
|
" display:flex;\n",
|
|
" gap: 12px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-convert {\n",
|
|
" background-color: #E8F0FE;\n",
|
|
" border: none;\n",
|
|
" border-radius: 50%;\n",
|
|
" cursor: pointer;\n",
|
|
" display: none;\n",
|
|
" fill: #1967D2;\n",
|
|
" height: 32px;\n",
|
|
" padding: 0 0 0 0;\n",
|
|
" width: 32px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-convert:hover {\n",
|
|
" background-color: #E2EBFA;\n",
|
|
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
|
" fill: #174EA6;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-buttons div {\n",
|
|
" margin-bottom: 4px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-convert {\n",
|
|
" background-color: #3B4455;\n",
|
|
" fill: #D2E3FC;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-convert:hover {\n",
|
|
" background-color: #434B5C;\n",
|
|
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
|
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
|
" fill: #FFFFFF;\n",
|
|
" }\n",
|
|
" </style>\n",
|
|
"\n",
|
|
" <script>\n",
|
|
" const buttonEl =\n",
|
|
" document.querySelector('#df-61ac7a5e-82b0-4fd0-9859-07cced4c6061 button.colab-df-convert');\n",
|
|
" buttonEl.style.display =\n",
|
|
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
|
"\n",
|
|
" async function convertToInteractive(key) {\n",
|
|
" const element = document.querySelector('#df-61ac7a5e-82b0-4fd0-9859-07cced4c6061');\n",
|
|
" const dataTable =\n",
|
|
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
|
|
" [key], {});\n",
|
|
" if (!dataTable) return;\n",
|
|
"\n",
|
|
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
|
|
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
|
|
" + ' to learn more about interactive tables.';\n",
|
|
" element.innerHTML = '';\n",
|
|
" dataTable['output_type'] = 'display_data';\n",
|
|
" await google.colab.output.renderOutput(dataTable, element);\n",
|
|
" const docLink = document.createElement('div');\n",
|
|
" docLink.innerHTML = docLinkHtml;\n",
|
|
" element.appendChild(docLink);\n",
|
|
" }\n",
|
|
" </script>\n",
|
|
" </div>\n",
|
|
"\n",
|
|
"\n",
|
|
"<div id=\"df-896847d1-56d9-4cdf-8a9a-d48ad7415e45\">\n",
|
|
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-896847d1-56d9-4cdf-8a9a-d48ad7415e45')\"\n",
|
|
" title=\"Suggest charts\"\n",
|
|
" style=\"display:none;\">\n",
|
|
"\n",
|
|
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
|
" width=\"24px\">\n",
|
|
" <g>\n",
|
|
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
|
|
" </g>\n",
|
|
"</svg>\n",
|
|
" </button>\n",
|
|
"\n",
|
|
"<style>\n",
|
|
" .colab-df-quickchart {\n",
|
|
" --bg-color: #E8F0FE;\n",
|
|
" --fill-color: #1967D2;\n",
|
|
" --hover-bg-color: #E2EBFA;\n",
|
|
" --hover-fill-color: #174EA6;\n",
|
|
" --disabled-fill-color: #AAA;\n",
|
|
" --disabled-bg-color: #DDD;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-quickchart {\n",
|
|
" --bg-color: #3B4455;\n",
|
|
" --fill-color: #D2E3FC;\n",
|
|
" --hover-bg-color: #434B5C;\n",
|
|
" --hover-fill-color: #FFFFFF;\n",
|
|
" --disabled-bg-color: #3B4455;\n",
|
|
" --disabled-fill-color: #666;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-quickchart {\n",
|
|
" background-color: var(--bg-color);\n",
|
|
" border: none;\n",
|
|
" border-radius: 50%;\n",
|
|
" cursor: pointer;\n",
|
|
" display: none;\n",
|
|
" fill: var(--fill-color);\n",
|
|
" height: 32px;\n",
|
|
" padding: 0;\n",
|
|
" width: 32px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-quickchart:hover {\n",
|
|
" background-color: var(--hover-bg-color);\n",
|
|
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
|
" fill: var(--button-hover-fill-color);\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-quickchart-complete:disabled,\n",
|
|
" .colab-df-quickchart-complete:disabled:hover {\n",
|
|
" background-color: var(--disabled-bg-color);\n",
|
|
" fill: var(--disabled-fill-color);\n",
|
|
" box-shadow: none;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-spinner {\n",
|
|
" border: 2px solid var(--fill-color);\n",
|
|
" border-color: transparent;\n",
|
|
" border-bottom-color: var(--fill-color);\n",
|
|
" animation:\n",
|
|
" spin 1s steps(1) infinite;\n",
|
|
" }\n",
|
|
"\n",
|
|
" @keyframes spin {\n",
|
|
" 0% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-bottom-color: var(--fill-color);\n",
|
|
" border-left-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 20% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-left-color: var(--fill-color);\n",
|
|
" border-top-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 30% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-left-color: var(--fill-color);\n",
|
|
" border-top-color: var(--fill-color);\n",
|
|
" border-right-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 40% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-right-color: var(--fill-color);\n",
|
|
" border-top-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 60% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-right-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 80% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-right-color: var(--fill-color);\n",
|
|
" border-bottom-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 90% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-bottom-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"\n",
|
|
" <script>\n",
|
|
" async function quickchart(key) {\n",
|
|
" const quickchartButtonEl =\n",
|
|
" document.querySelector('#' + key + ' button');\n",
|
|
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
|
|
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
|
|
" try {\n",
|
|
" const charts = await google.colab.kernel.invokeFunction(\n",
|
|
" 'suggestCharts', [key], {});\n",
|
|
" } catch (error) {\n",
|
|
" console.error('Error during call to suggestCharts:', error);\n",
|
|
" }\n",
|
|
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
|
|
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
|
|
" }\n",
|
|
" (() => {\n",
|
|
" let quickchartButtonEl =\n",
|
|
" document.querySelector('#df-896847d1-56d9-4cdf-8a9a-d48ad7415e45 button');\n",
|
|
" quickchartButtonEl.style.display =\n",
|
|
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
|
" })();\n",
|
|
" </script>\n",
|
|
"</div>\n",
|
|
"\n",
|
|
" <div id=\"id_e6686149-58a9-44c4-aedb-e026d9d54910\">\n",
|
|
" <style>\n",
|
|
" .colab-df-generate {\n",
|
|
" background-color: #E8F0FE;\n",
|
|
" border: none;\n",
|
|
" border-radius: 50%;\n",
|
|
" cursor: pointer;\n",
|
|
" display: none;\n",
|
|
" fill: #1967D2;\n",
|
|
" height: 32px;\n",
|
|
" padding: 0 0 0 0;\n",
|
|
" width: 32px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-generate:hover {\n",
|
|
" background-color: #E2EBFA;\n",
|
|
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
|
" fill: #174EA6;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-generate {\n",
|
|
" background-color: #3B4455;\n",
|
|
" fill: #D2E3FC;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-generate:hover {\n",
|
|
" background-color: #434B5C;\n",
|
|
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
|
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
|
" fill: #FFFFFF;\n",
|
|
" }\n",
|
|
" </style>\n",
|
|
" <button class=\"colab-df-generate\" onclick=\"generateWithVariable('results')\"\n",
|
|
" title=\"Generate code using this dataframe.\"\n",
|
|
" style=\"display:none;\">\n",
|
|
"\n",
|
|
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
|
" width=\"24px\">\n",
|
|
" <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
|
|
" </svg>\n",
|
|
" </button>\n",
|
|
" <script>\n",
|
|
" (() => {\n",
|
|
" const buttonEl =\n",
|
|
" document.querySelector('#id_e6686149-58a9-44c4-aedb-e026d9d54910 button.colab-df-generate');\n",
|
|
" buttonEl.style.display =\n",
|
|
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
|
"\n",
|
|
" buttonEl.onclick = () => {\n",
|
|
" google.colab.notebook.generateWithVariable('results');\n",
|
|
" }\n",
|
|
" })();\n",
|
|
" </script>\n",
|
|
" </div>\n",
|
|
"\n",
|
|
" </div>\n",
|
|
" </div>\n"
|
|
],
|
|
"text/plain": [
|
|
" texts distance\n",
|
|
"0 It has also received praise from many astronom... 10757.379883\n",
|
|
"1 Caltech theoretical physicist and 2017 Nobel l... 11566.131836\n",
|
|
"2 Interstellar uses extensive practical and mini... 11922.833008"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"query = \"how precise was the science\"\n",
|
|
"results = search(query)\n",
|
|
"results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "EkkDh12ZGRhY"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from rank_bm25 import BM25Okapi\n",
|
|
"from sklearn.feature_extraction import _stop_words\n",
|
|
"import string\n",
|
|
"\n",
|
|
"def bm25_tokenizer(text):\n",
|
|
" tokenized_doc = []\n",
|
|
" for token in text.lower().split():\n",
|
|
" token = token.strip(string.punctuation)\n",
|
|
"\n",
|
|
" if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:\n",
|
|
" tokenized_doc.append(token)\n",
|
|
" return tokenized_doc"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 4,
|
|
"status": "ok",
|
|
"timestamp": 1718963358455,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "cHl8HnvgGXHG",
|
|
"outputId": "0defa2a0-8e6b-436b-f6d2-6c5ce94f1636"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"100%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê| 15/15 [00:00<00:00, 38908.20it/s]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from tqdm import tqdm\n",
|
|
"\n",
|
|
"tokenized_corpus = []\n",
|
|
"for passage in tqdm(texts):\n",
|
|
" tokenized_corpus.append(bm25_tokenizer(passage))\n",
|
|
"\n",
|
|
"bm25 = BM25Okapi(tokenized_corpus)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "ZlyGXye4GRj0"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def keyword_search(query, top_k=3, num_candidates=15):\n",
|
|
" print(\"Input question:\", query)\n",
|
|
"\n",
|
|
" ##### BM25 search (lexical search) #####\n",
|
|
" bm25_scores = bm25.get_scores(bm25_tokenizer(query))\n",
|
|
" top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]\n",
|
|
" bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]\n",
|
|
" bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)\n",
|
|
"\n",
|
|
" print(f\"Top-3 lexical search (BM25) hits\")\n",
|
|
" for hit in bm25_hits[0:top_k]:\n",
|
|
" print(\"\\t{:.3f}\\t{}\".format(hit['score'], texts[hit['corpus_id']].replace(\"\\n\", \" \")))\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 3,
|
|
"status": "ok",
|
|
"timestamp": 1718963358455,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "jV-V_mhRGRmS",
|
|
"outputId": "0957579e-0de7-4646-949e-cd9750178dcf"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Input question: how precise was the science\n",
|
|
"Top-3 lexical search (BM25) hits\n",
|
|
"\t1.789\tInterstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan\n",
|
|
"\t1.373\tCaltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar\n",
|
|
"\t0.000\tIt stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"keyword_search(query = \"how precise was the science\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "ehyhfd7NG5kw"
|
|
},
|
|
"source": [
|
|
"## Caveats of Dense Retrieval\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 180
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 280,
|
|
"status": "ok",
|
|
"timestamp": 1718963358733,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "NxYwEfYRGpNe",
|
|
"outputId": "dfaf50a0-f500-4160-8126-1b3a825fe750"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Query:'What is the mass of the moon?'\n",
|
|
"Nearest neighbors:\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.google.colaboratory.intrinsic+json": {
|
|
"summary": "{\n \"name\": \"results\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"texts\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm\",\n \"The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014\",\n \"It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"distance\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 3,\n \"samples\": [\n 12854.458984375,\n 13301.0302734375,\n 13332.01171875\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
|
|
"type": "dataframe",
|
|
"variable_name": "results"
|
|
},
|
|
"text/html": [
|
|
"\n",
|
|
" <div id=\"df-6669e738-8ee5-4143-abc8-f1cff04e0803\" class=\"colab-df-container\">\n",
|
|
" <div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>texts</th>\n",
|
|
" <th>distance</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Cinematographer Hoyte van Hoytema shot it on 3...</td>\n",
|
|
" <td>12854.458984</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>The film had a worldwide gross over $677 milli...</td>\n",
|
|
" <td>13301.030273</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>It has also received praise from many astronom...</td>\n",
|
|
" <td>13332.011719</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>\n",
|
|
" <div class=\"colab-df-buttons\">\n",
|
|
"\n",
|
|
" <div class=\"colab-df-container\">\n",
|
|
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-6669e738-8ee5-4143-abc8-f1cff04e0803')\"\n",
|
|
" title=\"Convert this dataframe to an interactive table.\"\n",
|
|
" style=\"display:none;\">\n",
|
|
"\n",
|
|
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
|
|
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
|
|
" </svg>\n",
|
|
" </button>\n",
|
|
"\n",
|
|
" <style>\n",
|
|
" .colab-df-container {\n",
|
|
" display:flex;\n",
|
|
" gap: 12px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-convert {\n",
|
|
" background-color: #E8F0FE;\n",
|
|
" border: none;\n",
|
|
" border-radius: 50%;\n",
|
|
" cursor: pointer;\n",
|
|
" display: none;\n",
|
|
" fill: #1967D2;\n",
|
|
" height: 32px;\n",
|
|
" padding: 0 0 0 0;\n",
|
|
" width: 32px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-convert:hover {\n",
|
|
" background-color: #E2EBFA;\n",
|
|
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
|
" fill: #174EA6;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-buttons div {\n",
|
|
" margin-bottom: 4px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-convert {\n",
|
|
" background-color: #3B4455;\n",
|
|
" fill: #D2E3FC;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-convert:hover {\n",
|
|
" background-color: #434B5C;\n",
|
|
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
|
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
|
" fill: #FFFFFF;\n",
|
|
" }\n",
|
|
" </style>\n",
|
|
"\n",
|
|
" <script>\n",
|
|
" const buttonEl =\n",
|
|
" document.querySelector('#df-6669e738-8ee5-4143-abc8-f1cff04e0803 button.colab-df-convert');\n",
|
|
" buttonEl.style.display =\n",
|
|
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
|
"\n",
|
|
" async function convertToInteractive(key) {\n",
|
|
" const element = document.querySelector('#df-6669e738-8ee5-4143-abc8-f1cff04e0803');\n",
|
|
" const dataTable =\n",
|
|
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
|
|
" [key], {});\n",
|
|
" if (!dataTable) return;\n",
|
|
"\n",
|
|
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
|
|
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
|
|
" + ' to learn more about interactive tables.';\n",
|
|
" element.innerHTML = '';\n",
|
|
" dataTable['output_type'] = 'display_data';\n",
|
|
" await google.colab.output.renderOutput(dataTable, element);\n",
|
|
" const docLink = document.createElement('div');\n",
|
|
" docLink.innerHTML = docLinkHtml;\n",
|
|
" element.appendChild(docLink);\n",
|
|
" }\n",
|
|
" </script>\n",
|
|
" </div>\n",
|
|
"\n",
|
|
"\n",
|
|
"<div id=\"df-68b953f1-57a3-4f4a-9cc3-4cc06914eb72\">\n",
|
|
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-68b953f1-57a3-4f4a-9cc3-4cc06914eb72')\"\n",
|
|
" title=\"Suggest charts\"\n",
|
|
" style=\"display:none;\">\n",
|
|
"\n",
|
|
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
|
" width=\"24px\">\n",
|
|
" <g>\n",
|
|
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
|
|
" </g>\n",
|
|
"</svg>\n",
|
|
" </button>\n",
|
|
"\n",
|
|
"<style>\n",
|
|
" .colab-df-quickchart {\n",
|
|
" --bg-color: #E8F0FE;\n",
|
|
" --fill-color: #1967D2;\n",
|
|
" --hover-bg-color: #E2EBFA;\n",
|
|
" --hover-fill-color: #174EA6;\n",
|
|
" --disabled-fill-color: #AAA;\n",
|
|
" --disabled-bg-color: #DDD;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-quickchart {\n",
|
|
" --bg-color: #3B4455;\n",
|
|
" --fill-color: #D2E3FC;\n",
|
|
" --hover-bg-color: #434B5C;\n",
|
|
" --hover-fill-color: #FFFFFF;\n",
|
|
" --disabled-bg-color: #3B4455;\n",
|
|
" --disabled-fill-color: #666;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-quickchart {\n",
|
|
" background-color: var(--bg-color);\n",
|
|
" border: none;\n",
|
|
" border-radius: 50%;\n",
|
|
" cursor: pointer;\n",
|
|
" display: none;\n",
|
|
" fill: var(--fill-color);\n",
|
|
" height: 32px;\n",
|
|
" padding: 0;\n",
|
|
" width: 32px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-quickchart:hover {\n",
|
|
" background-color: var(--hover-bg-color);\n",
|
|
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
|
" fill: var(--button-hover-fill-color);\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-quickchart-complete:disabled,\n",
|
|
" .colab-df-quickchart-complete:disabled:hover {\n",
|
|
" background-color: var(--disabled-bg-color);\n",
|
|
" fill: var(--disabled-fill-color);\n",
|
|
" box-shadow: none;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-spinner {\n",
|
|
" border: 2px solid var(--fill-color);\n",
|
|
" border-color: transparent;\n",
|
|
" border-bottom-color: var(--fill-color);\n",
|
|
" animation:\n",
|
|
" spin 1s steps(1) infinite;\n",
|
|
" }\n",
|
|
"\n",
|
|
" @keyframes spin {\n",
|
|
" 0% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-bottom-color: var(--fill-color);\n",
|
|
" border-left-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 20% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-left-color: var(--fill-color);\n",
|
|
" border-top-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 30% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-left-color: var(--fill-color);\n",
|
|
" border-top-color: var(--fill-color);\n",
|
|
" border-right-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 40% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-right-color: var(--fill-color);\n",
|
|
" border-top-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 60% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-right-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 80% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-right-color: var(--fill-color);\n",
|
|
" border-bottom-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" 90% {\n",
|
|
" border-color: transparent;\n",
|
|
" border-bottom-color: var(--fill-color);\n",
|
|
" }\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"\n",
|
|
" <script>\n",
|
|
" async function quickchart(key) {\n",
|
|
" const quickchartButtonEl =\n",
|
|
" document.querySelector('#' + key + ' button');\n",
|
|
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
|
|
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
|
|
" try {\n",
|
|
" const charts = await google.colab.kernel.invokeFunction(\n",
|
|
" 'suggestCharts', [key], {});\n",
|
|
" } catch (error) {\n",
|
|
" console.error('Error during call to suggestCharts:', error);\n",
|
|
" }\n",
|
|
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
|
|
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
|
|
" }\n",
|
|
" (() => {\n",
|
|
" let quickchartButtonEl =\n",
|
|
" document.querySelector('#df-68b953f1-57a3-4f4a-9cc3-4cc06914eb72 button');\n",
|
|
" quickchartButtonEl.style.display =\n",
|
|
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
|
" })();\n",
|
|
" </script>\n",
|
|
"</div>\n",
|
|
"\n",
|
|
" <div id=\"id_6fce7d53-53cc-42ea-9bf5-3844e31f3081\">\n",
|
|
" <style>\n",
|
|
" .colab-df-generate {\n",
|
|
" background-color: #E8F0FE;\n",
|
|
" border: none;\n",
|
|
" border-radius: 50%;\n",
|
|
" cursor: pointer;\n",
|
|
" display: none;\n",
|
|
" fill: #1967D2;\n",
|
|
" height: 32px;\n",
|
|
" padding: 0 0 0 0;\n",
|
|
" width: 32px;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .colab-df-generate:hover {\n",
|
|
" background-color: #E2EBFA;\n",
|
|
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
|
" fill: #174EA6;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-generate {\n",
|
|
" background-color: #3B4455;\n",
|
|
" fill: #D2E3FC;\n",
|
|
" }\n",
|
|
"\n",
|
|
" [theme=dark] .colab-df-generate:hover {\n",
|
|
" background-color: #434B5C;\n",
|
|
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
|
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
|
" fill: #FFFFFF;\n",
|
|
" }\n",
|
|
" </style>\n",
|
|
" <button class=\"colab-df-generate\" onclick=\"generateWithVariable('results')\"\n",
|
|
" title=\"Generate code using this dataframe.\"\n",
|
|
" style=\"display:none;\">\n",
|
|
"\n",
|
|
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
|
" width=\"24px\">\n",
|
|
" <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
|
|
" </svg>\n",
|
|
" </button>\n",
|
|
" <script>\n",
|
|
" (() => {\n",
|
|
" const buttonEl =\n",
|
|
" document.querySelector('#id_6fce7d53-53cc-42ea-9bf5-3844e31f3081 button.colab-df-generate');\n",
|
|
" buttonEl.style.display =\n",
|
|
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
|
"\n",
|
|
" buttonEl.onclick = () => {\n",
|
|
" google.colab.notebook.generateWithVariable('results');\n",
|
|
" }\n",
|
|
" })();\n",
|
|
" </script>\n",
|
|
" </div>\n",
|
|
"\n",
|
|
" </div>\n",
|
|
" </div>\n"
|
|
],
|
|
"text/plain": [
|
|
" texts distance\n",
|
|
"0 Cinematographer Hoyte van Hoytema shot it on 3... 12854.458984\n",
|
|
"1 The film had a worldwide gross over $677 milli... 13301.030273\n",
|
|
"2 It has also received praise from many astronom... 13332.011719"
|
|
]
|
|
},
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"query = \"What is the mass of the moon?\"\n",
|
|
"results = search(query)\n",
|
|
"results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "V_RalLmuG0jw"
|
|
},
|
|
"source": [
|
|
"# Reranking Example\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 3,
|
|
"status": "ok",
|
|
"timestamp": 1718963358733,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "HulOxkW_Focv",
|
|
"outputId": "06e05541-4383-452e-d41a-04c3dc5521fb"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics'), index=12, relevance_score=0.1698185),\n",
|
|
" RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014'), index=10, relevance_score=0.07004896),\n",
|
|
" RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar'), index=4, relevance_score=0.0043994132)]"
|
|
]
|
|
},
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"query = \"how precise was the science\"\n",
|
|
"results = co.rerank(query=query, documents=texts, top_n=3, return_documents=True)\n",
|
|
"results.results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 2,
|
|
"status": "ok",
|
|
"timestamp": 1718963358733,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "SUrmMW8LFofP",
|
|
"outputId": "9724bf94-cf9d-45ff-eb80-50ededa34275"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"0 0.1698185 It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics\n",
|
|
"1 0.07004896 The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014\n",
|
|
"2 0.0043994132 Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for idx, result in enumerate(results.results):\n",
|
|
" print(idx, result.relevance_score , result.document.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "rqYJaq2CFohv"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def keyword_and_reranking_search(query, top_k=3, num_candidates=10):\n",
|
|
" print(\"Input question:\", query)\n",
|
|
"\n",
|
|
" ##### BM25 search (lexical search) #####\n",
|
|
" bm25_scores = bm25.get_scores(bm25_tokenizer(query))\n",
|
|
" top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]\n",
|
|
" bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]\n",
|
|
" bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)\n",
|
|
"\n",
|
|
" print(f\"Top-3 lexical search (BM25) hits\")\n",
|
|
" for hit in bm25_hits[0:top_k]:\n",
|
|
" print(\"\\t{:.3f}\\t{}\".format(hit['score'], texts[hit['corpus_id']].replace(\"\\n\", \" \")))\n",
|
|
"\n",
|
|
" #Add re-ranking\n",
|
|
" docs = [texts[hit['corpus_id']] for hit in bm25_hits]\n",
|
|
"\n",
|
|
" print(f\"\\nTop-3 hits by rank-API ({len(bm25_hits)} BM25 hits re-ranked)\")\n",
|
|
" results = co.rerank(query=query, documents=docs, top_n=top_k, return_documents=True)\n",
|
|
" for hit in results.results:\n",
|
|
" print(\"\\t{:.3f}\\t{}\".format(hit.relevance_score, hit.document.text.replace(\"\\n\", \" \")))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 1,
|
|
"status": "ok",
|
|
"timestamp": 1718963359073,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "9FITOXqkHONy",
|
|
"outputId": "c4e81e12-b19c-4fa0-8ce6-7907002df7ca"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Input question: how precise was the science\n",
|
|
"Top-3 lexical search (BM25) hits\n",
|
|
"\t1.789\tInterstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan\n",
|
|
"\t1.373\tCaltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar\n",
|
|
"\t0.000\tInterstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects\n",
|
|
"\n",
|
|
"Top-3 hits by rank-API (10 BM25 hits re-ranked)\n",
|
|
"\t0.004\tCaltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar\n",
|
|
"\t0.004\tSet in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind\n",
|
|
"\t0.003\tBrothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"keyword_and_reranking_search(query = \"how precise was the science\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "ugdnTs_VHV25"
|
|
},
|
|
"source": [
|
|
"# Retrieval-Augmented Generation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "-iqKQ7F0HZh-"
|
|
},
|
|
"source": [
|
|
"## Example: Grounded Generation with an LLM API\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 2275,
|
|
"status": "ok",
|
|
"timestamp": 1718963362077,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "VeHX0D8DHaim",
|
|
"outputId": "364b75ea-0b36-4a01-b16f-aa407768dde8"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Query:'income generated'\n",
|
|
"Nearest neighbors:\n",
|
|
"The film generated a worldwide gross of over $677 million, or $773 million with subsequent re-releases.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"query = \"income generated\"\n",
|
|
"\n",
|
|
"# 1- Retrieval\n",
|
|
"# We'll use embedding search. But ideally we'd do hybrid\n",
|
|
"results = search(query)\n",
|
|
"\n",
|
|
"# 2- Grounded Generation\n",
|
|
"docs_dict = [{'text': text} for text in results['texts']]\n",
|
|
"response = co.chat(\n",
|
|
" message = query,\n",
|
|
" documents=docs_dict\n",
|
|
")\n",
|
|
"\n",
|
|
"print(response.text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 362,
|
|
"status": "ok",
|
|
"timestamp": 1718963362438,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "E9YmHEOHHpUW",
|
|
"outputId": "f8670a77-8f69-4870-8887-060fdf195978"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"NonStreamedChatResponse(text='The film generated a worldwide gross of over $677 million, or $773 million with subsequent re-releases.', generation_id='bebc32bd-d620-42cf-bd13-f8d1f96d4aa6', citations=[ChatCitation(start=21, end=57, text='worldwide gross of over $677 million', document_ids=['doc_0']), ChatCitation(start=62, end=103, text='$773 million with subsequent re-releases.', document_ids=['doc_0'])], documents=[{'id': 'doc_0', 'text': 'The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014'}], is_search_required=None, search_queries=None, search_results=None, finish_reason='COMPLETE', tool_calls=None, chat_history=[Message_User(message='income generated', tool_calls=None, role='USER'), Message_Chatbot(message='The film generated a worldwide gross of over $677 million, or $773 million with subsequent re-releases.', tool_calls=None, role='CHATBOT')], prompt=None, meta=ApiMeta(api_version=ApiMetaApiVersion(version='1', is_deprecated=None, is_experimental=None), billed_units=ApiMetaBilledUnits(input_tokens=106, output_tokens=26, search_units=None, classifications=None), tokens=ApiMetaTokens(input_tokens=797, output_tokens=95), warnings=None), response_id='319002db-505d-4d3c-903f-41eefbf0f856')"
|
|
]
|
|
},
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"response"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 2,
|
|
"status": "ok",
|
|
"timestamp": 1718963362438,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "YLwaGXM2Hg7b",
|
|
"outputId": "03f052d6-0723-45d9-8b24-fbc1195664c4"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[ChatCitation(start=21, end=57, text='worldwide gross of over $677 million', document_ids=['doc_0']),\n",
|
|
" ChatCitation(start=62, end=103, text='$773 million with subsequent re-releases.', document_ids=['doc_0'])]"
|
|
]
|
|
},
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"response.citations"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "D_25ztzEHuWX"
|
|
},
|
|
"source": [
|
|
"## Example: RAG with Local Models\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "jNZ5gUoWIYhp"
|
|
},
|
|
"source": [
|
|
"### Loading the Generation Model\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 33056,
|
|
"status": "ok",
|
|
"timestamp": 1718963395761,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "E4LNwOWTHvOv",
|
|
"outputId": "945a6fa3-511d-48b7-d305-fe6d1dd23be9"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"--2024-06-21 09:49:22-- https://huggingface.co/lmstudio-community/Phi-3-mini-4k-instruct-GGUF/resolve/main/Phi-3-mini-4k-instruct-Q8_0.gguf\n",
|
|
"Resolving huggingface.co (huggingface.co)... 18.164.174.118, 18.164.174.23, 18.164.174.17, ...\n",
|
|
"Connecting to huggingface.co (huggingface.co)|18.164.174.118|:443... connected.\n",
|
|
"HTTP request sent, awaiting response... 302 Found\n",
|
|
"Location: https://cdn-lfs-us-1.huggingface.co/repos/8e/3f/8e3fafa0351929e621a3db9a53b131a9d7f4b222332208032555bb92f11ab100/8d2f3732e31c354e169cd81dcde9807a1c73b85b9a0f9b16c19013e7a4bb151c?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27Phi-3-mini-4k-instruct-Q8_0.gguf%3B+filename%3D%22Phi-3-mini-4k-instruct-Q8_0.gguf%22%3B&Expires=1719222562&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxOTIyMjU2Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzhlLzNmLzhlM2ZhZmEwMzUxOTI5ZTYyMWEzZGI5YTUzYjEzMWE5ZDdmNGIyMjIzMzIyMDgwMzI1NTViYjkyZjExYWIxMDAvOGQyZjM3MzJlMzFjMzU0ZTE2OWNkODFkY2RlOTgwN2ExYzczYjg1YjlhMGY5YjE2YzE5MDEzZTdhNGJiMTUxYz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=DFqGVAGZBHhpPlWt2pSH5nbTgskxr6FrSK19FhopbgIYnqujHL3LDgnleMTDHrtRvflyX-6QE88ZAYZc1wIZA7ZXgBVoFdYMNPbHixfv4hqGdKiddWcF4QYi5JCYChS3Z8oZPUkcbyX6KNTqMR1nls2KTZ3K0Xl3E7nGmlTXo85mRdRKQojZLkuLOa28pG2z9jrs1wJ1B2W3Ed%7E%7EK1E-BXhjKsK4zUR1Ch-3KfqAqe0q0XmnCcF1Ml2xosujvlA%7EZuVxflmRf8wRVZBZsbZNXaUFAb3oiyNTuyr9g1fvSxJPmAJocs9jIMUZwU88Sa4s%7EVdfytD0s6YruNH1GOAfoA__&Key-Pair-Id=K2FPYV99P2N66Q [following]\n",
|
|
"--2024-06-21 09:49:22-- https://cdn-lfs-us-1.huggingface.co/repos/8e/3f/8e3fafa0351929e621a3db9a53b131a9d7f4b222332208032555bb92f11ab100/8d2f3732e31c354e169cd81dcde9807a1c73b85b9a0f9b16c19013e7a4bb151c?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27Phi-3-mini-4k-instruct-Q8_0.gguf%3B+filename%3D%22Phi-3-mini-4k-instruct-Q8_0.gguf%22%3B&Expires=1719222562&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxOTIyMjU2Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzhlLzNmLzhlM2ZhZmEwMzUxOTI5ZTYyMWEzZGI5YTUzYjEzMWE5ZDdmNGIyMjIzMzIyMDgwMzI1NTViYjkyZjExYWIxMDAvOGQyZjM3MzJlMzFjMzU0ZTE2OWNkODFkY2RlOTgwN2ExYzczYjg1YjlhMGY5YjE2YzE5MDEzZTdhNGJiMTUxYz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=DFqGVAGZBHhpPlWt2pSH5nbTgskxr6FrSK19FhopbgIYnqujHL3LDgnleMTDHrtRvflyX-6QE88ZAYZc1wIZA7ZXgBVoFdYMNPbHixfv4hqGdKiddWcF4QYi5JCYChS3Z8oZPUkcbyX6KNTqMR1nls2KTZ3K0Xl3E7nGmlTXo85mRdRKQojZLkuLOa28pG2z9jrs1wJ1B2W3Ed%7E%7EK1E-BXhjKsK4zUR1Ch-3KfqAqe0q0XmnCcF1Ml2xosujvlA%7EZuVxflmRf8wRVZBZsbZNXaUFAb3oiyNTuyr9g1fvSxJPmAJocs9jIMUZwU88Sa4s%7EVdfytD0s6YruNH1GOAfoA__&Key-Pair-Id=K2FPYV99P2N66Q\n",
|
|
"Resolving cdn-lfs-us-1.huggingface.co (cdn-lfs-us-1.huggingface.co)... 18.65.25.71, 18.65.25.64, 18.65.25.113, ...\n",
|
|
"Connecting to cdn-lfs-us-1.huggingface.co (cdn-lfs-us-1.huggingface.co)|18.65.25.71|:443... connected.\n",
|
|
"HTTP request sent, awaiting response... 200 OK\n",
|
|
"Length: 4061221024 (3.8G) [binary/octet-stream]\n",
|
|
"Saving to: ΓÇÿPhi-3-mini-4k-instruct-Q8_0.ggufΓÇÖ\n",
|
|
"\n",
|
|
"Phi-3-mini-4k-instr 100%[===================>] 3.78G 66.7MB/s in 33s \n",
|
|
"\n",
|
|
"2024-06-21 09:49:55 (119 MB/s) - ΓÇÿPhi-3-mini-4k-instruct-Q8_0.ggufΓÇÖ saved [4061221024/4061221024]\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"!wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "a2Qgnc5OHvRQ"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain import LlamaCpp\n",
|
|
"\n",
|
|
"# Make sure the model path is correct for your system!\n",
|
|
"llm = LlamaCpp(\n",
|
|
" model_path=\"Phi-3-mini-4k-instruct-q4.gguf\",\n",
|
|
" n_gpu_layers=-1,\n",
|
|
" max_tokens=500,\n",
|
|
" n_ctx=2048,\n",
|
|
" seed=42,\n",
|
|
" verbose=False\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "H7ahQtlvIZjS"
|
|
},
|
|
"source": [
|
|
"### Loading the Embedding Model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 539,
|
|
"referenced_widgets": [
|
|
"138cc674d550408d92f2793bb8d8feef",
|
|
"83f7665ab87042798e1b8e94ef3275f7",
|
|
"3c73236c809f45e49293deb47b55de0a",
|
|
"5d2bcf6e2b874522ae07d65d59bac90b",
|
|
"6058adc08c9e43adb214dc157980ccd5",
|
|
"94f720fd05bd42d2a3e80456516511ad",
|
|
"05e80b09ac484cc5a580d8d23bab4727",
|
|
"acd7c3193cd849d2927811bcc7c3ab86",
|
|
"d0db0682e7bb48b3b195dec91dfceea5",
|
|
"d189e49a470a4fe5ad2b8374cd1c3422",
|
|
"bdacebab45304f47b732c33ac4639e9e",
|
|
"5cff7f41f95d47dbb1c296647c967bfc",
|
|
"ae492efb01234d4fbfb928087d05f094",
|
|
"a453e6e3da5548d3a0d3336ba397e505",
|
|
"4dc38e826b7f4982ba36b77a4bf2db2e",
|
|
"255c1e532ee048e286e1d5c89fe3015d",
|
|
"19b25464d66945a2a0e40284b0f278ec",
|
|
"d7b688b70a2d48e3b20b7fa1d90e8e97",
|
|
"7f75961fa15b443fb4d528116b925922",
|
|
"b6a72b7b8e864691996256146e91ef48",
|
|
"1adb2c1d93004a4aaec5842eb07dd5df",
|
|
"89b93e5618f64fbca249f7f3a0eaae49",
|
|
"a8e14699f76a49fc95bf572c33e89b9e",
|
|
"ba3157d275b845edb92df2ebc4ea9bc1",
|
|
"6a4f72cbd2b94be5af4d3ffcad6b1e87",
|
|
"71815245763045838934063bd772d3ab",
|
|
"0b8f820e5efe4e10b33e9693c3eef9bf",
|
|
"e702233daf474196a5eb7a6b9500e75a",
|
|
"db1c9245b86946ac9dde32c29604f215",
|
|
"e61d6b43f4144e6eaa43716261a85516",
|
|
"ef4669fd011241c2ab048399a5ad89c4",
|
|
"5d9802b14082481e93aa3650e7a14cec",
|
|
"c4de82afef0f40af8147f0d93f978174",
|
|
"25d46b46b26a48d7a4810dd691dbfa86",
|
|
"d3aee1810b1c40e19a184d2d6af8eefa",
|
|
"715ed5ce1cd243d09d19ade841be1a7c",
|
|
"1a8e6ef4564e43c3b4631223491d1138",
|
|
"277df89d75b0440e99c7c38304de709d",
|
|
"a95a3d6cff4f4b1fa9d716099789cb9f",
|
|
"f366405bc2af4d5e97a179300fe267b9",
|
|
"00fc8b78e0c2440d9b514378c6f95099",
|
|
"68e451cf20454e86aec8160ae82f3670",
|
|
"c74ad52fe8bb44da8d3b2a7b1b10c4f1",
|
|
"a7a538d9ec814c9e8fd50eb15712de8f",
|
|
"3b990fafeff24fe8adb29a3d24a89565",
|
|
"e39c103ec551423aa126540b6afa7d10",
|
|
"29e90bb4d1fa4c11a3a8d563bdbeb160",
|
|
"82e4cb20b3724261b1c59d8841325bb9",
|
|
"8cc39fab52804fa1867d6395a51900f9",
|
|
"fe5b5da0ccad4bd2b5dedbbb25d0817d",
|
|
"f94579b834ee41d9b12a0759e8206763",
|
|
"525c9e8a0d834bbd83e23bc381cf3ecd",
|
|
"7372effad89243fb8d7dbbd87fa3b413",
|
|
"b935647ac06f4e20bfa62de8a1ddf05e",
|
|
"681ea5b1d73b4c3580f4dcca25b9d9d0",
|
|
"89c99f9a11164857a367d22af469cf85",
|
|
"d4604e6f553941fab0d2c583fcbd593a",
|
|
"eda119b60c004a1bb00b7ca0e583188d",
|
|
"2b8114428e8d484a9c0c0638477a31d1",
|
|
"ad192a0f6d9d42f49e3e24caaee42e26",
|
|
"f9af1e04b2144d9f9b104cfa87455109",
|
|
"368731d085ed4eb5bc1c78d73725df39",
|
|
"430ff548ae1d4edeb14b91c12fbeb9dc",
|
|
"3038cc2ac17f4a0397840b7104d2fd7c",
|
|
"15d6c1ab1e0f4908b2a7584159bb0105",
|
|
"65a247e214364094a8d6fcefdbce1c3a",
|
|
"326669dae98648b2b55e6fcb45ed4f43",
|
|
"30cee3a3d2684bd59ec661e67090b7ea",
|
|
"196331e7f282454fb4ea32de0410a495",
|
|
"6e5aa6453a5741aa9696ce0444b7fca4",
|
|
"14f8aa41b5574d6f855ddcde71c1e59b",
|
|
"4930e2e16041457580e15399bb983c34",
|
|
"c3180ac3bc61416abc7fe10db1171bd9",
|
|
"a1f95710663b4bdeb7b1c7dc6f4f4eb7",
|
|
"6328742a36d54f619a1a7803b6be790e",
|
|
"cd87abc413db4d65be32fd47389b46ed",
|
|
"a78c6111af5541e0893f3775f3992a6a",
|
|
"2d6c6545fc9e446f882e1b0eadd03ffc",
|
|
"b574a5cb83164992a2dff90c17587acb",
|
|
"05f0d4f09ba84f7287178b1dd43fdd25",
|
|
"0b5c0e31175b4b2f9f1c1c3cdc15d34e",
|
|
"31f1a648bf5d4c7baf88de71dcc6d3c1",
|
|
"fc5cc6536106431d8515731c0e682acc",
|
|
"653c8fe00d5d4996b155df26d0a0598d",
|
|
"1eb0802651cb4560bacfb9be6e23f335",
|
|
"8518599e1b1144fda9f208956f8f1eff",
|
|
"ed0677d112e5469bb42068f0e3b14adf",
|
|
"cb6d2cb161c347aab0b3a06727d776df",
|
|
"e7fdfec1d8b341d7a1c228c069632400",
|
|
"32d56ae437da4d9cbcdecc6942a54ce8",
|
|
"23ed3163e67f4a10bc626b3440de7553",
|
|
"53f46138cc2c4e74a3cbe1f77ac2f92e",
|
|
"dac714e2ccba4635a1aceb46ebdc33ac",
|
|
"b3001c1847394d08ae53cf68ec697e0f",
|
|
"bb434d424fa144b0b0211603d30b5c72",
|
|
"e4cb2ade41d4466d99149cee70c1e7e9",
|
|
"dddb4584c34f40c6bc66df30c1c3c4c0",
|
|
"23169e2b0cb34fa388a7c7f7a140335c",
|
|
"c3aacf3d8c7b4b4b9b98ca123e087593",
|
|
"44d38a0ba97444c6bfd80c8c9553c070",
|
|
"dd7f4f080941498397dc51d459bbc519",
|
|
"5c5c32897b19438dbba56f72c577fe64",
|
|
"e049e3a92fbd422198c16a94e46a5455",
|
|
"18588efdadbc457a8940facc0aa96511",
|
|
"ae97a2366ac944d1a089b4f1d8269822",
|
|
"23dc7b6255cc479c89799dc10653008c",
|
|
"f19fe22a899e4fa19e607c16c94cf16c",
|
|
"28f6bcec0d6448e18fb894e2dae4c2ae",
|
|
"3fd02e49e5f64baf95596b82e1c5d9fb",
|
|
"9e0b0f814ad6487fb5ee842e723a0b52"
|
|
]
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 23624,
|
|
"status": "ok",
|
|
"timestamp": 1718963519389,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "ODkBMgsIIddp",
|
|
"outputId": "2ca552cb-565a-459b-e9eb-0aed4481d492"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/usr/local/lib/python3.10/dist-packages/sentence_transformers/cross_encoder/CrossEncoder.py:11: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n",
|
|
" from tqdm.autonotebook import tqdm, trange\n",
|
|
"/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n",
|
|
"The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
|
|
"To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
|
|
"You will be able to reuse this secret in all of your notebooks.\n",
|
|
"Please note that authentication is recommended but still optional to access public models or datasets.\n",
|
|
" warnings.warn(\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "138cc674d550408d92f2793bb8d8feef",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"modules.json: 0%| | 0.00/385 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "5cff7f41f95d47dbb1c296647c967bfc",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"README.md: 0%| | 0.00/68.1k [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "a8e14699f76a49fc95bf572c33e89b9e",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"sentence_bert_config.json: 0%| | 0.00/57.0 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
|
|
" warnings.warn(\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "25d46b46b26a48d7a4810dd691dbfa86",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"config.json: 0%| | 0.00/583 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "3b990fafeff24fe8adb29a3d24a89565",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"model.safetensors: 0%| | 0.00/66.7M [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "89c99f9a11164857a367d22af469cf85",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"tokenizer_config.json: 0%| | 0.00/394 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "326669dae98648b2b55e6fcb45ed4f43",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "2d6c6545fc9e446f882e1b0eadd03ffc",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"tokenizer.json: 0%| | 0.00/712k [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "e7fdfec1d8b341d7a1c228c069632400",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"application/vnd.jupyter.widget-view+json": {
|
|
"model_id": "44d38a0ba97444c6bfd80c8c9553c070",
|
|
"version_major": 2,
|
|
"version_minor": 0
|
|
},
|
|
"text/plain": [
|
|
"1_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain.embeddings.huggingface import HuggingFaceEmbeddings\n",
|
|
"\n",
|
|
"# Embedding Model for converting text to numerical representations\n",
|
|
"embedding_model = HuggingFaceEmbeddings(\n",
|
|
" model_name='BAAI/bge-small-en-v1.5'\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "LgPua3jsIgmW"
|
|
},
|
|
"source": [
|
|
"### Preparing the Vector Database"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "NV57LOf8IjM-"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.vectorstores import FAISS\n",
|
|
"\n",
|
|
"# Create a local vector database\n",
|
|
"db = FAISS.from_texts(texts, embedding_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "P06UYeIVIk1e"
|
|
},
|
|
"source": [
|
|
"### The RAG Prompt\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "F_3nTc69InwO"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain import PromptTemplate\n",
|
|
"from langchain.chains import RetrievalQA\n",
|
|
"\n",
|
|
"\n",
|
|
"# Create a prompt template\n",
|
|
"template = \"\"\"<|user|>\n",
|
|
"Relevant information:\n",
|
|
"{context}\n",
|
|
"\n",
|
|
"Provide a concise answer the following question using the relevant information provided above:\n",
|
|
"{question}<|end|>\n",
|
|
"<|assistant|>\"\"\"\n",
|
|
"prompt = PromptTemplate(\n",
|
|
" template=template,\n",
|
|
" input_variables=[\"context\", \"question\"]\n",
|
|
")\n",
|
|
"\n",
|
|
"# RAG Pipeline\n",
|
|
"rag = RetrievalQA.from_chain_type(\n",
|
|
" llm=llm,\n",
|
|
" chain_type='stuff',\n",
|
|
" retriever=db.as_retriever(),\n",
|
|
" chain_type_kwargs={\n",
|
|
" \"prompt\": prompt\n",
|
|
" },\n",
|
|
" verbose=True\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/"
|
|
},
|
|
"executionInfo": {
|
|
"elapsed": 90250,
|
|
"status": "ok",
|
|
"timestamp": 1718963614201,
|
|
"user": {
|
|
"displayName": "Maarten Grootendorst",
|
|
"userId": "11015108362723620659"
|
|
},
|
|
"user_tz": -120
|
|
},
|
|
"id": "x2p2pJPfIp16",
|
|
"outputId": "3d284ce5-d35d-429d-fcec-a6a4a427dc05"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n",
|
|
"\n",
|
|
"\u001b[1m> Entering new RetrievalQA chain...\u001b[0m\n",
|
|
"\n",
|
|
"\u001b[1m> Finished chain.\u001b[0m\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'query': 'Income generated',\n",
|
|
" 'result': \" Interstellar grossed over $677 million worldwide in 2014 and had additional earnings from subsequent re-releases, totaling approximately $773 million. The film's release utilized both traditional film stock and digital projectors across various venues to maximize its income generation potential.\"}"
|
|
]
|
|
},
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"rag.invoke('Income generated')"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"accelerator": "GPU",
|
|
"colab": {
|
|
"gpuType": "T4",
|
|
"provenance": []
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.14"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|