{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "g_a9QvUFVCUR" }, "source": [ "

Chapter 2 - Tokens and Token Embeddings

\n", "Exploring tokens and embeddings as an integral part of building LLMs\n", "\n", "\n", "\n", "\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter02/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb)\n", "\n", "---\n", "\n", "This notebook is for Chapter 2 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).\n", "\n", "---\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [OPTIONAL] - Installing Packages on \n", "\n", "If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:\n", "\n", "---\n", "\n", "💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to\n", "**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %%capture\n", "# !pip install transformers>=4.41.2 sentence-transformers>=3.0.1 gensim>=4.3.2 scikit-learn>=1.5.0 accelerate>=0.31.0" ] }, { "cell_type": "markdown", "metadata": { "id": "oQHfpqT_t9-K" }, "source": [ "# Downloading and Running An LLM\n", "\n", "The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately and keep them as such so that we can explore them separately." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 753, "referenced_widgets": [ "851b6e59cc2e4eb8961cb5fa4906c47c", "fd5b6ec0a82c493a92a2235635ea5ac0", "2bc5005713ba4b61a392935e1d83994c", "607bb5b3f27d463dbdb5ad04583a1c4f", "0d0c6c47fbb34090a73ed8bf20597ee5", "083149fa68934cce90cd03b6c8f90dd4", "73263751695843ada347c6c694afd0cd", "79c1152a61dd4a87a6a4e700096d92b4", "27fb45533f134c308cc0c27797c127f3", "78eb9f6c02e841ad9ecf8ed809c724bb", "1ade569c5cec4764af86e7eec6bc64ed", "dbf8268e613e4e1498133a42a48a58f8", "66c3d79c9411447eabae228b5d125c09", "46ba964cc4c84bbb87eeccc817a3354c", "150625fb48b740f6bce66fd4919357a9", "974be84551a54deb82051b947ff013af", "22e2cbe577e6475aa80dd94f4b704f1b", "83daaa4a1a5d437e8345d4740c3625cd", "6e1c0e41b31b4b0b83aa3736e386aa49", "bf1d39ed8ee84a6b95e68a11962032b4", "7476eb73c9544435b07256f028168a11", "f33bc31f0cb842388057b33dd107f2e7", "9b9f8a8f0eb14dd9a2c82461f22c4636", "214eb5ac9a5048c586b3418452017991", "dd1b7306ad4641799d4947a94aa0f088", "d58e84330cff499c87e03827d9727d19", "2809ac7769af482a86d5e763f5f69211", "746054d4ac0b442e85b75959b66ceb34", "9b77017c3e764c97b44c56fb5ad3bc77", "3dc073c6257143748849311e1c08bdbe", "3117fa7446394cb6b6b56427be0a3290", "5c3c4612f65f40808f8ef1d0e31f9836", "f66983763d454728b86d881621948b8d", "a855727c923648ea8d6c7b9ec117c94f", "6c9350120fec4f4e9b80e063279b459c", "28283e52317e437c8e118d052d93419c", "5556b8b2f22548109153146ef66702df", "dc1457b4d1d344c9863fbb6b0d38e2cf", "029a23c145604f209b12044a6b367802", "252ccf08bd52433a888d2bac9ed1b64c", "893af8dae5874b55a582f180652e36e3", "27e14bb1e4aa467c97ac0c6a6eb44108", "87812873ff2d4514aca02c64a3509bd9", "13ecb25d0d6541b1a4f2b7dea92dee61", "2107a7e7d40f416ca4bc46c90725e0ca", "1cfc631c1d7d4fb087753e9e34fd1aa2", "238f094ef6744b31ba4444d47c17558c", "b11bf972b98645e79f98cdec2c1440f3", "baa7c53a02cd42688d768a4595611f64", "6a63559431934a4aae4ce79e8b2e76c0", "81652b39152145748580c9667b2554de", "4dc8d9b1a2d248f9a30bc8e985db4568", "adfcbd50473c4ea9bb414e33926cc33b", "52a9a5aed89a4294882c8c55b2078b84", "83fb238d4eba47a390233ecb5e870ed4", "8da9bab6ee214504a187e5b3c9bf1b80", "77922a918c1b4bb08f5304a34044e9ba", "ac36ee2d4df04399b11f57bb56930a52", "df2f73dc04d64bf6adb4df0eae207c27", "3cbc759c16b24d63971ba5ef53ff9ba3", "ea5d84bf35754e71a628755c0cff49be", "3ed30ef053324b3f9a5d3258d2a88a86", "f8bae10ac5b54f1b99f7b595f123be4d", "dfb5b5a357ca41f88efa53288ebeb5b3", "b917827c07344e018543fcbb9c7a6fe8", "bad199037f6548749f4dfe12724e3e62", "0e30dc616d9c4c50adb9cd292fc3d89e", "570a9d836e48456d8c301184d670c50e", "471cdd69adc24003b14680e8dbe7b183", "c11552c4f423434aa2c41dfa22b8f227", "687c61c78b044690a91e34affe570613", "82dcc51143ac46349c13d2f5bee380da", "a9c8a9b099aa416d9015ee48006db324", "1fcc7e4cd65449c49956981ec1b46843", "5fc8310895a64815b15af53e3583f60d", "ffa0950ea21a4689b5804b0285049ccd", "da9ec5fd7acd43b7a11ac860771732df", "961d4e8f3a1845e393e25d21de3cd6d3", "e7506426cc424dfdb28fa6c35cb74c24", "d9b26784fc7c4e6b8abe9490caf53afc", "e96d58589fcc4bd7a469c4281baa01ee", "dd8fa47d73774a7684cda73e6675c0bd", "41843fb3c92c425b9f39cb124b74e3ad", "2ab54f26cc504dc09db592039c02c210", "cab531e67dd3452f9a4fcd17e324d76a", "58de29a54c91401d92cc73a96721a3e9", "0eabbb5ac0784a1c89ad58cca37a38fe", "362724ce1c5944bd9335b66a14f89844", "bd2749d8ba7240408be56e26c796e00a", "b54e7ab6e22345bdae3484cab6dab985", "77d3a6cc940a4dd7a38c47dfc2b7041b", "13071c70f17144b891a1f0fde6d9b21e", "205628f7d62d4e54add8ee90b418ebdf", "037c74818af74266b2bce5a454b99ab8", "bba33c0c581d415bba86aebfc8642196", "294ee2f27b4c4ee1ad6ca2a121a9e66d", "5bdf1b7758624e379c71da5296929973", "6cae5555dce346dc9867edd130c840c9", "0e37d9ed7d3142a3b80f188ddee54a4d", "d077b2c0e6a142cb905e38a7c7855997", "2a47328eace84efbb8fcf6644d4e469d", "c1634b24795243f5b8384957ce11b9c0", "b8285e7e6db348c385694d2fd63b514e", "da9d77e8fa0641dc97f7610aa59374a2", "7209a02303ad46f3880799f0b92171f0", "770f06c5ff714c87af1d0d7536da12c9", "f59c537df5684195b1b5c44754ed2d07", "6dd8dd0a99bd4cbbb7401a3fa3893128", "f8c4e2a7f11f47fcbd574ceaf828aa66", "193e7cde826d43bd894362498a217888", "febedf9dd6d248baa984ac215697968d", "0bf43b40190141d5bf8e0be3fdb1e529", "4070e2510e1549c3879e2458e42bb269", "9ba44d309f684c98991889b992767ab3", "b47b04d69b33402f80ea75ed472d3c29", "19a091b24a5d44b9a98569a7e124c6f8", "0d1cea03253c426b84814936c93e5279", "7c1b3811b42549569cc83423bbeb2163", "55bc2055b3c5426f8ba9afc9f80ccd58", "61098d22dd294891afdd34b6208906ba", "0a0db951df42424084b531ebbd6cfa98", "3f1dc764fb2c48bd9fd8797a86b6c59f", "95146eb05e964e32bfa6f5078a66fab7", "8c7e09509c524cf29782df69988eef6d", "5237587cfb9c4e86a82aaf72cc9db39d", "22fd14ac3ccf4e1d8e4782d35e7ec91c", "138a3cb6ba5b494c8a237af0c6306a4f", "e88a1d50d7f940d793f4dcd76a5f5929", "4182011b06304005a6e8922d338c65fb", "16ae1f6d9d844d60b55da9e36a5792cd", "56ea52f3a1ea485ca52eebaf4f97c3a3", "c81c6fe3a8ed4466bf95c53dfe415fba", "157b7540b5a3404e82d7e86b19906c7a", "48d180eca04341d692d5fb83a8e0f76d", "c84b966b98144e22884183cd2a03e2a8", "9a6496f9037149289e64e66a0f46cbd3", "ecd15644266b43678dbe7b067b553833", "43a6de1c960341eda9e154d47e0142c9", "6e8fa3e12a8b4262bbca73c57c7f4eb9", "86445e2d62b44847a4be7419a6000305", "3837338134c240caa0c51501fae402fb", "5e2d4a77710e428ebef6bae0854c8527", "63d81ace73634bcdaef6f5a103c35dcf", "dd1af9490bc24d5ea2804adefd442fdb", "cf18ae8dd5bb4d378f017f9eaff967b4", "461fde63ac0a41ed8ad9aab2d60769a7", "3c5bb3c7c77f4ca393dc05b12fb69dd0", "bd707eacc9914dc69a1fe4e35320a9a4", "d7cb5720fbbb4ae0858dd3070c712615", "bf390e3a7172415089b2cce2223dd8b6", "72ae6f3dcf3d4567a4a60f627a51f1f4", "026141ba5e404e4e86c1d1c083ab186d", "fcd5f49ec07f401fb8a2b403516bbd17", "22d378959eb941c5bbfe77702fb426e6" ] }, "executionInfo": { "elapsed": 95520, "status": "ok", "timestamp": 1723034396041, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": -60 }, "id": "jjU8NBHnwA4j", "outputId": "286bdccb-f25d-4b0e-bda3-44d2a4be45cd" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.\n", "Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dd45dd8837f94b38ae6f4ffd205d9ea6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00 Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: My Sincere Apologies for the Gardening Mishap\n", "\n", "Dear\n" ] } ], "source": [ "prompt = \"Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>\"\n", "\n", "# Tokenize the input prompt\n", "input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids.to(\"cuda\")\n", "\n", "# Generate the text\n", "generation_output = model.generate(\n", " input_ids=input_ids,\n", " max_new_tokens=20\n", ")\n", "\n", "# Print the output\n", "print(tokenizer.decode(generation_output[0]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1719641447389, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "JmzgbbdKuvHt", "outputId": "82511d5b-7949-49a0-e3a6-c128564575c8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([[ 1, 14350, 385, 4876, 27746, 5281, 304, 19235, 363, 278,\n", " 25305, 293, 16423, 292, 286, 728, 481, 29889, 12027, 7420,\n", " 920, 372, 9559, 29889, 32001]], device='cuda:0')\n" ] } ], "source": [ "print(input_ids)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1719641447389, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "W4vsjbxwu1K1", "outputId": "506f32d1-f058-4cfd-a9cd-13c4dabe80e6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Write\n", "an\n", "email\n", "apolog\n", "izing\n", "to\n", "Sarah\n", "for\n", "the\n", "trag\n", "ic\n", "garden\n", "ing\n", "m\n", "ish\n", "ap\n", ".\n", "Exp\n", "lain\n", "how\n", "it\n", "happened\n", ".\n", "<|assistant|>\n" ] } ], "source": [ "for id in input_ids[0]:\n", " print(tokenizer.decode(id))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1719641447389, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "A9wRZ3J3u4z1", "outputId": "7efaa49c-7a5a-41d7-f000-7aace16007e5" }, "outputs": [ { "data": { "text/plain": [ "tensor([[ 1, 14350, 385, 4876, 27746, 5281, 304, 19235, 363, 278,\n", " 25305, 293, 16423, 292, 286, 728, 481, 29889, 12027, 7420,\n", " 920, 372, 9559, 29889, 32001, 3323, 622, 29901, 1619, 317,\n", " 3742, 406, 6225, 11763, 363, 278, 19906, 292, 341, 728,\n", " 481, 13, 13, 29928, 799]], device='cuda:0')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "generation_output" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 275, "status": "ok", "timestamp": 1723034447362, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": -60 }, "id": "7QlHLof3u8A3", "outputId": "c2315e1b-91b4-4a1b-9bcc-084f16ac8db1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sub\n", "ject\n", "Subject\n", ":\n" ] } ], "source": [ "print(tokenizer.decode(3323))\n", "print(tokenizer.decode(622))\n", "print(tokenizer.decode([3323, 622]))\n", "print(tokenizer.decode(29901))" ] }, { "cell_type": "markdown", "metadata": { "id": "T9nRducW48bd" }, "source": [ "# Comparing Trained LLM Tokenizers\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7W0xFIVo5A0S" }, "outputs": [], "source": [ "from transformers import AutoModelForCausalLM, AutoTokenizer\n", "\n", "colors_list = [\n", " '102;194;165', '252;141;98', '141;160;203',\n", " '231;138;195', '166;216;84', '255;217;47'\n", "]\n", "\n", "def show_tokens(sentence, tokenizer_name):\n", " tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)\n", " token_ids = tokenizer(sentence).input_ids\n", " for idx, t in enumerate(token_ids):\n", " print(\n", " f'\\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +\n", " tokenizer.decode(t) +\n", " '\\x1b[0m',\n", " end=' '\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Gcc3JjwX5DK-" }, "outputs": [], "source": [ "text = \"\"\"\n", "English and CAPITALIZATION\n", "🎵 鸟\n", "show_tokens False None elif == >= else: two tabs:\" \" Three tabs: \" \"\n", "12.0*50=600\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 354, "status": "ok", "timestamp": 1725544666773, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": 240 }, "id": "fCDGSXP75Hv-", "outputId": "f2c26835-a857-41db-ff2d-d930d06e512e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;30;48;2;102;194;165m[CLS]\u001b[0m \u001b[0;30;48;2;252;141;98menglish\u001b[0m \u001b[0;30;48;2;141;160;203mand\u001b[0m \u001b[0;30;48;2;231;138;195mcapital\u001b[0m \u001b[0;30;48;2;166;216;84m##ization\u001b[0m \u001b[0;30;48;2;255;217;47m[UNK]\u001b[0m \u001b[0;30;48;2;102;194;165m[UNK]\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_\u001b[0m \u001b[0;30;48;2;231;138;195mtoken\u001b[0m \u001b[0;30;48;2;166;216;84m##s\u001b[0m \u001b[0;30;48;2;255;217;47mfalse\u001b[0m \u001b[0;30;48;2;102;194;165mnone\u001b[0m \u001b[0;30;48;2;252;141;98meli\u001b[0m \u001b[0;30;48;2;141;160;203m##f\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m>\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98melse\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195mtwo\u001b[0m \u001b[0;30;48;2;166;216;84mtab\u001b[0m \u001b[0;30;48;2;255;217;47m##s\u001b[0m \u001b[0;30;48;2;102;194;165m:\u001b[0m \u001b[0;30;48;2;252;141;98m\"\u001b[0m \u001b[0;30;48;2;141;160;203m/\u001b[0m \u001b[0;30;48;2;231;138;195mt\u001b[0m \u001b[0;30;48;2;166;216;84m/\u001b[0m \u001b[0;30;48;2;255;217;47mt\u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98mthree\u001b[0m \u001b[0;30;48;2;141;160;203mtab\u001b[0m \u001b[0;30;48;2;231;138;195m##s\u001b[0m \u001b[0;30;48;2;166;216;84m:\u001b[0m \u001b[0;30;48;2;255;217;47m\"\u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98m12\u001b[0m \u001b[0;30;48;2;141;160;203m.\u001b[0m \u001b[0;30;48;2;231;138;195m0\u001b[0m \u001b[0;30;48;2;166;216;84m*\u001b[0m \u001b[0;30;48;2;255;217;47m50\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m600\u001b[0m \u001b[0;30;48;2;141;160;203m[SEP]\u001b[0m " ] } ], "source": [ "show_tokens(text, \"bert-base-uncased\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 219, "referenced_widgets": [ "76ff072348d0471abaa566d7de6b8e93", "5323bca1afe64a2a9bc9c14ac39ad230", "bf9fbc21d76a424dab0c8687bd0874c9", "f67d099d1e0343d9a822523223d21a75", "1dc884bbe68e4bf4ad1e14ca44b5c2fd", "b57a5357a7334d47b39f14e07e2d5708", "e608cc17958347c19615577b73926f45", "1695867ecb3f402ea8e10920adf651ee", "9fb081ecc7374f73b32cb203c7ed042d", "2fdb8c6afd2647b695f249b4f8122e52", "6e12c295923644ab930dcb51de932fd9", "d3316f8df2804c2ba34504b196aed6be", "893c6036041346ab9486cb7bde06ad0d", "42f7a33f1b554c289a434300bcc19f70", "d75a253a2ef648e88e35d765be5d4c34", "3b5a28c840ee4ba78676b4c3dbcd7af6", "e1175aef0e2e4b17afdb6a62f68887a7", "8774144d2cbc44f4bb39305e20cd5093", "139f7c4f547f4d08a6126897617989d5", "61bafb93125042f5bb7cc1195b459d45", "8053c516a8df40498085843ff07a2884", "81aa45723ec44d18a0e37844b9c70c4e", "33541a7c0d664fa2bd104fc9bc91f1bd", "580f25a7f04943e6a5100bdf584f8c97", "1848a0b868254c848115fc59b2cdd639", "47e387c2bcdc43329354304e5358c224", "883b562fa1c7409fbf43d2ac90c29955", "9798a6e28f56466f9582a40064ad4c4f", "4f31fd12d7c04233856206304b2a1bc7", "1460bc4aca764def9120218a963ae183", "3933809fb01c43bca62dc220cc94f217", "e4caf964309345bf82ad65411c1a7f3f", "ea6b6957c38d469abbe07bb20c811d2d", "eabd5498553646f18eada3542254cb0b", "206f377162614c72a74e73557fece973", "deabe5ef1f1f473f81403df9d8923846", "4096283e990a47a7bd4c00416fa71788", "3598675602e34600ae5c719d67778d24", "9ca5334888b249b2bd318be5a97fae7e", "7cce748a6f264cb69921fc97d4b8c946", "449aaf8bc9a64f6483eff88cc4678f6c", "e983503abea646a68dfe45652f7e78d1", "06b9fe5f3e644ddab47a17a3276e1a67", "4128b1818d30497f9a3d6869e02addaa" ] }, "executionInfo": { "elapsed": 1520, "status": "ok", "timestamp": 1719589575187, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": 240 }, "id": "0Ay_NX3K5HyP", "outputId": "4a32ab93-75f2-4b70-a55b-b643283c8270" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "76ff072348d0471abaa566d7de6b8e93", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0%| | 0.00/49.0 [00:00\u001b[0m \u001b[0;30;48;2;141;160;203m=\u001b[0m \u001b[0;30;48;2;231;138;195melse\u001b[0m \u001b[0;30;48;2;166;216;84m:\u001b[0m \u001b[0;30;48;2;255;217;47mtwo\u001b[0m \u001b[0;30;48;2;102;194;165mta\u001b[0m \u001b[0;30;48;2;252;141;98m##bs\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195m\"\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47mThree\u001b[0m \u001b[0;30;48;2;102;194;165mta\u001b[0m \u001b[0;30;48;2;252;141;98m##bs\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195m\"\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m12\u001b[0m \u001b[0;30;48;2;102;194;165m.\u001b[0m \u001b[0;30;48;2;252;141;98m0\u001b[0m \u001b[0;30;48;2;141;160;203m*\u001b[0m \u001b[0;30;48;2;231;138;195m50\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m600\u001b[0m \u001b[0;30;48;2;102;194;165m[SEP]\u001b[0m " ] } ], "source": [ "show_tokens(text, \"bert-base-cased\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 284, "referenced_widgets": [ "77d1608ca87f4bc1b7a731c896d86db9", "6e2c83d85af0419c81df10e85e31d29d", "5d3e86f8d3f949aeacabace1f7640d81", "6fd2baf1fc1244d38fc41cde100d7b6e", "41965c378ba243339352ad3926d48862", "873384e3c478450ea4a5a9e061c87133", "dcce05965068457fb764ae1c04066d88", "41f2087fd27a4c018087063e8e7629d3", "905ac86fe3294497b1722e955d63ed4c", "131c47105d324eddaea5c241829e878a", "22f955573c2845cfba5c314b59d26739", "41301a0547754ecfbd6044f6eacb0b8d", "eff43edf235a4b92b0e26d5bc21fc909", "bbdc2b4c70a0426aa3d2043d0b91b839", "405476ddfe634ad793e28474dfe30ecc", "8575a84785714069921bbfdc13fb957e", "bad64205077a496f96e6d03d927140ba", "ca152f8ec99e48b39ee5267269eeaca0", "461e3c04697641359924ae0902b13db0", "f15e240b2d01488698caa3275e0bacc1", "4e2491afa9fc4d65b95e9471af782e4d", "ff1f8b630ceb449e910ca34d969fdafd", "fa60038dc7c547b8b1d9c54f88fd6b39", "e69fe73aa3d44de0bddbe1711269bd8e", "5ff700af61664f0eaebe580c8a49a910", "e901ed75738f4e41b79651dc012003a6", "b115c5c5193f489f87209bb5c6d788f9", "3bce3be7198c4917a6cb2183e1344e2c", "dedac5ecd5844ee29e346f465074d3fd", "4521cc909b2942de889a37dbec1f0277", "645a2e3dba2f49fb9ce9c0e2b2a8e73f", "198833f7fa2f4ff8ab064b4671461830", "66acf835e274473f84ccd486a99e71d2", "db2934af14274fe78ffc85f7d03fd1c8", "9be5d1e096934134a00974cf8e3fa63c", "fc2c07f1eeee43e3aad438206929f5df", "08d6a11cebf840748261e0ba6970092b", "f8912b0da8aa4f7499ad3f4e5ccca84b", "c87ad6d49a054fc8850bebf87c444ee3", "17b6d632e333476b99f4315fe737d359", "2661e810b7084f93a4dcd454ea7665a0", "042557f8b84b422882c651f910a9fce0", "8f95639d6dc946f18f14d8a16e73b4a4", "89224986b13645d9a9dbedb038e795bb", "2fcd1b9c380f422291096e57a6c7f85e", "142133a41f664fdf82a1b16d87a68ae5", "d1c2b3aac5cc4f3fad1413f8cfdc04e3", "bce267a75b9946ae8b0db42dd7f925d1", "3b46d5e1b7fe4427909d1c82debd7ba7", "3dc1a66c56fb428aad53f5221ed1ae18", "23cda173696645b6955515990b6834ec", "2141c20003154d8dab2855deb44d3aad", "2213d4aa55eb4ef384eaf879552dcd7d", "56009a7bf6fc4d7c8300fc9dc4d6ad14", "551cdd6ea1d94ba8a6cce7b00798c63b" ] }, "executionInfo": { "elapsed": 2010, "status": "ok", "timestamp": 1719589579935, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": 240 }, "id": "K_k5QduY5H0u", "outputId": "2e844f23-3dee-4078-8d51-4c250d2c2f3e" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "77d1608ca87f4bc1b7a731c896d86db9", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0%| | 0.00/26.0 [00:00=\u001b[0m \u001b[0;30;48;2;141;160;203m else\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m two\u001b[0m \u001b[0;30;48;2;255;217;47m tabs\u001b[0m \u001b[0;30;48;2;102;194;165m:\"\u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m \"\u001b[0m \u001b[0;30;48;2;255;217;47m Three\u001b[0m \u001b[0;30;48;2;102;194;165m tabs\u001b[0m \u001b[0;30;48;2;252;141;98m:\u001b[0m \u001b[0;30;48;2;141;160;203m \"\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m \u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m \u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \"\u001b[0m \u001b[0;30;48;2;166;216;84m\n", "\u001b[0m \u001b[0;30;48;2;255;217;47m12\u001b[0m \u001b[0;30;48;2;102;194;165m.\u001b[0m \u001b[0;30;48;2;252;141;98m0\u001b[0m \u001b[0;30;48;2;141;160;203m*\u001b[0m \u001b[0;30;48;2;231;138;195m50\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m600\u001b[0m \u001b[0;30;48;2;102;194;165m\n", "\u001b[0m " ] } ], "source": [ "show_tokens(text, \"gpt2\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 183, "referenced_widgets": [ "63772475e9234672994f2a8edf89b192", "fa45fdb364444208b2693760809e3c60", "5a92b1072a834d69ad33060ae1b0fdf0", "b8a3d22de4964f369ecabc31b6cdca57", "274983c981654d1eb25d630d8d5e47e3", "eebaba8a83ed48f8afe2ca129f3a73c9", "2d7a3c68a2e246059a93fea280f4c2c8", "22983afeb04d4a06b8b01527d869585e", "f7492077d2ba4ccf8df039473057321d", "fb255528c7754047aeb29863dd642b19", "cc7f7f3ef40042458ac7565b070af032", "bc67bb86f76e482ab703f3f403f4cc76", "a7605351b83941698348ee84cd99f955", "88af753262344cfe9a88b133540bcaf7", "8f92316b17a24623b8646747f5ecc7d6", "8153c7e3f21f44f09a3222da6312137d", "22bee52a95d243868b40b3f8ce5ca7d9", "f09f8925fc34454dbb970c31d5d82707", "354c2db5dbd34284baf62a7529537b8b", "028db37ce29c45939adeed5ae311583c", "dd534b3f89c64de9b9aa4f7a95a05f34", "30ed305df62c45329f24a2f64499d490", "36e3e6b45fca44c8b01a729189b1bdab", "4727da03ed724b8dafff24856652fe95", "b580d17fa1134425a837a48aba06dbf4", "e71172ffd8d74f189cea18e4898c4c2d", "ce0031117cf347f48d027cd70e87193b", "9ee79c669c6641d887fa284286435f57", "9885aaa3e0be4052af4848b92c642cdd", "ce6eee6121334479a72e023b252124d2", "3adda22d2bca47a886da662621ec9a9d", "6a1a34996ae24a14972fd425be48dd7c", "2c323975d4454113a863e3ec0b56f4fb", "4b9fab6416924d509bbb6361f63797e1", "a721b2fba975474c8a3d9384cf998228", "1f76260c0f6c46598bee13eb4a0f8b65", "fd95f752cc684613b0f6c6db12af874f", "3c0a4cbec1bf4c7886a1a9271c1b0832", "2489776fa74e4dccb4154368f3861623", "c8506a9393604d119ff71264e27b6734", "93187864ee6d4d41b49df1c84f35e6e2", "d7cbd635afec493ebcc4973f8d98c58b", "d09ad050869c47fb9fbc558f8b6d47d7", "b4e35a7edeea4b089182f4e9b15dc12e" ] }, "executionInfo": { "elapsed": 1618, "status": "ok", "timestamp": 1719589589160, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": 240 }, "id": "EJn5nf3c5H2_", "outputId": "607c38ff-9425-4371-f5e0-1f8ee9449eee" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "63772475e9234672994f2a8edf89b192", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0%| | 0.00/2.54k [00:00\u001b[0m \u001b[0;30;48;2;231;138;195m\u001b[0m \u001b[0;30;48;2;166;216;84m\u001b[0m \u001b[0;30;48;2;255;217;47mshow\u001b[0m \u001b[0;30;48;2;102;194;165m_\u001b[0m \u001b[0;30;48;2;252;141;98mto\u001b[0m \u001b[0;30;48;2;141;160;203mken\u001b[0m \u001b[0;30;48;2;231;138;195ms\u001b[0m \u001b[0;30;48;2;166;216;84mFal\u001b[0m \u001b[0;30;48;2;255;217;47ms\u001b[0m \u001b[0;30;48;2;102;194;165me\u001b[0m \u001b[0;30;48;2;252;141;98mNone\u001b[0m \u001b[0;30;48;2;141;160;203m\u001b[0m \u001b[0;30;48;2;231;138;195me\u001b[0m \u001b[0;30;48;2;166;216;84ml\u001b[0m \u001b[0;30;48;2;255;217;47mif\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m=\u001b[0m \u001b[0;30;48;2;141;160;203m>\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84melse\u001b[0m \u001b[0;30;48;2;255;217;47m:\u001b[0m \u001b[0;30;48;2;102;194;165mtwo\u001b[0m \u001b[0;30;48;2;252;141;98mtab\u001b[0m \u001b[0;30;48;2;141;160;203ms\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m\"\u001b[0m \u001b[0;30;48;2;102;194;165mThree\u001b[0m \u001b[0;30;48;2;252;141;98mtab\u001b[0m \u001b[0;30;48;2;141;160;203ms\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m\"\u001b[0m \u001b[0;30;48;2;102;194;165m12.\u001b[0m \u001b[0;30;48;2;252;141;98m0\u001b[0m \u001b[0;30;48;2;141;160;203m*\u001b[0m \u001b[0;30;48;2;231;138;195m50\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m600\u001b[0m \u001b[0;30;48;2;102;194;165m\u001b[0m \u001b[0;30;48;2;252;141;98m\u001b[0m " ] } ], "source": [ "show_tokens(text, \"google/flan-t5-small\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 714, "status": "ok", "timestamp": 1723035784494, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": -60 }, "id": "1ymhAsTg5H5e", "outputId": "7827a535-4f33-4620-f4e7-4a2b622a78c2" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;30;48;2;102;194;165m\n", "\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203m and\u001b[0m \u001b[0;30;48;2;231;138;195m CAPITAL\u001b[0m \u001b[0;30;48;2;166;216;84mIZATION\u001b[0m \u001b[0;30;48;2;255;217;47m\n", "\u001b[0m \u001b[0;30;48;2;102;194;165m�\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m�\u001b[0m \u001b[0;30;48;2;231;138;195m �\u001b[0m \u001b[0;30;48;2;166;216;84m�\u001b[0m \u001b[0;30;48;2;255;217;47m�\u001b[0m \u001b[0;30;48;2;102;194;165m\n", "\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_tokens\u001b[0m \u001b[0;30;48;2;231;138;195m False\u001b[0m \u001b[0;30;48;2;166;216;84m None\u001b[0m \u001b[0;30;48;2;255;217;47m elif\u001b[0m \u001b[0;30;48;2;102;194;165m ==\u001b[0m \u001b[0;30;48;2;252;141;98m >=\u001b[0m \u001b[0;30;48;2;141;160;203m else\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m two\u001b[0m \u001b[0;30;48;2;255;217;47m tabs\u001b[0m \u001b[0;30;48;2;102;194;165m:\"\u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \"\u001b[0m \u001b[0;30;48;2;231;138;195m Three\u001b[0m \u001b[0;30;48;2;166;216;84m tabs\u001b[0m \u001b[0;30;48;2;255;217;47m:\u001b[0m \u001b[0;30;48;2;102;194;165m \"\u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \"\n", "\u001b[0m \u001b[0;30;48;2;231;138;195m12\u001b[0m \u001b[0;30;48;2;166;216;84m.\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m*\u001b[0m \u001b[0;30;48;2;252;141;98m50\u001b[0m \u001b[0;30;48;2;141;160;203m=\u001b[0m \u001b[0;30;48;2;231;138;195m600\u001b[0m \u001b[0;30;48;2;166;216;84m\n", "\u001b[0m " ] } ], "source": [ "# The official is `tiktoken` but this the same tokenizer on the HF platform\n", "show_tokens(text, \"Xenova/gpt-4\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 284, "referenced_widgets": [ "770da5f1b8b24972b2018e1cadd3ec8a", "e2f255083a1b4d8f9992f27b4d21c676", "4fac840948a048d6b1ca7c0dc4f4c5d5", "c2ac158eb3f0469ca90f7247c546c70f", "4a8cfc4637124995866810ef1b750fe4", "409169ca000a484ca4472750cfe63f30", "8663795c551e457dafc93d02cf0026c3", "5639fe0c03e7451db316579356290e3d", "46e73023abe1465382771e9af87f36fc", "5feaeba2e88a4a7fb83a85f3200e2639", "76a8030af7794356b5c2daa891d789e2", "37c46ab78fc64eb98923b24d6a0de37e", "3bb4b5235ef74c7d89489f8fa8cded17", "a720bb387fca45968c75352398935382", "683b85afadb744e4bd7164c51f01d3f9", "00dd050102674a1ab3fd8d8f9caec4b0", "cf6a7c6ada024f8e9f106428d506b078", "33744d7c827e4784a91955159a47e337", "f6dce141c94d4c8494f75a7387b65331", "48738fb1cf8e4f0fb70b74a5896669cd", "1b10141545cb489fa3a58d4939cc4d9b", "99b9c874e58c4db9b596b6ca1699e666", "07bf43728198472997c8b59b9343adfe", "74d33e70d8af43148fd3a618b5d3c5dd", "8008a03780b24639abce64498b1d832e", "82ad72412e1343b983679e625c85f47d", "0d3aa270949048a5886de118b1a3b1f1", "cc568e7a8ca84810ab878e601fae557a", "cb57c7f3455f4a34b59ae39d0b599b8b", "577a22cb6c7549ff96f367bd6f4f8b12", "d0d4c92c9a0f4bd29255d8ff47d18c11", "e6d9b96a5cb9487d90136d097e716a5a", "e878edbee8ea48178b424e56417b7fa5", "e227f5f6bb3b4580b0ea4304d34ad556", "36863dd97aa04c48831d1fb455557adc", "ece59919873646f9bbf41c7547e802a3", "7ff1f54520324b3e9462062ecd87ce69", "2e8d55afb3fc4e2fa6b0b887a09b7ca9", "e527a040f0be4d43830ae6d4335771b0", "32f18ab0328146b5aac38b4c7ef8029d", "5d5d4e02a6724861aa36d9af5ea70ea7", "c8637134a894493093654456f2a9763b", "8a1023f076f34f34ab0b091d7f62c172", "3537e0361ab5475282b7f34adbfc70dc", "12d4df348cd34dc2b3d7ffc41f0561e8", "65326149d62e4404ba49a8d2d505adac", "b84c1bff2fcc48ed8fee636e1bdb16f9", "a8ae6ec72f2744b4991de0a961e1b142", "98182c58a56343e482ed44935be4fd31", "15719135708047a49a19887989dac12d", "a43ccd6d114644ae86c3129786f8105a", "90cc27a829e54b16850552e04f718bc3", "2be39bb4eacf41e381432b37febfd788", "df4dd594480d446da1523bd2d016c1cb", "11ea292f268d481396192b158814e6b1" ] }, "executionInfo": { "elapsed": 9948, "status": "ok", "timestamp": 1719590292199, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": 240 }, "id": "3_vAyeTy5H7_", "outputId": "ad3f759f-19b7-4880-cbf8-9ed7cb25d627" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "770da5f1b8b24972b2018e1cadd3ec8a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0%| | 0.00/7.88k [00:00=\u001b[0m \u001b[0;30;48;2;231;138;195m else\u001b[0m \u001b[0;30;48;2;166;216;84m:\u001b[0m \u001b[0;30;48;2;255;217;47m two\u001b[0m \u001b[0;30;48;2;102;194;165m tabs\u001b[0m \u001b[0;30;48;2;252;141;98m:\"\u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \"\u001b[0m \u001b[0;30;48;2;166;216;84m Three\u001b[0m \u001b[0;30;48;2;255;217;47m tabs\u001b[0m \u001b[0;30;48;2;102;194;165m:\u001b[0m \u001b[0;30;48;2;252;141;98m \"\u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \"\u001b[0m \u001b[0;30;48;2;166;216;84m\n", "\u001b[0m \u001b[0;30;48;2;255;217;47m1\u001b[0m \u001b[0;30;48;2;102;194;165m2\u001b[0m \u001b[0;30;48;2;252;141;98m.\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m*\u001b[0m \u001b[0;30;48;2;166;216;84m5\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m6\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m0\u001b[0m \u001b[0;30;48;2;166;216;84m\n", "\u001b[0m " ] } ], "source": [ "# You need to request access before being able to use this tokenizer\n", "show_tokens(text, \"bigcode/starcoder2-15b\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 220, "referenced_widgets": [ "6a6efb1d66ea423a9b5ff4b2f1f1194c", "1283ee793a40405aa9763e1b88d6d7a3", "11df3b517fd94524be18cff070e273a8", "86b01695df4b42d3a6602d704243b6ee", "b0b1039168a74b8fa79271592f29f0b7", "877ff6d25f524779a10df09be5fc6093", "276ec5fb636b49bcb933a4ce96cb900e", "ac88f9025a0e44b38fc14720d810b5ab", "4ceb8ce8b67a44b4b7505cd7e589dec1", "b0d45aec56fd4219b9224dbe31fad3a3", "bf3a8980a70547f5b853390235a37592", "f11120105af24fe1b40b6490897e2e2e", "8109bb974a3e4b12bd7534b62be20940", "f1d6a31870da4e27bc482bb84953b165", "1128f56169ac4376ab5fbb46d44b01dc", "3da38a2294a145bf86124d0fda8b3255", "85113d3b53fb47bb8593c3a21e37142c", "1e4a17f723d14694b5aeb673db7394cc", "87a4011d2d0a4076b027f7068b244dda", "48ca9047fc7d424f99b40c54e6d732f4", "3b169b44c1814ed5a7feba7bab0f3ce6", "dff4ee8d0bd74822a0adf204e21521b8", "5d3f3b08ec5044e3acf4414703e579d9", "900ddbabea1846a3a0dfd8380668bec1", "7da6e29f0349438494ff83975d48f02a", "01f1433d221f437eb0692d25948ce080", "9f189b7e32c94a3a84ffd40beed7d1fd", "3e5f406442df4b848d324ff584eea75c", "575b63bdd98047a4934d556423dd9ce6", "c9662398b8ad4d4fa1a59d44f6205769", "37379e486478437f9fa2f8eac7f9fd60", "2b83154cf934484da54fe0d0b08fe3d3", "759faaea712c4a8abb0ece5217e1c470" ] }, "executionInfo": { "elapsed": 1388, "status": "ok", "timestamp": 1719589605088, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": 240 }, "id": "KeWcUdxY6I3u", "outputId": "f39c8f56-1e71-44bb-bade-75bfb33b581c" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6a6efb1d66ea423a9b5ff4b2f1f1194c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0%| | 0.00/166 [00:00\u001b[0m \u001b[0;30;48;2;252;141;98m=\u001b[0m \u001b[0;30;48;2;141;160;203m else\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m two\u001b[0m \u001b[0;30;48;2;255;217;47m t\u001b[0m \u001b[0;30;48;2;102;194;165mabs\u001b[0m \u001b[0;30;48;2;252;141;98m:\u001b[0m \u001b[0;30;48;2;141;160;203m\"\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m Three\u001b[0m \u001b[0;30;48;2;102;194;165m t\u001b[0m \u001b[0;30;48;2;252;141;98mabs\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98m\n", "\u001b[0m \u001b[0;30;48;2;141;160;203m1\u001b[0m \u001b[0;30;48;2;231;138;195m2\u001b[0m \u001b[0;30;48;2;166;216;84m.\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m*\u001b[0m \u001b[0;30;48;2;252;141;98m5\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84m6\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m0\u001b[0m \u001b[0;30;48;2;252;141;98m\n", "\u001b[0m " ] } ], "source": [ "show_tokens(text, \"facebook/galactica-1.3b\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 374, "status": "ok", "timestamp": 1719589632350, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": 240 }, "id": "__QNj2Cohzz2", "outputId": "17ffab73-b07c-44a9-c482-64ab9f4c45a4" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;30;48;2;102;194;165m\u001b[0m \u001b[0;30;48;2;252;141;98m\u001b[0m \u001b[0;30;48;2;141;160;203m\n", "\u001b[0m \u001b[0;30;48;2;231;138;195mEnglish\u001b[0m \u001b[0;30;48;2;166;216;84mand\u001b[0m \u001b[0;30;48;2;255;217;47mC\u001b[0m \u001b[0;30;48;2;102;194;165mAP\u001b[0m \u001b[0;30;48;2;252;141;98mIT\u001b[0m \u001b[0;30;48;2;141;160;203mAL\u001b[0m \u001b[0;30;48;2;231;138;195mIZ\u001b[0m \u001b[0;30;48;2;166;216;84mATION\u001b[0m \u001b[0;30;48;2;255;217;47m\n", "\u001b[0m \u001b[0;30;48;2;102;194;165m�\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m�\u001b[0m \u001b[0;30;48;2;231;138;195m�\u001b[0m \u001b[0;30;48;2;166;216;84m\u001b[0m \u001b[0;30;48;2;255;217;47m�\u001b[0m \u001b[0;30;48;2;102;194;165m�\u001b[0m \u001b[0;30;48;2;252;141;98m�\u001b[0m \u001b[0;30;48;2;141;160;203m\n", "\u001b[0m \u001b[0;30;48;2;231;138;195mshow\u001b[0m \u001b[0;30;48;2;166;216;84m_\u001b[0m \u001b[0;30;48;2;255;217;47mto\u001b[0m \u001b[0;30;48;2;102;194;165mkens\u001b[0m \u001b[0;30;48;2;252;141;98mFalse\u001b[0m \u001b[0;30;48;2;141;160;203mNone\u001b[0m \u001b[0;30;48;2;231;138;195melif\u001b[0m \u001b[0;30;48;2;166;216;84m==\u001b[0m \u001b[0;30;48;2;255;217;47m>=\u001b[0m \u001b[0;30;48;2;102;194;165melse\u001b[0m \u001b[0;30;48;2;252;141;98m:\u001b[0m \u001b[0;30;48;2;141;160;203mtwo\u001b[0m \u001b[0;30;48;2;231;138;195mtabs\u001b[0m \u001b[0;30;48;2;166;216;84m:\"\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98mThree\u001b[0m \u001b[0;30;48;2;141;160;203mtabs\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98m\n", "\u001b[0m \u001b[0;30;48;2;141;160;203m1\u001b[0m \u001b[0;30;48;2;231;138;195m2\u001b[0m \u001b[0;30;48;2;166;216;84m.\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m*\u001b[0m \u001b[0;30;48;2;252;141;98m5\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84m6\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m0\u001b[0m \u001b[0;30;48;2;252;141;98m\n", "\u001b[0m " ] } ], "source": [ "show_tokens(text, \"microsoft/Phi-3-mini-4k-instruct\")" ] }, { "cell_type": "markdown", "metadata": { "id": "9Tu7OY4HvBEm" }, "source": [ "# Contextualized Word Embeddings From a Language Model (Like BERT)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 265, "referenced_widgets": [ "761c3e6c7f26453bba2f463f39f3ae73", "a5c45eeafcf4456bbb0fe1bb43ef2497", "1f35082ee7ec425ea801321112d48db1", "618c1c5ae3ea4650a412a3e009d9fb49", "a158991ec46e4587aac2111e02153f4c", "51af2c28e34245ed83f02b424a6a640a", "56a95848a2814195b334c31f0a961cd8", "c1e09c1869f7410ba328e6efe56a0460", "17777d8eae4742aaa78d7854a52e102f", "5916af8bb7ed4efcb05c8f1cdf826149", "b1778299eac04302ac136872dfb0a359", "8ee98d9d017542609881a1e16a8f393f", "c4a8f08da3f64fb287f7a940c4a9408f", "d66ee44b7f5b4c6eac1536235a627441", "c44feccdd01949c1af40fa69f6e03dd0", "84682eb9cfff444da7633f4ca9360f77", "332810f458df4e23bc034852706bcc6f", "4999e7d8e2384bf4adfbf1777587a65f", "a8d4f19ff4554165a5e78d3783928fe1", "29d9e7a799ea402fb5a28b2817156838", "222d489bf1664763babf0e377e45f4d8", "2ae57535c39541fe98d6a8ae22bcd7d4", "fd134a05028c447a994166eccc557806", "69101d935ae841e59aa7f30e40789496", "3ef63f93e424409192edd1b1364aba48", "1d204aedfeb14df5ae7e27eb88a87018", "e74b57abfaa2487aa8e369103be5d00d", "94609e349c8b43b5b74d2c059623f9f0", "febce6a7c96e4a42a9c6faa0bf1763c0", "18a4603fa6a040c0acd792243510562a", "c04ce575bf624bd1a80113d1eff1ae94", "6a9de4aaed054608b800820831aec87f", "46cd179c1c09474a80dc4cea39b759d5", "20331d1d457143719fe732325e79877e", "2a5055c8fc03457eb29390312c555ec5", "6de0874e33c146bba06131aa452c403f", "0f1d2e4c312d4ab38e359f96c9a760b7", "f946c53a81f34d64b25e331fc4b4c7a1", "2dc553ef192c4002b818de6736367fe0", "665b4085199a4ec5890dba773f07d4b7", "de74a5af1ef9462e82e5622232346a79", "34df4fe808174f80a89a765f6ce2f28f", "56c082332310434fa5ec791b728fe82e", "9ef54c7a15d1400f91598b367ac6552e", "ed19532a8eda4925a4014a3f11517ff9", "3859f0311ade4d278190924f0107ba0e", "6ea963b64fd642b29021110d77827021", "e3efc5b43bf2417faaea2e44a306fa75", "4975826de21547fca434e1fb492a216a", "19b547738c8e45bda7879dc527229019", "6469ef3a7ba6465c911a22060d54ff95", "657b6ce5f0804bc78d64f9b7b6a27777", "301363b755a24bc9bf413fc3b0ffd8b2", "50fc50f975294a23a1d5bccf64efc872", "3c722f92f6c2479a91cf957d1d18fbee", "7570614d79184ab2b44700df2342b294", "543ae93b2c5f4339979b8c64eff33f62", "72b54fe7c88540488adee31c15595c89", "3fbdaadec9f545b99057f7c55c6a6df1", "7656d1b978554f4383160b93a80ee7c6", "4d55d194e62942df8f1c32b9bb244e9b", "403f03c3a2434fd2ac6f143c1972c62e", "cc68d46e46e7487484d366f77ea863ca", "3dee319ce0aa4ce589ce84033bba8d9d", "a280161408504bb894ab78526f67750b", "72d18d21ca2a45ac868d390ead3ac086" ] }, "executionInfo": { "elapsed": 5049, "status": "ok", "timestamp": 1719641476949, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "nsjz-VsYu9bB", "outputId": "03ea124b-c6de-449d-ea6f-f5e5b84c2c97" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "761c3e6c7f26453bba2f463f39f3ae73", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0%| | 0.00/52.0 [00:00)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output" ] }, { "cell_type": "markdown", "metadata": { "id": "DdEDuLWa0r4L" }, "source": [ "# Text Embeddings (For Sentences and Whole Documents)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 425, "referenced_widgets": [ "a50156f59d8548b683982af06d5bda09", "7cbfb80418f14068bb4327a3823140fb", "ef1e2b9a1e694eaa8d5b371caa66277e", "ebf8ef569e374c17aa15472ee6ea98d8", "75f5174b66c14f5480960c79a44283ad", "05504faf760d43ea9082ff5a91ff82f7", "50c6cf2be63b4be29524188d79e8cd53", "9088f66d46f44030af46aee55005a939", "7af424dbfd864b8ba6f7661f2204302c", "44aae8dd70da4be98fd86a970373753b", "49dd6490203640d1821196fe28a08732", "c40e76c5e9bd42bea7903e811bce53a4", "8f7d3eea82614e4ca93482ad7cb637a6", "05eb94a4253544648654f24268e8b6da", "01b9535a72c64c95b482640b2bd3c5fa", "b954ab8edd66487cb9a5502879f0e1c5", "e6e103b2c53a4b5ca71f988b7804aeaf", "77f25a7ee7f843b0a24a1d360a1df211", "a0e30058096941dcaab8f24235ee1c49", "b26ac1747c884e3a913d90b8c05a991b", "e5db8326887a48ae97341cf216d35893", "c76f0b056c1f40f3ac37df24c449e9ec", "c4c7962b94674cd6898980cea6483595", "7fc0e3e3c75e47f88410a2774695081e", "6c96c43dd36a4b08ad94a9cad642b1ab", "73b403ec39fa4a16883ad08a50c7204b", "6feee2b015ef4fa6b94c905ec91f5146", "ab68f3b41d954ec992b9339a91def6a8", "fb9054b0d27c4a8a9b8241f9d5910e51", "78b6d1098f754eb491a2a729e00d2335", "e84f055fc0c04c4983b162d1d8c67147", "13455551484542ea93d4dbbd937288d5", "99db2aef29ba4717866072f20f1acf61", "10f9441d843c44e5b11cfb2e21b5d89e", "5c987bfb44d14b1ea822c99cb7dde071", "52ac1a973f7140e8b49dbf58dc0c8b21", "37117814cae9440c9c54f63def546c4b", "e019010a35ea4f3a9b9236a626b34760", "ba03fe6450b742c99e8b8836f585232c", "47e9e2ed70c84023ae3a1dbf0cb27328", "99e53690e1c940cea12f581b625c3b3d", "f53653a9050042649df9d91114ba39bb", "737c25a604ee4bcd982487f79450d3ea", "cea92b3b96f24087b494487bf3f4c0f9", "784748de51254ba18128af8df30b8a93", "11890f81eceb41f7be6a2d52c9a9e55e", "f7adb025a07a4aca8b7a3a174304666f", "7f7a5bfc6073495da65dcfd4b2d49309", "fa6483d07ae54fb2904fc117aa9a3d5b", "2cbcf7d0b1384b8cb320e1b30c124d71", "94ea21e19acf4eb7a9010510b226db86", "f399e1a91e7d4b7eaddfb910bd81d750", "f26297a84f224c4da8afd4316c8e7477", "e4bef8778ddc46e5a8d0756eb27c3e7c", "66943967c327428a9796fcd38c36b24d", "f0793746dff34da8858758fb55284b97", "1bc7b31eddc54588979f3fff14a0e12e", "75f54f7a8e5b4965b6f0ad28e5f3bf26", "3cb9178f0568448fa839d5cccc7973d7", "40bde45ac20f48ab93c4fd9e8284eac8", "09b701e83e3844fab97ff237b06a1238", "c1bbe572b8324ea48d42c40a5128bb8e", "a4bd81ed4d9d498a983c13ea79265819", "d8d65b5ac8914792b82460cf0bae980d", "136cd465bac246f2ac2454eea2f0484d", "59030392bbde468aae6c62aecddd499e", "ca7eb54b296c4a1fa91678c0e3d65f5d", "35521a33c1324a928fd2c9f7fce2ce69", "80c82a58f0924a578bcae9d3c6537c11", "589d24a92b3d4e5c99460cd609c8a230", "0edd8bfe5bba47d59c5b195841cc4228", "2aed08160bef4a7189821c560c01d6bb", "01ca9f66804048a9a75475ab9c49a24e", "880c7bdcf3174a78892aa7d0cd11dca7", "1d4f85ce80d841eab27b411b3e61e9be", "9960dd3bff70458abc148f1a153175ec", "0bdc012004ad445fa527fafee0ab55c2", "2895fad80e754b4e8158c6dd8db69058", "4a2c467901414bf0afc5b310aa959dae", "a465317d0f3b4106bbf8fd6c7a3caf6a", "7c93b5df16a64e1981e350459b05852d", "66d5c0087bd141b3baa502e3aa8bd408", "d26ec2f29c154da1ac7cb49cb9729113", "7385eea9ed2a438c8dae350fe2328162", "c861634df7524e7cbd7fdca030a0b663", "9904b80c37c44ed2a6b3c21786016e26", "df424fcea3e84c8084dbf8e146d1231f", "a1e116cf62d74d4e8e33c99379e924ed", "652ddbc085994a36b553ea04359943a1", "b340ba4de77043dcbedbcbdf6033d0c7", "10c89678b42b4cf0b8e95b74ad346fba", "fb22220311094fe2b3d245ff080ca4d7", "fc1358383bba4e5ebe2edfca57473002", "342adaf9c12548a4af25f5361ca869ca", "374287b3d21a427fabd82dc1e0710d62", "427e31cb6fcc4687937d807116e5e581", "8195fa9ad4e3487c90cbc0860361a336", "82bf559e6997425aba4245c44531f762", "6bc8887963d744a7a6c15930844f5513", "4688efe8ab954510b30df18f8daa74a5", "27c697f872b64fb5a43deb255e27fb50", "08f2dd1dd0a742eb8c9e11a41dbe69a3", "94b1fee85a034b49aba0c50fcdfc0fdb", "c21e510111f5425bad28afd9b723f9d1", "fea69d561fb94e8c98af4526f8d4b33e", "0cec3fa3672b4bbaa20e3d4aae6fd575", "2b09f8d112ea43a5b60c010f2bf0bbdb", "60ea33df89a449ad9e0f90a0bca672ad", "48230299285241a28e2c8e03db6fce4d", "3229d1aa3fe74299a1d4e3f917cc2ca6", "1f5849878621437397efe2de7b7a43fe", "2a554904b31c4ed8a4269d2c26ea4e91", "a29c289ad83d4c718d5854e5d3eff48a", "07584d1f220f4fb6b82a42090f3818f7", "04e248de88ac4a50ad20272d31549304", "57bb0b0d061d4e778493adb47482f234", "2a24003cc0d54e568706cc6fc77d2831", "849583682f034d3d8b8887cbafb3daaf", "8ab0464a810e4b208d4c0fc481c58b54", "cab50f86e61d43df90643acaf98670ab", "d4180478ac134757bcfe6c4f0ff4990e" ] }, "executionInfo": { "elapsed": 7006, "status": "ok", "timestamp": 1719641491724, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "TQHWioIc0pQ8", "outputId": "87112ec7-bee0-4894-d850-8dd5e0f4e38c" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a50156f59d8548b683982af06d5bda09", "version_major": 2, "version_minor": 0 }, "text/plain": [ "modules.json: 0%| | 0.00/349 [00:00 1]\n", "\n", "# Load song metadata\n", "songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')\n", "songs_file = songs_file.read().decode(\"utf-8\").split('\\n')\n", "songs = [s.rstrip().split('\\t') for s in songs_file]\n", "songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])\n", "songs_df = songs_df.set_index('id')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1724598630488, "user": { "displayName": "Jay Alammar جهاد العمار", "userId": "14617748739431919458" }, "user_tz": 240 }, "id": "Q3zirG-lo3H8", "outputId": "e3b4269e-dd42-428e-8b28-46c27d0231af" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Playlist #1:\n", " ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] \n", "\n", "Playlist #2:\n", " ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117', '118', '119', '120', '121', '122', '123', '50', '70', '71', '124', '17', '85', '14', '82', '48', '125', '47', '46', '72', '53', '25', '73', '4', '126', '59', '74', '20', '43', '127', '128', '129', '13', '82', '48', '130', '131', '132', '133', '134', '135', '136', '137', '59', '46', '138', '43', '20', '139', '140', '73', '57', '70', '141', '3', '1', '74', '142', '143', '144', '145', '48', '13', '25', '146', '50', '147', '126', '59', '20', '148', '149', '150', '151', '152', '56', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '60', '176', '51', '177', '178', '179', '180', '181', '182', '183', '184', '185', '57', '186', '187', '188', '189', '190', '191', '46', '192', '193', '194', '195', '196', '197', '198', '25', '199', '200', '49', '201', '100', '202', '203', '204', '205', '206', '207', '32', '208', '209', '210']\n" ] } ], "source": [ "print( 'Playlist #1:\\n ', playlists[0], '\\n')\n", "print( 'Playlist #2:\\n ', playlists[1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EaUz3E0P7sJs" }, "outputs": [], "source": [ "from gensim.models import Word2Vec\n", "\n", "# Train our Word2Vec model\n", "model = Word2Vec(\n", " playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 314, "status": "ok", "timestamp": 1719642095066, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "9EFGWesO8rOJ", "outputId": "1e46ce56-7b14-4268-a38a-c328e0f52943" }, "outputs": [ { "data": { "text/plain": [ "[('2849', 0.9979680776596069),\n", " ('2640', 0.9964019060134888),\n", " ('3167', 0.9963980317115784),\n", " ('5549', 0.9959008693695068),\n", " ('2715', 0.9958351850509644),\n", " ('3117', 0.9954560995101929),\n", " ('2987', 0.9953479766845703),\n", " ('2881', 0.9951083660125732),\n", " ('2886', 0.9950577616691589),\n", " ('3094', 0.994985044002533)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "song_id = 2172\n", "\n", "# Ask the model for songs similar to song #2172\n", "model.wv.most_similar(positive=str(song_id))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 321, "status": "ok", "timestamp": 1719642762615, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "AMiY6isXqKk4", "outputId": "0f465f20-ada8-4fa8-92d6-f72966d03aa4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "title Fade To Black\n", "artist Metallica\n", "Name: 2172 , dtype: object\n" ] } ], "source": [ "print(songs_df.iloc[2172])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 237 }, "executionInfo": { "elapsed": 556, "status": "ok", "timestamp": 1719642918281, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "aOzWENxr2Fl3", "outputId": "0b1ac29a-14f7-4e30-e153-e8f35ca97d7e" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "summary": "{\n \"name\": \"print_recommendations(2172)\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"2640 \",\n \"2715 \",\n \"3167 \"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Red Barchetta\",\n \"Rainbow In The Dark\",\n \"Unchained\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"artist\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Rush\",\n \"Dio\",\n \"Van Halen\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", "type": "dataframe" }, "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleartist
id
2849Run To The HillsIron Maiden
2640Red BarchettaRush
3167UnchainedVan Halen
5549November RainGuns N' Roses
2715Rainbow In The DarkDio
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "text/plain": [ " title artist\n", "id \n", "2849 Run To The Hills Iron Maiden\n", "2640 Red Barchetta Rush\n", "3167 Unchained Van Halen\n", "5549 November Rain Guns N' Roses\n", "2715 Rainbow In The Dark Dio" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "def print_recommendations(song_id):\n", " similar_songs = np.array(\n", " model.wv.most_similar(positive=str(song_id),topn=5)\n", " )[:,0]\n", " return songs_df.iloc[similar_songs]\n", "\n", "# Extract recommendations\n", "print_recommendations(2172)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 310 }, "executionInfo": { "elapsed": 681, "status": "ok", "timestamp": 1719642181255, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "xqrzQQ-m1EJ5", "outputId": "3cf4967d-f510-4772-cb11-4166d16c6956" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "title Fade To Black\n", "artist Metallica\n", "Name: 2172 , dtype: object\n", "['2849' '2640' '3167' '5549' '2715']\n" ] }, { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "summary": "{\n \"name\": \"print_recommendations(2172)\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"2640 \",\n \"2715 \",\n \"3167 \"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Red Barchetta\",\n \"Rainbow In The Dark\",\n \"Unchained\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"artist\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Rush\",\n \"Dio\",\n \"Van Halen\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", "type": "dataframe" }, "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleartist
id
2849Run To The HillsIron Maiden
2640Red BarchettaRush
3167UnchainedVan Halen
5549November RainGuns N' Roses
2715Rainbow In The DarkDio
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "text/plain": [ " title artist\n", "id \n", "2849 Run To The Hills Iron Maiden\n", "2640 Red Barchetta Rush\n", "3167 Unchained Van Halen\n", "5549 November Rain Guns N' Roses\n", "2715 Rainbow In The Dark Dio" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print_recommendations(2172)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 310 }, "executionInfo": { "elapsed": 316, "status": "ok", "timestamp": 1719642205517, "user": { "displayName": "Maarten Grootendorst", "userId": "11015108362723620659" }, "user_tz": -120 }, "id": "TIHiN62g1NMi", "outputId": "c548f528-6e2e-4a46-89e0-6599395d6419" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "title California Love (w\\/ Dr. Dre & Roger Troutman)\n", "artist 2Pac\n", "Name: 842 , dtype: object\n", "['5668' '413' '5661' '330' '886']\n" ] }, { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "summary": "{\n \"name\": \"print_recommendations(842)\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"413 \",\n \"886 \",\n \"5661 \"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"If I Ruled The World (Imagine That) (w\\\\/ Lauryn Hill)\",\n \"Heartless\",\n \"Sweet Dreams\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"artist\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"Nas\",\n \"Kanye West\",\n \"The Game\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", "type": "dataframe" }, "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleartist
id
5668How We Do (w\\/ 50 Cent)The Game
413If I Ruled The World (Imagine That) (w\\/ Laury...Nas
5661Sweet DreamsBeyonce
330Hate It Or Love It (w\\/ 50 Cent)The Game
886HeartlessKanye West
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "text/plain": [ " title artist\n", "id \n", "5668 How We Do (w\\/ 50 Cent) The Game\n", "413 If I Ruled The World (Imagine That) (w\\/ Laury... Nas\n", "5661 Sweet Dreams Beyonce\n", "330 Hate It Or Love It (w\\/ 50 Cent) The Game\n", "886 Heartless Kanye West" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print_recommendations(842)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 4 }