3374 lines
125 KiB
Plaintext
3374 lines
125 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "g_a9QvUFVCUR"
|
||
},
|
||
"source": [
|
||
"<h1>Chapter 2 - Tokens and Token Embeddings</h1>\n",
|
||
"<i>Exploring tokens and embeddings as an integral part of building LLMs</i>\n",
|
||
"\n",
|
||
"\n",
|
||
"<a href=\"https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961\"><img src=\"https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon\"></a>\n",
|
||
"<a href=\"https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/\"><img src=\"https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K\"></a>\n",
|
||
"<a href=\"https://github.com/HandsOnLLM/Hands-On-Large-Language-Models\"><img src=\"https://img.shields.io/badge/GitHub%20Repository-black?logo=github\"></a>\n",
|
||
"[](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter02/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb)\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"This notebook is for Chapter 2 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"<a href=\"https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961\">\n",
|
||
"<img src=\"https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png\" width=\"350\"/></a>\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### [OPTIONAL] - Installing Packages on <img src=\"https://colab.google/static/images/icons/colab.png\" width=100>\n",
|
||
"\n",
|
||
"If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to\n",
|
||
"**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.\n",
|
||
"\n",
|
||
"---"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# %%capture\n",
|
||
"# !pip install transformers>=4.41.2 sentence-transformers>=3.0.1 gensim>=4.3.2 scikit-learn>=1.5.0 accelerate>=0.31.0"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "oQHfpqT_t9-K"
|
||
},
|
||
"source": [
|
||
"# Downloading and Running An LLM\n",
|
||
"\n",
|
||
"The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately and keep them as such so that we can explore them separately."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 753,
|
||
"referenced_widgets": [
|
||
"851b6e59cc2e4eb8961cb5fa4906c47c",
|
||
"fd5b6ec0a82c493a92a2235635ea5ac0",
|
||
"2bc5005713ba4b61a392935e1d83994c",
|
||
"607bb5b3f27d463dbdb5ad04583a1c4f",
|
||
"0d0c6c47fbb34090a73ed8bf20597ee5",
|
||
"083149fa68934cce90cd03b6c8f90dd4",
|
||
"73263751695843ada347c6c694afd0cd",
|
||
"79c1152a61dd4a87a6a4e700096d92b4",
|
||
"27fb45533f134c308cc0c27797c127f3",
|
||
"78eb9f6c02e841ad9ecf8ed809c724bb",
|
||
"1ade569c5cec4764af86e7eec6bc64ed",
|
||
"dbf8268e613e4e1498133a42a48a58f8",
|
||
"66c3d79c9411447eabae228b5d125c09",
|
||
"46ba964cc4c84bbb87eeccc817a3354c",
|
||
"150625fb48b740f6bce66fd4919357a9",
|
||
"974be84551a54deb82051b947ff013af",
|
||
"22e2cbe577e6475aa80dd94f4b704f1b",
|
||
"83daaa4a1a5d437e8345d4740c3625cd",
|
||
"6e1c0e41b31b4b0b83aa3736e386aa49",
|
||
"bf1d39ed8ee84a6b95e68a11962032b4",
|
||
"7476eb73c9544435b07256f028168a11",
|
||
"f33bc31f0cb842388057b33dd107f2e7",
|
||
"9b9f8a8f0eb14dd9a2c82461f22c4636",
|
||
"214eb5ac9a5048c586b3418452017991",
|
||
"dd1b7306ad4641799d4947a94aa0f088",
|
||
"d58e84330cff499c87e03827d9727d19",
|
||
"2809ac7769af482a86d5e763f5f69211",
|
||
"746054d4ac0b442e85b75959b66ceb34",
|
||
"9b77017c3e764c97b44c56fb5ad3bc77",
|
||
"3dc073c6257143748849311e1c08bdbe",
|
||
"3117fa7446394cb6b6b56427be0a3290",
|
||
"5c3c4612f65f40808f8ef1d0e31f9836",
|
||
"f66983763d454728b86d881621948b8d",
|
||
"a855727c923648ea8d6c7b9ec117c94f",
|
||
"6c9350120fec4f4e9b80e063279b459c",
|
||
"28283e52317e437c8e118d052d93419c",
|
||
"5556b8b2f22548109153146ef66702df",
|
||
"dc1457b4d1d344c9863fbb6b0d38e2cf",
|
||
"029a23c145604f209b12044a6b367802",
|
||
"252ccf08bd52433a888d2bac9ed1b64c",
|
||
"893af8dae5874b55a582f180652e36e3",
|
||
"27e14bb1e4aa467c97ac0c6a6eb44108",
|
||
"87812873ff2d4514aca02c64a3509bd9",
|
||
"13ecb25d0d6541b1a4f2b7dea92dee61",
|
||
"2107a7e7d40f416ca4bc46c90725e0ca",
|
||
"1cfc631c1d7d4fb087753e9e34fd1aa2",
|
||
"238f094ef6744b31ba4444d47c17558c",
|
||
"b11bf972b98645e79f98cdec2c1440f3",
|
||
"baa7c53a02cd42688d768a4595611f64",
|
||
"6a63559431934a4aae4ce79e8b2e76c0",
|
||
"81652b39152145748580c9667b2554de",
|
||
"4dc8d9b1a2d248f9a30bc8e985db4568",
|
||
"adfcbd50473c4ea9bb414e33926cc33b",
|
||
"52a9a5aed89a4294882c8c55b2078b84",
|
||
"83fb238d4eba47a390233ecb5e870ed4",
|
||
"8da9bab6ee214504a187e5b3c9bf1b80",
|
||
"77922a918c1b4bb08f5304a34044e9ba",
|
||
"ac36ee2d4df04399b11f57bb56930a52",
|
||
"df2f73dc04d64bf6adb4df0eae207c27",
|
||
"3cbc759c16b24d63971ba5ef53ff9ba3",
|
||
"ea5d84bf35754e71a628755c0cff49be",
|
||
"3ed30ef053324b3f9a5d3258d2a88a86",
|
||
"f8bae10ac5b54f1b99f7b595f123be4d",
|
||
"dfb5b5a357ca41f88efa53288ebeb5b3",
|
||
"b917827c07344e018543fcbb9c7a6fe8",
|
||
"bad199037f6548749f4dfe12724e3e62",
|
||
"0e30dc616d9c4c50adb9cd292fc3d89e",
|
||
"570a9d836e48456d8c301184d670c50e",
|
||
"471cdd69adc24003b14680e8dbe7b183",
|
||
"c11552c4f423434aa2c41dfa22b8f227",
|
||
"687c61c78b044690a91e34affe570613",
|
||
"82dcc51143ac46349c13d2f5bee380da",
|
||
"a9c8a9b099aa416d9015ee48006db324",
|
||
"1fcc7e4cd65449c49956981ec1b46843",
|
||
"5fc8310895a64815b15af53e3583f60d",
|
||
"ffa0950ea21a4689b5804b0285049ccd",
|
||
"da9ec5fd7acd43b7a11ac860771732df",
|
||
"961d4e8f3a1845e393e25d21de3cd6d3",
|
||
"e7506426cc424dfdb28fa6c35cb74c24",
|
||
"d9b26784fc7c4e6b8abe9490caf53afc",
|
||
"e96d58589fcc4bd7a469c4281baa01ee",
|
||
"dd8fa47d73774a7684cda73e6675c0bd",
|
||
"41843fb3c92c425b9f39cb124b74e3ad",
|
||
"2ab54f26cc504dc09db592039c02c210",
|
||
"cab531e67dd3452f9a4fcd17e324d76a",
|
||
"58de29a54c91401d92cc73a96721a3e9",
|
||
"0eabbb5ac0784a1c89ad58cca37a38fe",
|
||
"362724ce1c5944bd9335b66a14f89844",
|
||
"bd2749d8ba7240408be56e26c796e00a",
|
||
"b54e7ab6e22345bdae3484cab6dab985",
|
||
"77d3a6cc940a4dd7a38c47dfc2b7041b",
|
||
"13071c70f17144b891a1f0fde6d9b21e",
|
||
"205628f7d62d4e54add8ee90b418ebdf",
|
||
"037c74818af74266b2bce5a454b99ab8",
|
||
"bba33c0c581d415bba86aebfc8642196",
|
||
"294ee2f27b4c4ee1ad6ca2a121a9e66d",
|
||
"5bdf1b7758624e379c71da5296929973",
|
||
"6cae5555dce346dc9867edd130c840c9",
|
||
"0e37d9ed7d3142a3b80f188ddee54a4d",
|
||
"d077b2c0e6a142cb905e38a7c7855997",
|
||
"2a47328eace84efbb8fcf6644d4e469d",
|
||
"c1634b24795243f5b8384957ce11b9c0",
|
||
"b8285e7e6db348c385694d2fd63b514e",
|
||
"da9d77e8fa0641dc97f7610aa59374a2",
|
||
"7209a02303ad46f3880799f0b92171f0",
|
||
"770f06c5ff714c87af1d0d7536da12c9",
|
||
"f59c537df5684195b1b5c44754ed2d07",
|
||
"6dd8dd0a99bd4cbbb7401a3fa3893128",
|
||
"f8c4e2a7f11f47fcbd574ceaf828aa66",
|
||
"193e7cde826d43bd894362498a217888",
|
||
"febedf9dd6d248baa984ac215697968d",
|
||
"0bf43b40190141d5bf8e0be3fdb1e529",
|
||
"4070e2510e1549c3879e2458e42bb269",
|
||
"9ba44d309f684c98991889b992767ab3",
|
||
"b47b04d69b33402f80ea75ed472d3c29",
|
||
"19a091b24a5d44b9a98569a7e124c6f8",
|
||
"0d1cea03253c426b84814936c93e5279",
|
||
"7c1b3811b42549569cc83423bbeb2163",
|
||
"55bc2055b3c5426f8ba9afc9f80ccd58",
|
||
"61098d22dd294891afdd34b6208906ba",
|
||
"0a0db951df42424084b531ebbd6cfa98",
|
||
"3f1dc764fb2c48bd9fd8797a86b6c59f",
|
||
"95146eb05e964e32bfa6f5078a66fab7",
|
||
"8c7e09509c524cf29782df69988eef6d",
|
||
"5237587cfb9c4e86a82aaf72cc9db39d",
|
||
"22fd14ac3ccf4e1d8e4782d35e7ec91c",
|
||
"138a3cb6ba5b494c8a237af0c6306a4f",
|
||
"e88a1d50d7f940d793f4dcd76a5f5929",
|
||
"4182011b06304005a6e8922d338c65fb",
|
||
"16ae1f6d9d844d60b55da9e36a5792cd",
|
||
"56ea52f3a1ea485ca52eebaf4f97c3a3",
|
||
"c81c6fe3a8ed4466bf95c53dfe415fba",
|
||
"157b7540b5a3404e82d7e86b19906c7a",
|
||
"48d180eca04341d692d5fb83a8e0f76d",
|
||
"c84b966b98144e22884183cd2a03e2a8",
|
||
"9a6496f9037149289e64e66a0f46cbd3",
|
||
"ecd15644266b43678dbe7b067b553833",
|
||
"43a6de1c960341eda9e154d47e0142c9",
|
||
"6e8fa3e12a8b4262bbca73c57c7f4eb9",
|
||
"86445e2d62b44847a4be7419a6000305",
|
||
"3837338134c240caa0c51501fae402fb",
|
||
"5e2d4a77710e428ebef6bae0854c8527",
|
||
"63d81ace73634bcdaef6f5a103c35dcf",
|
||
"dd1af9490bc24d5ea2804adefd442fdb",
|
||
"cf18ae8dd5bb4d378f017f9eaff967b4",
|
||
"461fde63ac0a41ed8ad9aab2d60769a7",
|
||
"3c5bb3c7c77f4ca393dc05b12fb69dd0",
|
||
"bd707eacc9914dc69a1fe4e35320a9a4",
|
||
"d7cb5720fbbb4ae0858dd3070c712615",
|
||
"bf390e3a7172415089b2cce2223dd8b6",
|
||
"72ae6f3dcf3d4567a4a60f627a51f1f4",
|
||
"026141ba5e404e4e86c1d1c083ab186d",
|
||
"fcd5f49ec07f401fb8a2b403516bbd17",
|
||
"22d378959eb941c5bbfe77702fb426e6"
|
||
]
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 95520,
|
||
"status": "ok",
|
||
"timestamp": 1723034396041,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": -60
|
||
},
|
||
"id": "jjU8NBHnwA4j",
|
||
"outputId": "286bdccb-f25d-4b0e-bda3-44d2a4be45cd"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.\n",
|
||
"Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "dd45dd8837f94b38ae6f4ffd205d9ea6",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from transformers import AutoModelForCausalLM, AutoTokenizer\n",
|
||
"\n",
|
||
"# Load model and tokenizer\n",
|
||
"model = AutoModelForCausalLM.from_pretrained(\n",
|
||
" \"microsoft/Phi-3-mini-4k-instruct\",\n",
|
||
" device_map=\"cuda\",\n",
|
||
" torch_dtype=\"auto\",\n",
|
||
" trust_remote_code=False,\n",
|
||
")\n",
|
||
"tokenizer = AutoTokenizer.from_pretrained(\"microsoft/Phi-3-mini-4k-instruct\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 5750,
|
||
"status": "ok",
|
||
"timestamp": 1719641447389,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "_iVl5yePuq3B",
|
||
"outputId": "4ce629bf-3897-4ab0-8cf1-8f55e2040155"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"WARNING:transformers_modules.microsoft.Phi-3-mini-4k-instruct.ff07dc01615f8113924aed013115ab2abd32115b.modeling_phi3:You are not running the flash-attention implementation, expect numerical differences.\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<s> Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: My Sincere Apologies for the Gardening Mishap\n",
|
||
"\n",
|
||
"Dear\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"prompt = \"Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>\"\n",
|
||
"\n",
|
||
"# Tokenize the input prompt\n",
|
||
"input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids.to(\"cuda\")\n",
|
||
"\n",
|
||
"# Generate the text\n",
|
||
"generation_output = model.generate(\n",
|
||
" input_ids=input_ids,\n",
|
||
" max_new_tokens=20\n",
|
||
")\n",
|
||
"\n",
|
||
"# Print the output\n",
|
||
"print(tokenizer.decode(generation_output[0]))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 4,
|
||
"status": "ok",
|
||
"timestamp": 1719641447389,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "JmzgbbdKuvHt",
|
||
"outputId": "82511d5b-7949-49a0-e3a6-c128564575c8"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"tensor([[ 1, 14350, 385, 4876, 27746, 5281, 304, 19235, 363, 278,\n",
|
||
" 25305, 293, 16423, 292, 286, 728, 481, 29889, 12027, 7420,\n",
|
||
" 920, 372, 9559, 29889, 32001]], device='cuda:0')\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(input_ids)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 3,
|
||
"status": "ok",
|
||
"timestamp": 1719641447389,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "W4vsjbxwu1K1",
|
||
"outputId": "506f32d1-f058-4cfd-a9cd-13c4dabe80e6"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<s>\n",
|
||
"Write\n",
|
||
"an\n",
|
||
"email\n",
|
||
"apolog\n",
|
||
"izing\n",
|
||
"to\n",
|
||
"Sarah\n",
|
||
"for\n",
|
||
"the\n",
|
||
"trag\n",
|
||
"ic\n",
|
||
"garden\n",
|
||
"ing\n",
|
||
"m\n",
|
||
"ish\n",
|
||
"ap\n",
|
||
".\n",
|
||
"Exp\n",
|
||
"lain\n",
|
||
"how\n",
|
||
"it\n",
|
||
"happened\n",
|
||
".\n",
|
||
"<|assistant|>\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for id in input_ids[0]:\n",
|
||
" print(tokenizer.decode(id))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 3,
|
||
"status": "ok",
|
||
"timestamp": 1719641447389,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "A9wRZ3J3u4z1",
|
||
"outputId": "7efaa49c-7a5a-41d7-f000-7aace16007e5"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"tensor([[ 1, 14350, 385, 4876, 27746, 5281, 304, 19235, 363, 278,\n",
|
||
" 25305, 293, 16423, 292, 286, 728, 481, 29889, 12027, 7420,\n",
|
||
" 920, 372, 9559, 29889, 32001, 3323, 622, 29901, 1619, 317,\n",
|
||
" 3742, 406, 6225, 11763, 363, 278, 19906, 292, 341, 728,\n",
|
||
" 481, 13, 13, 29928, 799]], device='cuda:0')"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"generation_output"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 275,
|
||
"status": "ok",
|
||
"timestamp": 1723034447362,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": -60
|
||
},
|
||
"id": "7QlHLof3u8A3",
|
||
"outputId": "c2315e1b-91b4-4a1b-9bcc-084f16ac8db1"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Sub\n",
|
||
"ject\n",
|
||
"Subject\n",
|
||
":\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(tokenizer.decode(3323))\n",
|
||
"print(tokenizer.decode(622))\n",
|
||
"print(tokenizer.decode([3323, 622]))\n",
|
||
"print(tokenizer.decode(29901))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "T9nRducW48bd"
|
||
},
|
||
"source": [
|
||
"# Comparing Trained LLM Tokenizers\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "7W0xFIVo5A0S"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from transformers import AutoModelForCausalLM, AutoTokenizer\n",
|
||
"\n",
|
||
"colors_list = [\n",
|
||
" '102;194;165', '252;141;98', '141;160;203',\n",
|
||
" '231;138;195', '166;216;84', '255;217;47'\n",
|
||
"]\n",
|
||
"\n",
|
||
"def show_tokens(sentence, tokenizer_name):\n",
|
||
" tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)\n",
|
||
" token_ids = tokenizer(sentence).input_ids\n",
|
||
" for idx, t in enumerate(token_ids):\n",
|
||
" print(\n",
|
||
" f'\\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +\n",
|
||
" tokenizer.decode(t) +\n",
|
||
" '\\x1b[0m',\n",
|
||
" end=' '\n",
|
||
" )"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "Gcc3JjwX5DK-"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"text = \"\"\"\n",
|
||
"English and CAPITALIZATION\n",
|
||
"🎵 鸟\n",
|
||
"show_tokens False None elif == >= else: two tabs:\" \" Three tabs: \" \"\n",
|
||
"12.0*50=600\n",
|
||
"\"\"\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 354,
|
||
"status": "ok",
|
||
"timestamp": 1725544666773,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": 240
|
||
},
|
||
"id": "fCDGSXP75Hv-",
|
||
"outputId": "f2c26835-a857-41db-ff2d-d930d06e512e"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[0;30;48;2;102;194;165m[CLS]\u001b[0m \u001b[0;30;48;2;252;141;98menglish\u001b[0m \u001b[0;30;48;2;141;160;203mand\u001b[0m \u001b[0;30;48;2;231;138;195mcapital\u001b[0m \u001b[0;30;48;2;166;216;84m##ization\u001b[0m \u001b[0;30;48;2;255;217;47m[UNK]\u001b[0m \u001b[0;30;48;2;102;194;165m[UNK]\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_\u001b[0m \u001b[0;30;48;2;231;138;195mtoken\u001b[0m \u001b[0;30;48;2;166;216;84m##s\u001b[0m \u001b[0;30;48;2;255;217;47mfalse\u001b[0m \u001b[0;30;48;2;102;194;165mnone\u001b[0m \u001b[0;30;48;2;252;141;98meli\u001b[0m \u001b[0;30;48;2;141;160;203m##f\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m>\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98melse\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195mtwo\u001b[0m \u001b[0;30;48;2;166;216;84mtab\u001b[0m \u001b[0;30;48;2;255;217;47m##s\u001b[0m \u001b[0;30;48;2;102;194;165m:\u001b[0m \u001b[0;30;48;2;252;141;98m\"\u001b[0m \u001b[0;30;48;2;141;160;203m/\u001b[0m \u001b[0;30;48;2;231;138;195mt\u001b[0m \u001b[0;30;48;2;166;216;84m/\u001b[0m \u001b[0;30;48;2;255;217;47mt\u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98mthree\u001b[0m \u001b[0;30;48;2;141;160;203mtab\u001b[0m \u001b[0;30;48;2;231;138;195m##s\u001b[0m \u001b[0;30;48;2;166;216;84m:\u001b[0m \u001b[0;30;48;2;255;217;47m\"\u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98m12\u001b[0m \u001b[0;30;48;2;141;160;203m.\u001b[0m \u001b[0;30;48;2;231;138;195m0\u001b[0m \u001b[0;30;48;2;166;216;84m*\u001b[0m \u001b[0;30;48;2;255;217;47m50\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m600\u001b[0m \u001b[0;30;48;2;141;160;203m[SEP]\u001b[0m "
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"show_tokens(text, \"bert-base-uncased\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 219,
|
||
"referenced_widgets": [
|
||
"76ff072348d0471abaa566d7de6b8e93",
|
||
"5323bca1afe64a2a9bc9c14ac39ad230",
|
||
"bf9fbc21d76a424dab0c8687bd0874c9",
|
||
"f67d099d1e0343d9a822523223d21a75",
|
||
"1dc884bbe68e4bf4ad1e14ca44b5c2fd",
|
||
"b57a5357a7334d47b39f14e07e2d5708",
|
||
"e608cc17958347c19615577b73926f45",
|
||
"1695867ecb3f402ea8e10920adf651ee",
|
||
"9fb081ecc7374f73b32cb203c7ed042d",
|
||
"2fdb8c6afd2647b695f249b4f8122e52",
|
||
"6e12c295923644ab930dcb51de932fd9",
|
||
"d3316f8df2804c2ba34504b196aed6be",
|
||
"893c6036041346ab9486cb7bde06ad0d",
|
||
"42f7a33f1b554c289a434300bcc19f70",
|
||
"d75a253a2ef648e88e35d765be5d4c34",
|
||
"3b5a28c840ee4ba78676b4c3dbcd7af6",
|
||
"e1175aef0e2e4b17afdb6a62f68887a7",
|
||
"8774144d2cbc44f4bb39305e20cd5093",
|
||
"139f7c4f547f4d08a6126897617989d5",
|
||
"61bafb93125042f5bb7cc1195b459d45",
|
||
"8053c516a8df40498085843ff07a2884",
|
||
"81aa45723ec44d18a0e37844b9c70c4e",
|
||
"33541a7c0d664fa2bd104fc9bc91f1bd",
|
||
"580f25a7f04943e6a5100bdf584f8c97",
|
||
"1848a0b868254c848115fc59b2cdd639",
|
||
"47e387c2bcdc43329354304e5358c224",
|
||
"883b562fa1c7409fbf43d2ac90c29955",
|
||
"9798a6e28f56466f9582a40064ad4c4f",
|
||
"4f31fd12d7c04233856206304b2a1bc7",
|
||
"1460bc4aca764def9120218a963ae183",
|
||
"3933809fb01c43bca62dc220cc94f217",
|
||
"e4caf964309345bf82ad65411c1a7f3f",
|
||
"ea6b6957c38d469abbe07bb20c811d2d",
|
||
"eabd5498553646f18eada3542254cb0b",
|
||
"206f377162614c72a74e73557fece973",
|
||
"deabe5ef1f1f473f81403df9d8923846",
|
||
"4096283e990a47a7bd4c00416fa71788",
|
||
"3598675602e34600ae5c719d67778d24",
|
||
"9ca5334888b249b2bd318be5a97fae7e",
|
||
"7cce748a6f264cb69921fc97d4b8c946",
|
||
"449aaf8bc9a64f6483eff88cc4678f6c",
|
||
"e983503abea646a68dfe45652f7e78d1",
|
||
"06b9fe5f3e644ddab47a17a3276e1a67",
|
||
"4128b1818d30497f9a3d6869e02addaa"
|
||
]
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 1520,
|
||
"status": "ok",
|
||
"timestamp": 1719589575187,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": 240
|
||
},
|
||
"id": "0Ay_NX3K5HyP",
|
||
"outputId": "4a32ab93-75f2-4b70-a55b-b643283c8270"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "76ff072348d0471abaa566d7de6b8e93",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer_config.json: 0%| | 0.00/49.0 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
|
||
" warnings.warn(\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "d3316f8df2804c2ba34504b196aed6be",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"config.json: 0%| | 0.00/570 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "33541a7c0d664fa2bd104fc9bc91f1bd",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"vocab.txt: 0%| | 0.00/213k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "eabd5498553646f18eada3542254cb0b",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer.json: 0%| | 0.00/436k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[0;30;48;2;102;194;165m[CLS]\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203mand\u001b[0m \u001b[0;30;48;2;231;138;195mCA\u001b[0m \u001b[0;30;48;2;166;216;84m##PI\u001b[0m \u001b[0;30;48;2;255;217;47m##TA\u001b[0m \u001b[0;30;48;2;102;194;165m##L\u001b[0m \u001b[0;30;48;2;252;141;98m##I\u001b[0m \u001b[0;30;48;2;141;160;203m##Z\u001b[0m \u001b[0;30;48;2;231;138;195m##AT\u001b[0m \u001b[0;30;48;2;166;216;84m##ION\u001b[0m \u001b[0;30;48;2;255;217;47m[UNK]\u001b[0m \u001b[0;30;48;2;102;194;165m[UNK]\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_\u001b[0m \u001b[0;30;48;2;231;138;195mtoken\u001b[0m \u001b[0;30;48;2;166;216;84m##s\u001b[0m \u001b[0;30;48;2;255;217;47mF\u001b[0m \u001b[0;30;48;2;102;194;165m##als\u001b[0m \u001b[0;30;48;2;252;141;98m##e\u001b[0m \u001b[0;30;48;2;141;160;203mNone\u001b[0m \u001b[0;30;48;2;231;138;195mel\u001b[0m \u001b[0;30;48;2;166;216;84m##if\u001b[0m \u001b[0;30;48;2;255;217;47m=\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m>\u001b[0m \u001b[0;30;48;2;141;160;203m=\u001b[0m \u001b[0;30;48;2;231;138;195melse\u001b[0m \u001b[0;30;48;2;166;216;84m:\u001b[0m \u001b[0;30;48;2;255;217;47mtwo\u001b[0m \u001b[0;30;48;2;102;194;165mta\u001b[0m \u001b[0;30;48;2;252;141;98m##bs\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195m\"\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47mThree\u001b[0m \u001b[0;30;48;2;102;194;165mta\u001b[0m \u001b[0;30;48;2;252;141;98m##bs\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195m\"\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m12\u001b[0m \u001b[0;30;48;2;102;194;165m.\u001b[0m \u001b[0;30;48;2;252;141;98m0\u001b[0m \u001b[0;30;48;2;141;160;203m*\u001b[0m \u001b[0;30;48;2;231;138;195m50\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m600\u001b[0m \u001b[0;30;48;2;102;194;165m[SEP]\u001b[0m "
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"show_tokens(text, \"bert-base-cased\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 284,
|
||
"referenced_widgets": [
|
||
"77d1608ca87f4bc1b7a731c896d86db9",
|
||
"6e2c83d85af0419c81df10e85e31d29d",
|
||
"5d3e86f8d3f949aeacabace1f7640d81",
|
||
"6fd2baf1fc1244d38fc41cde100d7b6e",
|
||
"41965c378ba243339352ad3926d48862",
|
||
"873384e3c478450ea4a5a9e061c87133",
|
||
"dcce05965068457fb764ae1c04066d88",
|
||
"41f2087fd27a4c018087063e8e7629d3",
|
||
"905ac86fe3294497b1722e955d63ed4c",
|
||
"131c47105d324eddaea5c241829e878a",
|
||
"22f955573c2845cfba5c314b59d26739",
|
||
"41301a0547754ecfbd6044f6eacb0b8d",
|
||
"eff43edf235a4b92b0e26d5bc21fc909",
|
||
"bbdc2b4c70a0426aa3d2043d0b91b839",
|
||
"405476ddfe634ad793e28474dfe30ecc",
|
||
"8575a84785714069921bbfdc13fb957e",
|
||
"bad64205077a496f96e6d03d927140ba",
|
||
"ca152f8ec99e48b39ee5267269eeaca0",
|
||
"461e3c04697641359924ae0902b13db0",
|
||
"f15e240b2d01488698caa3275e0bacc1",
|
||
"4e2491afa9fc4d65b95e9471af782e4d",
|
||
"ff1f8b630ceb449e910ca34d969fdafd",
|
||
"fa60038dc7c547b8b1d9c54f88fd6b39",
|
||
"e69fe73aa3d44de0bddbe1711269bd8e",
|
||
"5ff700af61664f0eaebe580c8a49a910",
|
||
"e901ed75738f4e41b79651dc012003a6",
|
||
"b115c5c5193f489f87209bb5c6d788f9",
|
||
"3bce3be7198c4917a6cb2183e1344e2c",
|
||
"dedac5ecd5844ee29e346f465074d3fd",
|
||
"4521cc909b2942de889a37dbec1f0277",
|
||
"645a2e3dba2f49fb9ce9c0e2b2a8e73f",
|
||
"198833f7fa2f4ff8ab064b4671461830",
|
||
"66acf835e274473f84ccd486a99e71d2",
|
||
"db2934af14274fe78ffc85f7d03fd1c8",
|
||
"9be5d1e096934134a00974cf8e3fa63c",
|
||
"fc2c07f1eeee43e3aad438206929f5df",
|
||
"08d6a11cebf840748261e0ba6970092b",
|
||
"f8912b0da8aa4f7499ad3f4e5ccca84b",
|
||
"c87ad6d49a054fc8850bebf87c444ee3",
|
||
"17b6d632e333476b99f4315fe737d359",
|
||
"2661e810b7084f93a4dcd454ea7665a0",
|
||
"042557f8b84b422882c651f910a9fce0",
|
||
"8f95639d6dc946f18f14d8a16e73b4a4",
|
||
"89224986b13645d9a9dbedb038e795bb",
|
||
"2fcd1b9c380f422291096e57a6c7f85e",
|
||
"142133a41f664fdf82a1b16d87a68ae5",
|
||
"d1c2b3aac5cc4f3fad1413f8cfdc04e3",
|
||
"bce267a75b9946ae8b0db42dd7f925d1",
|
||
"3b46d5e1b7fe4427909d1c82debd7ba7",
|
||
"3dc1a66c56fb428aad53f5221ed1ae18",
|
||
"23cda173696645b6955515990b6834ec",
|
||
"2141c20003154d8dab2855deb44d3aad",
|
||
"2213d4aa55eb4ef384eaf879552dcd7d",
|
||
"56009a7bf6fc4d7c8300fc9dc4d6ad14",
|
||
"551cdd6ea1d94ba8a6cce7b00798c63b"
|
||
]
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 2010,
|
||
"status": "ok",
|
||
"timestamp": 1719589579935,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": 240
|
||
},
|
||
"id": "K_k5QduY5H0u",
|
||
"outputId": "2e844f23-3dee-4078-8d51-4c250d2c2f3e"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "77d1608ca87f4bc1b7a731c896d86db9",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer_config.json: 0%| | 0.00/26.0 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "41301a0547754ecfbd6044f6eacb0b8d",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"config.json: 0%| | 0.00/665 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "fa60038dc7c547b8b1d9c54f88fd6b39",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"vocab.json: 0%| | 0.00/1.04M [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "db2934af14274fe78ffc85f7d03fd1c8",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "2fcd1b9c380f422291096e57a6c7f85e",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[0;30;48;2;102;194;165m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203m and\u001b[0m \u001b[0;30;48;2;231;138;195m CAP\u001b[0m \u001b[0;30;48;2;166;216;84mITAL\u001b[0m \u001b[0;30;48;2;255;217;47mIZ\u001b[0m \u001b[0;30;48;2;102;194;165mATION\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;141;160;203m<33>\u001b[0m \u001b[0;30;48;2;231;138;195m<35>\u001b[0m \u001b[0;30;48;2;166;216;84m<34>\u001b[0m \u001b[0;30;48;2;255;217;47m <20>\u001b[0m \u001b[0;30;48;2;102;194;165m<35>\u001b[0m \u001b[0;30;48;2;252;141;98m<38>\u001b[0m \u001b[0;30;48;2;141;160;203m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;231;138;195mshow\u001b[0m \u001b[0;30;48;2;166;216;84m_\u001b[0m \u001b[0;30;48;2;255;217;47mt\u001b[0m \u001b[0;30;48;2;102;194;165mok\u001b[0m \u001b[0;30;48;2;252;141;98mens\u001b[0m \u001b[0;30;48;2;141;160;203m False\u001b[0m \u001b[0;30;48;2;231;138;195m None\u001b[0m \u001b[0;30;48;2;166;216;84m el\u001b[0m \u001b[0;30;48;2;255;217;47mif\u001b[0m \u001b[0;30;48;2;102;194;165m ==\u001b[0m \u001b[0;30;48;2;252;141;98m >=\u001b[0m \u001b[0;30;48;2;141;160;203m else\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m two\u001b[0m \u001b[0;30;48;2;255;217;47m tabs\u001b[0m \u001b[0;30;48;2;102;194;165m:\"\u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m \"\u001b[0m \u001b[0;30;48;2;255;217;47m Three\u001b[0m \u001b[0;30;48;2;102;194;165m tabs\u001b[0m \u001b[0;30;48;2;252;141;98m:\u001b[0m \u001b[0;30;48;2;141;160;203m \"\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m \u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m \u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \"\u001b[0m \u001b[0;30;48;2;166;216;84m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;255;217;47m12\u001b[0m \u001b[0;30;48;2;102;194;165m.\u001b[0m \u001b[0;30;48;2;252;141;98m0\u001b[0m \u001b[0;30;48;2;141;160;203m*\u001b[0m \u001b[0;30;48;2;231;138;195m50\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m600\u001b[0m \u001b[0;30;48;2;102;194;165m\n",
|
||
"\u001b[0m "
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"show_tokens(text, \"gpt2\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 183,
|
||
"referenced_widgets": [
|
||
"63772475e9234672994f2a8edf89b192",
|
||
"fa45fdb364444208b2693760809e3c60",
|
||
"5a92b1072a834d69ad33060ae1b0fdf0",
|
||
"b8a3d22de4964f369ecabc31b6cdca57",
|
||
"274983c981654d1eb25d630d8d5e47e3",
|
||
"eebaba8a83ed48f8afe2ca129f3a73c9",
|
||
"2d7a3c68a2e246059a93fea280f4c2c8",
|
||
"22983afeb04d4a06b8b01527d869585e",
|
||
"f7492077d2ba4ccf8df039473057321d",
|
||
"fb255528c7754047aeb29863dd642b19",
|
||
"cc7f7f3ef40042458ac7565b070af032",
|
||
"bc67bb86f76e482ab703f3f403f4cc76",
|
||
"a7605351b83941698348ee84cd99f955",
|
||
"88af753262344cfe9a88b133540bcaf7",
|
||
"8f92316b17a24623b8646747f5ecc7d6",
|
||
"8153c7e3f21f44f09a3222da6312137d",
|
||
"22bee52a95d243868b40b3f8ce5ca7d9",
|
||
"f09f8925fc34454dbb970c31d5d82707",
|
||
"354c2db5dbd34284baf62a7529537b8b",
|
||
"028db37ce29c45939adeed5ae311583c",
|
||
"dd534b3f89c64de9b9aa4f7a95a05f34",
|
||
"30ed305df62c45329f24a2f64499d490",
|
||
"36e3e6b45fca44c8b01a729189b1bdab",
|
||
"4727da03ed724b8dafff24856652fe95",
|
||
"b580d17fa1134425a837a48aba06dbf4",
|
||
"e71172ffd8d74f189cea18e4898c4c2d",
|
||
"ce0031117cf347f48d027cd70e87193b",
|
||
"9ee79c669c6641d887fa284286435f57",
|
||
"9885aaa3e0be4052af4848b92c642cdd",
|
||
"ce6eee6121334479a72e023b252124d2",
|
||
"3adda22d2bca47a886da662621ec9a9d",
|
||
"6a1a34996ae24a14972fd425be48dd7c",
|
||
"2c323975d4454113a863e3ec0b56f4fb",
|
||
"4b9fab6416924d509bbb6361f63797e1",
|
||
"a721b2fba975474c8a3d9384cf998228",
|
||
"1f76260c0f6c46598bee13eb4a0f8b65",
|
||
"fd95f752cc684613b0f6c6db12af874f",
|
||
"3c0a4cbec1bf4c7886a1a9271c1b0832",
|
||
"2489776fa74e4dccb4154368f3861623",
|
||
"c8506a9393604d119ff71264e27b6734",
|
||
"93187864ee6d4d41b49df1c84f35e6e2",
|
||
"d7cbd635afec493ebcc4973f8d98c58b",
|
||
"d09ad050869c47fb9fbc558f8b6d47d7",
|
||
"b4e35a7edeea4b089182f4e9b15dc12e"
|
||
]
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 1618,
|
||
"status": "ok",
|
||
"timestamp": 1719589589160,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": 240
|
||
},
|
||
"id": "EJn5nf3c5H2_",
|
||
"outputId": "607c38ff-9425-4371-f5e0-1f8ee9449eee"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "63772475e9234672994f2a8edf89b192",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer_config.json: 0%| | 0.00/2.54k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "bc67bb86f76e482ab703f3f403f4cc76",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"spiece.model: 0%| | 0.00/792k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "36e3e6b45fca44c8b01a729189b1bdab",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer.json: 0%| | 0.00/2.42M [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "4b9fab6416924d509bbb6361f63797e1",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"special_tokens_map.json: 0%| | 0.00/2.20k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[0;30;48;2;102;194;165mEnglish\u001b[0m \u001b[0;30;48;2;252;141;98mand\u001b[0m \u001b[0;30;48;2;141;160;203mCA\u001b[0m \u001b[0;30;48;2;231;138;195mPI\u001b[0m \u001b[0;30;48;2;166;216;84mTAL\u001b[0m \u001b[0;30;48;2;255;217;47mIZ\u001b[0m \u001b[0;30;48;2;102;194;165mATION\u001b[0m \u001b[0;30;48;2;252;141;98m\u001b[0m \u001b[0;30;48;2;141;160;203m<unk>\u001b[0m \u001b[0;30;48;2;231;138;195m\u001b[0m \u001b[0;30;48;2;166;216;84m<unk>\u001b[0m \u001b[0;30;48;2;255;217;47mshow\u001b[0m \u001b[0;30;48;2;102;194;165m_\u001b[0m \u001b[0;30;48;2;252;141;98mto\u001b[0m \u001b[0;30;48;2;141;160;203mken\u001b[0m \u001b[0;30;48;2;231;138;195ms\u001b[0m \u001b[0;30;48;2;166;216;84mFal\u001b[0m \u001b[0;30;48;2;255;217;47ms\u001b[0m \u001b[0;30;48;2;102;194;165me\u001b[0m \u001b[0;30;48;2;252;141;98mNone\u001b[0m \u001b[0;30;48;2;141;160;203m\u001b[0m \u001b[0;30;48;2;231;138;195me\u001b[0m \u001b[0;30;48;2;166;216;84ml\u001b[0m \u001b[0;30;48;2;255;217;47mif\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m=\u001b[0m \u001b[0;30;48;2;141;160;203m>\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84melse\u001b[0m \u001b[0;30;48;2;255;217;47m:\u001b[0m \u001b[0;30;48;2;102;194;165mtwo\u001b[0m \u001b[0;30;48;2;252;141;98mtab\u001b[0m \u001b[0;30;48;2;141;160;203ms\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m\"\u001b[0m \u001b[0;30;48;2;102;194;165mThree\u001b[0m \u001b[0;30;48;2;252;141;98mtab\u001b[0m \u001b[0;30;48;2;141;160;203ms\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m\"\u001b[0m \u001b[0;30;48;2;102;194;165m12.\u001b[0m \u001b[0;30;48;2;252;141;98m0\u001b[0m \u001b[0;30;48;2;141;160;203m*\u001b[0m \u001b[0;30;48;2;231;138;195m50\u001b[0m \u001b[0;30;48;2;166;216;84m=\u001b[0m \u001b[0;30;48;2;255;217;47m600\u001b[0m \u001b[0;30;48;2;102;194;165m\u001b[0m \u001b[0;30;48;2;252;141;98m</s>\u001b[0m "
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"show_tokens(text, \"google/flan-t5-small\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 714,
|
||
"status": "ok",
|
||
"timestamp": 1723035784494,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": -60
|
||
},
|
||
"id": "1ymhAsTg5H5e",
|
||
"outputId": "7827a535-4f33-4620-f4e7-4a2b622a78c2"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[0;30;48;2;102;194;165m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203m and\u001b[0m \u001b[0;30;48;2;231;138;195m CAPITAL\u001b[0m \u001b[0;30;48;2;166;216;84mIZATION\u001b[0m \u001b[0;30;48;2;255;217;47m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;102;194;165m<35>\u001b[0m \u001b[0;30;48;2;252;141;98m<38>\u001b[0m \u001b[0;30;48;2;141;160;203m<33>\u001b[0m \u001b[0;30;48;2;231;138;195m <20>\u001b[0m \u001b[0;30;48;2;166;216;84m<34>\u001b[0m \u001b[0;30;48;2;255;217;47m<37>\u001b[0m \u001b[0;30;48;2;102;194;165m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_tokens\u001b[0m \u001b[0;30;48;2;231;138;195m False\u001b[0m \u001b[0;30;48;2;166;216;84m None\u001b[0m \u001b[0;30;48;2;255;217;47m elif\u001b[0m \u001b[0;30;48;2;102;194;165m ==\u001b[0m \u001b[0;30;48;2;252;141;98m >=\u001b[0m \u001b[0;30;48;2;141;160;203m else\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m two\u001b[0m \u001b[0;30;48;2;255;217;47m tabs\u001b[0m \u001b[0;30;48;2;102;194;165m:\"\u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \"\u001b[0m \u001b[0;30;48;2;231;138;195m Three\u001b[0m \u001b[0;30;48;2;166;216;84m tabs\u001b[0m \u001b[0;30;48;2;255;217;47m:\u001b[0m \u001b[0;30;48;2;102;194;165m \"\u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203m \"\n",
|
||
"\u001b[0m \u001b[0;30;48;2;231;138;195m12\u001b[0m \u001b[0;30;48;2;166;216;84m.\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m*\u001b[0m \u001b[0;30;48;2;252;141;98m50\u001b[0m \u001b[0;30;48;2;141;160;203m=\u001b[0m \u001b[0;30;48;2;231;138;195m600\u001b[0m \u001b[0;30;48;2;166;216;84m\n",
|
||
"\u001b[0m "
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# The official is `tiktoken` but this the same tokenizer on the HF platform\n",
|
||
"show_tokens(text, \"Xenova/gpt-4\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 284,
|
||
"referenced_widgets": [
|
||
"770da5f1b8b24972b2018e1cadd3ec8a",
|
||
"e2f255083a1b4d8f9992f27b4d21c676",
|
||
"4fac840948a048d6b1ca7c0dc4f4c5d5",
|
||
"c2ac158eb3f0469ca90f7247c546c70f",
|
||
"4a8cfc4637124995866810ef1b750fe4",
|
||
"409169ca000a484ca4472750cfe63f30",
|
||
"8663795c551e457dafc93d02cf0026c3",
|
||
"5639fe0c03e7451db316579356290e3d",
|
||
"46e73023abe1465382771e9af87f36fc",
|
||
"5feaeba2e88a4a7fb83a85f3200e2639",
|
||
"76a8030af7794356b5c2daa891d789e2",
|
||
"37c46ab78fc64eb98923b24d6a0de37e",
|
||
"3bb4b5235ef74c7d89489f8fa8cded17",
|
||
"a720bb387fca45968c75352398935382",
|
||
"683b85afadb744e4bd7164c51f01d3f9",
|
||
"00dd050102674a1ab3fd8d8f9caec4b0",
|
||
"cf6a7c6ada024f8e9f106428d506b078",
|
||
"33744d7c827e4784a91955159a47e337",
|
||
"f6dce141c94d4c8494f75a7387b65331",
|
||
"48738fb1cf8e4f0fb70b74a5896669cd",
|
||
"1b10141545cb489fa3a58d4939cc4d9b",
|
||
"99b9c874e58c4db9b596b6ca1699e666",
|
||
"07bf43728198472997c8b59b9343adfe",
|
||
"74d33e70d8af43148fd3a618b5d3c5dd",
|
||
"8008a03780b24639abce64498b1d832e",
|
||
"82ad72412e1343b983679e625c85f47d",
|
||
"0d3aa270949048a5886de118b1a3b1f1",
|
||
"cc568e7a8ca84810ab878e601fae557a",
|
||
"cb57c7f3455f4a34b59ae39d0b599b8b",
|
||
"577a22cb6c7549ff96f367bd6f4f8b12",
|
||
"d0d4c92c9a0f4bd29255d8ff47d18c11",
|
||
"e6d9b96a5cb9487d90136d097e716a5a",
|
||
"e878edbee8ea48178b424e56417b7fa5",
|
||
"e227f5f6bb3b4580b0ea4304d34ad556",
|
||
"36863dd97aa04c48831d1fb455557adc",
|
||
"ece59919873646f9bbf41c7547e802a3",
|
||
"7ff1f54520324b3e9462062ecd87ce69",
|
||
"2e8d55afb3fc4e2fa6b0b887a09b7ca9",
|
||
"e527a040f0be4d43830ae6d4335771b0",
|
||
"32f18ab0328146b5aac38b4c7ef8029d",
|
||
"5d5d4e02a6724861aa36d9af5ea70ea7",
|
||
"c8637134a894493093654456f2a9763b",
|
||
"8a1023f076f34f34ab0b091d7f62c172",
|
||
"3537e0361ab5475282b7f34adbfc70dc",
|
||
"12d4df348cd34dc2b3d7ffc41f0561e8",
|
||
"65326149d62e4404ba49a8d2d505adac",
|
||
"b84c1bff2fcc48ed8fee636e1bdb16f9",
|
||
"a8ae6ec72f2744b4991de0a961e1b142",
|
||
"98182c58a56343e482ed44935be4fd31",
|
||
"15719135708047a49a19887989dac12d",
|
||
"a43ccd6d114644ae86c3129786f8105a",
|
||
"90cc27a829e54b16850552e04f718bc3",
|
||
"2be39bb4eacf41e381432b37febfd788",
|
||
"df4dd594480d446da1523bd2d016c1cb",
|
||
"11ea292f268d481396192b158814e6b1"
|
||
]
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 9948,
|
||
"status": "ok",
|
||
"timestamp": 1719590292199,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": 240
|
||
},
|
||
"id": "3_vAyeTy5H7_",
|
||
"outputId": "ad3f759f-19b7-4880-cbf8-9ed7cb25d627"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "770da5f1b8b24972b2018e1cadd3ec8a",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer_config.json: 0%| | 0.00/7.88k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "37c46ab78fc64eb98923b24d6a0de37e",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"vocab.json: 0%| | 0.00/777k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "07bf43728198472997c8b59b9343adfe",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"merges.txt: 0%| | 0.00/442k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "e227f5f6bb3b4580b0ea4304d34ad556",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer.json: 0%| | 0.00/2.06M [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "12d4df348cd34dc2b3d7ffc41f0561e8",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"special_tokens_map.json: 0%| | 0.00/958 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[0;30;48;2;102;194;165m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203m and\u001b[0m \u001b[0;30;48;2;231;138;195m CAPITAL\u001b[0m \u001b[0;30;48;2;166;216;84mIZATION\u001b[0m \u001b[0;30;48;2;255;217;47m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;102;194;165m<35>\u001b[0m \u001b[0;30;48;2;252;141;98m<38>\u001b[0m \u001b[0;30;48;2;141;160;203m<33>\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m<34>\u001b[0m \u001b[0;30;48;2;255;217;47m<37>\u001b[0m \u001b[0;30;48;2;102;194;165m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;252;141;98mshow\u001b[0m \u001b[0;30;48;2;141;160;203m_\u001b[0m \u001b[0;30;48;2;231;138;195mtokens\u001b[0m \u001b[0;30;48;2;166;216;84m False\u001b[0m \u001b[0;30;48;2;255;217;47m None\u001b[0m \u001b[0;30;48;2;102;194;165m elif\u001b[0m \u001b[0;30;48;2;252;141;98m ==\u001b[0m \u001b[0;30;48;2;141;160;203m >=\u001b[0m \u001b[0;30;48;2;231;138;195m else\u001b[0m \u001b[0;30;48;2;166;216;84m:\u001b[0m \u001b[0;30;48;2;255;217;47m two\u001b[0m \u001b[0;30;48;2;102;194;165m tabs\u001b[0m \u001b[0;30;48;2;252;141;98m:\"\u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \"\u001b[0m \u001b[0;30;48;2;166;216;84m Three\u001b[0m \u001b[0;30;48;2;255;217;47m tabs\u001b[0m \u001b[0;30;48;2;102;194;165m:\u001b[0m \u001b[0;30;48;2;252;141;98m \"\u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195m \"\u001b[0m \u001b[0;30;48;2;166;216;84m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;255;217;47m1\u001b[0m \u001b[0;30;48;2;102;194;165m2\u001b[0m \u001b[0;30;48;2;252;141;98m.\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m*\u001b[0m \u001b[0;30;48;2;166;216;84m5\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m=\u001b[0m \u001b[0;30;48;2;252;141;98m6\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m0\u001b[0m \u001b[0;30;48;2;166;216;84m\n",
|
||
"\u001b[0m "
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# You need to request access before being able to use this tokenizer\n",
|
||
"show_tokens(text, \"bigcode/starcoder2-15b\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 220,
|
||
"referenced_widgets": [
|
||
"6a6efb1d66ea423a9b5ff4b2f1f1194c",
|
||
"1283ee793a40405aa9763e1b88d6d7a3",
|
||
"11df3b517fd94524be18cff070e273a8",
|
||
"86b01695df4b42d3a6602d704243b6ee",
|
||
"b0b1039168a74b8fa79271592f29f0b7",
|
||
"877ff6d25f524779a10df09be5fc6093",
|
||
"276ec5fb636b49bcb933a4ce96cb900e",
|
||
"ac88f9025a0e44b38fc14720d810b5ab",
|
||
"4ceb8ce8b67a44b4b7505cd7e589dec1",
|
||
"b0d45aec56fd4219b9224dbe31fad3a3",
|
||
"bf3a8980a70547f5b853390235a37592",
|
||
"f11120105af24fe1b40b6490897e2e2e",
|
||
"8109bb974a3e4b12bd7534b62be20940",
|
||
"f1d6a31870da4e27bc482bb84953b165",
|
||
"1128f56169ac4376ab5fbb46d44b01dc",
|
||
"3da38a2294a145bf86124d0fda8b3255",
|
||
"85113d3b53fb47bb8593c3a21e37142c",
|
||
"1e4a17f723d14694b5aeb673db7394cc",
|
||
"87a4011d2d0a4076b027f7068b244dda",
|
||
"48ca9047fc7d424f99b40c54e6d732f4",
|
||
"3b169b44c1814ed5a7feba7bab0f3ce6",
|
||
"dff4ee8d0bd74822a0adf204e21521b8",
|
||
"5d3f3b08ec5044e3acf4414703e579d9",
|
||
"900ddbabea1846a3a0dfd8380668bec1",
|
||
"7da6e29f0349438494ff83975d48f02a",
|
||
"01f1433d221f437eb0692d25948ce080",
|
||
"9f189b7e32c94a3a84ffd40beed7d1fd",
|
||
"3e5f406442df4b848d324ff584eea75c",
|
||
"575b63bdd98047a4934d556423dd9ce6",
|
||
"c9662398b8ad4d4fa1a59d44f6205769",
|
||
"37379e486478437f9fa2f8eac7f9fd60",
|
||
"2b83154cf934484da54fe0d0b08fe3d3",
|
||
"759faaea712c4a8abb0ece5217e1c470"
|
||
]
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 1388,
|
||
"status": "ok",
|
||
"timestamp": 1719589605088,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": 240
|
||
},
|
||
"id": "KeWcUdxY6I3u",
|
||
"outputId": "f39c8f56-1e71-44bb-bade-75bfb33b581c"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "6a6efb1d66ea423a9b5ff4b2f1f1194c",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer_config.json: 0%| | 0.00/166 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "f11120105af24fe1b40b6490897e2e2e",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer.json: 0%| | 0.00/2.14M [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "5d3f3b08ec5044e3acf4414703e579d9",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"special_tokens_map.json: 0%| | 0.00/3.00 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[0;30;48;2;102;194;165m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;252;141;98mEnglish\u001b[0m \u001b[0;30;48;2;141;160;203m and\u001b[0m \u001b[0;30;48;2;231;138;195m CAP\u001b[0m \u001b[0;30;48;2;166;216;84mITAL\u001b[0m \u001b[0;30;48;2;255;217;47mIZATION\u001b[0m \u001b[0;30;48;2;102;194;165m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;252;141;98m<38>\u001b[0m \u001b[0;30;48;2;141;160;203m<33>\u001b[0m \u001b[0;30;48;2;231;138;195m<35>\u001b[0m \u001b[0;30;48;2;166;216;84m<34>\u001b[0m \u001b[0;30;48;2;255;217;47m <20>\u001b[0m \u001b[0;30;48;2;102;194;165m<35>\u001b[0m \u001b[0;30;48;2;252;141;98m<38>\u001b[0m \u001b[0;30;48;2;141;160;203m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;231;138;195mshow\u001b[0m \u001b[0;30;48;2;166;216;84m_\u001b[0m \u001b[0;30;48;2;255;217;47mtokens\u001b[0m \u001b[0;30;48;2;102;194;165m False\u001b[0m \u001b[0;30;48;2;252;141;98m None\u001b[0m \u001b[0;30;48;2;141;160;203m elif\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m==\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m>\u001b[0m \u001b[0;30;48;2;252;141;98m=\u001b[0m \u001b[0;30;48;2;141;160;203m else\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m two\u001b[0m \u001b[0;30;48;2;255;217;47m t\u001b[0m \u001b[0;30;48;2;102;194;165mabs\u001b[0m \u001b[0;30;48;2;252;141;98m:\u001b[0m \u001b[0;30;48;2;141;160;203m\"\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m Three\u001b[0m \u001b[0;30;48;2;102;194;165m t\u001b[0m \u001b[0;30;48;2;252;141;98mabs\u001b[0m \u001b[0;30;48;2;141;160;203m:\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;141;160;203m1\u001b[0m \u001b[0;30;48;2;231;138;195m2\u001b[0m \u001b[0;30;48;2;166;216;84m.\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m*\u001b[0m \u001b[0;30;48;2;252;141;98m5\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84m6\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m0\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
|
||
"\u001b[0m "
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"show_tokens(text, \"facebook/galactica-1.3b\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 374,
|
||
"status": "ok",
|
||
"timestamp": 1719589632350,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": 240
|
||
},
|
||
"id": "__QNj2Cohzz2",
|
||
"outputId": "17ffab73-b07c-44a9-c482-64ab9f4c45a4"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[0;30;48;2;102;194;165m<s>\u001b[0m \u001b[0;30;48;2;252;141;98m\u001b[0m \u001b[0;30;48;2;141;160;203m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;231;138;195mEnglish\u001b[0m \u001b[0;30;48;2;166;216;84mand\u001b[0m \u001b[0;30;48;2;255;217;47mC\u001b[0m \u001b[0;30;48;2;102;194;165mAP\u001b[0m \u001b[0;30;48;2;252;141;98mIT\u001b[0m \u001b[0;30;48;2;141;160;203mAL\u001b[0m \u001b[0;30;48;2;231;138;195mIZ\u001b[0m \u001b[0;30;48;2;166;216;84mATION\u001b[0m \u001b[0;30;48;2;255;217;47m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;102;194;165m<35>\u001b[0m \u001b[0;30;48;2;252;141;98m<38>\u001b[0m \u001b[0;30;48;2;141;160;203m<33>\u001b[0m \u001b[0;30;48;2;231;138;195m<35>\u001b[0m \u001b[0;30;48;2;166;216;84m\u001b[0m \u001b[0;30;48;2;255;217;47m<37>\u001b[0m \u001b[0;30;48;2;102;194;165m<35>\u001b[0m \u001b[0;30;48;2;252;141;98m<38>\u001b[0m \u001b[0;30;48;2;141;160;203m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;231;138;195mshow\u001b[0m \u001b[0;30;48;2;166;216;84m_\u001b[0m \u001b[0;30;48;2;255;217;47mto\u001b[0m \u001b[0;30;48;2;102;194;165mkens\u001b[0m \u001b[0;30;48;2;252;141;98mFalse\u001b[0m \u001b[0;30;48;2;141;160;203mNone\u001b[0m \u001b[0;30;48;2;231;138;195melif\u001b[0m \u001b[0;30;48;2;166;216;84m==\u001b[0m \u001b[0;30;48;2;255;217;47m>=\u001b[0m \u001b[0;30;48;2;102;194;165melse\u001b[0m \u001b[0;30;48;2;252;141;98m:\u001b[0m \u001b[0;30;48;2;141;160;203mtwo\u001b[0m \u001b[0;30;48;2;231;138;195mtabs\u001b[0m \u001b[0;30;48;2;166;216;84m:\"\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98mThree\u001b[0m \u001b[0;30;48;2;141;160;203mtabs\u001b[0m \u001b[0;30;48;2;231;138;195m:\u001b[0m \u001b[0;30;48;2;166;216;84m\"\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165m\"\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
|
||
"\u001b[0m \u001b[0;30;48;2;141;160;203m1\u001b[0m \u001b[0;30;48;2;231;138;195m2\u001b[0m \u001b[0;30;48;2;166;216;84m.\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m*\u001b[0m \u001b[0;30;48;2;252;141;98m5\u001b[0m \u001b[0;30;48;2;141;160;203m0\u001b[0m \u001b[0;30;48;2;231;138;195m=\u001b[0m \u001b[0;30;48;2;166;216;84m6\u001b[0m \u001b[0;30;48;2;255;217;47m0\u001b[0m \u001b[0;30;48;2;102;194;165m0\u001b[0m \u001b[0;30;48;2;252;141;98m\n",
|
||
"\u001b[0m "
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"show_tokens(text, \"microsoft/Phi-3-mini-4k-instruct\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "9Tu7OY4HvBEm"
|
||
},
|
||
"source": [
|
||
"# Contextualized Word Embeddings From a Language Model (Like BERT)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 265,
|
||
"referenced_widgets": [
|
||
"761c3e6c7f26453bba2f463f39f3ae73",
|
||
"a5c45eeafcf4456bbb0fe1bb43ef2497",
|
||
"1f35082ee7ec425ea801321112d48db1",
|
||
"618c1c5ae3ea4650a412a3e009d9fb49",
|
||
"a158991ec46e4587aac2111e02153f4c",
|
||
"51af2c28e34245ed83f02b424a6a640a",
|
||
"56a95848a2814195b334c31f0a961cd8",
|
||
"c1e09c1869f7410ba328e6efe56a0460",
|
||
"17777d8eae4742aaa78d7854a52e102f",
|
||
"5916af8bb7ed4efcb05c8f1cdf826149",
|
||
"b1778299eac04302ac136872dfb0a359",
|
||
"8ee98d9d017542609881a1e16a8f393f",
|
||
"c4a8f08da3f64fb287f7a940c4a9408f",
|
||
"d66ee44b7f5b4c6eac1536235a627441",
|
||
"c44feccdd01949c1af40fa69f6e03dd0",
|
||
"84682eb9cfff444da7633f4ca9360f77",
|
||
"332810f458df4e23bc034852706bcc6f",
|
||
"4999e7d8e2384bf4adfbf1777587a65f",
|
||
"a8d4f19ff4554165a5e78d3783928fe1",
|
||
"29d9e7a799ea402fb5a28b2817156838",
|
||
"222d489bf1664763babf0e377e45f4d8",
|
||
"2ae57535c39541fe98d6a8ae22bcd7d4",
|
||
"fd134a05028c447a994166eccc557806",
|
||
"69101d935ae841e59aa7f30e40789496",
|
||
"3ef63f93e424409192edd1b1364aba48",
|
||
"1d204aedfeb14df5ae7e27eb88a87018",
|
||
"e74b57abfaa2487aa8e369103be5d00d",
|
||
"94609e349c8b43b5b74d2c059623f9f0",
|
||
"febce6a7c96e4a42a9c6faa0bf1763c0",
|
||
"18a4603fa6a040c0acd792243510562a",
|
||
"c04ce575bf624bd1a80113d1eff1ae94",
|
||
"6a9de4aaed054608b800820831aec87f",
|
||
"46cd179c1c09474a80dc4cea39b759d5",
|
||
"20331d1d457143719fe732325e79877e",
|
||
"2a5055c8fc03457eb29390312c555ec5",
|
||
"6de0874e33c146bba06131aa452c403f",
|
||
"0f1d2e4c312d4ab38e359f96c9a760b7",
|
||
"f946c53a81f34d64b25e331fc4b4c7a1",
|
||
"2dc553ef192c4002b818de6736367fe0",
|
||
"665b4085199a4ec5890dba773f07d4b7",
|
||
"de74a5af1ef9462e82e5622232346a79",
|
||
"34df4fe808174f80a89a765f6ce2f28f",
|
||
"56c082332310434fa5ec791b728fe82e",
|
||
"9ef54c7a15d1400f91598b367ac6552e",
|
||
"ed19532a8eda4925a4014a3f11517ff9",
|
||
"3859f0311ade4d278190924f0107ba0e",
|
||
"6ea963b64fd642b29021110d77827021",
|
||
"e3efc5b43bf2417faaea2e44a306fa75",
|
||
"4975826de21547fca434e1fb492a216a",
|
||
"19b547738c8e45bda7879dc527229019",
|
||
"6469ef3a7ba6465c911a22060d54ff95",
|
||
"657b6ce5f0804bc78d64f9b7b6a27777",
|
||
"301363b755a24bc9bf413fc3b0ffd8b2",
|
||
"50fc50f975294a23a1d5bccf64efc872",
|
||
"3c722f92f6c2479a91cf957d1d18fbee",
|
||
"7570614d79184ab2b44700df2342b294",
|
||
"543ae93b2c5f4339979b8c64eff33f62",
|
||
"72b54fe7c88540488adee31c15595c89",
|
||
"3fbdaadec9f545b99057f7c55c6a6df1",
|
||
"7656d1b978554f4383160b93a80ee7c6",
|
||
"4d55d194e62942df8f1c32b9bb244e9b",
|
||
"403f03c3a2434fd2ac6f143c1972c62e",
|
||
"cc68d46e46e7487484d366f77ea863ca",
|
||
"3dee319ce0aa4ce589ce84033bba8d9d",
|
||
"a280161408504bb894ab78526f67750b",
|
||
"72d18d21ca2a45ac868d390ead3ac086"
|
||
]
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 5049,
|
||
"status": "ok",
|
||
"timestamp": 1719641476949,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "nsjz-VsYu9bB",
|
||
"outputId": "03ea124b-c6de-449d-ea6f-f5e5b84c2c97"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "761c3e6c7f26453bba2f463f39f3ae73",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer_config.json: 0%| | 0.00/52.0 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
|
||
" warnings.warn(\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "8ee98d9d017542609881a1e16a8f393f",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"config.json: 0%| | 0.00/474 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "fd134a05028c447a994166eccc557806",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "20331d1d457143719fe732325e79877e",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "ed19532a8eda4925a4014a3f11517ff9",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"config.json: 0%| | 0.00/578 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "7570614d79184ab2b44700df2342b294",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"pytorch_model.bin: 0%| | 0.00/241M [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"from transformers import AutoModel, AutoTokenizer\n",
|
||
"\n",
|
||
"# Load a tokenizer\n",
|
||
"tokenizer = AutoTokenizer.from_pretrained(\"microsoft/deberta-base\")\n",
|
||
"\n",
|
||
"# Load a language model\n",
|
||
"model = AutoModel.from_pretrained(\"microsoft/deberta-v3-xsmall\")\n",
|
||
"\n",
|
||
"# Tokenize the sentence\n",
|
||
"tokens = tokenizer('Hello world', return_tensors='pt')\n",
|
||
"\n",
|
||
"# Process the tokens\n",
|
||
"output = model(**tokens)[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 567,
|
||
"status": "ok",
|
||
"timestamp": 1719641482036,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "lQly_KcbvDce",
|
||
"outputId": "fe2cc467-2a5a-4111-8d23-4da9aa799b79"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"torch.Size([1, 4, 384])"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"output.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 2,
|
||
"status": "ok",
|
||
"timestamp": 1719641482353,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "8GcRrpPV0kVj",
|
||
"outputId": "93766ff1-1ae5-4e90-dba0-286d9e721c3d"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[CLS]\n",
|
||
"Hello\n",
|
||
" world\n",
|
||
"[SEP]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for token in tokens['input_ids'][0]:\n",
|
||
" print(tokenizer.decode(token))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 1,
|
||
"status": "ok",
|
||
"timestamp": 1719641482353,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "e8oHVC7B0lkk",
|
||
"outputId": "f7dd1e0c-a2db-4ae4-8ccb-c97fa150071a"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"tensor([[[-3.4816, 0.0861, -0.1819, ..., -0.0612, -0.3911, 0.3017],\n",
|
||
" [ 0.1898, 0.3208, -0.2315, ..., 0.3714, 0.2478, 0.8048],\n",
|
||
" [ 0.2071, 0.5036, -0.0485, ..., 1.2175, -0.2292, 0.8582],\n",
|
||
" [-3.4278, 0.0645, -0.1427, ..., 0.0658, -0.4367, 0.3834]]],\n",
|
||
" grad_fn=<NativeLayerNormBackward0>)"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"output"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "DdEDuLWa0r4L"
|
||
},
|
||
"source": [
|
||
"# Text Embeddings (For Sentences and Whole Documents)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 425,
|
||
"referenced_widgets": [
|
||
"a50156f59d8548b683982af06d5bda09",
|
||
"7cbfb80418f14068bb4327a3823140fb",
|
||
"ef1e2b9a1e694eaa8d5b371caa66277e",
|
||
"ebf8ef569e374c17aa15472ee6ea98d8",
|
||
"75f5174b66c14f5480960c79a44283ad",
|
||
"05504faf760d43ea9082ff5a91ff82f7",
|
||
"50c6cf2be63b4be29524188d79e8cd53",
|
||
"9088f66d46f44030af46aee55005a939",
|
||
"7af424dbfd864b8ba6f7661f2204302c",
|
||
"44aae8dd70da4be98fd86a970373753b",
|
||
"49dd6490203640d1821196fe28a08732",
|
||
"c40e76c5e9bd42bea7903e811bce53a4",
|
||
"8f7d3eea82614e4ca93482ad7cb637a6",
|
||
"05eb94a4253544648654f24268e8b6da",
|
||
"01b9535a72c64c95b482640b2bd3c5fa",
|
||
"b954ab8edd66487cb9a5502879f0e1c5",
|
||
"e6e103b2c53a4b5ca71f988b7804aeaf",
|
||
"77f25a7ee7f843b0a24a1d360a1df211",
|
||
"a0e30058096941dcaab8f24235ee1c49",
|
||
"b26ac1747c884e3a913d90b8c05a991b",
|
||
"e5db8326887a48ae97341cf216d35893",
|
||
"c76f0b056c1f40f3ac37df24c449e9ec",
|
||
"c4c7962b94674cd6898980cea6483595",
|
||
"7fc0e3e3c75e47f88410a2774695081e",
|
||
"6c96c43dd36a4b08ad94a9cad642b1ab",
|
||
"73b403ec39fa4a16883ad08a50c7204b",
|
||
"6feee2b015ef4fa6b94c905ec91f5146",
|
||
"ab68f3b41d954ec992b9339a91def6a8",
|
||
"fb9054b0d27c4a8a9b8241f9d5910e51",
|
||
"78b6d1098f754eb491a2a729e00d2335",
|
||
"e84f055fc0c04c4983b162d1d8c67147",
|
||
"13455551484542ea93d4dbbd937288d5",
|
||
"99db2aef29ba4717866072f20f1acf61",
|
||
"10f9441d843c44e5b11cfb2e21b5d89e",
|
||
"5c987bfb44d14b1ea822c99cb7dde071",
|
||
"52ac1a973f7140e8b49dbf58dc0c8b21",
|
||
"37117814cae9440c9c54f63def546c4b",
|
||
"e019010a35ea4f3a9b9236a626b34760",
|
||
"ba03fe6450b742c99e8b8836f585232c",
|
||
"47e9e2ed70c84023ae3a1dbf0cb27328",
|
||
"99e53690e1c940cea12f581b625c3b3d",
|
||
"f53653a9050042649df9d91114ba39bb",
|
||
"737c25a604ee4bcd982487f79450d3ea",
|
||
"cea92b3b96f24087b494487bf3f4c0f9",
|
||
"784748de51254ba18128af8df30b8a93",
|
||
"11890f81eceb41f7be6a2d52c9a9e55e",
|
||
"f7adb025a07a4aca8b7a3a174304666f",
|
||
"7f7a5bfc6073495da65dcfd4b2d49309",
|
||
"fa6483d07ae54fb2904fc117aa9a3d5b",
|
||
"2cbcf7d0b1384b8cb320e1b30c124d71",
|
||
"94ea21e19acf4eb7a9010510b226db86",
|
||
"f399e1a91e7d4b7eaddfb910bd81d750",
|
||
"f26297a84f224c4da8afd4316c8e7477",
|
||
"e4bef8778ddc46e5a8d0756eb27c3e7c",
|
||
"66943967c327428a9796fcd38c36b24d",
|
||
"f0793746dff34da8858758fb55284b97",
|
||
"1bc7b31eddc54588979f3fff14a0e12e",
|
||
"75f54f7a8e5b4965b6f0ad28e5f3bf26",
|
||
"3cb9178f0568448fa839d5cccc7973d7",
|
||
"40bde45ac20f48ab93c4fd9e8284eac8",
|
||
"09b701e83e3844fab97ff237b06a1238",
|
||
"c1bbe572b8324ea48d42c40a5128bb8e",
|
||
"a4bd81ed4d9d498a983c13ea79265819",
|
||
"d8d65b5ac8914792b82460cf0bae980d",
|
||
"136cd465bac246f2ac2454eea2f0484d",
|
||
"59030392bbde468aae6c62aecddd499e",
|
||
"ca7eb54b296c4a1fa91678c0e3d65f5d",
|
||
"35521a33c1324a928fd2c9f7fce2ce69",
|
||
"80c82a58f0924a578bcae9d3c6537c11",
|
||
"589d24a92b3d4e5c99460cd609c8a230",
|
||
"0edd8bfe5bba47d59c5b195841cc4228",
|
||
"2aed08160bef4a7189821c560c01d6bb",
|
||
"01ca9f66804048a9a75475ab9c49a24e",
|
||
"880c7bdcf3174a78892aa7d0cd11dca7",
|
||
"1d4f85ce80d841eab27b411b3e61e9be",
|
||
"9960dd3bff70458abc148f1a153175ec",
|
||
"0bdc012004ad445fa527fafee0ab55c2",
|
||
"2895fad80e754b4e8158c6dd8db69058",
|
||
"4a2c467901414bf0afc5b310aa959dae",
|
||
"a465317d0f3b4106bbf8fd6c7a3caf6a",
|
||
"7c93b5df16a64e1981e350459b05852d",
|
||
"66d5c0087bd141b3baa502e3aa8bd408",
|
||
"d26ec2f29c154da1ac7cb49cb9729113",
|
||
"7385eea9ed2a438c8dae350fe2328162",
|
||
"c861634df7524e7cbd7fdca030a0b663",
|
||
"9904b80c37c44ed2a6b3c21786016e26",
|
||
"df424fcea3e84c8084dbf8e146d1231f",
|
||
"a1e116cf62d74d4e8e33c99379e924ed",
|
||
"652ddbc085994a36b553ea04359943a1",
|
||
"b340ba4de77043dcbedbcbdf6033d0c7",
|
||
"10c89678b42b4cf0b8e95b74ad346fba",
|
||
"fb22220311094fe2b3d245ff080ca4d7",
|
||
"fc1358383bba4e5ebe2edfca57473002",
|
||
"342adaf9c12548a4af25f5361ca869ca",
|
||
"374287b3d21a427fabd82dc1e0710d62",
|
||
"427e31cb6fcc4687937d807116e5e581",
|
||
"8195fa9ad4e3487c90cbc0860361a336",
|
||
"82bf559e6997425aba4245c44531f762",
|
||
"6bc8887963d744a7a6c15930844f5513",
|
||
"4688efe8ab954510b30df18f8daa74a5",
|
||
"27c697f872b64fb5a43deb255e27fb50",
|
||
"08f2dd1dd0a742eb8c9e11a41dbe69a3",
|
||
"94b1fee85a034b49aba0c50fcdfc0fdb",
|
||
"c21e510111f5425bad28afd9b723f9d1",
|
||
"fea69d561fb94e8c98af4526f8d4b33e",
|
||
"0cec3fa3672b4bbaa20e3d4aae6fd575",
|
||
"2b09f8d112ea43a5b60c010f2bf0bbdb",
|
||
"60ea33df89a449ad9e0f90a0bca672ad",
|
||
"48230299285241a28e2c8e03db6fce4d",
|
||
"3229d1aa3fe74299a1d4e3f917cc2ca6",
|
||
"1f5849878621437397efe2de7b7a43fe",
|
||
"2a554904b31c4ed8a4269d2c26ea4e91",
|
||
"a29c289ad83d4c718d5854e5d3eff48a",
|
||
"07584d1f220f4fb6b82a42090f3818f7",
|
||
"04e248de88ac4a50ad20272d31549304",
|
||
"57bb0b0d061d4e778493adb47482f234",
|
||
"2a24003cc0d54e568706cc6fc77d2831",
|
||
"849583682f034d3d8b8887cbafb3daaf",
|
||
"8ab0464a810e4b208d4c0fc481c58b54",
|
||
"cab50f86e61d43df90643acaf98670ab",
|
||
"d4180478ac134757bcfe6c4f0ff4990e"
|
||
]
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 7006,
|
||
"status": "ok",
|
||
"timestamp": 1719641491724,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "TQHWioIc0pQ8",
|
||
"outputId": "87112ec7-bee0-4894-d850-8dd5e0f4e38c"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "a50156f59d8548b683982af06d5bda09",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "c40e76c5e9bd42bea7903e811bce53a4",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"config_sentence_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "c4c7962b94674cd6898980cea6483595",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"README.md: 0%| | 0.00/10.6k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "10f9441d843c44e5b11cfb2e21b5d89e",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
|
||
" warnings.warn(\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "784748de51254ba18128af8df30b8a93",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"config.json: 0%| | 0.00/571 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "f0793746dff34da8858758fb55284b97",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"model.safetensors: 0%| | 0.00/438M [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "ca7eb54b296c4a1fa91678c0e3d65f5d",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer_config.json: 0%| | 0.00/363 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "2895fad80e754b4e8158c6dd8db69058",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "652ddbc085994a36b553ea04359943a1",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "4688efe8ab954510b30df18f8daa74a5",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"special_tokens_map.json: 0%| | 0.00/239 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "1f5849878621437397efe2de7b7a43fe",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
"1_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"from sentence_transformers import SentenceTransformer\n",
|
||
"\n",
|
||
"# Load model\n",
|
||
"model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')\n",
|
||
"\n",
|
||
"# Convert text to text embeddings\n",
|
||
"vector = model.encode(\"Best movie ever!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 2,
|
||
"status": "ok",
|
||
"timestamp": 1719641491724,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "PDwfmBiC0uER",
|
||
"outputId": "db6755ce-92b2-45d1-85aa-9b53baee446e"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(768,)"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"vector.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "xnuGRjo80yKj"
|
||
},
|
||
"source": [
|
||
"# Word Embeddings Beyond LLMs\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 44634,
|
||
"status": "ok",
|
||
"timestamp": 1719641543423,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "sKgNdnwe0vfK",
|
||
"outputId": "180bbb09-b030-4fa0-9198-085b0eb54c7b"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[==================================================] 100.0% 66.0/66.0MB downloaded\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import gensim.downloader as api\n",
|
||
"\n",
|
||
"# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)\n",
|
||
"# Other options include \"word2vec-google-news-300\"\n",
|
||
"# More options at https://github.com/RaRe-Technologies/gensim-data\n",
|
||
"model = api.load(\"glove-wiki-gigaword-50\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 2,
|
||
"status": "ok",
|
||
"timestamp": 1719641543423,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "u_vj5NVn01aD",
|
||
"outputId": "73c3edd8-0185-494d-a842-d78cbe100642"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[('king', 1.0000001192092896),\n",
|
||
" ('prince', 0.8236179351806641),\n",
|
||
" ('queen', 0.7839043140411377),\n",
|
||
" ('ii', 0.7746230363845825),\n",
|
||
" ('emperor', 0.7736247777938843),\n",
|
||
" ('son', 0.766719400882721),\n",
|
||
" ('uncle', 0.7627150416374207),\n",
|
||
" ('kingdom', 0.7542161345481873),\n",
|
||
" ('throne', 0.7539914846420288),\n",
|
||
" ('brother', 0.7492411136627197),\n",
|
||
" ('ruler', 0.7434253692626953)]"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"model.most_similar([model['king']], topn=11)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "QMSgyKKS4xUx"
|
||
},
|
||
"source": [
|
||
"# Recommending songs by embeddings"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "3dJdWzT67nDL"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"from urllib import request\n",
|
||
"\n",
|
||
"# Get the playlist dataset file\n",
|
||
"data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')\n",
|
||
"\n",
|
||
"# Parse the playlist dataset file. Skip the first two lines as\n",
|
||
"# they only contain metadata\n",
|
||
"lines = data.read().decode(\"utf-8\").split('\\n')[2:]\n",
|
||
"\n",
|
||
"# Remove playlists with only one song\n",
|
||
"playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]\n",
|
||
"\n",
|
||
"# Load song metadata\n",
|
||
"songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')\n",
|
||
"songs_file = songs_file.read().decode(\"utf-8\").split('\\n')\n",
|
||
"songs = [s.rstrip().split('\\t') for s in songs_file]\n",
|
||
"songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])\n",
|
||
"songs_df = songs_df.set_index('id')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 3,
|
||
"status": "ok",
|
||
"timestamp": 1724598630488,
|
||
"user": {
|
||
"displayName": "Jay Alammar جهاد العمار",
|
||
"userId": "14617748739431919458"
|
||
},
|
||
"user_tz": 240
|
||
},
|
||
"id": "Q3zirG-lo3H8",
|
||
"outputId": "e3b4269e-dd42-428e-8b28-46c27d0231af"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Playlist #1:\n",
|
||
" ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] \n",
|
||
"\n",
|
||
"Playlist #2:\n",
|
||
" ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117', '118', '119', '120', '121', '122', '123', '50', '70', '71', '124', '17', '85', '14', '82', '48', '125', '47', '46', '72', '53', '25', '73', '4', '126', '59', '74', '20', '43', '127', '128', '129', '13', '82', '48', '130', '131', '132', '133', '134', '135', '136', '137', '59', '46', '138', '43', '20', '139', '140', '73', '57', '70', '141', '3', '1', '74', '142', '143', '144', '145', '48', '13', '25', '146', '50', '147', '126', '59', '20', '148', '149', '150', '151', '152', '56', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '60', '176', '51', '177', '178', '179', '180', '181', '182', '183', '184', '185', '57', '186', '187', '188', '189', '190', '191', '46', '192', '193', '194', '195', '196', '197', '198', '25', '199', '200', '49', '201', '100', '202', '203', '204', '205', '206', '207', '32', '208', '209', '210']\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print( 'Playlist #1:\\n ', playlists[0], '\\n')\n",
|
||
"print( 'Playlist #2:\\n ', playlists[1])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "EaUz3E0P7sJs"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from gensim.models import Word2Vec\n",
|
||
"\n",
|
||
"# Train our Word2Vec model\n",
|
||
"model = Word2Vec(\n",
|
||
" playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 314,
|
||
"status": "ok",
|
||
"timestamp": 1719642095066,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "9EFGWesO8rOJ",
|
||
"outputId": "1e46ce56-7b14-4268-a38a-c328e0f52943"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[('2849', 0.9979680776596069),\n",
|
||
" ('2640', 0.9964019060134888),\n",
|
||
" ('3167', 0.9963980317115784),\n",
|
||
" ('5549', 0.9959008693695068),\n",
|
||
" ('2715', 0.9958351850509644),\n",
|
||
" ('3117', 0.9954560995101929),\n",
|
||
" ('2987', 0.9953479766845703),\n",
|
||
" ('2881', 0.9951083660125732),\n",
|
||
" ('2886', 0.9950577616691589),\n",
|
||
" ('3094', 0.994985044002533)]"
|
||
]
|
||
},
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"song_id = 2172\n",
|
||
"\n",
|
||
"# Ask the model for songs similar to song #2172\n",
|
||
"model.wv.most_similar(positive=str(song_id))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 321,
|
||
"status": "ok",
|
||
"timestamp": 1719642762615,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "AMiY6isXqKk4",
|
||
"outputId": "0f465f20-ada8-4fa8-92d6-f72966d03aa4"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"title Fade To Black\n",
|
||
"artist Metallica\n",
|
||
"Name: 2172 , dtype: object\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(songs_df.iloc[2172])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 237
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 556,
|
||
"status": "ok",
|
||
"timestamp": 1719642918281,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "aOzWENxr2Fl3",
|
||
"outputId": "0b1ac29a-14f7-4e30-e153-e8f35ca97d7e"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"application/vnd.google.colaboratory.intrinsic+json": {
|
||
"summary": "{\n \"name\": \"print_recommendations(2172)\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"2640 \",\n \"2715 \",\n \"3167 \"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Red Barchetta\",\n \"Rainbow In The Dark\",\n \"Unchained\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"artist\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Rush\",\n \"Dio\",\n \"Van Halen\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
|
||
"type": "dataframe"
|
||
},
|
||
"text/html": [
|
||
"\n",
|
||
" <div id=\"df-94b64d84-06f0-49f5-a721-a51ab661e5c4\" class=\"colab-df-container\">\n",
|
||
" <div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>title</th>\n",
|
||
" <th>artist</th>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>id</th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>2849</th>\n",
|
||
" <td>Run To The Hills</td>\n",
|
||
" <td>Iron Maiden</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2640</th>\n",
|
||
" <td>Red Barchetta</td>\n",
|
||
" <td>Rush</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3167</th>\n",
|
||
" <td>Unchained</td>\n",
|
||
" <td>Van Halen</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5549</th>\n",
|
||
" <td>November Rain</td>\n",
|
||
" <td>Guns N' Roses</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2715</th>\n",
|
||
" <td>Rainbow In The Dark</td>\n",
|
||
" <td>Dio</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" <div class=\"colab-df-buttons\">\n",
|
||
"\n",
|
||
" <div class=\"colab-df-container\">\n",
|
||
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-94b64d84-06f0-49f5-a721-a51ab661e5c4')\"\n",
|
||
" title=\"Convert this dataframe to an interactive table.\"\n",
|
||
" style=\"display:none;\">\n",
|
||
"\n",
|
||
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
|
||
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
|
||
" </svg>\n",
|
||
" </button>\n",
|
||
"\n",
|
||
" <style>\n",
|
||
" .colab-df-container {\n",
|
||
" display:flex;\n",
|
||
" gap: 12px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-convert {\n",
|
||
" background-color: #E8F0FE;\n",
|
||
" border: none;\n",
|
||
" border-radius: 50%;\n",
|
||
" cursor: pointer;\n",
|
||
" display: none;\n",
|
||
" fill: #1967D2;\n",
|
||
" height: 32px;\n",
|
||
" padding: 0 0 0 0;\n",
|
||
" width: 32px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-convert:hover {\n",
|
||
" background-color: #E2EBFA;\n",
|
||
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
||
" fill: #174EA6;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-buttons div {\n",
|
||
" margin-bottom: 4px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" [theme=dark] .colab-df-convert {\n",
|
||
" background-color: #3B4455;\n",
|
||
" fill: #D2E3FC;\n",
|
||
" }\n",
|
||
"\n",
|
||
" [theme=dark] .colab-df-convert:hover {\n",
|
||
" background-color: #434B5C;\n",
|
||
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
||
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
||
" fill: #FFFFFF;\n",
|
||
" }\n",
|
||
" </style>\n",
|
||
"\n",
|
||
" <script>\n",
|
||
" const buttonEl =\n",
|
||
" document.querySelector('#df-94b64d84-06f0-49f5-a721-a51ab661e5c4 button.colab-df-convert');\n",
|
||
" buttonEl.style.display =\n",
|
||
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
||
"\n",
|
||
" async function convertToInteractive(key) {\n",
|
||
" const element = document.querySelector('#df-94b64d84-06f0-49f5-a721-a51ab661e5c4');\n",
|
||
" const dataTable =\n",
|
||
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
|
||
" [key], {});\n",
|
||
" if (!dataTable) return;\n",
|
||
"\n",
|
||
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
|
||
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
|
||
" + ' to learn more about interactive tables.';\n",
|
||
" element.innerHTML = '';\n",
|
||
" dataTable['output_type'] = 'display_data';\n",
|
||
" await google.colab.output.renderOutput(dataTable, element);\n",
|
||
" const docLink = document.createElement('div');\n",
|
||
" docLink.innerHTML = docLinkHtml;\n",
|
||
" element.appendChild(docLink);\n",
|
||
" }\n",
|
||
" </script>\n",
|
||
" </div>\n",
|
||
"\n",
|
||
"\n",
|
||
"<div id=\"df-66b2f5cc-45c9-44ee-b044-69f0e59b123a\">\n",
|
||
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-66b2f5cc-45c9-44ee-b044-69f0e59b123a')\"\n",
|
||
" title=\"Suggest charts\"\n",
|
||
" style=\"display:none;\">\n",
|
||
"\n",
|
||
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
||
" width=\"24px\">\n",
|
||
" <g>\n",
|
||
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
|
||
" </g>\n",
|
||
"</svg>\n",
|
||
" </button>\n",
|
||
"\n",
|
||
"<style>\n",
|
||
" .colab-df-quickchart {\n",
|
||
" --bg-color: #E8F0FE;\n",
|
||
" --fill-color: #1967D2;\n",
|
||
" --hover-bg-color: #E2EBFA;\n",
|
||
" --hover-fill-color: #174EA6;\n",
|
||
" --disabled-fill-color: #AAA;\n",
|
||
" --disabled-bg-color: #DDD;\n",
|
||
" }\n",
|
||
"\n",
|
||
" [theme=dark] .colab-df-quickchart {\n",
|
||
" --bg-color: #3B4455;\n",
|
||
" --fill-color: #D2E3FC;\n",
|
||
" --hover-bg-color: #434B5C;\n",
|
||
" --hover-fill-color: #FFFFFF;\n",
|
||
" --disabled-bg-color: #3B4455;\n",
|
||
" --disabled-fill-color: #666;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-quickchart {\n",
|
||
" background-color: var(--bg-color);\n",
|
||
" border: none;\n",
|
||
" border-radius: 50%;\n",
|
||
" cursor: pointer;\n",
|
||
" display: none;\n",
|
||
" fill: var(--fill-color);\n",
|
||
" height: 32px;\n",
|
||
" padding: 0;\n",
|
||
" width: 32px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-quickchart:hover {\n",
|
||
" background-color: var(--hover-bg-color);\n",
|
||
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
||
" fill: var(--button-hover-fill-color);\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-quickchart-complete:disabled,\n",
|
||
" .colab-df-quickchart-complete:disabled:hover {\n",
|
||
" background-color: var(--disabled-bg-color);\n",
|
||
" fill: var(--disabled-fill-color);\n",
|
||
" box-shadow: none;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-spinner {\n",
|
||
" border: 2px solid var(--fill-color);\n",
|
||
" border-color: transparent;\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" animation:\n",
|
||
" spin 1s steps(1) infinite;\n",
|
||
" }\n",
|
||
"\n",
|
||
" @keyframes spin {\n",
|
||
" 0% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" border-left-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 20% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-left-color: var(--fill-color);\n",
|
||
" border-top-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 30% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-left-color: var(--fill-color);\n",
|
||
" border-top-color: var(--fill-color);\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 40% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" border-top-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 60% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 80% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 90% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"\n",
|
||
" <script>\n",
|
||
" async function quickchart(key) {\n",
|
||
" const quickchartButtonEl =\n",
|
||
" document.querySelector('#' + key + ' button');\n",
|
||
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
|
||
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
|
||
" try {\n",
|
||
" const charts = await google.colab.kernel.invokeFunction(\n",
|
||
" 'suggestCharts', [key], {});\n",
|
||
" } catch (error) {\n",
|
||
" console.error('Error during call to suggestCharts:', error);\n",
|
||
" }\n",
|
||
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
|
||
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
|
||
" }\n",
|
||
" (() => {\n",
|
||
" let quickchartButtonEl =\n",
|
||
" document.querySelector('#df-66b2f5cc-45c9-44ee-b044-69f0e59b123a button');\n",
|
||
" quickchartButtonEl.style.display =\n",
|
||
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
||
" })();\n",
|
||
" </script>\n",
|
||
"</div>\n",
|
||
"\n",
|
||
" </div>\n",
|
||
" </div>\n"
|
||
],
|
||
"text/plain": [
|
||
" title artist\n",
|
||
"id \n",
|
||
"2849 Run To The Hills Iron Maiden\n",
|
||
"2640 Red Barchetta Rush\n",
|
||
"3167 Unchained Van Halen\n",
|
||
"5549 November Rain Guns N' Roses\n",
|
||
"2715 Rainbow In The Dark Dio"
|
||
]
|
||
},
|
||
"execution_count": 39,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"def print_recommendations(song_id):\n",
|
||
" similar_songs = np.array(\n",
|
||
" model.wv.most_similar(positive=str(song_id),topn=5)\n",
|
||
" )[:,0]\n",
|
||
" return songs_df.iloc[similar_songs]\n",
|
||
"\n",
|
||
"# Extract recommendations\n",
|
||
"print_recommendations(2172)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 310
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 681,
|
||
"status": "ok",
|
||
"timestamp": 1719642181255,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "xqrzQQ-m1EJ5",
|
||
"outputId": "3cf4967d-f510-4772-cb11-4166d16c6956"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"title Fade To Black\n",
|
||
"artist Metallica\n",
|
||
"Name: 2172 , dtype: object\n",
|
||
"['2849' '2640' '3167' '5549' '2715']\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.google.colaboratory.intrinsic+json": {
|
||
"summary": "{\n \"name\": \"print_recommendations(2172)\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"2640 \",\n \"2715 \",\n \"3167 \"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Red Barchetta\",\n \"Rainbow In The Dark\",\n \"Unchained\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"artist\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Rush\",\n \"Dio\",\n \"Van Halen\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
|
||
"type": "dataframe"
|
||
},
|
||
"text/html": [
|
||
"\n",
|
||
" <div id=\"df-c38e0eb4-9c39-45f5-aa32-dbd65ad89576\" class=\"colab-df-container\">\n",
|
||
" <div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>title</th>\n",
|
||
" <th>artist</th>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>id</th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>2849</th>\n",
|
||
" <td>Run To The Hills</td>\n",
|
||
" <td>Iron Maiden</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2640</th>\n",
|
||
" <td>Red Barchetta</td>\n",
|
||
" <td>Rush</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3167</th>\n",
|
||
" <td>Unchained</td>\n",
|
||
" <td>Van Halen</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5549</th>\n",
|
||
" <td>November Rain</td>\n",
|
||
" <td>Guns N' Roses</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2715</th>\n",
|
||
" <td>Rainbow In The Dark</td>\n",
|
||
" <td>Dio</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" <div class=\"colab-df-buttons\">\n",
|
||
"\n",
|
||
" <div class=\"colab-df-container\">\n",
|
||
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c38e0eb4-9c39-45f5-aa32-dbd65ad89576')\"\n",
|
||
" title=\"Convert this dataframe to an interactive table.\"\n",
|
||
" style=\"display:none;\">\n",
|
||
"\n",
|
||
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
|
||
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
|
||
" </svg>\n",
|
||
" </button>\n",
|
||
"\n",
|
||
" <style>\n",
|
||
" .colab-df-container {\n",
|
||
" display:flex;\n",
|
||
" gap: 12px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-convert {\n",
|
||
" background-color: #E8F0FE;\n",
|
||
" border: none;\n",
|
||
" border-radius: 50%;\n",
|
||
" cursor: pointer;\n",
|
||
" display: none;\n",
|
||
" fill: #1967D2;\n",
|
||
" height: 32px;\n",
|
||
" padding: 0 0 0 0;\n",
|
||
" width: 32px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-convert:hover {\n",
|
||
" background-color: #E2EBFA;\n",
|
||
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
||
" fill: #174EA6;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-buttons div {\n",
|
||
" margin-bottom: 4px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" [theme=dark] .colab-df-convert {\n",
|
||
" background-color: #3B4455;\n",
|
||
" fill: #D2E3FC;\n",
|
||
" }\n",
|
||
"\n",
|
||
" [theme=dark] .colab-df-convert:hover {\n",
|
||
" background-color: #434B5C;\n",
|
||
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
||
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
||
" fill: #FFFFFF;\n",
|
||
" }\n",
|
||
" </style>\n",
|
||
"\n",
|
||
" <script>\n",
|
||
" const buttonEl =\n",
|
||
" document.querySelector('#df-c38e0eb4-9c39-45f5-aa32-dbd65ad89576 button.colab-df-convert');\n",
|
||
" buttonEl.style.display =\n",
|
||
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
||
"\n",
|
||
" async function convertToInteractive(key) {\n",
|
||
" const element = document.querySelector('#df-c38e0eb4-9c39-45f5-aa32-dbd65ad89576');\n",
|
||
" const dataTable =\n",
|
||
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
|
||
" [key], {});\n",
|
||
" if (!dataTable) return;\n",
|
||
"\n",
|
||
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
|
||
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
|
||
" + ' to learn more about interactive tables.';\n",
|
||
" element.innerHTML = '';\n",
|
||
" dataTable['output_type'] = 'display_data';\n",
|
||
" await google.colab.output.renderOutput(dataTable, element);\n",
|
||
" const docLink = document.createElement('div');\n",
|
||
" docLink.innerHTML = docLinkHtml;\n",
|
||
" element.appendChild(docLink);\n",
|
||
" }\n",
|
||
" </script>\n",
|
||
" </div>\n",
|
||
"\n",
|
||
"\n",
|
||
"<div id=\"df-dbb90c85-6dc6-4ec9-a4c5-ebcbdb0a5897\">\n",
|
||
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-dbb90c85-6dc6-4ec9-a4c5-ebcbdb0a5897')\"\n",
|
||
" title=\"Suggest charts\"\n",
|
||
" style=\"display:none;\">\n",
|
||
"\n",
|
||
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
||
" width=\"24px\">\n",
|
||
" <g>\n",
|
||
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
|
||
" </g>\n",
|
||
"</svg>\n",
|
||
" </button>\n",
|
||
"\n",
|
||
"<style>\n",
|
||
" .colab-df-quickchart {\n",
|
||
" --bg-color: #E8F0FE;\n",
|
||
" --fill-color: #1967D2;\n",
|
||
" --hover-bg-color: #E2EBFA;\n",
|
||
" --hover-fill-color: #174EA6;\n",
|
||
" --disabled-fill-color: #AAA;\n",
|
||
" --disabled-bg-color: #DDD;\n",
|
||
" }\n",
|
||
"\n",
|
||
" [theme=dark] .colab-df-quickchart {\n",
|
||
" --bg-color: #3B4455;\n",
|
||
" --fill-color: #D2E3FC;\n",
|
||
" --hover-bg-color: #434B5C;\n",
|
||
" --hover-fill-color: #FFFFFF;\n",
|
||
" --disabled-bg-color: #3B4455;\n",
|
||
" --disabled-fill-color: #666;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-quickchart {\n",
|
||
" background-color: var(--bg-color);\n",
|
||
" border: none;\n",
|
||
" border-radius: 50%;\n",
|
||
" cursor: pointer;\n",
|
||
" display: none;\n",
|
||
" fill: var(--fill-color);\n",
|
||
" height: 32px;\n",
|
||
" padding: 0;\n",
|
||
" width: 32px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-quickchart:hover {\n",
|
||
" background-color: var(--hover-bg-color);\n",
|
||
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
||
" fill: var(--button-hover-fill-color);\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-quickchart-complete:disabled,\n",
|
||
" .colab-df-quickchart-complete:disabled:hover {\n",
|
||
" background-color: var(--disabled-bg-color);\n",
|
||
" fill: var(--disabled-fill-color);\n",
|
||
" box-shadow: none;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-spinner {\n",
|
||
" border: 2px solid var(--fill-color);\n",
|
||
" border-color: transparent;\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" animation:\n",
|
||
" spin 1s steps(1) infinite;\n",
|
||
" }\n",
|
||
"\n",
|
||
" @keyframes spin {\n",
|
||
" 0% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" border-left-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 20% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-left-color: var(--fill-color);\n",
|
||
" border-top-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 30% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-left-color: var(--fill-color);\n",
|
||
" border-top-color: var(--fill-color);\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 40% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" border-top-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 60% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 80% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 90% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"\n",
|
||
" <script>\n",
|
||
" async function quickchart(key) {\n",
|
||
" const quickchartButtonEl =\n",
|
||
" document.querySelector('#' + key + ' button');\n",
|
||
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
|
||
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
|
||
" try {\n",
|
||
" const charts = await google.colab.kernel.invokeFunction(\n",
|
||
" 'suggestCharts', [key], {});\n",
|
||
" } catch (error) {\n",
|
||
" console.error('Error during call to suggestCharts:', error);\n",
|
||
" }\n",
|
||
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
|
||
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
|
||
" }\n",
|
||
" (() => {\n",
|
||
" let quickchartButtonEl =\n",
|
||
" document.querySelector('#df-dbb90c85-6dc6-4ec9-a4c5-ebcbdb0a5897 button');\n",
|
||
" quickchartButtonEl.style.display =\n",
|
||
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
||
" })();\n",
|
||
" </script>\n",
|
||
"</div>\n",
|
||
"\n",
|
||
" </div>\n",
|
||
" </div>\n"
|
||
],
|
||
"text/plain": [
|
||
" title artist\n",
|
||
"id \n",
|
||
"2849 Run To The Hills Iron Maiden\n",
|
||
"2640 Red Barchetta Rush\n",
|
||
"3167 Unchained Van Halen\n",
|
||
"5549 November Rain Guns N' Roses\n",
|
||
"2715 Rainbow In The Dark Dio"
|
||
]
|
||
},
|
||
"execution_count": 31,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"print_recommendations(2172)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 310
|
||
},
|
||
"executionInfo": {
|
||
"elapsed": 316,
|
||
"status": "ok",
|
||
"timestamp": 1719642205517,
|
||
"user": {
|
||
"displayName": "Maarten Grootendorst",
|
||
"userId": "11015108362723620659"
|
||
},
|
||
"user_tz": -120
|
||
},
|
||
"id": "TIHiN62g1NMi",
|
||
"outputId": "c548f528-6e2e-4a46-89e0-6599395d6419"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"title California Love (w\\/ Dr. Dre & Roger Troutman)\n",
|
||
"artist 2Pac\n",
|
||
"Name: 842 , dtype: object\n",
|
||
"['5668' '413' '5661' '330' '886']\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"application/vnd.google.colaboratory.intrinsic+json": {
|
||
"summary": "{\n \"name\": \"print_recommendations(842)\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"413 \",\n \"886 \",\n \"5661 \"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"If I Ruled The World (Imagine That) (w\\\\/ Lauryn Hill)\",\n \"Heartless\",\n \"Sweet Dreams\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"artist\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"Nas\",\n \"Kanye West\",\n \"The Game\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
|
||
"type": "dataframe"
|
||
},
|
||
"text/html": [
|
||
"\n",
|
||
" <div id=\"df-1afa899d-2db1-434a-a095-9b7ade3d2589\" class=\"colab-df-container\">\n",
|
||
" <div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>title</th>\n",
|
||
" <th>artist</th>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>id</th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>5668</th>\n",
|
||
" <td>How We Do (w\\/ 50 Cent)</td>\n",
|
||
" <td>The Game</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>413</th>\n",
|
||
" <td>If I Ruled The World (Imagine That) (w\\/ Laury...</td>\n",
|
||
" <td>Nas</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5661</th>\n",
|
||
" <td>Sweet Dreams</td>\n",
|
||
" <td>Beyonce</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>330</th>\n",
|
||
" <td>Hate It Or Love It (w\\/ 50 Cent)</td>\n",
|
||
" <td>The Game</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>886</th>\n",
|
||
" <td>Heartless</td>\n",
|
||
" <td>Kanye West</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>\n",
|
||
" <div class=\"colab-df-buttons\">\n",
|
||
"\n",
|
||
" <div class=\"colab-df-container\">\n",
|
||
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1afa899d-2db1-434a-a095-9b7ade3d2589')\"\n",
|
||
" title=\"Convert this dataframe to an interactive table.\"\n",
|
||
" style=\"display:none;\">\n",
|
||
"\n",
|
||
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
|
||
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
|
||
" </svg>\n",
|
||
" </button>\n",
|
||
"\n",
|
||
" <style>\n",
|
||
" .colab-df-container {\n",
|
||
" display:flex;\n",
|
||
" gap: 12px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-convert {\n",
|
||
" background-color: #E8F0FE;\n",
|
||
" border: none;\n",
|
||
" border-radius: 50%;\n",
|
||
" cursor: pointer;\n",
|
||
" display: none;\n",
|
||
" fill: #1967D2;\n",
|
||
" height: 32px;\n",
|
||
" padding: 0 0 0 0;\n",
|
||
" width: 32px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-convert:hover {\n",
|
||
" background-color: #E2EBFA;\n",
|
||
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
||
" fill: #174EA6;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-buttons div {\n",
|
||
" margin-bottom: 4px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" [theme=dark] .colab-df-convert {\n",
|
||
" background-color: #3B4455;\n",
|
||
" fill: #D2E3FC;\n",
|
||
" }\n",
|
||
"\n",
|
||
" [theme=dark] .colab-df-convert:hover {\n",
|
||
" background-color: #434B5C;\n",
|
||
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
||
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
||
" fill: #FFFFFF;\n",
|
||
" }\n",
|
||
" </style>\n",
|
||
"\n",
|
||
" <script>\n",
|
||
" const buttonEl =\n",
|
||
" document.querySelector('#df-1afa899d-2db1-434a-a095-9b7ade3d2589 button.colab-df-convert');\n",
|
||
" buttonEl.style.display =\n",
|
||
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
||
"\n",
|
||
" async function convertToInteractive(key) {\n",
|
||
" const element = document.querySelector('#df-1afa899d-2db1-434a-a095-9b7ade3d2589');\n",
|
||
" const dataTable =\n",
|
||
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
|
||
" [key], {});\n",
|
||
" if (!dataTable) return;\n",
|
||
"\n",
|
||
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
|
||
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
|
||
" + ' to learn more about interactive tables.';\n",
|
||
" element.innerHTML = '';\n",
|
||
" dataTable['output_type'] = 'display_data';\n",
|
||
" await google.colab.output.renderOutput(dataTable, element);\n",
|
||
" const docLink = document.createElement('div');\n",
|
||
" docLink.innerHTML = docLinkHtml;\n",
|
||
" element.appendChild(docLink);\n",
|
||
" }\n",
|
||
" </script>\n",
|
||
" </div>\n",
|
||
"\n",
|
||
"\n",
|
||
"<div id=\"df-a8ceaf3a-b291-4c01-adfc-895ceccda974\">\n",
|
||
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-a8ceaf3a-b291-4c01-adfc-895ceccda974')\"\n",
|
||
" title=\"Suggest charts\"\n",
|
||
" style=\"display:none;\">\n",
|
||
"\n",
|
||
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
||
" width=\"24px\">\n",
|
||
" <g>\n",
|
||
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
|
||
" </g>\n",
|
||
"</svg>\n",
|
||
" </button>\n",
|
||
"\n",
|
||
"<style>\n",
|
||
" .colab-df-quickchart {\n",
|
||
" --bg-color: #E8F0FE;\n",
|
||
" --fill-color: #1967D2;\n",
|
||
" --hover-bg-color: #E2EBFA;\n",
|
||
" --hover-fill-color: #174EA6;\n",
|
||
" --disabled-fill-color: #AAA;\n",
|
||
" --disabled-bg-color: #DDD;\n",
|
||
" }\n",
|
||
"\n",
|
||
" [theme=dark] .colab-df-quickchart {\n",
|
||
" --bg-color: #3B4455;\n",
|
||
" --fill-color: #D2E3FC;\n",
|
||
" --hover-bg-color: #434B5C;\n",
|
||
" --hover-fill-color: #FFFFFF;\n",
|
||
" --disabled-bg-color: #3B4455;\n",
|
||
" --disabled-fill-color: #666;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-quickchart {\n",
|
||
" background-color: var(--bg-color);\n",
|
||
" border: none;\n",
|
||
" border-radius: 50%;\n",
|
||
" cursor: pointer;\n",
|
||
" display: none;\n",
|
||
" fill: var(--fill-color);\n",
|
||
" height: 32px;\n",
|
||
" padding: 0;\n",
|
||
" width: 32px;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-quickchart:hover {\n",
|
||
" background-color: var(--hover-bg-color);\n",
|
||
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
||
" fill: var(--button-hover-fill-color);\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-quickchart-complete:disabled,\n",
|
||
" .colab-df-quickchart-complete:disabled:hover {\n",
|
||
" background-color: var(--disabled-bg-color);\n",
|
||
" fill: var(--disabled-fill-color);\n",
|
||
" box-shadow: none;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .colab-df-spinner {\n",
|
||
" border: 2px solid var(--fill-color);\n",
|
||
" border-color: transparent;\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" animation:\n",
|
||
" spin 1s steps(1) infinite;\n",
|
||
" }\n",
|
||
"\n",
|
||
" @keyframes spin {\n",
|
||
" 0% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" border-left-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 20% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-left-color: var(--fill-color);\n",
|
||
" border-top-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 30% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-left-color: var(--fill-color);\n",
|
||
" border-top-color: var(--fill-color);\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 40% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" border-top-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 60% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 80% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-right-color: var(--fill-color);\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" 90% {\n",
|
||
" border-color: transparent;\n",
|
||
" border-bottom-color: var(--fill-color);\n",
|
||
" }\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"\n",
|
||
" <script>\n",
|
||
" async function quickchart(key) {\n",
|
||
" const quickchartButtonEl =\n",
|
||
" document.querySelector('#' + key + ' button');\n",
|
||
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
|
||
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
|
||
" try {\n",
|
||
" const charts = await google.colab.kernel.invokeFunction(\n",
|
||
" 'suggestCharts', [key], {});\n",
|
||
" } catch (error) {\n",
|
||
" console.error('Error during call to suggestCharts:', error);\n",
|
||
" }\n",
|
||
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
|
||
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
|
||
" }\n",
|
||
" (() => {\n",
|
||
" let quickchartButtonEl =\n",
|
||
" document.querySelector('#df-a8ceaf3a-b291-4c01-adfc-895ceccda974 button');\n",
|
||
" quickchartButtonEl.style.display =\n",
|
||
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
||
" })();\n",
|
||
" </script>\n",
|
||
"</div>\n",
|
||
"\n",
|
||
" </div>\n",
|
||
" </div>\n"
|
||
],
|
||
"text/plain": [
|
||
" title artist\n",
|
||
"id \n",
|
||
"5668 How We Do (w\\/ 50 Cent) The Game\n",
|
||
"413 If I Ruled The World (Imagine That) (w\\/ Laury... Nas\n",
|
||
"5661 Sweet Dreams Beyonce\n",
|
||
"330 Hate It Or Love It (w\\/ 50 Cent) The Game\n",
|
||
"886 Heartless Kanye West"
|
||
]
|
||
},
|
||
"execution_count": 32,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"print_recommendations(842)"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"accelerator": "GPU",
|
||
"colab": {
|
||
"gpuType": "T4",
|
||
"provenance": []
|
||
},
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|