ner-study/Processing.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "89761690",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C:\\\\Users\\\\Monoid\\\\anaconda3\\\\envs\\\\nn\\\\python.exe'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import sys\n",
    "sys.executable"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2303b263",
   "metadata": {},
   "source": [
    "먼저 파이썬 환경을 살펴봅니다."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9d4f2a3",
   "metadata": {},
   "source": [
    "개인적으로 이것을 하면서 가장 어려웠던 것은 환경을 구축하는 것 이였습니다. conda install로 설치했는데 transformers는 version 4.0.0 버전이 설치되지 않고 2.1.1가 설치되었어요. [Issue : Support transformers > 2.1.1 on Windows](https://github.com/conda-forge/transformers-feedstock/issues/16) 이거랑 무슨 연관이 있는 걸까요? 어쨋든, 해결하기위해서 transformers를 먼저 깔고 pytorch를 설치했어요. python 3.7에서 실행되더라고요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f3ab8b44",
   "metadata": {},
   "outputs": [],
   "source": [
    "from read_data import readKoreanDataAll"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a5deb97e",
   "metadata": {},
   "outputs": [],
   "source": [
    "train, dev, test = readKoreanDataAll()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3b5cede",
   "metadata": {},
   "source": [
    "데이터를 가져옵니다. 데이터는 단순히 리스트에 담겨있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "13165c8c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Sentence(word=['특히', '김병현', '은', '4', '회', '말', '에', '무', '기력', '하', '게', '6', '실점', '하', '면서'], pos=['MAG', 'NNP', 'JX', 'SN', 'NNB', 'NNG', 'JKB', 'XPN', 'NNG', 'XSA', 'EC', 'SN', 'NNG', 'XSV', 'EC'], namedEntity=['B', 'B', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I'], detail=['O', 'B-PS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2baf1e3",
   "metadata": {},
   "source": [
    "0번 데이터를 확인해보면 이렇게 나옵니다.\n",
    "`word`, `pos`, `namedEntity`, `detail`로 나누었습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3ee47f72",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "특히 김병현 은 4 회 말 에 무 기력 하 게 6 실점 하 면서\n"
     ]
    }
   ],
   "source": [
    "sentence0 = \" \".join(train[0].word)\n",
    "print(sentence0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7bf4a99",
   "metadata": {},
   "source": [
    "문장 하나는 이렇게 될 것이고요. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2939cd62",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "da89bd0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "longest_sentence_index = np.argmax([len(lst.word) for lst in train])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "3e4006b1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "245"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(train[longest_sentence_index].word)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d54d0558",
   "metadata": {},
   "source": [
    "가장 긴거는 245 길이이고 이정도면 Bert에서 요구하는 512 token보다 짧기에 괜찮다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f6460490",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import BertTokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "c3506c83",
   "metadata": {},
   "outputs": [],
   "source": [
    "PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n",
    "tokenizer = BertTokenizer.from_pretrained(PRETAINED_MODEL_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f3045a1",
   "metadata": {},
   "source": [
    "토크나이저를 다음코드로 불러옵니다. 이제 사용을 해보겠습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c7b02470",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['특히', '김', '##병', '##현', '은', '4', '회', '말', '에', '무', '기', '##력', '하', '게', '6', '실', '##점', '하', '면', '##서']\n"
     ]
    }
   ],
   "source": [
    "morph_to_tokens = tokenizer.tokenize(sentence0)\n",
    "print(morph_to_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78ba07a0",
   "metadata": {},
   "source": [
    "잘 작동합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "0dbc6dc5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[39671,\n",
       " 8935,\n",
       " 73380,\n",
       " 30842,\n",
       " 9632,\n",
       " 125,\n",
       " 9998,\n",
       " 9251,\n",
       " 9559,\n",
       " 9294,\n",
       " 8932,\n",
       " 28143,\n",
       " 9952,\n",
       " 8872,\n",
       " 127,\n",
       " 9489,\n",
       " 34907,\n",
       " 9952,\n",
       " 9279,\n",
       " 12424]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputs = tokenizer.convert_tokens_to_ids(morph_to_tokens)\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bd8ca5b",
   "metadata": {},
   "source": [
    "아이디는 이렇게도 얻을 수 있어요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d7f74b0c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1, 22])\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([[  101, 39671,  8935, 73380, 30842,  9632,   125,  9998,  9251,  9559,\n",
       "          9294,  8932, 28143,  9952,  8872,   127,  9489, 34907,  9952,  9279,\n",
       "         12424,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputs = tokenizer(sentence0, return_tensors='pt')\n",
    "print(inputs['input_ids'].size())\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a840377",
   "metadata": {},
   "source": [
    "아니면 이렇게 얻을 수 있어요. 차이점은 \\[CLS\\] 토큰(101) 과 \\[SEP\\] 토큰(102)이 자동으로 삽입됩니다. attention_mask도 같이 만들어 주고 tensor로 나와요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "e7ab65d2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1, 22])\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([[  101, 39671,  8935, 73380, 30842,  9632,   125,  9998,  9251,  9559,\n",
       "          9294,  8932, 28143,  9952,  8872,   127,  9489, 34907,  9952,  9279,\n",
       "         12424,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputs = tokenizer(sentence0, return_tensors='pt', padding='longest', truncation=True)\n",
    "print(inputs['input_ids'].size())\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e06a1a7",
   "metadata": {},
   "source": [
    "padding 옵션과 truncation 옵션이 있다. truncation 옵션은 512 개가 넘어가지 않는 한 별 상관이 없다.\n",
    "\n",
    "패딩 옵션은 다음과 같이 설정할 수 있다.\n",
    "\n",
    "padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:\n",
    "- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).\n",
    "- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.\n",
    "- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "74505ddd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['[CLS]',\n",
       " '특히',\n",
       " '김',\n",
       " '##병',\n",
       " '##현',\n",
       " '은',\n",
       " '4',\n",
       " '회',\n",
       " '말',\n",
       " '에',\n",
       " '무',\n",
       " '기',\n",
       " '##력',\n",
       " '하',\n",
       " '게',\n",
       " '6',\n",
       " '실',\n",
       " '##점',\n",
       " '하',\n",
       " '면',\n",
       " '##서',\n",
       " '[SEP]']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bd66d52",
   "metadata": {},
   "source": [
    "원래대로 돌릴려면 다음과 같이 하면 되요."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4a1d9f8",
   "metadata": {},
   "source": [
    "이제 BERT를 사용해보아요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "3dba1967",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import BertModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "c762970c",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']\n",
      "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
     ]
    }
   ],
   "source": [
    "PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n",
    "bert = BertModel.from_pretrained(PRETAINED_MODEL_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22aea316",
   "metadata": {},
   "source": [
    "Bert를 불러옵니다. bert-base-multilingual-cased를 써요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "2e9e3b82",
   "metadata": {},
   "outputs": [],
   "source": [
    "outputs = bert(**inputs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "6798d3d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "odict_keys(['last_hidden_state', 'pooler_output'])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outputs.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "7b1f9a60",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 22, 768])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outputs['last_hidden_state'].size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "14315cd6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 768])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outputs['pooler_output'].size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef641edf",
   "metadata": {},
   "source": [
    "표현의 차원은 768차원 입니다. last_hidden_state는 워드 갯수가 22개이니 22차원이 나와요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "2b173e84",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BertConfig {\n",
       "  \"_name_or_path\": \"bert-base-multilingual-cased\",\n",
       "  \"architectures\": [\n",
       "    \"BertForMaskedLM\"\n",
       "  ],\n",
       "  \"attention_probs_dropout_prob\": 0.1,\n",
       "  \"classifier_dropout\": null,\n",
       "  \"directionality\": \"bidi\",\n",
       "  \"hidden_act\": \"gelu\",\n",
       "  \"hidden_dropout_prob\": 0.1,\n",
       "  \"hidden_size\": 768,\n",
       "  \"initializer_range\": 0.02,\n",
       "  \"intermediate_size\": 3072,\n",
       "  \"layer_norm_eps\": 1e-12,\n",
       "  \"max_position_embeddings\": 512,\n",
       "  \"model_type\": \"bert\",\n",
       "  \"num_attention_heads\": 12,\n",
       "  \"num_hidden_layers\": 12,\n",
       "  \"pad_token_id\": 0,\n",
       "  \"pooler_fc_size\": 768,\n",
       "  \"pooler_num_attention_heads\": 12,\n",
       "  \"pooler_num_fc_layers\": 3,\n",
       "  \"pooler_size_per_head\": 128,\n",
       "  \"pooler_type\": \"first_token_transform\",\n",
       "  \"position_embedding_type\": \"absolute\",\n",
       "  \"transformers_version\": \"4.16.2\",\n",
       "  \"type_vocab_size\": 2,\n",
       "  \"use_cache\": true,\n",
       "  \"vocab_size\": 119547\n",
       "}"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bert.config"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7a9944a",
   "metadata": {},
   "source": [
    "Config 한번 보고 가요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d48edc53",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
init 2022-02-13 17:34:03 +09:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "89761690",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"'C:\\\\Users\\\\Monoid\\\\anaconda3\\\\envs\\\\nn\\\\python.exe'"`
			`]`
			`},`
			`"execution_count": 1,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"import sys\n",`
			`"sys.executable"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "2303b263",`
			`"metadata": {},`
			`"source": [`
			`"먼저 파이썬 환경을 살펴봅니다."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "d9d4f2a3",`
			`"metadata": {},`
			`"source": [`
			"개인적으로 이것을 하면서 가장 어려웠던 것은 환경을 구축하는 것 이였습니다. conda install로 설치했는데 transformers는 version 4.0.0 버전이 설치되지 않고 2.1.1가 설치되었어요. [Issue : Support transformers > 2.1.1 on Windows](https://github.com/conda-forge/transformers-feedstock/issues/16) 이거랑 무슨 연관이 있는 걸까요? 어쨋든, 해결하기위해서 transformers를 먼저 깔고 pytorch를 설치했어요. python 3.7에서 실행되더라고요."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"id": "f3ab8b44",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from read_data import readKoreanDataAll"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"id": "a5deb97e",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"train, dev, test = readKoreanDataAll()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "b3b5cede",`
			`"metadata": {},`
			`"source": [`
			`"데이터를 가져옵니다. 데이터는 단순히 리스트에 담겨있습니다."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"id": "13165c8c",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Sentence(word=['특히', '김병현', '은', '4', '회', '말', '에', '무', '기력', '하', '게', '6', '실점', '하', '면서'], pos=['MAG', 'NNP', 'JX', 'SN', 'NNB', 'NNG', 'JKB', 'XPN', 'NNG', 'XSA', 'EC', 'SN', 'NNG', 'XSV', 'EC'], namedEntity=['B', 'B', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I'], detail=['O', 'B-PS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])"`
			`]`
			`},`
			`"execution_count": 4,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"train[0]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "b2baf1e3",`
			`"metadata": {},`
			`"source": [`
			`"0번 데이터를 확인해보면 이렇게 나옵니다.\n",`
			"`word`, `pos`, `namedEntity`, `detail`로 나누었습니다."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"id": "3ee47f72",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"특히 김병현 은 4 회 말 에 무 기력 하 게 6 실점 하 면서\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"sentence0 = \" \".join(train[0].word)\n",`
			`"print(sentence0)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "d7bf4a99",`
			`"metadata": {},`
			`"source": [`
			`"문장 하나는 이렇게 될 것이고요. "`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"id": "2939cd62",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"import numpy as np"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"id": "da89bd0e",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"longest_sentence_index = np.argmax([len(lst.word) for lst in train])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"id": "3e4006b1",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"245"`
			`]`
			`},`
			`"execution_count": 8,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"len(train[longest_sentence_index].word)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "d54d0558",`
			`"metadata": {},`
			`"source": [`
			`"가장 긴거는 245 길이이고 이정도면 Bert에서 요구하는 512 token보다 짧기에 괜찮다."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"id": "f6460490",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from transformers import BertTokenizer"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"id": "c3506c83",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n",`
			`"tokenizer = BertTokenizer.from_pretrained(PRETAINED_MODEL_NAME)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "4f3045a1",`
			`"metadata": {},`
			`"source": [`
			`"토크나이저를 다음코드로 불러옵니다. 이제 사용을 해보겠습니다."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"id": "c7b02470",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"['특히', '김', '##병', '##현', '은', '4', '회', '말', '에', '무', '기', '##력', '하', '게', '6', '실', '##점', '하', '면', '##서']\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"morph_to_tokens = tokenizer.tokenize(sentence0)\n",`
			`"print(morph_to_tokens)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "78ba07a0",`
			`"metadata": {},`
			`"source": [`
			`"잘 작동합니다."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 12,`
			`"id": "0dbc6dc5",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[39671,\n",`
			`" 8935,\n",`
			`" 73380,\n",`
			`" 30842,\n",`
			`" 9632,\n",`
			`" 125,\n",`
			`" 9998,\n",`
			`" 9251,\n",`
			`" 9559,\n",`
			`" 9294,\n",`
			`" 8932,\n",`
			`" 28143,\n",`
			`" 9952,\n",`
			`" 8872,\n",`
			`" 127,\n",`
			`" 9489,\n",`
			`" 34907,\n",`
			`" 9952,\n",`
			`" 9279,\n",`
			`" 12424]"`
			`]`
			`},`
			`"execution_count": 12,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"inputs = tokenizer.convert_tokens_to_ids(morph_to_tokens)\n",`
			`"inputs"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "6bd8ca5b",`
			`"metadata": {},`
			`"source": [`
			`"아이디는 이렇게도 얻을 수 있어요."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 13,`
			`"id": "d7f74b0c",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"torch.Size([1, 22])\n"`
			`]`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'input_ids': tensor([[ 101, 39671, 8935, 73380, 30842, 9632, 125, 9998, 9251, 9559,\n",`
			`" 9294, 8932, 28143, 9952, 8872, 127, 9489, 34907, 9952, 9279,\n",`
			`" 12424, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}"`
			`]`
			`},`
			`"execution_count": 13,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"inputs = tokenizer(sentence0, return_tensors='pt')\n",`
			`"print(inputs['input_ids'].size())\n",`
			`"inputs"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "1a840377",`
			`"metadata": {},`
			`"source": [`
			`"아니면 이렇게 얻을 수 있어요. 차이점은 \\[CLS\\] 토큰(101) 과 \\[SEP\\] 토큰(102)이 자동으로 삽입됩니다. attention_mask도 같이 만들어 주고 tensor로 나와요."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 14,`
			`"id": "e7ab65d2",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"torch.Size([1, 22])\n"`
			`]`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'input_ids': tensor([[ 101, 39671, 8935, 73380, 30842, 9632, 125, 9998, 9251, 9559,\n",`
			`" 9294, 8932, 28143, 9952, 8872, 127, 9489, 34907, 9952, 9279,\n",`
			`" 12424, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}"`
			`]`
			`},`
			`"execution_count": 14,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"inputs = tokenizer(sentence0, return_tensors='pt', padding='longest', truncation=True)\n",`
			`"print(inputs['input_ids'].size())\n",`
			`"inputs"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "4e06a1a7",`
			`"metadata": {},`
			`"source": [`
			`"padding 옵션과 truncation 옵션이 있다. truncation 옵션은 512 개가 넘어가지 않는 한 별 상관이 없다.\n",`
			`"\n",`
			`"패딩 옵션은 다음과 같이 설정할 수 있다.\n",`
			`"\n",`
			`"padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:\n",`
			`"- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).\n",`
			`"- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.\n",`
			`"- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).\n",`
			`"\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 15,`
			`"id": "74505ddd",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"['[CLS]',\n",`
			`" '특히',\n",`
			`" '김',\n",`
			`" '##병',\n",`
			`" '##현',\n",`
			`" '은',\n",`
			`" '4',\n",`
			`" '회',\n",`
			`" '말',\n",`
			`" '에',\n",`
			`" '무',\n",`
			`" '기',\n",`
			`" '##력',\n",`
			`" '하',\n",`
			`" '게',\n",`
			`" '6',\n",`
			`" '실',\n",`
			`" '##점',\n",`
			`" '하',\n",`
			`" '면',\n",`
			`" '##서',\n",`
			`" '[SEP]']"`
			`]`
			`},`
			`"execution_count": 15,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "0bd66d52",`
			`"metadata": {},`
			`"source": [`
			`"원래대로 돌릴려면 다음과 같이 하면 되요."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "a4a1d9f8",`
			`"metadata": {},`
			`"source": [`
			`"이제 BERT를 사용해보아요."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 16,`
			`"id": "3dba1967",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from transformers import BertModel"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 17,`
			`"id": "c762970c",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']\n",`
			`"- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",`
			`"- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n",`
			`"bert = BertModel.from_pretrained(PRETAINED_MODEL_NAME)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "22aea316",`
			`"metadata": {},`
			`"source": [`
			`"Bert를 불러옵니다. bert-base-multilingual-cased를 써요."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 18,`
			`"id": "2e9e3b82",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"outputs = bert(**inputs)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 19,`
			`"id": "6798d3d7",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"odict_keys(['last_hidden_state', 'pooler_output'])"`
			`]`
			`},`
			`"execution_count": 19,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"outputs.keys()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 21,`
			`"id": "7b1f9a60",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"torch.Size([1, 22, 768])"`
			`]`
			`},`
			`"execution_count": 21,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"outputs['last_hidden_state'].size()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 22,`
			`"id": "14315cd6",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"torch.Size([1, 768])"`
			`]`
			`},`
			`"execution_count": 22,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"outputs['pooler_output'].size()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "ef641edf",`
			`"metadata": {},`
			`"source": [`
			`"표현의 차원은 768차원 입니다. last_hidden_state는 워드 갯수가 22개이니 22차원이 나와요."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 23,`
			`"id": "2b173e84",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"BertConfig {\n",`
			`" \"_name_or_path\": \"bert-base-multilingual-cased\",\n",`
			`" \"architectures\": [\n",`
			`" \"BertForMaskedLM\"\n",`
			`" ],\n",`
			`" \"attention_probs_dropout_prob\": 0.1,\n",`
			`" \"classifier_dropout\": null,\n",`
			`" \"directionality\": \"bidi\",\n",`
			`" \"hidden_act\": \"gelu\",\n",`
			`" \"hidden_dropout_prob\": 0.1,\n",`
			`" \"hidden_size\": 768,\n",`
			`" \"initializer_range\": 0.02,\n",`
			`" \"intermediate_size\": 3072,\n",`
			`" \"layer_norm_eps\": 1e-12,\n",`
			`" \"max_position_embeddings\": 512,\n",`
			`" \"model_type\": \"bert\",\n",`
			`" \"num_attention_heads\": 12,\n",`
			`" \"num_hidden_layers\": 12,\n",`
			`" \"pad_token_id\": 0,\n",`
			`" \"pooler_fc_size\": 768,\n",`
			`" \"pooler_num_attention_heads\": 12,\n",`
			`" \"pooler_num_fc_layers\": 3,\n",`
			`" \"pooler_size_per_head\": 128,\n",`
			`" \"pooler_type\": \"first_token_transform\",\n",`
			`" \"position_embedding_type\": \"absolute\",\n",`
			`" \"transformers_version\": \"4.16.2\",\n",`
			`" \"type_vocab_size\": 2,\n",`
			`" \"use_cache\": true,\n",`
			`" \"vocab_size\": 119547\n",`
			`"}"`
			`]`
			`},`
			`"execution_count": 23,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"bert.config"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "f7a9944a",`
			`"metadata": {},`
			`"source": [`
			`"Config 한번 보고 가요."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "d48edc53",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.7.11"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`