View file src/colab/copie_de_distilbert_tf.py - Download

# -*- coding: utf-8 -*-
"""Copie de distilbert_tf

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1A_mUkj9V32-w6yIcNtVcVaEdPfmH8pDY

## Classifying text with DistilBERT and Tensorflow
"""

# Commented out IPython magic to ensure Python compatibility.
# %%capture
# !python3 -m venv venv
# !source venv/bin/activate
# !pip install tensorflow transformers

import tensorflow as tf
from tensorflow.keras import activations, optimizers, losses
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
import pickle

"""This notebook explores binary text classification using `TFDistilBertForSequenceClassification` from the 🤗Transformers library.

This notebook is broken up into 5 sections:
1. Preprocessing the data
2. Fine-tuning the model
3. Testing the model
4. Using the fine-tuned model to predict new samples
5. Saving and loading the model for future use

Let's start off by taking a look at our dataset. In this example we consider a small corpus of 10 Yelp reviews: 5 positive (class 1) and 5 negative (class 0). BERT (and its variants like DistilBERT) can be a [great tool to use when you have a shortage of training data](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) ... that said, don't expect great results with just 10 reviews! Interchanging `x` and `y` with your own dataset is recommended 🙂
"""

x = [
     'Great customer service! The food was delicious! Definitely a come again.',
     'The VEGAN options are super fire!!! And the plates come in big portions. Very pleased with this spot, I\'ll definitely be ordering again',
     'Come on, this place is family owned and operated, they are super friendly, the tacos are bomb.',
     'This is such a great restaurant. Multiple times during days that we don\'t want to cook, we\'ve done takeout here and it\'s been amazing. It\'s fast and delicious.',
     'Staff is really nice. Food is way better than average. Good cost benefit.',
     'pricing for this, while relatively inexpensive for a Las Vegas attraction, is completely over the top.',
     'At such a *fine* institution, I find the lack of knowledge and respect for the art appalling',
     'If I could give one star I would...I walked out before my food arrived the customer service was horrible!',
     'Wow the slowest drive thru I\'ve ever been at WOWWWW. Horrible I won\'t be coming back here ever again',
     'Service: 1 out of 5 stars. They will mess up your order, not have it ready after 30 mins calling them before. Worst ran family business Ive ever seen.'
]

y = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

"""# 1. Preprocessing the data

Since we are using the [DisitlBERT](https://huggingface.co/transformers/model_doc/distilbert.html) model, the first preprocessing step is to convert each review string in our dataset into a tuple containing the review's [input ids](https://huggingface.co/transformers/glossary.html#input-ids) and [attention masks](https://huggingface.co/transformers/glossary.html#attention-mask) (click the links to learn what these mean!). Luckily, these can easily be computed using the [DistilBertTokenizer](https://huggingface.co/transformers/model_doc/distilbert.html#distilberttokenizer) from 🤗Transformers.

Below is an example usage of this tokenizer on the first review. Here `MAX_LEN` specifies the length of each tokenized review. If a tokenized review is shorter than `MAX_LEN`, then it is padded with zeros. If a tokenized review is greater than `MAX_LEN`, then it is truncated (this is why we have `truncation=True` and `padding=True`).
"""

MODEL_NAME = 'distilbert-base-uncased'
MAX_LEN = 20

review = x[0]

tkzr = DistilBertTokenizer.from_pretrained(MODEL_NAME)

inputs = tkzr(review, max_length=MAX_LEN, truncation=True, padding=True)

print(f'review: \'{review}\'')
print(f'input ids: {inputs["input_ids"]}')
print(f'attention mask: {inputs["attention_mask"]}')

"""Now we must apply this transformation to each review in our corpus. To do this we define a function `construct_encodings`, which maps the tokenizer to each review and aggregates them in `encodings`:"""

def construct_encodings(x, tkzr, max_len, trucation=True, padding=True):
    return tkzr(x, max_length=max_len, truncation=trucation, padding=padding)

encodings = construct_encodings(x, tkzr, max_len=MAX_LEN)

"""The first stage of preprocessing is done! The second stage is converting our `encodings` and `y` (which holds the classes of the reviews) into a [Tensorflow Dataset object](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). Below is a function to do this:


"""

def construct_tfdataset(encodings, y=None):
    if y:
        return tf.data.Dataset.from_tensor_slices((dict(encodings),y))
    else:
        # this case is used when making predictions on unseen samples after training
        return tf.data.Dataset.from_tensor_slices(dict(encodings))

tfdataset = construct_tfdataset(encodings, y)

"""The third and final preprocessing step is to create [training](https://developers.google.com/machine-learning/glossary#training-set) and [test sets](https://developers.google.com/machine-learning/glossary#test-set):"""

TEST_SPLIT = 0.2
BATCH_SIZE = 2

train_size = int(len(x) * (1-TEST_SPLIT))

tfdataset = tfdataset.shuffle(len(x))
tfdataset_train = tfdataset.take(train_size)
tfdataset_test = tfdataset.skip(train_size)

tfdataset_train = tfdataset_train.batch(BATCH_SIZE)
tfdataset_test = tfdataset_test.batch(BATCH_SIZE)

"""Our data is finally ready. Now we can do the fun part: model fitting! 🙌 🙌 🙌

# 2. Fine-tuning the model

Fine-tuning the model is as easy as instantiating a model instance, [optimizer](https://developers.google.com/machine-learning/glossary#optimizer), and [loss](https://developers.google.com/machine-learning/glossary#loss), and then compiling/fitting:
"""

# Commented out IPython magic to ensure Python compatibility.
# %%script echo Disabled
# N_EPOCHS = 2
# 
# model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)
# optimizer = optimizers.Adam(learning_rate=3e-5)
# loss = losses.SparseCategoricalCrossentropy(from_logits=True)
# model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
# 
# model.fit(tfdataset_train, batch_size=BATCH_SIZE, epochs=N_EPOCHS)

N_EPOCHS = 2

model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)
# Change the optimizer definition
optimizer_config = {
    'class_name': 'Adam',
    'config': {
        'learning_rate': 3e-5
    }
}
loss = losses.SparseCategoricalCrossentropy(from_logits=True)

# Pass the optimizer as a string identifier
model.compile(optimizer='adam', loss=loss, metrics=['accuracy']) # Removed optimizer=optimizer

model.fit(tfdataset_train, batch_size=BATCH_SIZE, epochs=N_EPOCHS)

"""## 3. Testing the model

Now we can use our test set to evaluate the performance of the model. Don't expect great results from such a small training set! 😅
"""

benchmarks = model.evaluate(tfdataset_test, return_dict=True, batch_size=BATCH_SIZE)
print(benchmarks)

"""# 4. Using the fine-tuned model to predict new samples

Now that we've trained the model, we can use it to predict new samples. Below is a function whose output is a classifier we can use for predictions:
"""

def create_predictor(model, model_name, max_len):
  tkzr = DistilBertTokenizer.from_pretrained(model_name)
  def predict_proba(text):
      x = [text]

      encodings = construct_encodings(x, tkzr, max_len=max_len)
      tfdataset = construct_tfdataset(encodings)
      tfdataset = tfdataset.batch(1)

      preds = model.predict(tfdataset).logits
      preds = activations.softmax(tf.convert_to_tensor(preds)).numpy()
      return preds[0][0]

  return predict_proba

clf = create_predictor(model, MODEL_NAME, MAX_LEN)
print(clf('this restaurant has horrible food'))

"""## 5. Saving and loading the model for future use

We use the built-in [save](https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.save_pretrained) and [load](https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained) methods from 🤗Transformers to serialize the model and restore it for future use:

"""

model.save_pretrained('./model/clf')
with open('./model/info.pkl', 'wb') as f:
    pickle.dump((MODEL_NAME, MAX_LEN), f)

"""We save `MODEL_NAME` and `MAX_LEN` for tokenizing/formatting down the line. Below we see how to reload the model for future use:"""

new_model = TFDistilBertForSequenceClassification.from_pretrained('./model/clf')
model_name, max_len = pickle.load(open('./model/info.pkl', 'rb'))

clf = create_predictor(new_model, model_name, max_len)
print(clf('this restaurant has horrible food'))

"""Unsuprisingly, we get the same prediction as before!

## Congrats -- you just made a binary text classifier with 🤗Transformers! 🤜💥🤛
"""