6.6. 使用 BERT 提取固定的特征向量 (像 ELMo)¶
In certain cases, rather than fine-tuning the entire pre-trained model end-to-end, it can be beneficial to obtained pre-trained contextual embeddings, which are fixed contextual representations of each input token generated from the hidden layers of the pre-trained model. This should also mitigate most of the out-of-memory issues.
As an example, we include the script extract_features.py
which can
be used like this:
# Sentence A and Sentence B are separated by the ||| delimiter for sentence
# pair tasks like question answering and entailment.
# For single sentence inputs, put one sentence per line and DON'T use the
# delimiter.
echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt
python extract_features.py \
--input_file=/tmp/input.txt \
--output_file=/tmp/output.jsonl \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--layers=-1,-2,-3,-4 \
--max_seq_length=128 \
--batch_size=8
This will create a JSON file (one line per line of input) containing the
BERT activations from each Transformer layer specified by layers
(-1
is the final hidden layer of the Transformer, etc.)
Note that this script will produce very large output files (by default, around 15kb for every input token).
If you need to maintain alignment between the original and tokenized words (for projecting training labels), see the Tokenization section below.
Note: You may see a message like
Could not find trained model in model_dir: /tmp/tmpuB5g5c, running initialization to predict.
This message is expected, it just means that we are using the
init_from_checkpoint()
API rather than the saved model API. If you
don’t specify a checkpoint or specify an invalid checkpoint, this script
will complain.
6.6.1. 符号化¶
For sentence-level tasks (or sentence-pair) tasks, tokenization is very
simple. Just follow the example code in run_classifier.py
and
extract_features.py
. The basic procedure for sentence-level tasks
is:
Instantiate an instance of
tokenizer = tokenization.FullTokenizer
Tokenize the raw text with
tokens = tokenizer.tokenize(raw_text)
.Truncate to the maximum sequence length.(You can use up to 512, but you probably want to use shorter if possible for memory and speed reasons.)
Add the
[CLS]
and[SEP]
tokens in the right place.
Word-level and span-level tasks (e.g., SQuAD and NER) are more complex,
since you need to maintain alignment between your input text and output
text so that you can project your training labels. SQuAD is a
particularly complex example because the input labels are
character-based, and SQuAD paragraphs are often longer than our
maximum sequence length. See the code in run_squad.py
to show how we
handle this.
Before we describe the general recipe for handling word-level tasks, it’s important to understand what exactly our tokenizer is doing. It has three main steps:
Text normalization: Convert all whitespace characters to spaces, and (for the
Uncased
model) lowercase the input and strip out accent markers. E.g.,John Johanson's, → john johanson's,
.Punctuation splitting: Split all punctuation characters on both sides (i.e., add whitespace around all punctuation characters). Punctuation characters are defined as (a) Anything with a
P*
Unicode class, (b) any non-letter/number/space ASCII character (e.g., characters like$
which are technically not punctuation). E.g.,john johanson's, → john johanson ' s ,
WordPiece tokenization: Apply whitespace tokenization to the output of the above procedure, and apply WordPiece tokenization to each token separately. (Our implementation is directly based on the one from
tensor2tensor
, which is linked). E.g.,john johanson ' s , → john johan ##son ' s ,
The advantage of this scheme is that it is “compatible” with most existing English tokenizers. For example, imagine that you have a part-of-speech tagging task which looks like this:
Input: John Johanson 's house
Labels: NNP NNP POS NN
The tokenized output will look like this:
Tokens: john johan ##son ' s house
Crucially, this would be the same output as if the raw text were
John Johanson's house
(with no space before the 's
).
If you have a pre-tokenized representation with word-level annotations, you can simply tokenize each input word independently, and deterministically maintain an original-to-tokenized alignment:
### Input
orig_tokens = ["John", "Johanson", "'s", "house"]
labels = ["NNP", "NNP", "POS", "NN"]
### Output
bert_tokens = []
# Token map will be an int -> int mapping between the `orig_tokens` index and
# the `bert_tokens` index.
orig_to_tok_map = []
tokenizer = tokenization.FullTokenizer(
vocab_file=vocab_file, do_lower_case=True)
bert_tokens.append("[CLS]")
for orig_token in orig_tokens:
orig_to_tok_map.append(len(bert_tokens))
bert_tokens.extend(tokenizer.tokenize(orig_token))
bert_tokens.append("[SEP]")
# bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]
# orig_to_tok_map == [1, 2, 4, 6]
Now orig_to_tok_map
can be used to project labels
to the
tokenized representation.
There are common English tokenization schemes which will cause a slight
mismatch between how BERT was pre-trained. For example, if your input
tokenization splits off contractions like do n't
, this will cause a
mismatch. If it is possible to do so, you should pre-process your data
to convert these back to raw-looking text, but if it’s not possible,
this mismatch is likely not a big deal.