Showing posts with label bert. Show all posts
Showing posts with label bert. Show all posts

6/29/2023

text summarise dataset

**Paper:**

https://arxiv.org/abs/1908.08345


**Dataset:**

1) the CNN/DailyMail news highlights dataset: somewhat Extractive

- News Articles & Related Highlights: Provides a brief overview of articles

- Input document: limited to 512 tokens

- https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail


2) the New York Times Annotated Corpus (NYT): somewhat Extractive

- Contains 110,540 articles with abstract summaries

- Input document : limited to 800 tokens

- https://research.google/resources/datasets/ny-times-annotated-corpus/


3) XSum: Abstractive

- 226,711 news articles answering the question of ‘What is this articles about?’ + one-sentence summaries

- Input document: limited to 512 tokens

- https://github.com/google-research-datasets/xsum_hallucination_annotations

5/09/2022

BERT Tokenizer, string to token, token to string

 

BERT Tokenizer token understanding examples

..

text = "I am e/mail"
# text = "I am a e-mail"
tokens = tokenizer.tokenize(text)
print(f'Tokens: {tokens}')
print(f'Tokens length: {len(tokens)}')
encoding = tokenizer.encode(text)
print(f'Encoding: {encoding}')
print(f'Encoding length: {len(encoding)}')
tok_text = tokenizer.convert_tokens_to_string(tokens)
print(f'token to string: {tok_text}')

..

output:

Tokens: ['I', 'Δ am', 'Δ e', '/', 'mail']
Tokens length: 5
Encoding: [0, 100, 524, 364, 73, 6380, 2]
Encoding length: 7
token to string: I am e/mail

--
Thank you.
www.marearts.com

6/09/2020

sentence embedding, sentence to vector using bert

refer to source code

.
#pip install -U sentence-transformers
#https://github.com/UKPLab/sentence-transformers
from sentence_transformers import SentenceTransformer, LoggingHandler

# Load Sentence model (based on BERT) from URL
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Embed a list of sentences
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

# The result is a list of sentence embeddings as numpy arrays
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding.shape, type(embedding))
print("")
.

result is like this:
Sentence: This framework generates embeddings for each input sentence
Embedding: (768,) <class 'numpy.ndarray'>

Sentence: Sentences are passed as a list of string.
Embedding: (768,) <class 'numpy.ndarray'>

Sentence: The quick brown fox jumps over the lazy dog.
Embedding: (768,) <class 'numpy.ndarray'>