Efficiently Converting LayoutLMv3 OCR Model Output Logits with Strides and Split Tokens

 refer to code:


import torch
from transformers import LayoutLMv3ForTokenClassification, LayoutLMv3Tokenizer

# Load the pre-trained model and its tokenizer
model_name = "microsoft/layoutlmv3-base"
tokenizer = LayoutLMv3Tokenizer.from_pretrained(model_name)
model = LayoutLMv3ForTokenClassification.from_pretrained(model_name)

# Load the image and preprocess it using the tokenizer
image_path = 'example.png'
inputs = tokenizer(image_path, return_tensors="pt")

# Pass the preprocessed inputs through the model and obtain the output logits
outputs = model.ocr(inputs.pixel_values, boxes=inputs.bbox, attention_mask=inputs.attention_mask)

# Get the list of input token ids and corresponding bounding boxes
input_tokens = inputs.input_ids[0]
input_boxes = inputs.bbox[0]

# Calculate the corresponding bounding boxes for the downsampled tokens
downsampled_boxes = []
for i in range(len(input_tokens)):
token_box = input_boxes[i]
stride = inputs.strides[0][i]
downsampled_box = [coord // stride for coord in token_box]

# Get the output logits corresponding to the input tokens
valid_tokens = inputs.attention_mask[0].nonzero(as_tuple=False)
valid_logits = outputs.logits[0, valid_tokens[:, 1]]

# Concatenate the output logits for all batches in the correct order
if len(outputs.logits.shape) == 3:
batch_logits = torch.cat([batch_logits for batch_logits in outputs.logits], dim=0)
batch_logits = outputs.logits

# Convert the selected output logits to probabilities and assign each word the label of the class with the highest probability
word_labels = []
for i, (token_id, token_box) in enumerate(zip(input_tokens, downsampled_boxes)):
token_logits = batch_logits[i]
token_probs = torch.softmax(token_logits, dim=0)
word_prob, word_label = token_probs.max(dim=0)


To convert the output logits to a word-level format, you will need to perform several steps:

  1. Obtain the list of token ids and corresponding bounding boxes from the input that was passed to the LayoutLMv3 model.

  2. Use the strides parameter to obtain the corresponding bounding boxes for the downsampled tokens.

  3. Use the attention_mask tensor to identify which output tokens correspond to valid input tokens, and select the corresponding output logits.

  4. If the input sequence was split into multiple batches, concatenate the output logits for each batch in the correct order.

  5. Convert the selected output logits to probabilities using the softmax function.

  6. Assign each word the label corresponding to the class with the highest probability for its tokens.

Thank you.