How to generate text using Large Language models: A guide to custom text generation

Viraj Kadam
4 min readSep 19, 2024

--

Most of us are aware and use the text generation pipeline that is available on Hugging Face Transformers library. But in production environments or otherwise, we might want to customise the text generation pipeline. In this post, we will create our own text generation pipeline.

First, we need to list the high level processes we have to do for generating a text from a prompt. They are
1. Pre-process input prompt(s) and tokenize them using the tokenizer.
2. Pass the tokenized prompt with the required generation parameters to the model for generation.
3. Decode the generated tokens from the model, and process it to display the model output.
We will build a function that addresses each of the following steps, and combine them as a part of the generate function.

Before we create individual functions to do the above tasks, we define a class which will have all the necessary methods.

class textGen_pipeline:
def __init__(self,model,tokenizer):
self.model = model.eval()
self.tokenizer = tokenizer
self.device = 'cuda' if torch.cuda.is_available() else "cpu"
self.framework = 'pt'

The block will take in as input the model and its corresponding tokenizer for initialisation of the generation pipeline.

Now lets define methods/functions that will perform the defined tasks.

Pre-process and Tokenize

class Chat:
"""This class is intended to just be used internally in this pipeline and not exposed to users. We convert chats
to this format because the rest of the pipeline code tends to assume that lists of messages are
actually a batch of samples rather than messages in the same conversation."""

def __init__(self, messages: dict):
for message in messages:
if not ("role" in message and "content" in message):
raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.")
self.messages = messages
def preprocess(
self,
prompt,
add_special_tokens=None,
truncation=True,
padding=None,
max_length=None,
):
# Only set non-None tokenizer kwargs, so as to rely on the tokenizer's defaults
tokenizer_kwargs = {
"add_special_tokens": add_special_tokens,
"truncation": truncation,
"padding": padding,
"max_length": max_length}

tokenizer_kwargs = {key: value for key, value in tokenizer_kwargs.items() if value is not None}

if isinstance(prompt, Chat) or isinstance(prompt,list):
tokenizer_kwargs.pop("add_special_tokens", None) # ignore add_special_tokens on chats
inputs = self.tokenizer.apply_chat_template(
prompt.messages,
add_generation_prompt=True,
return_dict=True,
return_tensors=self.framework,
**tokenizer_kwargs,
)
else:
inputs = self.tokenizer(prompt, return_tensors=self.framework, **tokenizer_kwargs)

inputs["prompt"] = prompt

return inputs

Here, we are tokenizing the input, which can be either a string, list or of the Chat type. The Chat format is of relevance, as it lets you use the <system> and <User> formatting templates.

Model Output

def _forward(self, 
model_inputs,
**generate_kwargs):
input_ids = model_inputs["input_ids"].to(self.device)
attention_mask = model_inputs.get("attention_mask", None).to(self.device)
prompt_text = model_inputs.pop("prompt")

with torch.no_grad():
with torch.nn.attention.sdpa_kernel([SDPBackend.FLASH_ATTENTION,SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION]):
generated_sequence = self.model.generate(input_ids=input_ids,
attention_mask=attention_mask,
**generate_kwargs)

del input_ids,attention_mask; gc.collect(); torch.cuda.empty_cache()
return generated_sequence.detach().cpu()

In this block, we will pass the processed tokens from the tokenizer, and pass it to the model with other generation parameters for generation to the Model. There are other optimizations in the code, for faster and memory efficient inference.

Decoding and post-processing model output

def decode_output(self,
sequence,
input_ids):
text = self.tokenizer.decode(
sequence,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)

prompt_length = len(
self.tokenizer.decode(
input_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,))

return {"role": "assistant",
"content": text[prompt_length:]}

Generate Text

Finally, we can combine all the above methods in a generate function, which takes as a input prompt and generation kwargs (keyword arguments), and outputs the generated text.

def generate(self,
prompt,
**generate_kwargs):

model_inputs = self.preprocess(prompt=prompt)
model_output = self._forward(model_inputs,**generate_kwargs)

output = self.decode_output(model_output[0],model_inputs['input_ids'][0])
del model_inputs,model_output; gc.collect(); torch.cuda.empty_cache()
return output.get("content")

The full class looks like this

class textGen_pipeline:
def __init__(self,model,tokenizer):
self.model = model.eval()
self.tokenizer = tokenizer
self.device = 'cuda' if torch.cuda.is_available() else "cpu"
self.framework = 'pt'

def preprocess(
self,
prompt,
add_special_tokens=None,
truncation=True,
padding=None,
max_length=None,
):
# Only set non-None tokenizer kwargs, so as to rely on the tokenizer's defaults
tokenizer_kwargs = {
"add_special_tokens": add_special_tokens,
"truncation": truncation,
"padding": padding,
"max_length": max_length}

tokenizer_kwargs = {key: value for key, value in tokenizer_kwargs.items() if value is not None}

if isinstance(prompt, Chat) or isinstance(prompt,list):
tokenizer_kwargs.pop("add_special_tokens", None) # ignore add_special_tokens on chats
inputs = self.tokenizer.apply_chat_template(
prompt.messages,
add_generation_prompt=True,
return_dict=True,
return_tensors=self.framework,
**tokenizer_kwargs,
)
else:
inputs = self.tokenizer(prompt, return_tensors=self.framework, **tokenizer_kwargs)

inputs["prompt"] = prompt

return inputs

def _forward(self,
model_inputs,
**generate_kwargs):
input_ids = model_inputs["input_ids"].to(self.device)
attention_mask = model_inputs.get("attention_mask", None).to(self.device)
prompt_text = model_inputs.pop("prompt")

with torch.no_grad():
generated_sequence = self.model.generate(input_ids=input_ids,
attention_mask=attention_mask,
**generate_kwargs)

del input_ids,attention_mask; gc.collect(); torch.cuda.empty_cache()
return generated_sequence.detach().cpu()

def decode_output(self,
sequence,
input_ids):
text = self.tokenizer.decode(
sequence,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)

prompt_length = len(
self.tokenizer.decode(
input_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,))

return {"role": "assistant",
"content": text[prompt_length:]}

def generate(self,
prompt,
**generate_kwargs):

model_inputs = self.preprocess(prompt=prompt)
model_output = self._forward(model_inputs,**generate_kwargs)

output = self.decode_output(model_output[0],model_inputs['input_ids'][0])
del model_inputs,model_output; gc.collect(); torch.cuda.empty_cache()
return output.get("content")

To use the pipeline, we need to initialize with the model and its tokenizer.

Resources

--

--

No responses yet