Language Model API#

Language models (LMs) are models that are trained to predict the next word in a sequence of words. They are being evaluated in this project.

2 useful abstractions are defined in this module:

  • ClassifierModel:

    A model that classifies text. Classifiers are used for predicting the class of a text. They do so by generating logits for each choice.

  • GenerativeModel:

    A model that generates text. This is the LLM in common sense.

class bocoel.models.lms.HuggingfaceCausalLM(model_path: str, batch_size: int, device: str, add_sep_token: bool = False)[source]#

The Huggingface implementation of language model. This is a wrapper around the Huggingface library, which would try to pull the model from the huggingface hub.

FIXME:

add_sep_token might cause huggingface to bug out with index out of range. Still unclear how this might occur as [SEP] is a special token.

__init__(model_path: str, batch_size: int, device: str, add_sep_token: bool = False) None[source]#
Parameters:
  • model_path – The path to the model.

  • batch_size – The batch size to use.

  • device – The device to use.

  • add_sep_token – Whether to add the sep token.

class bocoel.models.lms.HuggingfaceGenerativeLM(model_path: str, batch_size: int, device: str, add_sep_token: bool = False)[source]#

The generative model backed by huggingface’s transformers library.

Since huggingface’s tokenizer needs padding to the left to work, padding doesn’t guarantee the same positional embeddings, and thus, results. If sameness with generating one by one is desired, batch size should be 1.

__init__(model_path: str, batch_size: int, device: str, add_sep_token: bool = False) None[source]#
Parameters:
  • model_path – The path to the model.

  • batch_size – The batch size to use.

  • device – The device to use.

  • add_sep_token – Whether to add the sep token.

generate(prompts: Sequence[str], /) Sequence[str][source]#

Generate a sequence of responses given prompts. The length of the response is the same as the prompt. The response would be a continuation of the prompt, and the prompts would be the prefix of the response.

Parameters:

prompts – The prompts to generate.

Returns:

The generated responses. The length must be the same as the prompts.

Todo

Add logits.

class bocoel.models.lms.HuggingfaceLogitsLM(model_path: str, batch_size: int, device: str, choices: Sequence[str], add_sep_token: bool = False)[source]#

Logits classification model backed by huggingface’s transformers library. This means that the model would use the logits of [‘1’, ‘2’, ‘3’, ‘4’, ‘5’] as the output, if choices = 5, for the current batch of inputs.

__init__(model_path: str, batch_size: int, device: str, choices: Sequence[str], add_sep_token: bool = False) None[source]#
Parameters:
  • model_path – The path to the model.

  • batch_size – The batch size to use.

  • device – The device to use.

  • choices – The choices to classify.

  • add_sep_token – Whether to add the sep token.

property choices: Sequence[str]#

The choices for this language model.

Returns:

The choices for this language model.

class bocoel.models.lms.HuggingfaceSequenceLM(model_path: str, device: str, choices: Sequence[str], add_sep_token: bool = False)[source]#

The sequence classification model backed by huggingface’s transformers library.

__init__(model_path: str, device: str, choices: Sequence[str], add_sep_token: bool = False) None[source]#
property choices: Sequence[str]#

The choices for this language model.

Returns:

The choices for this language model.

class bocoel.models.lms.HuggingfaceTokenizer(model_path: str, device: str, add_sep_token: bool)[source]#

A tokenizer for Huggingface models.

__init__(model_path: str, device: str, add_sep_token: bool) None[source]#
Parameters:
  • model_path – The path to the model.

  • device – The device to use.

  • add_sep_token – Whether to add the sep token.

Raises:

ImportError – If the transformers library is not installed.

to(device: str, /) HuggingfaceTokenizer[source]#

Move the tokenizer to the given device.

Parameters:

device – The device to move to.

tokenize(prompts: Sequence[str], /, max_length: int | None = None) BatchEncoding[source]#

Tokenize, pad, truncate, cast to device, and yield the encoded results. Returning BatchEncoding but not marked in the type hint due to optional dependency.

Parameters:

prompts – The prompts to tokenize.

Returns:

The tokenized prompts.

Return type:

(BatchEncoding)

encode(prompts: Sequence[str], /, return_tensors: str | None = None, add_special_tokens: bool = True) list[int][source]#

Encode the given prompts.

Parameters:
  • prompts – The prompts to encode.

  • return_tensors – Whether to return tensors.

  • add_special_tokens – Whether to add special tokens.

Returns:

The encoded prompts.

Return type:

(Any)

decode(outputs: Any, /, skip_special_tokens: bool = True) str[source]#

Decode the given outputs.

Parameters:
  • outputs – The outputs to decode.

  • skip_special_tokens – Whether to skip special tokens.

Returns:

The decoded outputs.

batch_decode(outputs: Any, /, skip_special_tokens: bool = True) list[str][source]#

Batch decode the given outputs.

Parameters:
  • outputs – The outputs to decode.

  • skip_special_tokens – Whether to skip special tokens.

Returns:

The batch decoded outputs.

class bocoel.models.lms.ClassifierModel(*args, **kwargs)[source]#
classify(prompts: Sequence[str], /) ndarray[Any, dtype[_ScalarType_co]][source]#

Classify the given prompts.

Parameters:

prompts – The prompts to classify.

Returns:

The logits for each prompt and choice.

Raises:

ValueError – If the shape of the logits is not [len(prompts), len(choices)].

abstract property choices: Sequence[str]#

The choices for this language model.

Returns:

The choices for this language model.

__init__(*args, **kwargs)#
class bocoel.models.lms.GenerativeModel(*args, **kwargs)[source]#
abstract generate(prompts: Sequence[str], /) Sequence[str][source]#

Generate a sequence of responses given prompts. The length of the response is the same as the prompt. The response would be a continuation of the prompt, and the prompts would be the prefix of the response.

Parameters:

prompts – The prompts to generate.

Returns:

The generated responses. The length must be the same as the prompts.

Todo

Add logits.

__init__(*args, **kwargs)#