API Reference
Namespaces
dotjson
Main namespace containing the public API.
Classes
BatchedTokenSet
An opaque class containing the information needed for the DotJsonLogits
processor to efficiently mask a batch of logits vectors.
Constructors
explicit BatchedTokenSet(rust::Box<internal::BatchedTokenSet> token_set);
Methods
std::vector<bool> contains(u_int32_t token_id);
- Description: Returns a vector of booleans indicating if a particular token is marked as allowed in a token set
- Parameters:
token_id
: A tokenid to check
- Returns: A vector of booleans with length equal to the batch size, where each element indicates if the token is allowed in the corresponding batch element
Tip
This can be used to stop early if the EOS token is available in the set of allowed tokens. For example, if the EOS token has ID 42, you can use
if (batched_token_set.contains(42)[0]) {
// EOS token is allowed in the first batch element, stop this sequence early
}
Early stopping may not be appropriate for all use cases. Typically, you will want to stop when the model has sampled the EOS token, not when it is first available.
std::vector<std::size_t> num_allowed();
- Description: Returns a vector of length batch_size containing the number of allowed tokens in each tokenset
- Returns: A vector of integers with length equal to the batch size, where each element indicates the number of allowed tokens in the corresponding batch element
Tip
num_allowed
can be useful for skipping model forward passes. If only one token is allowed in a batch element, you can sample that token directly without evaluating the model.
Vocabulary
A vocabulary for constructing an index.
Constructors
Vocabulary(
std::string &model,
const std::string &revision = "",
const std::vector<std::pair<std::string, std::string>> &user_agent = {},
const std::string &auth_token = "");
- Description: Constructs a serializable vocabulary object for a given tokenizer
- Parameters:
model
: The name of the tokenizer in the huggingface indexrevision
: (default:""
) The specific revision of the tokenizer on the hf-hubuser_agent
: (default:{}
) The user agent info in the form of a dictionary or a single string. It will be completed with information about the installed packages.auth_token
: (default:""
) A valid user access token forhf-hub
.
- Throws: This will throw a
std::exception
if the vocabulary cannot be built. This will typically happen in one of two cases: The tokenizer could not be found in the huggingface hub; or the tokenizer has unsupported features.
Vocabulary(std::vector<std::pair<std::string, u_int32_t>> &dictionary,
u_int32_t eos_token_id);
- Description: Constructs a serializable vocabulary object from a set of (Token, TokenId) pairs
- Parameters:
dictionary
: This contains the pairs of (Token, TokenId).eos_token_id
: The token_id of the end of string token. This should NOT be present in the dictionary
- Throws: This will throw a
std::exception
if the vocabulary cannot be built. This will typically happen if the dictionary provided contains the EOS tokenID or is otherwise invalid.
explicit Vocabulary(std::filesystem::path &path);
- Description: Constructs a vocabulary object from a serialized object on disk
- Parameters:
path
: Path to the serialized object
- Throws: This will throw a
std::exception
in the event that a properly serialized vocabulary cannot be found at the specified path.
Methods
void serialize_to_disk(std::string &path);
- Description: Serialize the
Vocabulary
object to disk - Parameters:
path
: The path to the file you want to serialize your vocabulary to
- Throws: This will throw a
std::exception
in the event that the serialization fails.
u_int32_t max_token_id();
- Description: Returns the largest token id in the vocabulary. This is used for bounds checking in the logits processor.
u_int32_t eos_token_id();
- Description: Returns the EOS (end of string) TokenId for the vocabulary. This can be useful for checking termination.
Index
An index for constructing a LogitsProcessor.
Constructors
Index(std::string &schema, Vocabulary &vocabulary,
bool disallow_whitespace = true);
- Description: Constructs a serializable index object for a given JSON schema
- Parameters:
schema
: A JSON schema that generations should matchvocabulary
: A Vocabulary object encoding information about the model’s tokenizerdisallow_whitespace
: (Default:true
) Don’t generate JSON containing extra whitespace (such as spaces after commas or linebreaks after{
)
- Throws: This will throw a
std::exception
if the index cannot be built. This will typically happen in one of two cases: The JSON schema is malformed or it contains unsupported features.
[!NOTE]
Using
disallow_whitespace=true
(the default) may cause unanticipated model performance issues, as it disables a formatting that may be natural for the model.
explicit Index(std::filesystem::path &path);
- Description: Constructs an index object from a serialized object on disk
- Parameters:
path
: Path to the serialized object
- Throws: This will throw a
std::exception
in the event that a properly serialized index cannot be found at the specified path.
Methods
void serialize_to_disk(std::string &path);
- Description: Serialize the
Index
object to disk - Parameters:
path
: The path to the file you want to serialize your index to
- Throws: This will throw a
std::exception
in the event that the serialization fails.
Guide
A guide class that reads batches of TokenIds and produces BatchedTokenSets
that can be used in the logits processor.
Constructors
Guide(const Index &index, size_t batch_size) noexcept;
- Description: Constructs the Guide object
- Parameters:
index
: The index used for developing the maskbatch_size
: The batch size for the logits updates
Methods
BatchedTokenSet get_start_tokensets() noexcept;
- Description: Construct the set of allowed tokens need to generate the first token.
BatchedTokenSet get_next_tokensets(const std::vector<u_int32_t> &token_ids);
- Description: Read a new batch of tokens in and produce the mask
- Parameters:
token_ids
: The vector of TokenIds that have just been sampled.
- Throws: This will throw an exception when the wrong number of tokens are given or when at least one of the tokens does not come from the allowed token set.
LogitsProcessor
A logits processor that modifies logits arrays in place for structured generation.
Constructors
LogitsProcessor(Guide &guide, u_int16_t mask_value) noexcept;
- Description: Constructs the logits processor
- Parameters:
guide
: TheGuide
object used to produce the token sets.mask_value
: The equivalent of-std::numeric_limits<float>::infinity()
foru_int16_t
.
Methods
void operator()(std::vector<std::span<u_int16_t>> &logits,
BatchedTokenSet &token_set);
- Description: Adaptively compute the mask and apply it in place to the logits array
- Parameters:
logits
: A vector ofstd::span<u_int16_t>
containing logits computed after reading the tokens incontext
. The vector must be the same as thebatch_size
used to initialize the processor and the spans must all have the same size.token_set
: The set of tokens produced by the guide.
- Throws: A
std::exception
will be thrown on bounds errors.
Example Usage
Here is a minimal example of how to use the dotjson
library:
// Create vocabulary and index
std::string model = "gpt2";
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(schema, vocabulary);
// Create guide and processor
std::size_t batch_size = 1;
u_int16_t mask_value = 0; // Appropriate mask value
dotjson::Guide guide(index, batch_size);
dotjson::LogitsProcessor processor(guide, mask_value);
// Get initial token set
dotjson::BatchedTokenSet token_set = guide.get_start_tokensets();
// Initial logit vector (generated by LLM)
std::vector<std::span<uint16_t>> logits;
// Populate logits...
// Process the logits using the token set
processor(logits, token_set);
// Sample tokens based on the processed logits
std::vector<u_int32_t> sampled_tokens = sample_tokens(logits);
// Get the next token set based on sampled tokens
token_set = guide.get_next_tokensets(sampled_tokens);