Usage
dotjson
exposes five primary classes:
dotjson::Vocabulary
to prepare the vocabulary.dotjson::Index
to compile an index for a given schema and model.dotjson::Guide
produces a set of allowed tokens at each generation step.dotjson::BatchedTokenSet
to represent allowed tokens at each generation step.dotjson::LogitsProcessor
to mask logits for tokens inconsistent with the provided schema.
Tip
dotjson::Vocabulary
and dotjson::Index
are both serializable. See the API reference for details.
General program flow
Programs using dotjson
follow a similar pattern:
- Initialize the
Vocabulary
using a model’s HuggingFace identifier. - Compile the
Index
using the schema and vocabulary. - Create a
Guide
and aLogitsProcessor
. - For each step in the inference loop:
- Perform a forward pass on your language model and place the results for each batch in the
logits
vector. - Retrieve the set of allowed tokens:
- If this is the first step, get the initial set of allowed tokens from the guide with
guide.get_start_tokensets()
. - Otherwise, get the next set of allowed tokens using the most recently sampled tokens from the guide with
guide.get_next_tokensets(sampled_tokens)
.
- Call the processor on the logits vector using the current token set with
processor(logits, token_set)
. - Choose new tokens for each sequence in the batch using the
logits
vector, via your preferred sampling method.
- Perform a forward pass on your language model and place the results for each batch in the
Tip
Visit the example page for a complete example of how to use dotjson
.
Compiling an index
Construct a vocabulary
// Basic usage with just the model name
std::string model = "NousResearch/Hermes-3-Llama-3.1-8B";
dotjson::Vocabulary vocabulary(model);
// Advanced usage with all optional parameters
std::string model = "NousResearch/Hermes-3-Llama-3.1-8B";
std::string revision = "main"; // specific model revision
std::vector<std::pair<std::string, std::string>> user_agent = {
{"application", "my-app-name"},
{"version", "1.0.0"}
};
std::string auth_token = "hf_..."; // HuggingFace access token
dotjson::Vocabulary vocabulary(model, revision, user_agent, auth_token);
model
must be the model identifier on the HuggingFace hub, such as "NousResearch/Hermes-3-Llama-3.1-8B"
. dotjson
will download the tokenizer for this model if it is not currently cached on disk.
The user_agent
parameter identifies your application when making requests to the HuggingFace hub. The library automatically augments your user agent with information about installed packages.
Alternatively, you can construct a vocabulary from a custom dictionary of token-to-ID mappings:
std::vector<std::pair<std::string, u_int32_t>> dictionary = {
{"hello", 1},
{"world", 2},
// ... more token mappings
};
u_int32_t eos_token_id = 3;
dotjson::Vocabulary vocabulary(dictionary, eos_token_id);
Note
When constructing a vocabulary from a dictionary, the EOS token ID must not be present in the dictionary. The dictionary must be valid and contain unique token-to-ID mappings.
Vocabulary
contains two convenience functions:
Vocabulary::eos_token_id()
returns the EOS token ID to determine when generation should terminate.Vocabulary::max_token_id()
returns the maximum token ID to determine the size of the logits vector.
Construct an index
Pass the schema you wish to compile and the vocabulary to create an index:
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
dotjson::Index index(schema, vocabulary);
schema
must be a string containing a valid JSON schema. See the list of supported JSON Schema features here.
By default, the generated JSON disallows whitespace (spaces after commas, linebreaks after {
, etc.). To generate compact JSON without whitespace constraints, set the disallow_whitespace
parameter to false
:
// Generate compact JSON without extra whitespace
dotjson::Index index(schema, vocabulary, false);
Note
Your index should be compiled when a new schema is received, unless that schema is invalid. Please see the API reference for information on exceptions thrown for invalid schemas.
Loading and saving an index
If you use the same schema repeatedly, you can serialize the corresponding Index
instance and save it to disk to avoid repeated compilations.
To save an index after compilation:
// Compile a vocabulary and index
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
std::string model = "NousResearch/Hermes-3-Llama-3.1-8B";
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(schema, vocabulary);
// Save the index
index.serialize_to_disk(path);
To load an index from disk:
// Load the index
std::filesystem::path path = "path/to/file";
dotjson::Index index(path);
Prepare the set of allowed tokens
dotjson
uses a Guide
to generate sets of allowed tokens at each inference step, as deterined by previously sampled tokens.
Guides separate the logic of determining allowed tokens from the logic of biasing the model’s logits. The distinction is important because it allows for parallel computation of allowed tokens during the model’s forward pass, rather than during the logit processing step.
The Guide
is constructed with an Index
and a batch size:
// Create a guide from the index with the batch size
std::size_t batch_size = 1;
dotjson::Guide guide(index, batch_size);
The Guide
generates a BatchedTokenSet
for each batch. This contains a set of valid token IDs.
The guide is used in two ways:
Generate the initial set of allowed tokens
// Get the initial set of allowed tokens
dotjson::BatchedTokenSet initial_token_set = guide.get_start_tokensets();
Generate the next set of allowed tokens
// Get the next set of allowed tokens
dotjson::BatchedTokenSet next_token_set = guide.get_next_tokensets(sampled_tokens);
Important
A BatchedTokenSet
should never be reused after new tokens have been sampled.
Always get a fresh token set from the guide after each sampling step, or tokens will be masked incorrectly.
Failing to do so will result in silent failures where invalid tokens are masked.
Note
guide.get_start_tokensets()
can only be called once. If you need to restart generation, create a new Guide
instance.
Using BatchedTokenSet methods
The BatchedTokenSet
class provides two utility methods to inspect the allowed tokens:
// Check if a specific token is allowed in each batch
u_int32_t token_to_check = 42;
std::vector<bool> is_allowed = token_set.contains(token_to_check);
// is_allowed[i] is true if token 42 is allowed in batch i
// Get the number of allowed tokens in each batch
std::vector<std::size_t> allowed_count = token_set.num_allowed();
// allowed_count[i] contains the number of allowed tokens in batch i
These methods can be useful for debugging or for implementing custom token sampling strategies. For example, if only one token is allowed in a batch element (using num_allowed
), you can sample that token directly without evaluating the model.
Preallocate the logits
Logits must be stored in a std::vector<std::span<u_int16_t>>
. Each element of logits
is a vector of token logit spans for each batch.
A dummy initialization, which uses constant logits vectors for demonstration purposes, follows:
// Get the vocabulary size
u_int32_t vocab_size = vocabulary.max_token_id();
// Allocate a single vector for all batches
std::vector<u_int16_t> all_logits(n_batches * vocab_size, 1);
std::vector<std::span<u_int16_t>> logits(n_batches);
// Create non-overlapping spans pointing to sections of the vector
for (int i = 0; i < n_batches; ++i) {
logits[i] = std::span<u_int16_t>(all_logits.data() + i * vocab_size, vocab_size);
}
For multiple batches, an alternative approach is to use separate vectors for each batch:
// Store the vectors in a container that persists outside the loop
std::vector<std::vector<u_int16_t>> batch_logits_storage(n_batches);
std::vector<std::span<u_int16_t>> logits(n_batches);
// Initialize each batch's logits:
for (int i = 0; i < n_batches; ++i) {
batch_logits_storage[i] = std::vector<u_int16_t>(vocab_size, 1);
logits[i] = std::span<u_int16_t>(batch_logits_storage[i].data(), batch_logits_storage[i].size());
}
Important
Each span must not overlap in memory with another span. For example, the following will throw an exception:
std::vector<u_int16_t> logits_base_test(50258, 3);
std::span<u_int16_t> logits_span_test(logits_base_test.data(), 50257);
std::span<u_int16_t> logits_span2_test(logits_base_test.data() + 1, 50257);
std::vector<std::span<u_int16_t>> logits_test = {logits_span_test, logits_span2_test };
Construct a logit processor
The logit processor is a function that modifies the logits in-place based on the set of allowed tokens. It is constructed with a Guide
object and the mask value:
// Set the batch size
std::size_t batch_size = 1;
// The value to use for masking (aka disabling tokens)
// This will be determined by your quantization scheme,
// but in this example we use 0
u_int16_t mask_value = 0;
// Create the processor using the guide object
dotjson::LogitsProcessor processor(guide, mask_value);
LogitsProcessor
is a function that masks tokens that are inconsistent with the schema.
To use the processor, call:
processor(logits, token_set);
This will modify the logits
vector in place, using the token_set
to determine which tokens to allow.
For example, if logits
is the following after the model’s forward pass:
// Single logits vector, with a three token vocabulary
logits_vec = { {1, 2, 3} };
Assuming token 2 is not in the allowed token set, logit processing will modify it to:
processor(logits, token_set);
// logits = {{1,0,3}};
Example program
// Import dotjson
// Note: The include path may vary depending on your installation method
// For system installation, use: #include "dotjson.hpp"
// For local installation, use: #include "vendor/dotjson/include/dotjson.hpp"
#include "../src/dotjson.hpp"
#include <iostream>
#include <vector>
#include <limits>
#include <span>
// A function that returns logits in the appropriate form,
// taking the full sequence history as input for the model's forward pass
std::vector<std::span<u_int16_t>> get_logits(std::string model, const std::vector<std::vector<u_int32_t>>& sequences) {
// This is your language model's forward pass.
// It should use the sequences (token history) to compute the next token logits
// ...
// Return the computed logits
std::vector<std::span<u_int16_t>> logits;
// ... populate logits from your forward pass ...
return logits;
}
std::vector<u_int32_t> sample_tokens(std::vector<std::span<u_int16_t>> &logits) {
// A function that returns the next tokens generated from a given set of logits, such as multinomial or greedy sampling.
// ...
return sampled_tokens;
}
int main() {
// Specify the model to use -- this downloads the tokenizer
// from HuggingFace.
std::string model = "gpt2";
// Specify a JSON schema to use.
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
// Compile the index and vocabulary for the schema.
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(schema, vocabulary);
// Specify the mask value to use for disabling tokens.
// Tokens that don't match the schema will have their logit values set to this.
u_int16_t mask_value = 0;
std::size_t batch_size = 1;
// Create a guide to generate sets of allowed tokens
dotjson::Guide guide(index, batch_size);
// Create a logit processor using the guide
dotjson::LogitsProcessor processor(guide, mask_value);
// Get the initial set of allowed tokens
dotjson::BatchedTokenSet token_set = guide.get_start_tokensets();
// Initialize sequence tracking for each batch
std::vector<std::vector<u_int32_t>> sequences(batch_size);
// Maximum allowed sequence length
const size_t max_sequence_length = 1024;
// Initialize the first logits based on empty sequences
// (this is your language model's forward pass)
std::vector<std::span<u_int16_t>> logits = get_logits(model, sequences);
// Process the initial logits with the token set
processor(logits, token_set);
// Sample the first set of tokens
std::vector<u_int32_t> sampled_tokens = sample_tokens(logits);
// Add sampled tokens to the sequences history
for (size_t i = 0; i < batch_size; i++) {
sequences[i].push_back(sampled_tokens[i]);
}
// Tracking boolean to know when to end generation
bool is_completed = false;
// A complete inference loop would look like:
while (!is_completed) {
// Get the next set of allowed tokens based on the
// most recently sampled tokens
token_set = guide.get_next_tokensets(sampled_tokens);
// Get the next set of logits from the model
// The full sequences history is used for the model's context
logits = get_logits(model, sequences);
// Process logits with the current token set
processor(logits, token_set);
// After processing, only tokens that are valid according to the schema
// will retain their original values. Others will be set to mask_value.
// Sample the next tokens
sampled_tokens = sample_tokens(logits);
// Add the new tokens to the sequence history
for (size_t i = 0; i < batch_size; i++) {
sequences[i].push_back(sampled_tokens[i]);
// Check for completion (e.g., EOS token)
if (sampled_tokens[i] == vocabulary.eos_token_id()) {
is_completed = true;
}
}
}
// Process the generated sequences to return the final
// completion to the user
for (size_t i = 0; i < batch_size; i++) {
// Convert token IDs back to text
// ...
}
}
Need help?
- Email us at [email protected]
- Your dedicated Slack channel
- Schedule a call with us here