Usage

dotjson exposes five primary classes:

dotjson::Vocabulary to prepare the vocabulary.
dotjson::Index to compile an index for a given schema and model.
dotjson::Guide produces a set of allowed tokens at each generation step.
dotjson::BatchedTokenSet to represent allowed tokens at each generation step.
dotjson::LogitsProcessor to mask logits for tokens inconsistent with the provided schema.

Tip

dotjson::Vocabulary and dotjson::Index are both serializable. See the API reference for details.

General program flow

Programs using dotjson follow a similar pattern:

Initialize the Vocabulary using a model’s HuggingFace identifier.
Compile the Index using the schema and vocabulary.
Create a Guide and a LogitsProcessor.
For each step in the inference loop:
1. Perform a forward pass on your language model and place the results for each batch in the logits vector.
2. Retrieve the set of allowed tokens:
- If this is the first step, get the initial set of allowed tokens from the guide with guide.get_start_tokensets().
- Otherwise, get the next set of allowed tokens using the most recently sampled tokens from the guide with guide.get_next_tokensets(sampled_tokens).
1. Call the processor on the logits vector using the current token set with processor(logits, token_set).
2. Choose new tokens for each sequence in the batch using the logits vector, via your preferred sampling method.

Tip

Visit the example page for a complete example of how to use dotjson.

Compiling an index

Construct a vocabulary

// Basic usage with just the model name
std::string model = "NousResearch/Hermes-3-Llama-3.1-8B";
dotjson::Vocabulary vocabulary(model);

// Advanced usage with all optional parameters
std::string model = "NousResearch/Hermes-3-Llama-3.1-8B";
std::string revision = "main";  // specific model revision
std::vector<std::pair<std::string, std::string>> user_agent = {
    {"application", "my-app-name"},
    {"version", "1.0.0"}
};
std::string auth_token = "hf_...";  // HuggingFace access token
dotjson::Vocabulary vocabulary(model, revision, user_agent, auth_token);

model must be the model identifier on the HuggingFace hub, such as "NousResearch/Hermes-3-Llama-3.1-8B". dotjson will download the tokenizer for this model if it is not currently cached on disk.

The user_agent parameter identifies your application when making requests to the HuggingFace hub. The library automatically augments your user agent with information about installed packages.

Alternatively, you can construct a vocabulary from a custom dictionary of token-to-ID mappings:

std::vector<std::pair<std::string, u_int32_t>> dictionary = {
    {"hello", 1},
    {"world", 2},
    // ... more token mappings
};
u_int32_t eos_token_id = 3;
dotjson::Vocabulary vocabulary(dictionary, eos_token_id);

Note

When constructing a vocabulary from a dictionary, the EOS token ID must not be present in the dictionary. The dictionary must be valid and contain unique token-to-ID mappings.

Vocabulary contains two convenience functions:

Vocabulary::eos_token_id() returns the EOS token ID to determine when generation should terminate.
Vocabulary::max_token_id() returns the maximum token ID to determine the size of the logits vector.

Construct an index

Pass the schema you wish to compile and the vocabulary to create an index:

std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
dotjson::Index index(schema, vocabulary);

schema must be a string containing a valid JSON schema. See the list of supported JSON Schema features here.

By default, the generated JSON disallows whitespace (spaces after commas, linebreaks after {, etc.). To generate compact JSON without whitespace constraints, set the disallow_whitespace parameter to false:

// Generate compact JSON without extra whitespace
dotjson::Index index(schema, vocabulary, false);

Note

Your index should be compiled when a new schema is received, unless that schema is invalid. Please see the API reference for information on exceptions thrown for invalid schemas.

Loading and saving an index

If you use the same schema repeatedly, you can serialize the corresponding Index instance and save it to disk to avoid repeated compilations.

To save an index after compilation:

// Compile a vocabulary and index
std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";
std::string model = "NousResearch/Hermes-3-Llama-3.1-8B";
dotjson::Vocabulary vocabulary(model);
dotjson::Index index(schema, vocabulary);

// Save the index
index.serialize_to_disk(path);

To load an index from disk:

// Load the index
std::filesystem::path path = "path/to/file";
dotjson::Index index(path);

Prepare the set of allowed tokens

dotjson uses a Guide to generate sets of allowed tokens at each inference step, as deterined by previously sampled tokens.

Guides separate the logic of determining allowed tokens from the logic of biasing the model’s logits. The distinction is important because it allows for parallel computation of allowed tokens during the model’s forward pass, rather than during the logit processing step.

The Guide is constructed with an Index and a batch size:

// Create a guide from the index with the batch size
std::size_t batch_size = 1;
dotjson::Guide guide(index, batch_size);

The Guide generates a BatchedTokenSet for each batch. This contains a set of valid token IDs.

The guide is used in two ways:

Generate the initial set of allowed tokens

// Get the initial set of allowed tokens
dotjson::BatchedTokenSet initial_token_set = guide.get_start_tokensets();

Generate the next set of allowed tokens

// Get the next set of allowed tokens
dotjson::BatchedTokenSet next_token_set = guide.get_next_tokensets(sampled_tokens);

Important

A BatchedTokenSet should never be reused after new tokens have been sampled.

Always get a fresh token set from the guide after each sampling step, or tokens will be masked incorrectly.

Failing to do so will result in silent failures where invalid tokens are masked.

Note

guide.get_start_tokensets() can only be called once. If you need to restart generation, create a new Guide instance.

Using BatchedTokenSet methods

The BatchedTokenSet class provides two utility methods to inspect the allowed tokens:

// Check if a specific token is allowed in each batch
u_int32_t token_to_check = 42;
std::vector<bool> is_allowed = token_set.contains(token_to_check);
// is_allowed[i] is true if token 42 is allowed in batch i

// Get the number of allowed tokens in each batch
std::vector<std::size_t> allowed_count = token_set.num_allowed();
// allowed_count[i] contains the number of allowed tokens in batch i

These methods can be useful for debugging or for implementing custom token sampling strategies. For example, if only one token is allowed in a batch element (using num_allowed), you can sample that token directly without evaluating the model.

Preallocate the logits

Logits must be stored in a std::vector<std::span<u_int16_t>>. Each element of logits is a vector of token logit spans for each batch.

A dummy initialization, which uses constant logits vectors for demonstration purposes, follows:

// Get the vocabulary size
u_int32_t vocab_size = vocabulary.max_token_id();

// Allocate a single vector for all batches
std::vector<u_int16_t> all_logits(n_batches * vocab_size, 1);
std::vector<std::span<u_int16_t>> logits(n_batches);

// Create non-overlapping spans pointing to sections of the vector
for (int i = 0; i < n_batches; ++i) {
    logits[i] = std::span<u_int16_t>(all_logits.data() + i * vocab_size, vocab_size);
}

For multiple batches, an alternative approach is to use separate vectors for each batch:

// Store the vectors in a container that persists outside the loop
std::vector<std::vector<u_int16_t>> batch_logits_storage(n_batches);
std::vector<std::span<u_int16_t>> logits(n_batches);

// Initialize each batch's logits:
for (int i = 0; i < n_batches; ++i) {
    batch_logits_storage[i] = std::vector<u_int16_t>(vocab_size, 1);
    logits[i] = std::span<u_int16_t>(batch_logits_storage[i].data(), batch_logits_storage[i].size());
}

Important

Each span must not overlap in memory with another span. For example, the following will throw an exception:

std::vector<u_int16_t> logits_base_test(50258, 3);
std::span<u_int16_t> logits_span_test(logits_base_test.data(), 50257);
std::span<u_int16_t> logits_span2_test(logits_base_test.data() + 1, 50257);
std::vector<std::span<u_int16_t>> logits_test = {logits_span_test, logits_span2_test };

Construct a logit processor

The logit processor is a function that modifies the logits in-place based on the set of allowed tokens. It is constructed with a Guide object and the mask value:

// Set the batch size
std::size_t batch_size = 1;

// The value to use for masking (aka disabling tokens)
// This will be determined by your quantization scheme, 
// but in this example we use 0
u_int16_t mask_value = 0;

// Create the processor using the guide object
dotjson::LogitsProcessor processor(guide, mask_value);

LogitsProcessor is a function that masks tokens that are inconsistent with the schema.

To use the processor, call:

processor(logits, token_set);

This will modify the logits vector in place, using the token_set to determine which tokens to allow.

For example, if logits is the following after the model’s forward pass:

// Single logits vector, with a three token vocabulary
logits_vec = { {1, 2, 3} };

Assuming token 2 is not in the allowed token set, logit processing will modify it to:

processor(logits, token_set);
// logits = {{1,0,3}};

Example program

// Import dotjson
// Note: The include path may vary depending on your installation method
// For system installation, use: #include "dotjson.hpp"
// For local installation, use: #include "vendor/dotjson/include/dotjson.hpp"
#include "../src/dotjson.hpp"
#include <iostream>
#include <vector>
#include <limits>
#include <span>

// A function that returns logits in the appropriate form,
// taking the full sequence history as input for the model's forward pass
std::vector<std::span<u_int16_t>> get_logits(std::string model, const std::vector<std::vector<u_int32_t>>& sequences) {
    // This is your language model's forward pass.
    // It should use the sequences (token history) to compute the next token logits
    // ...
    
    // Return the computed logits
    std::vector<std::span<u_int16_t>> logits;
    // ... populate logits from your forward pass ...
    return logits;
}

std::vector<u_int32_t> sample_tokens(std::vector<std::span<u_int16_t>> &logits) {
    // A function that returns the next tokens generated from a given set of logits, such as multinomial or greedy sampling.
    // ...
    return sampled_tokens;
}

int main() {
    // Specify the model to use -- this downloads the tokenizer
    // from HuggingFace.
    std::string model = "gpt2";

    // Specify a JSON schema to use.
    std::string schema = "{\"type\":\"object\",\"properties\":{\"x\":{\"type\":\"integer\"}}}";

    // Compile the index and vocabulary for the schema.
    dotjson::Vocabulary vocabulary(model);
    dotjson::Index index(schema, vocabulary);

    // Specify the mask value to use for disabling tokens.
    // Tokens that don't match the schema will have their logit values set to this.
    u_int16_t mask_value = 0;
    std::size_t batch_size = 1;

    // Create a guide to generate sets of allowed tokens
    dotjson::Guide guide(index, batch_size);

    // Create a logit processor using the guide
    dotjson::LogitsProcessor processor(guide, mask_value);

    // Get the initial set of allowed tokens
    dotjson::BatchedTokenSet token_set = guide.get_start_tokensets();

    // Initialize sequence tracking for each batch
    std::vector<std::vector<u_int32_t>> sequences(batch_size);
    
    // Maximum allowed sequence length
    const size_t max_sequence_length = 1024;

    // Initialize the first logits based on empty sequences 
    // (this is your language model's forward pass)
    std::vector<std::span<u_int16_t>> logits = get_logits(model, sequences);

    // Process the initial logits with the token set
    processor(logits, token_set);

    // Sample the first set of tokens
    std::vector<u_int32_t> sampled_tokens = sample_tokens(logits);
    
    // Add sampled tokens to the sequences history
    for (size_t i = 0; i < batch_size; i++) {
        sequences[i].push_back(sampled_tokens[i]);
    }

    // Tracking boolean to know when to end generation
    bool is_completed = false;

    // A complete inference loop would look like:
    while (!is_completed) {
        // Get the next set of allowed tokens based on the
        // most recently sampled tokens
        token_set = guide.get_next_tokensets(sampled_tokens);

        // Get the next set of logits from the model
        // The full sequences history is used for the model's context
        logits = get_logits(model, sequences);

        // Process logits with the current token set
        processor(logits, token_set);

        // After processing, only tokens that are valid according to the schema
        // will retain their original values. Others will be set to mask_value.

        // Sample the next tokens
        sampled_tokens = sample_tokens(logits);
        
        // Add the new tokens to the sequence history
        for (size_t i = 0; i < batch_size; i++) {
            sequences[i].push_back(sampled_tokens[i]);
            
            // Check for completion (e.g., EOS token)
            if (sampled_tokens[i] == vocabulary.eos_token_id()) {
                is_completed = true;
            }
        }
    }
    
    // Process the generated sequences to return the final 
    // completion to the user
    for (size_t i = 0; i < batch_size; i++) {
        // Convert token IDs back to text
        // ...
    }
}

Need help?

Email us at [email protected]
Your dedicated Slack channel
Schedule a call with us here

Next steps

Features

Learn about supported JSON Schema features

Example

See a complete example of how to use dotjson

Troubleshooting

Common issues and their solutions

API Reference

Detailed API documentation