Native backend for the lloyal inference platform.
Prebuilt llama.cpp binaries for 13 platform/GPU combinations, exposing a SessionContext that powers the @lloyal-labs/sdk inference primitives (Branch, BranchStore, Session, Rerank) and @lloyal-labs/lloyal-agents multi-agent framework. Built on liblloyal, a header-only C++20 inference kernel for llama.cpp.
All SDK and agent exports are re-exported from this package for convenience — import { Branch, runAgents } from "@lloyal-labs/lloyal.node" works out of the box.
npm install @lloyal-labs/lloyal.nodePrebuilt binaries for 13 platform/GPU combinations. GPU selection at runtime, not install time.
| Platform | Arch | Acceleration |
|---|---|---|
| macOS | arm64 | Metal |
| macOS | x64 | CPU |
| Linux | x64 | CPU / CUDA / Vulkan |
| Linux | arm64 | CPU / CUDA / Vulkan |
| Windows | x64 | CPU / CUDA / Vulkan |
| Windows | arm64 | CPU / Vulkan |
import { createContext } from "@lloyal-labs/lloyal.node";
import { Branch, BranchStore } from "@lloyal-labs/sdk";
const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 4 });
const store = new BranchStore(ctx);
const root = Branch.create(ctx, 0, { temperature: 0.8 });
await root.prefill(await ctx.tokenize("Explain quantum entanglement"));
// Fork and generate — all branches in lockstep, 1 GPU call per step
const branches = await Promise.all([root.fork(), root.fork(), root.fork()]);
for (;;) {
const live = branches.filter((b) => !b.disposed);
if (!live.length) break;
const produced = live.map((b) => ({ b, ...b.produce() }));
for (const p of produced.filter((p) => p.isStop)) await p.b.prune();
const items = produced
.filter((p) => !p.isStop)
.map((p) => {
p.b.accept(p.token);
return [p.b, p.token];
});
await store.commit(items);
}Or for single-branch generation, Branch is an async iterable:
for await (const { token, text } of branch) {
process.stdout.write(text);
}See @lloyal-labs/sdk for the full Branch API, continuous tree batching, KV tenancy, and topology documentation.
createContext returns a SessionContext — the native interface to llama.cpp. You can use it directly without the SDK's Branch/BranchStore layer:
import { createContext } from "@lloyal-labs/lloyal.node";
const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 4 });
// Chat templates — model-agnostic formatting + tool calling
const { prompt, grammar, format } = await ctx.formatChat(messages, {
addGenerationPrompt: true,
tools: [{ type: "function", function: { name: "search", parameters: schema } }],
});
const { content, toolCalls } = await ctx.parseChatOutput(output, format);
// Branch primitives — what the SDK's Branch class wraps
const handle = ctx._branchCreate(0, samplerParams);
await ctx._branchPrefill(handle, tokens);
const token = ctx._branchSample(handle);
const text = ctx.tokenToText(token);
const isStop = ctx.isStopToken(token);
ctx._branchAccept(handle, token);
const logits = ctx._branchGetLogits(handle); // Float32Array(vocabSize)
const entropy = ctx._branchModelEntropy(handle);
const child = ctx._branchFork(handle);
// Store primitives — what the SDK's BranchStore wraps
await ctx._storeCommit([handle1, handle2], [tok1, tok2]); // N branches, 1 GPU call
await ctx._storePrefill([handle], [tokens]);
await ctx._storeRetainOnly(winner);
const available = ctx._storeAvailable();
// KV cache — snapshot, copy, persist
await ctx.kvSeqCopy(0, 1); // share prefix across sequences
await ctx.kvCacheSave(); // snapshot for rollback
await ctx.kvCacheLoad(); // restore checkpoint
await ctx.kvCacheWriteFile("cache.bin"); // persist to disk
// Embeddings
const embeddings = await ctx.encode("query text");
const dim = ctx.getEmbeddingDimension();
// Grammar + tokenizer
const grammar = await ctx.jsonSchemaToGrammar(schema);
const tokens = await ctx.tokenize("Hello world");
const sep = await ctx.getTurnSeparator();Native-only (not in SDK):
createContext(options)— load a GGUF model, return aSessionContextloadBinary(options?)— explicit GPU variant selection with automatic fallback- Prebuilt binaries for 13 platform/GPU combinations
Re-exported from @lloyal-labs/sdk:
Branch,BranchStore,Session,Rerank- Per-token metrics:
modelEntropy(),modelSurprisal(),samplingPerplexity - Chat formatting:
formatChat(),parseChatOutput() - Grammar:
jsonSchemaToGrammar(),setGrammar()
Re-exported from @lloyal-labs/lloyal-agents:
runAgents,useAgentPool,generate,diverge,createToolkit- Structured concurrency DAG via Effection generators
- In-loop orchestration: agents as branches of a single running process
import { loadBinary, createContext } from "@lloyal-labs/lloyal.node";
// Automatic — uses Metal on macOS, CPU elsewhere
const ctx = await createContext({ modelPath: "./model.gguf" });
// Explicit CUDA
const binding = loadBinary({ gpuVariant: "cuda" });
const ctx = await binding.createContext({ modelPath: "./model.gguf" });
// Falls back to CPU with a warning if CUDA runtime not available| Example | Pattern |
|---|---|
entropy/ |
modelEntropy() mid-generation as control signal |
chat/ |
Interactive streaming chat |
embed/ |
Text embeddings extraction |
npx tsx examples/best-of-n/best-of-n.ts
npx tsx examples/chat/chat.ts ./model.ggufIntegration tests run real inference across architectures:
| Architecture | Test Model | Template |
|---|---|---|
| Llama | Llama 3.2 1B | llama3 |
| Phi | Phi 3.5 Mini | phi3 |
| Qwen | Qwen 3 1.7B | chatml |
| Gemma | Gemma 3 1B | gemma |
| SmolLM | SmolLM2 1.7B | chatml |
| Ministral | Ministral 3B | mistral |
See distribution.md for details.
| Package | Description |
|---|---|
@lloyal-labs/sdk |
Backend-agnostic inference primitives (Branch, BranchStore, Session, Rerank) |
@lloyal-labs/lloyal-agents |
Multi-agent framework — in-loop orchestration via structured concurrency |
| liblloyal | Header-only C++20 inference kernel for llama.cpp |
| lloyal.node | This package — native backend + prebuilt binaries |
| nitro-llama | React Native backend via Nitro Modules |
| tsampler | Reference sampler implementation |
See CONTRIBUTING.md for development setup and release process.
Apache 2.0 — See LICENSE for details.