lloyal.node

Native backend for the lloyal inference platform.

Prebuilt llama.cpp binaries for 13 platform/GPU combinations, exposing a SessionContext that powers the @lloyal-labs/sdk inference primitives (Branch, BranchStore, Session, Rerank) and @lloyal-labs/lloyal-agents multi-agent framework. Built on liblloyal, a header-only C++20 inference kernel for llama.cpp.

All SDK and agent exports are re-exported from this package for convenience — import { Branch, runAgents } from "@lloyal-labs/lloyal.node" works out of the box.

Install

npm install @lloyal-labs/lloyal.node

Prebuilt binaries for 13 platform/GPU combinations. GPU selection at runtime, not install time.

Platform	Arch	Acceleration
macOS	arm64	Metal
macOS	x64	CPU
Linux	x64	CPU / CUDA / Vulkan
Linux	arm64	CPU / CUDA / Vulkan
Windows	x64	CPU / CUDA / Vulkan
Windows	arm64	CPU / Vulkan

Quick Start

import { createContext } from "@lloyal-labs/lloyal.node";
import { Branch, BranchStore } from "@lloyal-labs/sdk";

const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 4 });
const store = new BranchStore(ctx);

const root = Branch.create(ctx, 0, { temperature: 0.8 });
await root.prefill(await ctx.tokenize("Explain quantum entanglement"));

// Fork and generate — all branches in lockstep, 1 GPU call per step
const branches = await Promise.all([root.fork(), root.fork(), root.fork()]);
for (;;) {
  const live = branches.filter((b) => !b.disposed);
  if (!live.length) break;
  const produced = live.map((b) => ({ b, ...b.produce() }));
  for (const p of produced.filter((p) => p.isStop)) await p.b.prune();
  const items = produced
    .filter((p) => !p.isStop)
    .map((p) => {
      p.b.accept(p.token);
      return [p.b, p.token];
    });
  await store.commit(items);
}

Or for single-branch generation, Branch is an async iterable:

for await (const { token, text } of branch) {
  process.stdout.write(text);
}

See @lloyal-labs/sdk for the full Branch API, continuous tree batching, KV tenancy, and topology documentation.

Without the SDK

createContext returns a SessionContext — the native interface to llama.cpp. You can use it directly without the SDK's Branch/BranchStore layer:

import { createContext } from "@lloyal-labs/lloyal.node";

const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 4 });

// Chat templates — model-agnostic formatting + tool calling
const { prompt, grammar, format } = await ctx.formatChat(messages, {
  addGenerationPrompt: true,
  tools: [{ type: "function", function: { name: "search", parameters: schema } }],
});
const { content, toolCalls } = await ctx.parseChatOutput(output, format);

// Branch primitives — what the SDK's Branch class wraps
const handle = ctx._branchCreate(0, samplerParams);
await ctx._branchPrefill(handle, tokens);
const token = ctx._branchSample(handle);
const text = ctx.tokenToText(token);
const isStop = ctx.isStopToken(token);
ctx._branchAccept(handle, token);
const logits = ctx._branchGetLogits(handle);     // Float32Array(vocabSize)
const entropy = ctx._branchModelEntropy(handle);
const child = ctx._branchFork(handle);

// Store primitives — what the SDK's BranchStore wraps
await ctx._storeCommit([handle1, handle2], [tok1, tok2]);  // N branches, 1 GPU call
await ctx._storePrefill([handle], [tokens]);
await ctx._storeRetainOnly(winner);
const available = ctx._storeAvailable();

// KV cache — snapshot, copy, persist
await ctx.kvSeqCopy(0, 1);                      // share prefix across sequences
await ctx.kvCacheSave();                         // snapshot for rollback
await ctx.kvCacheLoad();                         // restore checkpoint
await ctx.kvCacheWriteFile("cache.bin");         // persist to disk

// Embeddings
const embeddings = await ctx.encode("query text");
const dim = ctx.getEmbeddingDimension();

// Grammar + tokenizer
const grammar = await ctx.jsonSchemaToGrammar(schema);
const tokens = await ctx.tokenize("Hello world");
const sep = await ctx.getTurnSeparator();

What This Package Provides

Native-only (not in SDK):

createContext(options) — load a GGUF model, return a SessionContext
loadBinary(options?) — explicit GPU variant selection with automatic fallback
Prebuilt binaries for 13 platform/GPU combinations

Re-exported from @lloyal-labs/sdk:

Branch, BranchStore, Session, Rerank
Per-token metrics: modelEntropy(), modelSurprisal(), samplingPerplexity
Chat formatting: formatChat(), parseChatOutput()
Grammar: jsonSchemaToGrammar(), setGrammar()

Re-exported from @lloyal-labs/lloyal-agents:

runAgents, useAgentPool, generate, diverge, createToolkit
Structured concurrency DAG via Effection generators
In-loop orchestration: agents as branches of a single running process

GPU Variant Selection

import { loadBinary, createContext } from "@lloyal-labs/lloyal.node";

// Automatic — uses Metal on macOS, CPU elsewhere
const ctx = await createContext({ modelPath: "./model.gguf" });

// Explicit CUDA
const binding = loadBinary({ gpuVariant: "cuda" });
const ctx = await binding.createContext({ modelPath: "./model.gguf" });
// Falls back to CPU with a warning if CUDA runtime not available

Examples

Example	Pattern
`entropy/`	`modelEntropy()` mid-generation as control signal
`chat/`	Interactive streaming chat
`embed/`	Text embeddings extraction

npx tsx examples/best-of-n/best-of-n.ts
npx tsx examples/chat/chat.ts ./model.gguf

CI Testing

Integration tests run real inference across architectures:

Architecture	Test Model	Template
Llama	Llama 3.2 1B	llama3
Phi	Phi 3.5 Mini	phi3
Qwen	Qwen 3 1.7B	chatml
Gemma	Gemma 3 1B	gemma
SmolLM	SmolLM2 1.7B	chatml
Ministral	Ministral 3B	mistral

See distribution.md for details.

Ecosystem

Package	Description
`@lloyal-labs/sdk`	Backend-agnostic inference primitives (Branch, BranchStore, Session, Rerank)
`@lloyal-labs/lloyal-agents`	Multi-agent framework — in-loop orchestration via structured concurrency
liblloyal	Header-only C++20 inference kernel for llama.cpp
lloyal.node	This package — native backend + prebuilt binaries
nitro-llama	React Native backend via Nitro Modules
tsampler	Reference sampler implementation

Contributing

See CONTRIBUTING.md for development setup and release process.

License

Apache 2.0 — See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 262 Commits
.github		.github
cmake		cmake
docs		docs
examples		examples
liblloyal @ f130fc5		liblloyal @ f130fc5
llama.cpp @ e2f19b3		llama.cpp @ e2f19b3
packages/template		packages/template
scripts		scripts
src		src
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.npmignore		.npmignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsconfig.test.json		tsconfig.test.json
typedoc.json		typedoc.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lloyal.node

Install

Quick Start

Without the SDK

What This Package Provides

GPU Variant Selection

Examples

CI Testing

Ecosystem

Contributing

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lloyal.node

Install

Quick Start

Without the SDK

What This Package Provides

GPU Variant Selection

Examples

CI Testing

Ecosystem

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages