The rampart-langtools modules¶
Preface¶
Acknowledgment¶
The rampart-langtools package provides three modules built on best-in-class machine-learning libraries:
- The rampart-llamacpp module is built on llama.cpp, the C/C++ LLM inference engine created by Georgi Gerganov and contributors.
- The rampart-faiss module is built on FAISS, the library for efficient similarity search and clustering of dense vectors from Meta AI Research.
- The rampart-sentencepiece module is built on SentencePiece, the unsupervised text tokenizer from Google.
The authors of Rampart extend their thanks to the authors and contributors of each of these libraries.
License¶
The llama.cpp library is licensed under the MIT License. The FAISS library is licensed under the MIT License. The SentencePiece library is licensed under the Apache 2.0 License.
The rampart-langtools modules are released under the MIT license.
What do they do?¶
Together the three modules provide the building blocks for semantic search and local LLM inference inside Rampart:
- rampart-llamacpp runs GGUF models directly inside the rampart process: text embedding (initEmbed), reranking (initRerank) and text generation (initGen).
- rampart-faiss builds, trains, saves and searches vector indexes, from small exact-search indexes to compressed indexes holding hundreds of millions of vectors.
- rampart-sentencepiece tokenizes text into subword pieces (and back) using a SentencePiece model.
A typical pipeline embeds documents with rampart-llamacpp, stores the vectors in a rampart-faiss index (and/or a rampart-sql table), searches that index with an embedded query vector, and optionally reranks the results with a reranking model.
Note that the rampart-sql module has this pipeline built in: its embed() SQL function generates
vectors in the SQL engine using this package’s rampart-llamacpp module, and its CREATE VECTOR
INDEX / LIKEV similarity search uses a FAISS-backed (IVFPQ) index internally. When
vectors live in a SQL table, that integrated path is usually the simplest choice — see
Vector Search in the rampart-sql documentation. The modules documented
here are for using the same engines directly: custom or standalone faiss indexes, embedding
outside the SQL engine, reranking and text generation.
Platform Availability¶
The rampart-langtools modules are not available on the following platforms:
- macOS x86_64 (Intel Macs).
- 32-bit ARM Linux (the
raspberry_pi_os-buster-armv7lbuild).
On macOS (Apple Silicon):
- initEmbed requires macOS 12 (Monterey) or later.
-
initGen requires macOS 15 (Sequoia) or
later, due to a Metal regression in older versions of macOS. The requirement is checked at
runtime and
initGenwill throw a descriptive error on macOS 14 and below.
On Linux, the modules are available in CPU and CUDA (GPU) builds. GPU-only features (such as the faiss idx.enableGpu() function) are noted where applicable.
The rampart-llamacpp module¶
Loading the module is a simple matter of using the require() function:
var llamacpp = require("rampart-llamacpp");
Models are standard GGUF files, as downloaded from, e.g., Hugging Face.
modelInfo¶
The
modelInfofunction returns a model’s key parameters by reading only its GGUF metadata and vocabulary. It does not load the weight tensors, so there is no GPU upload and the call is fast even for multi-gigabyte models.Usage:
var llamacpp = require("rampart-llamacpp"); var info = llamacpp.modelInfo(path);Where
pathis a String, the path to a.ggufmodel file.
- Return Value:
An Object with the following properties:
embedDim- A Number, the size of the vector that initEmbed’s embedding functions produce. It prefers the GGUFembedding_length_outof a projection head, falling back toembedding_length.hiddenDim- A Number, the model hidden size.nCtxTrain- A Number, the trained context length.nLayer- A Number, the number of layers.arch- A String, the GGUFgeneral.architecture(e.g."bert","llama").pooling- A String, the model’s declared pooling type:"none","mean","cls","last","rank"or"unspecified".nParams- A Number, the parameter count.Example:
var llamacpp = require("rampart-llamacpp"); var info = llamacpp.modelInfo("bge-m3-FP16.gguf"); /* { embedDim: 1024, hiddenDim: 1024, nCtxTrain: 8192, nLayer: 24, arch: "bert", pooling: "cls", nParams: 566703104 } */ /* size vector storage from the model itself: */ var vecDim = info.embedDim;
initEmbed¶
The
initEmbedfunction loads an embedding model and returns a handle used to convert text into semantic vectors.Usage:
var llamacpp = require("rampart-llamacpp"); var emb = llamacpp.initEmbed(path[, options]);Where:
pathis a String, the path to a.ggufembedding model.
optionsis an optional Object accepting all of the settings in Common Model and Context Options below, plus the following embedding-specific options:
pooling- A String, one of"none","mean","cls","last"or"rank"(--poolingin llama-server). Default:"mean".attention- A String,"causal"or"non-causal"(--attention).The legacy option names
nctx,ubatch,nthreadsandnthreads_batchare accepted as aliases fornCtx,nUBatch,threadsandthreadsBatch.If
nCtxis not given, the context size defaults to the model’s trained maximum, capped at 8192 tokens.Model weights are shared: loading the same model file from several handles or rampart threads keeps a single copy of the weights in memory. A handle that is copied to another rampart thread transparently builds its own per-thread context on first use.
- Note:
- The rampart-sql
embed()SQL function runs this same engine inside the SQL module:sql.set({llamaEmbed: '/path/to/model.gguf'})loads the model through rampart-llamacpp andembed(?)then produces the same vectors as emb.embedTextToFp16Buf()’savgVec. When embedding rows of a SQL table, that path avoids round-trips through JavaScript — see Generating embeddings and the llamaEmbed property.- Return Value:
- An Object (the embedding handle) with the functions emb.embedTextToFp16Buf(), emb.embedTextToFp32Buf(), emb.embedTextToNumbers() and emb.destroy() documented below.
Example:
var llamacpp = require("rampart-llamacpp"); var emb = llamacpp.initEmbed("all-minilm-l6-v2_f16.gguf"); var v = emb.embedTextToFp16Buf("about a paragraph of text ..."); /* v = { vecs: [vec1, vec2, ...], avgVec: avgOfVecs } If the text fits the model's context window, v.vecs.length == 1 and v.vecs[0] == v.avgVec. */ /* store the vector, e.g., in a sql table */ sql.exec("insert into vecs values(?,?,?)", [v.avgVec, docId, text]); /* unload when no longer needed */ emb.destroy();
emb.embedTextToFp16Buf()¶
Embed text and return the vector(s) packed as 16-bit floats (little-endian) in a Buffer. This is the most compact format and is directly usable by rampart.vector functions, the rampart-sql
vecdist()function and the faiss idx.addFp16() / idx.searchFp16() functions.Usage:
var ret = emb.embedTextToFp16Buf(text);Where
textis a String (or a Buffer containing text) to be embedded.If the tokenized text does not fit the model’s context window, it is split into overlapping chunks (one eighth of a window of overlap) and one vector is produced per chunk. Each vector is L2-normalized.
- Return Value:
An Object with the following properties:
vecs- An Array of Buffers, one vector per chunk (2 * embedDimbytes each).avgVec- A Buffer. If only one chunk was produced, the same vector asvecs[0]. Otherwise the re-normalized average of all the chunk vectors, suitable as a single whole-document vector.For empty or whitespace-only input,
vecsis an empty Array andavgVecis not set.
emb.embedTextToFp32Buf()¶
The same as emb.embedTextToFp16Buf() except that vectors are packed as 32-bit floats (
4 * embedDimbytes per vector).Usage:
var ret = emb.embedTextToFp32Buf(text);
emb.embedTextToNumbers()¶
The same as emb.embedTextToFp16Buf() except that each vector is returned as an Array of Numbers rather than a packed Buffer.
Usage:
var ret = emb.embedTextToNumbers(text);
emb.destroy()¶
Free the model context (and release the model weights if this was the last handle using them). Using the handle after calling
destroy()throws an error. Handles are also freed automatically when garbage collected, but for large models it is good practice to free them deterministically.Usage:
emb.destroy();
initRerank¶
The
initRerankfunction loads a reranking model (such as bge-reranker-v2-m3) and returns a handle used to score how well documents answer a query. Reranking is commonly applied to the top results of a vector search to improve final ordering.Usage:
var llamacpp = require("rampart-llamacpp"); var rr = llamacpp.initRerank(path[, options]);Where:
pathis a String, the path to a.ggufreranking model.optionsis an optional Object accepting the same settings as initEmbed above. For reranking,poolingdefaults to"rank", the context size defaults to the model’s trained maximum capped at 1024 tokens, andnUBatchdefaults to 512. Input longer thannUBatchtokens is truncated.
- Return Value:
- An Object (the reranker handle) with the functions rr.rerank() and
destroy()(as in emb.destroy()).Example:
var llamacpp = require("rampart-llamacpp"); var rr = llamacpp.initRerank("bge-reranker-v2-m3-Q8_0.gguf"); var question = "How tall is the Eiffel Tower?"; /* score a single document */ var score = rr.rerank(question, "The Eiffel Tower is 330 metres tall."); /* score several documents at once */ var scored = rr.rerank(question, [ "The Eiffel Tower is 330 metres tall.", "Gustave Eiffel also designed bridges.", "Paris is the capital of France." ]); /* [ { document: "The Eiffel Tower is ...", score: 4.21 }, { document: "Gustave Eiffel also ...", score: -1.7 }, ... ] */ rr.destroy();
rr.rerank()¶
Score one or more documents against a query.
Usage:
var score = rr.rerank(query, document); var scored = rr.rerank(query, documents[, scoresOnly]);Where:
queryis a String, the question or search text.document/documentsis a String (a single document) or an Array of Strings.scoresOnlyis an optional Boolean (only meaningful with an Array). Defaultfalse.
- Return Value:
- Given a single String document: a Number, the relevance score. Higher is more relevant; the range depends on the model (scores are typically logits, not probabilities).
- Given an Array of documents: an Array of Objects, each
{document: String, score: Number}, in the same order as the input.- Given an Array and
scoresOnly=true: an Array of Numbers.
initGen¶
The
initGenfunction loads a text-generation model and returns a handle for synchronous and streaming generation.Experimental.
initGen,predictandpredictAsyncare under active development and the API may change. Vision / multimodal input (mmproj) is not supported.
initGenruns a single shared, continuously-batched engine on a dedicated rampart thread. When the returned handle is shared across rampart threads (e.g. server threads), their requests are transparently pooled into that one engine: one copy of the model in memory, batched decoding.nSeqMaxsets how many requests may decode together.On macOS,
initGenrequires macOS 15 (Sequoia) or later — see Platform Availability.Usage:
var llamacpp = require("rampart-llamacpp"); var gen = llamacpp.initGen(path[, options]);Where:
pathis a String, the path to a.gguftext-generation model.optionsis an optional Object accepting all of the settings in Common Model and Context Options below, plus the following generation-specific options:
jinja- A Boolean, whether to apply the model’s chat template via Jinja whenmessagesare given to gen.predict(). Default:true.chatTemplate- A String, a custom Jinja chat template overriding the model’s built-in template.chatTemplateFile- A String, a file from which to read the custom chat template.
- Return Value:
- An Object (the gen handle) with the properties
nCtxandnVocab(Numbers, the resolved context size and vocabulary size) and the functions gen.predict(), gen.predictAsync(), gen.getLast() and gen.destroy().Example:
var llamacpp = require("rampart-llamacpp"); var gen = llamacpp.initGen("gemma-3-4b-it-Q4_K_M.gguf", { nCtx: 4096, nSeqMax: 4 /* up to 4 requests batched together */ }); /* synchronous: blocks and returns the full text */ var text = gen.predict({ messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "What is the capital of France?" } ], maxTokens: 128 }); rampart.utils.printf("%s\n", text); /* streaming: tokens are delivered as they are produced */ var h = gen.predictAsync( { prompt: "Explain how a combustion engine works.", maxTokens: 256, temp: 0.7 }, function(res) { /* per token */ if (!res.done && !res.error) rampart.utils.printf("%s", res.token); }, function(res) { /* done: res.fullText, res.error */ rampart.utils.printf("\n[done]\n"); } ); /* h.cancel(); -- stop this generation early */ gen.destroy();
gen.predict()¶
Generate text synchronously. The call blocks the current thread until generation completes and returns the full generated text. On a server, this blocks the worker thread’s event loop — use gen.predictAsync() in hot handlers.
Usage:
var text = gen.predict(options);Where
optionsis an Object with the following properties (one ofpromptormessagesis required):
prompt- A String, a plain text prompt.messages- An Array of{role: String, content: String}Objects, chat-style messages. The model’s chat template is applied (see thejinjaoption of initGen).maxTokens- A Number, the maximum number of tokens to generate. Default:512.temp- A Number, the sampling temperature.topP- A Number, top-p (nucleus) sampling.topK- A Number, top-k sampling.minP- A Number, min-p sampling.repeatPenalty- A Number, the repetition penalty.repeatLastN- A Number, how many recent tokens the repetition penalty considers.seed- A Number, the RNG seed. If not given, a random seed is used per request.stop- An Array of Strings; generation stops when any of them is produced.addAssistant- A Boolean, whether to append the assistant generation prompt when applying a chat template. Default:true.Unset sampling options use the model/engine defaults.
- Return Value:
- A String, the full generated text. If the engine reports an error, the returned string is
"[gen err:<message>]".
gen.predictAsync()¶
Generate text asynchronously, streaming tokens as they are produced. The call returns immediately; callbacks fire from the event loop. Multiple in-flight calls (across threads, or from one event loop) batch together through the shared engine.
Usage:
var h = gen.predictAsync(options, perToken[, final]);Where:
optionsis the same Object accepted by gen.predict().perTokenis a Function, called once per generated token with an Object:
token- A String, the token text.done- A Boolean,falsefor token callbacks.finalis an optional Function, called once when the generation ends, with an Object:
fullText- A String, the complete generated text.error- A String, set if the generation failed.
- Return Value:
- An Object with a single function
cancel(), which stops this generation early and frees its slot in the engine.
gen.getLast()¶
Return the full text of the last completed gen.predict() call made through this handle on the current thread.
Usage:
var text = gen.getLast();
gen.destroy()¶
Shut down the shared generation engine and free the model context. The handle (and any copies of it on other threads) must not be used afterwards.
Usage:
gen.destroy();
Common Model and Context Options¶
initEmbed, initRerank and initGen accept a common set of model-loading and context options. Most map 1:1 onto the matching
llama-servercommand-line flag (the camelCase of the flag); see the llama.cpp server documentation for full descriptions of each.
Option llama-server flag Value gpuLayers--gpu-layersNumber. Layers to offload to the GPU ( -1= all).mainGpu--main-gpuNumber. GPU device to use. splitMode--split-modeString: "none","layer"or"row".useMmap--no-mmapBoolean. Memory-map the model file. useMlock--mlockBoolean. Lock the model in RAM. checkTensors--check-tensorsBoolean. Validate tensor data while loading. nCtx--ctx-sizeNumber. Context size in tokens ( 0or-1= the model’s trained maximum).nBatch--batch-sizeNumber. Logical batch size. nUBatch--ubatch-sizeNumber. Physical (micro-)batch size. nSeqMax--parallelNumber. Maximum parallel sequences (initGen: how many requests decode together). threads--threadsNumber. Threads for generation. threadsBatch--threads-batchNumber. Threads for batch/prompt processing. flashAttn--flash-attnBoolean or String: "on","off"or"auto".cacheTypeK--cache-type-kString. KV cache type for K: "f32","f16","bf16","q8_0","q4_0","q4_1","q5_0","q5_1"or"iq4_nl".cacheTypeV--cache-type-vString. KV cache type for V (same values). offloadKqv--no-kv-offloadBoolean. Offload the KV cache to the GPU. offloadKQVis accepted as an alias.opOffload--op-offloadBoolean. Offload host-tensor operations. kvUnified--kv-unifiedBoolean. Use a unified KV cache. ropeScaling--rope-scalingString: "none","linear","yarn"or"longrope".ropeFreqBase--rope-freq-baseNumber. ropeFreqScale--rope-freq-scaleNumber. yarnExtFactor--yarn-ext-factorNumber. yarnAttnFactor--yarn-attn-factorNumber. yarnBetaFast--yarn-beta-fastNumber. yarnBetaSlow--yarn-beta-slowNumber. yarnOrigCtx--yarn-orig-ctxNumber.
getLog¶
llama.cpp produces log output during model loading and initialization. This output is captured in an internal buffer rather than printed to stdout/stderr.
getLogretrieves the captured log.Usage:
var llamacpp = require("rampart-llamacpp"); var emb = llamacpp.initEmbed("all-minilm-l6-v2_f16.gguf"); var log = llamacpp.getLog(); console.log(log);
- Return Value:
- A String, the captured log output.
- Note:
- The log buffer has a maximum size of 40KB. If it overflows, the oldest half of the log is discarded and the first line will read
WARN: log overflow. The log buffer is process-global (all threads write to the same, mutex-protected buffer), butgetLog/resetLogare only serviceable from the module object of the thread that first loaded the module; on other threads they throw.
Environment Variables¶
RAMPART_LLAMA_CUDA_GRAPHS- On CUDA builds, ggml caches a captured CUDA graph per compute-graph shape and only evicts entries after they have been idle for 10 seconds. Batched embedding and reranking decode a stream of varying shapes, which fills that cache faster than it drains and makes GPU memory climb until it runs out. CUDA graphs only speed up single-stream text generation, so initEmbed, initRerank and the rampart-sql embedding path disable them automatically. A process that only calls initGen never disables them, so generation performance is unaffected. SetRAMPART_LLAMA_CUDA_GRAPHS(to any value) to opt out and keep CUDA graphs on even for embedding/reranking, accepting the memory growth above. The setting is process-global and read once at startup. It has no effect on CPU or Metal (Apple) builds.RAMPART_METAL_RESIDENCY- On macOS, Metal “residency sets” are disabled by default to avoid an assertion at process exit. SetRAMPART_METAL_RESIDENCY=1to keep the feature (a marginal performance optimization) enabled.
The rampart-faiss module¶
Loading the module is a simple matter of using the require() function:
var faiss = require("rampart-faiss");
Note that the rampart-sql module’s built-in Vector Indexes (the IVFPQ
backend of CREATE
VECTOR INDEX) are also FAISS indexes,
maintained automatically as table rows change. Use rampart-faiss directly when the vectors do not
live in a SQL table, or when a custom index type, metric or training regime is needed.
openFactory¶
The
openFactoryfunction creates a new (empty) vector index from a FAISS index factory description string. See also the FAISS guidelines for choosing an index.Usage:
var faiss = require("rampart-faiss"); var idx = faiss.openFactory(description, dimensions[, metricType]);Where:
descriptionis a String, the factory description (e.g."Flat","IDMap2,Flat","IDMap2,OPQ96,IVF262144,PQ48"). It is highly recommended to includeIDMaporIDMap2so that arbitrary ids can be stored with each vector; otherwise ids are assigned sequentially starting at0.dimensionsis a positive Number, the vector dimension (e.g.384for all-minilm-l6-v2 embeddings — see modelInfo to read it from the embedding model).metricTypeis an optional String, the distance metric (case-insensitive):
"innerProduct"or"ip"- inner (dot) product (the default). Equivalent to cosine similarity when vectors are L2-normalized, as the initEmbed vectors are."l2"- squared euclidean distance."l1","manhattan"or"cityBlock"- L1 distance."linf"or"infinity"- L-infinity distance."lp"- Lp distance."canberra","brayCurtis","jensenShannon"- additional metrics supported by FAISS.
- Return Value:
- An Object (the index handle) with the property idx.settings and the functions idx.addFp32(), idx.addFp16(), idx.searchFp32(), idx.searchFp16(), idx.save() and (on CUDA builds) idx.enableGpu(). If the index type requires training, the handle additionally has the idx.trainer() function;
idx.trainerisundefinedfor index types that need no training (such asFlat).Example:
var faiss = require("rampart-faiss"); /* a small exact-search index storing our own doc ids */ var idx = faiss.openFactory("IDMap2,Flat", 4); /* add fp32 vectors (Buffer of 4 floats = 16 bytes each) */ idx.addFp32(101, new Float32Array([1,0,0,0]).buffer); idx.addFp32(102, new Float32Array([0,1,0,0]).buffer); /* search: top 2 results for a query vector */ var res = idx.searchFp32(new Float32Array([0.9,0.1,0,0]).buffer, 2); /* res = [ { id: 101, distance: 0.9 }, { id: 102, distance: 0.1 } ] */ idx.save("myindex.faiss");
openIndexFromFile¶
The
openIndexFromFilefunction loads an index previously written with idx.save().Usage:
var faiss = require("rampart-faiss"); var idx = faiss.openIndexFromFile(filename[, readOnly]);Where:
filenameis a String, the path of the saved index.readOnlyis an optional Boolean. Iftrue, the index is opened read-only and memory-mapped, serving searches directly from disk rather than loading the entire index into RAM. Default:false.
- Return Value:
- An index handle, as returned by openFactory.
Example:
var faiss = require("rampart-faiss"); var llamacpp = require("rampart-llamacpp"); var idx = faiss.openIndexFromFile("myindex.faiss", true); var emb = llamacpp.initEmbed("all-minilm-l6-v2_f16.gguf"); var v = emb.embedTextToFp16Buf("my search query"); var res = idx.searchFp16(v.avgVec, 10); res.forEach(function(r) { rampart.utils.printf("id=%s distance=%f\n", r.id, r.distance); });
idx.settings¶
A read-only Object describing the open index:
dimension- A Number, the vector dimension.count- A Number, the number of vectors currently in the index (updated by idx.addFp32() / idx.addFp16()).metricType- A String, the distance metric (see openFactory).type- A String, the detected index type (e.g."Flat","IVFFlat","IVFPQ","HNSWFlat","LSH").map- A String,"IDMap"or"IDMap2"if the index stores arbitrary ids. Not set otherwise.PQm,PQbits- Numbers, product-quantization parameters. Only set for PQ-based index types.onGpu,gpuDevice- set after a successful idx.enableGpu() call.
idx.addFp32()¶
Add one vector of 32-bit floats to the index.
Usage:
var id = idx.addFp32(id, buffer);Where:
idis a Number or String (for ids larger than the float precision of a Number), the 64-bit id to associate with the vector. Pass-1to have an id assigned sequentially (0,1,2, …). Arbitrary (non-sequential) ids require anIDMap/IDMap2index — see openFactory.bufferis a Buffer of4 * dimensionbytes: the vector packed as little-endian 32-bit floats (e.g. from emb.embedTextToFp32Buf() or aFloat32Array’sbuffer).If the index requires training (see idx.trainer()) it must be trained before vectors can be added.
- Return Value:
- A Number, the id under which the vector was stored (the passed
id, or the assigned sequential id when-1was passed).
idx.addFp16()¶
The same as idx.addFp32() except that
bufferis2 * dimensionbytes: the vector packed as little-endian 16-bit (half-precision) floats, e.g. from emb.embedTextToFp16Buf() or rampart.vector conversion functions. The vector is converted to 32-bit floats before insertion (FAISS indexes store fp32).Usage:
var id = idx.addFp16(id, buffer);
idx.searchFp32()¶
Search the index for the nearest vectors to a query vector of 32-bit floats.
Usage:
var results = idx.searchFp32(buffer[, nResults[, nProbe]]);Where:
bufferis a Buffer of4 * dimensionbytes, the query vector (see idx.addFp32()).nResultsis an optional positive Number, the maximum number of results to return. Default:10.nProbeis an optional positive Number. For IVF index types, how many inverted-list cells to probe (more = better recall, slower search). Ignored for non-IVF indexes.
- Return Value:
An Array of Objects, best match first:
id- A Number, the id stored with the vector.distance- A Number, the metric value (for the defaultinnerProductmetric, larger is more similar; forl2, smaller is more similar).Fewer than
nResultsentries are returned if the index holds fewer vectors.
idx.searchFp16()¶
The same as idx.searchFp32() except that
bufferis the query vector packed as 16-bit floats (2 * dimensionbytes).Usage:
var results = idx.searchFp16(buffer[, nResults[, nProbe]]);
idx.save()¶
Write the index to a file, which may later be loaded with openIndexFromFile. If the index has been moved to the GPU with idx.enableGpu(), it is converted back to a CPU index for saving (the in-memory index stays on the GPU).
Usage:
idx.save(filename);Where
filenameis a String, the path of the file to write.
idx.enableGpu()¶
Move the index to a GPU. Only available on CUDA builds of the module; on CPU builds the function does not exist (check with
if (idx.enableGpu)). Note that not every FAISS index type is supported on the GPU.Usage:
if (idx.enableGpu) idx.enableGpu(device);Where
deviceis an optional Number, the GPU device id. Default:0.
- Return Value:
trueupon success. After the call,idx.settings.onGpuistrueandidx.settings.gpuDeviceis set. An error is thrown on failure.
idx.trainer()¶
Index types that partition or compress the vector space (IVF, PQ, OPQ, LSH and similar) must be trained on a representative sample of vectors before any can be added. For such indexes the handle returned by openFactory includes a
trainerfunction; for index types that need no training (e.g.Flat)idx.trainerisundefined.The trainer accumulates training vectors in a file, so that millions of training vectors need not be held in memory, and so that an interrupted run can be resumed from the same file.
Usage:
if (idx.trainer) { var trainer = new idx.trainer(path); ... }Where
pathis an optional String:
- A directory: a new training file named
faisstrainingdata.<n>.<pid>is created there. Default:"/tmp".- An existing training file (from a previous run): its vectors are reloaded and
trainer.settings.loadedRowsis set to the number of vectors found. Newly added vectors are appended.- A non-existent path: it is created as a new (empty) training file.
The training file is not deleted automatically; it may be reused by a later run, or removed with rampart.utils.rmFile when no longer wanted.
- Return Value:
- An Object (the trainer handle) with the read-only property
trainFile(a String, the path of the training data file) and the functions trainer.addTrainingfp32(), trainer.addTrainingfp16() and trainer.train().Example:
var faiss = require("rampart-faiss"); /* an IVF index over 384-dim vectors: requires training */ var idx = faiss.openFactory("IDMap2,IVF4096,Flat", 384); if (idx.trainer) { var trainer = new idx.trainer("./tdata"); /* feed a representative sample of vectors */ sql.exec("select Vec from vecs", {maxRows: 1000000}, function(row) { trainer.addTrainingfp16(row.Vec); }); trainer.train(); /* may take a while */ } /* now vectors may be added */ sql.exec("select Id, Vec from vecs", {maxRows: -1}, function(row) { idx.addFp16(row.Id, row.Vec); }); idx.save("vecs-ivf4096.faiss");
trainer.addTrainingfp32()¶
Append one 32-bit float vector to the training file.
Usage:
trainer.addTrainingfp32(buffer);Where
bufferis a Buffer of4 * dimensionbytes (see idx.addFp32()).
trainer.addTrainingfp16()¶
Append one 16-bit float vector to the training file. The vector is converted to 32-bit floats before being written.
Usage:
trainer.addTrainingfp16(buffer);Where
bufferis a Buffer of2 * dimensionbytes (see idx.addFp16()).
trainer.train()¶
Train the index from all the vectors accumulated in the training file (including vectors reloaded from a previous run). Depending on the index type and the number of training vectors, training can take from seconds to many hours.
Usage:
trainer.train();An error is thrown if no vectors have been added, or if the training file’s size is not a multiple of the vector size.
The rampart-sentencepiece module¶
Loading the module is a simple matter of using the require() function:
var sp = require("rampart-sentencepiece");
init¶
The
initfunction loads a SentencePiece model file and returns a handle for encoding text into subword pieces.Usage:
var sp = require("rampart-sentencepiece"); var encoder = sp.init(path);Where
pathis a String, the path to a SentencePiece model (e.g. sentencepiece.bpe.model from the bge-m3 repository).
- Return Value:
- An Object (the encoder handle) with the function encoder.encode().
Example:
var sp = require("rampart-sentencepiece"); var encoder = sp.init("./sentencepiece.bpe.model"); var pieces = encoder.encode("hello there you goat"); /* [ "▁hell", "o", "▁there", "▁you", "▁go", "at" ] */ var text = sp.decode(pieces); /* "hello there you goat" */
encoder.encode()¶
Encode text into subword pieces using the loaded model. In the pieces, the character
▁(U+2581, “lower one eighth block”) marks the start of a word.Usage:
var pieces = encoder.encode(text[, asString]);Where:
textis a String, the text to encode.asStringis an optional Boolean. Iftrue, the pieces are returned as a single space-separated String instead of an Array. Default:false.
- Return Value:
- An Array of Strings (one per piece), or a single space-separated String if
asStringistrue. Either form is accepted by decode.
decode¶
Reassemble encoded pieces into text. Decoding is purely textual (pieces are concatenated and the
▁word-start markers become spaces), so no model handle is needed and the function lives on the module object itself.Usage:
var sp = require("rampart-sentencepiece"); var text = sp.decode(pieces);Where
piecesis an Array of Strings, or a space-separated String of pieces, as produced by encoder.encode().
- Return Value:
- A String, the decoded text.
Putting It Together¶
A compact end-to-end example: embed documents, index them, and serve semantic search queries.
rampart.globalize(rampart.utils);
var llamacpp = require("rampart-llamacpp");
var faiss = require("rampart-faiss");
var emb = llamacpp.initEmbed("all-minilm-l6-v2_f16.gguf");
var dim = llamacpp.modelInfo("all-minilm-l6-v2_f16.gguf").embedDim;
var docs = [
"The Eiffel Tower is 330 metres tall.",
"Gustave Eiffel also designed bridges.",
"Semantic search finds meaning, not just words."
];
/* build the index */
var idx = faiss.openFactory("IDMap2,Flat", dim);
docs.forEach(function(text, i) {
var v = emb.embedTextToFp16Buf(text);
idx.addFp16(i, v.avgVec);
});
/* search it */
var q = emb.embedTextToFp16Buf("How high is the Eiffel Tower?");
var res = idx.searchFp16(q.avgVec, 2);
res.forEach(function(r) {
printf("%f %s\n", r.distance, docs[r.id]);
});
For a larger, trained index (tens of millions of vectors) see the idx.trainer() example above; for improving the final ordering of the top results, see initRerank.