Readme

Part of Val Town Semantic Search.

Generates OpenAI embeddings for all public vals, and stores them in Neon, using the pg_vector extension.

  • Create the vals_embeddings table in Neon if it doesn't already exist.
  • Get all val names from the database of public vals, made by Achille Lacoin.
  • Get all val names from the vals_embeddings table and compute the difference (which ones are missing).
  • Iterate through all missing vals, get their code, get embeddings from OpenAI, and store the result in Neon.
  • Can now be searched using janpaul123/semanticSearchNeon.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import { decode as base64Decode, encode as base64Encode } from "https://deno.land/std@0.166.0/encoding/base64.ts";
import { Client } from "https://deno.land/x/postgres/mod.ts";
import getValCode from "https://esm.town/v/janpaul123/getValCode";
import { sqlToJSON } from "https://esm.town/v/nbbaier/sqliteExportHelpers?v=22";
import { db as allValsDb } from "https://esm.town/v/sqlite/db?v=9";
import { blob } from "https://esm.town/v/std/blob";
import OpenAI from "npm:openai";
import { truncateMessage } from "npm:openai-tokens";
// CREATE TABLE vals_embeddings (id TEXT PRIMARY KEY, embedding VECTOR(1536));
export default async function() {
const dimensions = 1536;
const client = new Client(Deno.env.get("NEON_URL_VALSEMBEDDINGS"));
await client.connect();
const allVals = await sqlToJSON(
await allValsDb.execute("SELECT author_username, name, version FROM vals WHERE LENGTH(code) > 10 ORDER BY name"),
) as any;
const existingEmbeddingsIds = new Set(
(await client.queryObject`SELECT id FROM vals_embeddings`).rows.map(row => row.id),
);
function idForVal(val: any): string {
return `${val.author_username}!!${val.name}!!${val.version}`;
}
const newValsBatches = [[]];
let currentBatch = newValsBatches[0];
for (const val of allVals) {
const id = idForVal(val);
if (!existingEmbeddingsIds.has(id)) {
currentBatch.push(val);
}
if (currentBatch.length >= 100) {
currentBatch = [];
newValsBatches.push(currentBatch);
}
}
const openai = new OpenAI();
for (const newValsBatch of newValsBatches) {
await Promise.all([...Array(newValsBatch.length).keys()].map(async (valIndex) => {
const val = newValsBatch[valIndex];
const code = getValCode(val);
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: truncateMessage(code, "text-embedding-3-small"),
encoding_format: "base64",
dimensions,
});
const embeddingBinary = new Float32Array(base64Decode(embedding.data[0].embedding as any).buffer);
if (embeddingBinary.length != dimensions) {
throw new Error(`Invalid embeddingBinary.length: ${embeddingBinary.length}`);
}
const id = idForVal(val);
const embeddedBinaryString = `[${embeddingBinary.join(",")}]`;
const result = await client
.queryObject`INSERT INTO vals_embeddings (id, embedding) VALUES (${id}, ${embeddedBinaryString})`;
console.log(`Processed ${id}..`);
}));
console.log("Finished batch..");
}
console.log(`Finished, we have indexed ${allVals.length} records`);
}
Val Town is a social website to write and deploy JavaScript.
Build APIs and schedule functions from your browser.
Comments
Nobody has commented on this val yet: be the first!
June 17, 2024