proxy70

	[HN Gopher] New and Improved Embedding Model for OpenAI ___________________________________________________________________ New and Improved Embedding Model for OpenAI Author : craigkerstiens Score : 62 points Date : 2022-12-15 18:13 UTC (4 hours ago)
	web link (openai.com)
	w3m dump (openai.com)
	\| tibbar wrote: \| Note that this model is not able to write code. \| \| Prompt: "Write a C program to RSA encrypt a string." \| \| text-embedding-ada-002: \| \| %-C C C C C C C C \| \| C \| \| C \| \| C \| \| C \| \| C C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C C C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C C \| \| C \| \| C \| \| C \| \| C C \| \| C \| \| C \| \| C \| \| C \| \| C C C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C \| \| C C C C C C C \| \| C C \| \| C \| \| C \| \| C C C C C C C \| \| C \| \| C C C C C C C \| \| C \| \| C C C C C C \| \| C C C aC C C \| \| C \| \| m m m m m m m m m m m m m m m m m m mm m C mC \| \| (Note: a relatively short max length) \| \| Compare text-davinci-003: \| \| #include #include \| \| // Utility function to find GCD of 'a' // and 'b' int gcd(int a, \| int b) { if (a == 0) return b; return gcd(b % a, a); } \| \| // Function to find Modular Multiplicative // Inverse of 'a' \| under modulo 'm' int modInverse(int a, int m) { a = a % m; for \| (int x = 1; x < m; x++) if ((a * x) % m == 1) return x; } \| \| // Function to compute a/b under modulo m int modDivide(int a, \| int b, int m) { a = a % m; int inv = modInverse(b, m); return \| (inv * a) % m; } \| \| // Function to calculate (a^b)%c long long int power(int a, int \| b, int c) { int res = 1; a = a % c; while (b > 0) { if (b & 1) \| res = modDivide(res * a, 1, c); b = b >> 1; a = modDivide(a * a, \| 1, c); } return res; } \| \| // C function for extended Euclidean Algorithm int \| gcdExtended(int a, int b, int _x, int_ y) { if (a == 0) { _x = \| 0;_ y = 1; return b; } int x1, y1; int gcd = gcdExtended(b%a, a, \| &x1, &y1); _x = y1 - (b /a) _ x1; _y = x1; \| return gcd; \| \| } \| \| // Function to generate public and private // keys void \| generateKeys(int p, int q, int _e, int _d, int_ n) { // n = p _q_ \| n = p * q; // Computing phi(n) = (p-1)(q-1) \| int phi = (p - 1) (q - 1); // Choose an integer \| e such that // e and phi(n) are co-prime e = \| 2; while (e < phi) { if (gcd(e, \| phi)==1) break; else \| ++e; } // Compute d such that it \| satisfies // de = 1 + k phi(n) int k = 2; \| // Fixed value d = (1 + (k phi))/ *e; \| \| } \| \| // Encryption Function long int RSA_Encrypt(int msg, int e, int \| n) { // Cipher Text c = (msg ^ e) % n long int c = power(msg, e, \| n); return c; \| \| } \| \| // Decryption Function long int RSA_Decrypt(int c, int d, int n) \| { // Message msg = (c ^ d) % n long int msg = power(c, d, n); \| return msg; \| \| } \| \| // Driver program int main() { // Two random prime numbers int p \| = 3; int q = 7; // Message to be encrypted \| int msg = 15; // Encryption key (e, n) \| int e, d, n; // Calculate n and phi \| generateKeys(p, q, &e, &d, &n); // Encryption \| long int c = RSA_Encrypt(msg, e, n); printf("Encrypted \| message is: %d\n", c); // Decryption \| long int m = RSA_Decrypt(c, d, n); printf("Original \| Message is: %d\n", m); return 0; } \| varunkmohan wrote: \| It's an embedding model so it generates vector embeddings not \| text generations. That's to be expected. \| lee101 wrote: \| Also check out the embedding model from https://text-generator.io \| \| It supports some things that openAI can't do, it retrieves any \| linked images of web pages, analyses the images or images with \| text inside to help the embedding model \| IanCal wrote: \| Once I've got embeddings my naive next step would be to do cosine \| similarity for comparisons/search/anything that requires a \| distance. I see they do that in some examples. \| \| Is that the standard approach these days? Are there newer default \| approaches that tend to work better? \| visarga wrote: \| You can also apply clustering or train a classification model \| based on embeddings. \| evergreener wrote: \| Is it known to anyone how OpenAI (and others) are extending the \| context windows of things like ChatGPT so far? E.g. if you exceed \| 2048/8192 (subword) tokens, does the model just chunk the inputs \| and evaluate separately on the chunks? Is context/state \| maintained across chunks? I've never seen anyone actually explain \| this. \| bcjordan wrote: \| The embeddings/search API seem super powerful, been meaning to \| play around with it more. I wonder how its performance compares \| to ElasticSearch / other text search/classification offerings out \| there \| gk1 wrote: \| The Search and Classification (and Answers) APIs were \| deprecated last week.[1] \| \| They were never in serious competition with Elastic, as far as \| search goes. If you wanted to build a semantic search \| application using OpenAI embeddings, the more common (and \| scalable) method is to index those embeddings in a vector \| database like Pinecone.[2] In fact that's what OpenAI \| recommends to anyone who needs to transition off their Search \| API. \| \| [1] https://help.openai.com/en/articles/6272952-search- \| transitio... \| \| [2] https://docs.pinecone.io/docs/openai \| thirdtrigger wrote: \| Agreed - one can also use Weaviate which comes with an OOTB \| OpenAI module leveraging the embeddings end-point \| https://weaviate.io/developers/weaviate/current/retriever- \| ve... \| dr_dshiv wrote: \| What effect will this have for connecting concepts between books? \| Either through summarization or topic mapping? \| gok wrote: \| > Longer context. The context length of the new model is \| increased by a factor of four, from 2048 to 8192, making it more \| convenient to work with long documents. \| \| 8192 words is getting into the range of short stories or a \| masters thesis, which opens the door to some interesting \| applications. \| drusepth wrote: \| Important to note that these tokens _can_ be words, but \| oftentimes a word will be comprised of multiple tokens, so 8192 \| tokens = 8192 words isn't strictly correct. \| \| That said, your point stands. Most short stories are low-to-mid \| four-digit words, and a jump from 2048 tokens to 8192 squarely \| fits in that window. \| \| As someone who's been working on multi-layered approaches to \| using GPT-like models for long text generation (e.g. synopsis \| -> outline -> paragraph expansions) to get around the limited \| context window, it'll be interesting to see if people will keep \| working towards that end or if it'll all become a moot point as \| the effective context window continues to scale up. ___________________________________________________________________ (page generated 2022-12-15 23:01 UTC)