[HN Gopher] New and Improved Embedding Model for OpenAI
___________________________________________________________________
 
New and Improved Embedding Model for OpenAI
 
Author : craigkerstiens
Score  : 62 points
Date   : 2022-12-15 18:13 UTC (4 hours ago)
 
web link (openai.com)
w3m dump (openai.com)
 
| tibbar wrote:
| Note that this model is not able to write code.
| 
| Prompt: "Write a C program to RSA encrypt a string."
| 
| text-embedding-ada-002:
| 
| %-C C C C C C C C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C C C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C C
| 
| C
| 
| C
| 
| C
| 
| C C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C C C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C
| 
| C C C C C C C
| 
| C C
| 
| C
| 
| C
| 
| C C C C C C C
| 
| C
| 
| C C C C C C C
| 
| C
| 
| C C C C C C
| 
| C C C aC C C
| 
| C
| 
| m m m m m m m m m m m m m m m m m m mm m C mC
| 
| (Note: a relatively short max length)
| 
| Compare text-davinci-003:
| 
| #include  #include
| 
| // Utility function to find GCD of 'a' // and 'b' int gcd(int a,
| int b) { if (a == 0) return b; return gcd(b % a, a); }
| 
| // Function to find Modular Multiplicative // Inverse of 'a'
| under modulo 'm' int modInverse(int a, int m) { a = a % m; for
| (int x = 1; x < m; x++) if ((a * x) % m == 1) return x; }
| 
| // Function to compute a/b under modulo m int modDivide(int a,
| int b, int m) { a = a % m; int inv = modInverse(b, m); return
| (inv * a) % m; }
| 
| // Function to calculate (a^b)%c long long int power(int a, int
| b, int c) { int res = 1; a = a % c; while (b > 0) { if (b & 1)
| res = modDivide(res * a, 1, c); b = b >> 1; a = modDivide(a * a,
| 1, c); } return res; }
| 
| // C function for extended Euclidean Algorithm int
| gcdExtended(int a, int b, int _x, int_ y) { if (a == 0) { _x =
| 0;_ y = 1; return b; } int x1, y1; int gcd = gcdExtended(b%a, a,
| &x1, &y1); _x = y1 - (b /a) _ x1; _y = x1;
| return gcd;
| 
| }
| 
| // Function to generate public and private // keys void
| generateKeys(int p, int q, int _e, int _d, int_ n) { // n = p _q_
| n = p * q;                   // Computing phi(n) = (p-1)*(q-1)
| int phi = (p - 1) * (q - 1);                 // Choose an integer
| e such that          // e and phi(n) are co-prime          *e =
| 2;          while (*e < phi)          {              if (gcd(*e,
| phi)==1)                  break;              else
| ++*e;          }                 // Compute d such that it
| satisfies          // d*e = 1 + k * phi(n)          int k = 2;
| // Fixed value          *d = (1 + (k * phi))/ *e;
| 
| }
| 
| // Encryption Function long int RSA_Encrypt(int msg, int e, int
| n) { // Cipher Text c = (msg ^ e) % n long int c = power(msg, e,
| n);                   return c;
| 
| }
| 
| // Decryption Function long int RSA_Decrypt(int c, int d, int n)
| { // Message msg = (c ^ d) % n long int msg = power(c, d, n);
| return msg;
| 
| }
| 
| // Driver program int main() { // Two random prime numbers int p
| = 3; int q = 7;                   // Message to be encrypted
| int msg = 15;                 // Encryption key (e, n)
| int e, d, n;                 // Calculate n and phi
| generateKeys(p, q, &e, &d, &n);                 // Encryption
| long int c = RSA_Encrypt(msg, e, n);          printf("Encrypted
| message is: %d\n", c);                 // Decryption
| long int m = RSA_Decrypt(c, d, n);          printf("Original
| Message is: %d\n", m);                 return 0;      }
 
  | varunkmohan wrote:
  | It's an embedding model so it generates vector embeddings not
  | text generations. That's to be expected.
 
| lee101 wrote:
| Also check out the embedding model from https://text-generator.io
| 
| It supports some things that openAI can't do, it retrieves any
| linked images of web pages, analyses the images or images with
| text inside to help the embedding model
 
| IanCal wrote:
| Once I've got embeddings my naive next step would be to do cosine
| similarity for comparisons/search/anything that requires a
| distance. I see they do that in some examples.
| 
| Is that the standard approach these days? Are there newer default
| approaches that tend to work better?
 
  | visarga wrote:
  | You can also apply clustering or train a classification model
  | based on embeddings.
 
| evergreener wrote:
| Is it known to anyone how OpenAI (and others) are extending the
| context windows of things like ChatGPT so far? E.g. if you exceed
| 2048/8192 (subword) tokens, does the model just chunk the inputs
| and evaluate separately on the chunks? Is context/state
| maintained across chunks? I've never seen anyone actually explain
| this.
 
| bcjordan wrote:
| The embeddings/search API seem super powerful, been meaning to
| play around with it more. I wonder how its performance compares
| to ElasticSearch / other text search/classification offerings out
| there
 
  | gk1 wrote:
  | The Search and Classification (and Answers) APIs were
  | deprecated last week.[1]
  | 
  | They were never in serious competition with Elastic, as far as
  | search goes. If you wanted to build a semantic search
  | application using OpenAI embeddings, the more common (and
  | scalable) method is to index those embeddings in a vector
  | database like Pinecone.[2] In fact that's what OpenAI
  | recommends to anyone who needs to transition off their Search
  | API.
  | 
  | [1] https://help.openai.com/en/articles/6272952-search-
  | transitio...
  | 
  | [2] https://docs.pinecone.io/docs/openai
 
    | thirdtrigger wrote:
    | Agreed - one can also use Weaviate which comes with an OOTB
    | OpenAI module leveraging the embeddings end-point
    | https://weaviate.io/developers/weaviate/current/retriever-
    | ve...
 
| dr_dshiv wrote:
| What effect will this have for connecting concepts between books?
| Either through summarization or topic mapping?
 
| gok wrote:
| > Longer context. The context length of the new model is
| increased by a factor of four, from 2048 to 8192, making it more
| convenient to work with long documents.
| 
| 8192 words is getting into the range of short stories or a
| masters thesis, which opens the door to some interesting
| applications.
 
  | drusepth wrote:
  | Important to note that these tokens _can_ be words, but
  | oftentimes a word will be comprised of multiple tokens, so 8192
  | tokens = 8192 words isn't strictly correct.
  | 
  | That said, your point stands. Most short stories are low-to-mid
  | four-digit words, and a jump from 2048 tokens to 8192 squarely
  | fits in that window.
  | 
  | As someone who's been working on multi-layered approaches to
  | using GPT-like models for long text generation (e.g. synopsis
  | -> outline -> paragraph expansions) to get around the limited
  | context window, it'll be interesting to see if people will keep
  | working towards that end or if it'll all become a moot point as
  | the effective context window continues to scale up.
 
___________________________________________________________________
(page generated 2022-12-15 23:01 UTC)