How does it work?
Last updated
Last updated
Mindplug is built using OpenAI to generate embeddings, and Pinecone, to store the embedding data. To allow for efficient data storage, these technologies are combined with Supabase to keep track of the stored objects. Supabase stores the vector ids which allows Mindplug to effciently retrieve vectors from any collection. A deeper use case is these ids will allow Mindplug to do easily export out vector data from any collection.
Mindplug is designed with transparency and data ownership on mind. We do not take control of your data, but rather help you interface with it. You can export out your data at any time easily soon in the future.
Storing plain text is a simple process which uses OpenAI to generate embeddings, Pinecone to store them and Supabase to track them. For interfacing with data from different sources such as PDFs and webpages, the data must first be converted into plain text.
For converting PDFs into plain text, Langchain's PDF interface is utilized to return all text on each page of the PDF. For webpages, we use playwright to crawl and extract the data. These methods efficient and scalable.
Before generating embeddings for this data, it is first chunked or split into smaller objects. To ensure each object keeps full context of the text being embedding, recursive splitting from Langchain is used. This type of splitting has shown to generate best results overall through different contexts. By default, the chunk size is 1024 for all sources. This size has shown to preserve the best context for differ sources.
Our goal is to allow everyone around the world to experience the power of embeddings. Thus we provide free storage for the starting plan. Given the 1024 chunk size, this is easily enough for anyone to embed a few websites and articles.
Why 1024? On some websites, the information is sparse and spread out on the page. Generating more detailed vectors on such data results in loss of meaning of data. The same is true for plain text and pdf data. The broad context is well suited for a large amount of use cases while the narrow context storage is only best for specific use cases only. The narrow context storage works best on a small set of data which is pre-processed for a specific use case. In the future, maybe we will explore use cases for dynamic sizing of data at 256, 512 chunk size.
Using this chunked data, model 'text-embedding-ada-002' from OpenAI is used to chunk the embeddings into a vector size of 1536. These embeddings are then uploaded to Pinecone with an auto generated unique id for each of the embedding object. After uploading, these embeddings are then stored in Supabase.
All these process happens in the backend of Mindplug, which is available as an API to the public. The snippets of the sudo-code for each endpoint are provided in the docs for a better user understanding of data.
Mindplug uses Pinecone. How do we minimize costs to provide the best user experience?
Pinecone is a vector database that charges per index. An index is a type of storage that supports different projects. Each project can further be sub-divided by namespaces. The projects run on a server called a pod. There are different tiers for the pods, optimized for different use cases. In our case, we optimize for storage, using S1 pods, given the users may upload an arbitrary amount of data.
On this S1 pod, Mindplug automatically creates an index called Mindplug. The projects and collections in your account are living in a shared space in this S1 pod, divided by the namespaces for each specific user. This allows Mindplug to utilize the most of amount of space per index, avoiding the creation of additional pods thus charges.
Pinecone automatically encrypts data for its embeddings. Please see Pinecone's security policy on how your embedding data is treated: https://www.pinecone.io/security/. Although this is the case, the data lives in a shared space for different users. This data is not utilized for any other purpose other than to complete the user API requests.