Use the weavia vector to search for 600,000 academic papers to build scalable knowledge map search

Keenious is a search engine designed for students and researchers! And provide the application directly in the text editor; it can help us analyze the entire document and find a highly related result when working

introduce

Different from traditional search, our Keenious academic search engine is balanced between direct related results (keywords, etc.) and similarity results related to semantic and input documents. The results of similarity have promoted the continuous discovery and exploration of research and themes, which are essential to researchers and students at all levels.

It is difficult to find the right balance because the user’s intention itself is uncertain, so it is difficult to find a effective method to find the best balance solution. We use our own understanding, repeated tests, and the most important user feedback to find the best combination.

Recently, we have been exploring a method that can be executed by semantic search that does not require text vectors based on text. The reason for this is that users can explore and discover in -depth research and topics without searching documents. We usually call this cold start, that is, users do not have the paper or document text to search.

Being able to develop a solution has always been the focus of our attention. This solution allows users to not only need documents or search inquiries, and can easily discover personalized research suggestions through single prompts or inputs (their favorite papers, themes, etc.). The solution we currently find is to use the knowledge map (KG) combined with the fast vector search solution (Weaviate).

In this article, we will briefly introduce the knowledge map, and how we use them in Keenious, and mainly pay attention to how to use the weaviate scales and embed them into the search engine. This mainly includes an overview of the Weaviate, and why weaviate is the correct tool for this task.

KEENIOUS’s rapid introduction

Before we start, you may not be familiar with Keenious and its functions. Keenious is a tool that can analyze your writing and show you the most related research in millions of online publications within a few seconds.

We believe that learning is not a static process. Therefore, research should not be. Each document with Keenious can become a search and query. Our plugin will analyze the text while writing the text and find the most related research for you at each step. Keenious explores hidden treasures by screening cross -disciplinary themes and research fields.

If you need to search for more specific content, you can use search to browse each sentence or every sentence in the document. This will narrow the search range while maintaining it related to the rest of the document.

Academic knowledge map

The knowledge map (KG) is a deep theme. Here we will not introduce it in detail. Interested friends can check the articles published by our public account Deephub-Imba. We have compiled some points of the knowledge map for you to understand the content behind this article

Knowledge map and synonym network: They are a different way to connect multiple real -world entities and concepts, and distinguish them differently. Almost any concept/entity/system can be abstracted as the knowledge map.

The more formal KG is an alien diagram, where there can be a variety of types of nodes and/or edges. For example, the Movielens dataset can abstract KG composed of multiple types of nodes, such as movies, actors, genres, languages, and designated two entities, that is, actors → appear → movie.

They are the same type of homogeneous comparison with all nodes and edges, for example, friends (social) networks.

The main entities (nodes) and relationship (edge) of our knowledge map are specific metadata of the dissertation and other papers that are rich in maps.

Using these nodes and marginal relationships, we can create academic knowledge maps and train custom models to generate very rich graph embedding. Each embedded represents a unique node type, including our data concentration papers. These embedded foundations of this search are also the content we provided to Weaviate so that each unique entity in our KG can be found.

Weaviate vector search engine

Weaviate is a vector search engine to help promote artificial intelligence search and discovery. Vector search is now in an important and very interesting intersection. It is mature and becomes the mainstream of search technology, because its benefits are undeniable.

Just as the inverted index changes the way we conduct a full -text search, a vector search engine like Weaviate is pushing the next generation of non -structured data in text, images, and knowledge maps.

Data object and hybrid index search

Fundamental speaking, the Weaviate’s architecture has been deeply considered from the beginning. The data object in Weaviate is based on a class attribute structure, which allows all objects in Weaviate to easily use GraphQL for this machine query, and optimize the inquiry of complex filters and scales. In fact, the combination of traditional inverted indexes and vector indexes is the reason why the Weaviate really stands out. In the same query, users can choose to include or exclude data objects with specific scales (texts, numbers, etc.) from vector search.

Instead

Another interesting design of Weavia is that its API is highly modular. The structure of a vector indexing API is also working as a plug -in system, so that weavia can be used to adapt to vector search and continuously improved.

The current vector indexing type of Weaviate is HNSW, which is the most advanced approximate approximate nearest neighbor (Ann) vector search algorithm. ANN search is a very active research field, and has been proposing a new index architecture that can improve the recall rate and efficiency. Because the vector index API of the Weaviate has nothing to do with the backend, which means that when the latest and best Ann indexes are added to the weaviate in the future, the user may be able to switch with the minimum change of its settings (maybe basically basically No need to change).

I think it is a very good choice to choose an API that can adapt to any future index of any vector index. Too many text search engines used more than 20 years ago. This retrieval method has long been surpassed, but because the code coupling is too close and cannot be replaced.

Module and integration

In addition to the overall modular method of Weaviate, the functional module of search is also based on the search, and many customized vector modules that convert data to the vector. Weaviate has some very powerful modules. During the use process, several conventional conversion steps are involved, and it can be stitched by creating its own task stream. Here are some of the common modules I organize:

Text2vec-Contextation: A very interesting feature, which essentially represents the data object and object’s context and saves it into the database.

Text2vec-TransFormers: Using the rich embedded model in the Sentence-TransFormers to create a paragraph/document embedded for each text object imported by Weaviate, so that developers do not have to achieve reasoning code by themselves.

IMG2VEC-Neural: Similar to Text2vec, the module uses a large pre-training computer vision model to vectrate the image to seamlessly enables the semantic search of any image.

Horizontal expansion

Note: When writing this article, Weaviate joined the candidate version of the horizontal expansion characteristic V.1.8.0-RC0 just released. It is estimated that by the fall of 2021, it will be released stably.

Although our use cases are currently suitable for a single node instance of Weaviate, in the end we still need a vector search solution that can be expanded infinitely. This is very troublesome in the world of vector search. Once the vector reaches a certain number of magnitude of many vector search algorithms, there will be an upper limit. The design of Weaviate also takes into account the horizontal expansion as a node cluster, which is very similar to the function that Elasticsearch is currently used in text search. The scalable version of the Weaviate consists of an index, which is decomposed into many different shards or small Ann indexes, and then can be distributed on multiple nodes.

Through this settings, the number of objects that can be added to the Weaviate cluster is actually not limited because it can be extended to any case without sacrificing performance.

Horizontal scalability is the most critical feature for vector search engines to really put in production. Weaviate is currently in a favorable position in scalability. The entire code library, including the custom implementation of HNSW, is written in Go language, because Go is very suitable for large telescopic systems.

Use weavia to provide search support for the knowledge map

For anyone who considers the use of WeaviaTe, it is important to note that it has high requirements for memory, but depends on the “mode” required. If new objects are added for indexes, that is, a large number of writing, then memory consumption will be very large. In order to solve this problem, we can re -start WeaviaTe after a large number of insertions, because the vector after insertion does not need to be stored in memory.

Let’s discuss how Keenious actually uses WeaviaTe to support our upcoming knowledge map search function.

Although Weaviate has many functional modules mentioned earlier, its core is pure vector native database and search engine. Since we have trained the custom model to generate rich embedded vectors for our project, we only need all vectors to import them directly into the weaviate without any conversion. In a single node version, we have now independent more than 60 million documents.

At present, we build indexes and databases on our own custom workstations. Everyone in Keenious is kindly called Goku. In the next step, we will migrate it to work on Kubernetes. Weaviate has a very useful tool that can quickly create a basic docker-compose file to make things go smoothly

Weaviate has many different clients to choose from, depending on the language you are familiar with. Python, GO, Java, JavaScript and CLI clients are already very mature products.

Optimize some of the skills of WeaviaTe

At present, there are 4 main parameters of vector indexes that can be adjusted in a sense. These parameters are:

efconstruction

maxConnections

EF

vectorCachemaxobjects

The first three parameters are directly from the HNSW itself and are particularly in the algorithm, so you can view the original paper to obtain a more detailed explanation of each parameter. The main impacts of these parameters, especially EFCONSTRUCTION and MaxConnections, are the balance between the recall/accuracy and resource use/import time.

Increasing MaxConnections usually increases the quality of the index, but also increases the size of the HNSW graph in the memory. If you cannot bear the increase in memory, then EFCONSTRUCTION may be a parameter you want to increase, but increasing this parameter will increase the import time.

The EF parameters actually only work when searching, and depend on the number of objects and corresponding time requirements in the index. We found that when using a higher EFConStruction value during index, we can provide a lower EF value when searching.

Please be careful to use VectorCachemaxobjects, which is almost sure to hope that it is greater than or equal to the number of data concentrations when indexing, but it may be beneficial to keep the memory at a lower level when running Weaviate is only used for search, because it does not need to do so that you do not need to do so you don’t need to do so. All vector is stored in memory, and the HNSW diagram does a heavy work, and the vector is only used to calculate the final distance score.

Summarize

In Keenious, we are very satisfied with the quality of Weaviate’s vector search and all the additional functions built on it. These additional functions really change the game rules of vector search.

Choose WeaviaTe to fully focus on developing excellent functions for our search engines. These features involve more than 600,000 knowledge charts we stored in Weaviate. We can solve academic search problems with the most cutting -edge products, rather than technical implementation requirements. This is great.

We soon provide some very interesting features on the basis of this basis (late 2021). We look forward to sharing these features with you after release, so stay tuned.

Author: Charles Pierse