❗❗ This repo will no longer be maintained, please visit /~https://github.com/milvus-io/bootcamp ❗ ❗
This solution implements hybrid search of vectors and structured data based on Milvus Partitions.
The following example uses feature vectors and structured data to simulate human faces. For a specific vector (can be regarded as a specific feature vector retrieved from a human face) and corresponding properties, this example moves this vector into a specific partition based on its properties. Vectors in a partition share the same properties. During a query, this example searches for top 10 similar vectors in the corresponding partition based on specified properties (time when the face image is acquired, gender, with/without glasses).
- Install Milvus0.10.4
- pip3 install numpy
- pip3 install faker
This example uses the ANN_SIFT1B dataset.
- Download location: http://corpus-texmex.irisa.fr/
- Base dataset: ANN_SIFT1B Base_set
- Query dataset: ANN_SIFIT1B Query_set
Note: You can also use other data files with
.bvecs
format.
This example contains two scripts: partition_import.py
and partition_query.py
. partition_import.py
defines the data properties and imports data to Milvus in partitions. partition_query.py
performs hybrid search based on customizable conditions.
The raw data contains feature vectors from 1,000,000 face images. This example code randomly generates properties (time when the image is acquired, gender, with/without glasses) for these vectors. Every 100,000 vectors have the same properties. Thus, there will be 10 partitions in the table imported to Milvus. Each partition contains 100,000 vectors. (Each partition can contain different numbers of vectors. This is just a simple example.) Each partition contains its corresponding partition name and partition tag. A partition tag is a string composed of properties of the corresponding partition.
Before running this script, you need to check the variables and change their value per your environment and data to ensure that the script runs correctly.
Variable description
Variable | Description |
---|---|
MILVUS_TABLE |
Table name created in Milvus. |
FILE_PATH |
Path of the dataset to be imported to Milvus. |
VEC_NUM |
Number of vectors in the table. Should be smaller than the sum of the local dataset. |
BASE_LEN |
Number of vectors to be imported to the table each time. Equals to the number of vectors in a partition. |
VEC_DIM |
Dimension of the table. You must set this value per the dimension of imported vectors. |
SERVER_ADDR |
Address of Milvus server. |
SERVER_PORT |
Port of Milvus server. |
Run
$ python3 partition_import.py
As is displayed in the previous screenshot, 1,000,000 vectors that are imported to Milvus are divided into 10 partitions. partition_tag
specifies strings such as '2019-11-20'. partition0
to partition9
specify the partition_name
of the corresponding partition. (partition_tag
and partition_name
are both self-defined.)
Variable description
Variable | Description |
---|---|
MILVUS_TABLE |
Name of the table in Milvus to be queried. Must be the same as the table name created in partition_import.py . |
QUERY_PATH |
Location of the dataset to be queried. |
SERVER_ADDR |
Location of the Milvus server. |
SERVER_PORT |
Port of the Milvus server. |
TOP_K |
Number of vectors that are most similar to the raw data. |
Variables
Parameter | Description | |
---|---|---|
-n |
--num |
Sequence of the vector to query in the query dataset. |
-s |
--sex |
Gender of the face: male or female . |
-t |
--time |
Time when the image is acquired, such as 2019-04-05 . |
-g |
--glasses |
Whether the face is with glasses: True or False . |
-q |
--query |
Runs the query. No value needed for this parameter. |
Example
The following command queries top 10 vectors that are most similar to the 0th vector in the dataset. The gender is male and there are no glasses. The image is acquired on 2019-11-24:
$ python3 partition_query.py -n 0 -s male -g False -t 2019-11-24 -q
In the previous query result, id
specifies the sequence of the order to be inserted to the table. distance
specifies the Euclidean distance between the vector to query and the queried vector.
The following command queries top 10 vectors that are most similar to the 10th vector in the dataset. The gender is male and the image is acquired on 2019-12-07:
$ python3 partition_query.py -n 10 -s male -t 2019-12-07 -q
The following command queries top 10 vectors that are most similar to the 50th vector in the dataset. There are no glasses and the image is acquired on 2019-11-29:
$ python3 partition_query.py -n 50 -g False -t 2019-11-29 -q
The following command queries top 10 vectors that are most similar to the 306th vector in the dataset. The gender is female and there are glasses.
$ python3 partition_query.py -n 306 -s female -g True -q
The following command queries top 10 vectors that are most similar to the 255th vector in the dataset. The gender is male.
$ python3 partition_query.py -n 255 -s male -q
The following command queries top 10 vectors that are most similar to the 3450th vector in the dataset.
$ python3 partition_query.py -n 3450 -q