Semantic Search

An open-source Python library for semantic search, featuring:

FAISS for rapid vector similarity.
SentenceTransformers for high-quality embeddings.
Pluggable database backends (MongoDB, SQLite, Redis, PostgreSQL, MySQL).

With this library, you can efficiently store, index, and retrieve documents based on semantic similarity.

Features

Multi-DB Support – Choose MongoDB, SQLite, Redis, PostgreSQL, or MySQL.
FAISS Index – Build a fast, in-memory index for vector searches.
Easy API – add_document(), build_faiss_index(), retrieve().
Scalable – Handle millions of embeddings with FAISS.
Open Source – Contributions are welcome!

Databases & FAISS Table

Database	Best For	Advantages	Considerations
MongoDB	JSON-like docs, horizontal scaling	- Great for big data - Flexible schema	- No built-in vector search - Use external index
SQLite	Lightweight local storage	- Easy setup - Single file DB	- Not ideal for concurrent writes
Redis	Fast in-memory caching	- Extremely fast read - Good for ephemeral data	- Data stored in RAM - Must handle persistence
PostgreSQL	Traditional SQL, robust & reliable	- Potential pgvector extension - ACID-compliant	- Requires indexing optimization for large data
MySQL	Widely used SQL store	- Scalable - Familiar to many devs	- No native vector index; manual approach needed

FAISS:

IndexFlatL2 used for demonstration (simple & effective for smaller data).
For large data: consider IVF, HNSW, or GPU-based indexing.

Installation

pypi install this repo:
```
 pip install pysemantic-search
```
Install Dependencies for Development:

    git clone /~https://github.com/username/semantic-search.git
    cd semantic-search
    pip install -r requirements.txt
    
    Make sure to adjust dependencies (faiss-cpu vs. faiss-gpu) depending on your environment.
    If you plan to use PyTorch on GPU, install torch with CUDA support.

Usage

Below is a basic example using MongoDB as the database backend. You can switch to other backends by changing db_type.

from semantic_search import SemanticSearch, DatabaseFactory

# 1. Create a database connection (MongoDB example)
db = DatabaseFactory.create_database(
    db_type="mongodb",
    mongo_uri="mongodb://localhost:27017/",
    db_name="semantic_db",
    collection_name="documents"
)

# 2. Initialize SemanticSearch with the DB
search_engine = SemanticSearch(database=db)

# 3. If documents already exist, build FAISS index
try:
    search_engine.build_faiss_index()
except ValueError:
    print("No existing documents found. The FAISS index was not built.")

# 4. Add a new document
search_engine.add_document("Deep learning for NLP is a powerful tool.")

# 5. Retrieve similar documents
query = "Best techniques for NLP deep learning?"
results = search_engine.retrieve(query, top_k=3)
print("Results:", results)

Testing

This library includes pytest tests in the tests/ directory. To run them locally:

Install dev dependencies (pytest, pylint, etc.) from requirements.txt.
Run

python -m pytest tests/

FAQ

1. Why do I see a type error about `faiss_index.add(x)`?

FAISS’s Python bindings are generated via SWIG, so static type checkers (like Pyright) think the signature is add(n, x). We add # type: ignore to bypass this false positive. Runtime usage works fine with add(x).

2. Is `.cpu()` needed for embeddings?

By default, SentenceTransformers returns NumPy arrays if you pass convert_to_tensor=False, so .cpu() is not needed.
If you use convert_to_tensor=True, you get a PyTorch tensor. Convert it with:
```
embedding = embedding.cpu().numpy().astype(np.float32)
```

3. Which database is best?

MongoDB or PostgreSQL for large data.
Redis for fast in-memory lookups.
SQLite for small local apps.

4. Can I store embeddings in the DB & FAISS on disk?

Currently, IndexFlatL2 is in-memory only. To store FAISS on disk, use other FAISS indexes with I/O support or HDF5-based approach.

5. How do I contribute?

Fork this repo, create a new branch, and submit a PR with changes.
Add tests in tests/.

Reference

/~https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/README.md

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
semantic_search		semantic_search
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Search

Table of Contents

Features

Databases & FAISS Table

Installation

Usage

Testing

FAQ

1. Why do I see a type error about `faiss_index.add(x)`?

2. Is `.cpu()` needed for embeddings?

3. Which database is best?

4. Can I store embeddings in the DB & FAISS on disk?

5. How do I contribute?

Reference

About

Releases 1

Packages

Languages

License

kunci115/semantic-search

Folders and files

Latest commit

History

Repository files navigation

Semantic Search

Table of Contents

Features

Databases & FAISS Table

Installation

Usage

Testing

FAQ

1. Why do I see a type error about faiss_index.add(x)?

2. Is .cpu() needed for embeddings?

3. Which database is best?

4. Can I store embeddings in the DB & FAISS on disk?

5. How do I contribute?

Reference

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

1. Why do I see a type error about `faiss_index.add(x)`?

2. Is `.cpu()` needed for embeddings?

Packages