Skip to content

Commit

Permalink
feat: disable matcher predictor for category
Browse files Browse the repository at this point in the history
  • Loading branch information
raphael0202 committed Aug 29, 2023
1 parent 2d729d4 commit 07ada5b
Show file tree
Hide file tree
Showing 9 changed files with 14 additions and 92 deletions.
20 changes: 1 addition & 19 deletions doc/explanations/category-prediction.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,6 @@ Knowing the category of each product is critically important at Open Food Facts,

In Open Food Facts, more 12,500 categories exist in the [category taxonomy](https://static.openfoodfacts.org/data/taxonomies/categories.full.json) (as of March 2023). Category prediction using product meta-data was one the first project developed as part of Robotoff in 2018.

Two complementary approaches currently exist in production to predict categories: a matching-based approach and a machine learning one.

## Matcher

A simple "matcher" algorithm is used to predict categories from product names. This used to be done using Elasticsearch but it's directly included in Robotoff codebase [^matcher]. It currently works for the following languages: `fr`, `en`, `de`, `es`, `it`, `nl`.
The product name and all category names in target languages are preprocessed with the following pipeline:

- lowercasing
- language-specific stop word removal
- language-specific lookup-based lemmatization: fast and independent of part of speech for speed and simplicity
- text normalization and accent stripping

Then a category is predicted if the category name is a substring of the product name.

Many false positive came from the fact some category names were also ingredients: category *fraise* matched product name *jus de fraise*. To prevent this, we only allow non-full matches (full match=the two preprocessed string are the same) to occur for an ingredient category if the match starts at the beginning of the product name. There are still false positive in English as adjectives come before nouns (ex: *strawberry juice*), so partial matching for ingredient categories is disabled for English.

## ML prediction

A neural network model is used to predict categories [^neural]. Details about the model training, results and model assets are available on the [model robotoff-models release page](/~https://github.com/openfoodfacts/robotoff-models/releases/tag/keras-category-classifier-image-embeddings-3.0).

This model takes as inputs (all inputs are optional):
Expand Down Expand Up @@ -53,6 +35,6 @@ Here is a summary on the milestones in category detection:
- 2022-10 | Remove Elasticsearch-based category predictor, switch to custom model in Robotoff codebase

- 2023-03 | Deployment of the [v3 model](/~https://github.com/openfoodfacts/robotoff-models/releases/tag/keras-category-classifier-image-embeddings-3.0)
- 2023-08 | Disabling of the `matcher` predictor: after an analysis through Hunger Games, most errors were due to the `matcher` predictor, and the `neural` predictor gave most of the time accurate predictions for products for which the `matcher` predictor failed.

[^matcher]: see `robotoff.prediction.category.matcher`
[^neural]: see `robotoff.prediction.category.neural`
8 changes: 1 addition & 7 deletions doc/introduction/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,15 +75,9 @@ Robotoff is also notified by Product Opener every time a product is updated or d

Robotoff also depends on the following services:

- a single node Elasticsearch instance, used to:
- infer the product category from the product name, using an improved string matching algorithm. [^predict_category] (used in conjunction with ML detection)
- index all logos to run ANN search for automatic logo classification [^logos]
- a single node Elasticsearch instance, used to index all logos to run ANN search for automatic logo classification [^logos]
- a Triton instance, used to serve object detection models (nutriscore, nutrition-table, universal-logo-detector) [^robotoff_ml].
- a Tensorflow Serving instance, used to serve the category detection model. We're going to get rid of Tensorflow Serving once a new categorizer is trained. [^robotoff_ml]
- [robotoff-ann](/~https://github.com/openfoodfacts/robotoff-ann/) which uses an approximate KNN approach to predict logo label
- MongoDB, to fetch the product latest version without querying Product Opener API.


[^predict_category]: see `robotoff.prediction.category.matcher`

[^robotoff_ml]: see `docker/ml.yml`
2 changes: 1 addition & 1 deletion robotoff/app/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -565,7 +565,7 @@ def on_post(self, req: falcon.Request, resp: falcon.Response):
f"category predictor is only available for 'off' server type (here: '{server_type.name}')"
)

predictors: list[str] = req.media.get("predictors") or ["neural", "matcher"]
predictors: list[str] = req.media.get("predictors") or ["neural"]
neural_model_name = None
if (neural_model_name_str := req.media.get("neural_model_name")) is not None:
neural_model_name = NeuralCategoryClassifierModel[neural_model_name_str]
Expand Down
15 changes: 0 additions & 15 deletions robotoff/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,21 +113,6 @@ def generate_ocr_predictions(
)


@app.command()
def predict_category(output: str) -> None:
"""Predict categories from the product JSONL dataset stored in `datasets`
directory."""
from robotoff import settings
from robotoff.prediction.category.matcher import predict_from_dataset
from robotoff.products import ProductDataset
from robotoff.utils import dump_jsonl

dataset = ProductDataset(settings.JSONL_DATASET_PATH)
insights = predict_from_dataset(dataset)
dict_insights = (i.to_dict() for i in insights)
dump_jsonl(output, dict_insights)


@app.command()
def download_dataset(minify: bool = False) -> None:
"""Download Open Food Facts dataset and save it in `datasets` directory."""
Expand Down
7 changes: 7 additions & 0 deletions robotoff/prediction/category/matcher.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
"""Simple "matcher" algorithm is used to predict categories from product names.
It's currently disabled, as categorization errors mostly come from the matcher
predictor on Hunger Games, and as the neural categorizer almost always returns
more accurate predictions for products for which the matcher predictor fails.
"""

import datetime
import functools
import itertools
Expand Down
34 changes: 1 addition & 33 deletions robotoff/scheduler/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,15 @@

from robotoff import settings, slack
from robotoff.insights.annotate import UPDATED_ANNOTATION_RESULT, annotate
from robotoff.insights.importer import (
BrandInsightImporter,
import_insights,
is_valid_insight_image,
)
from robotoff.insights.importer import BrandInsightImporter, is_valid_insight_image
from robotoff.metrics import (
ensure_influx_database,
save_facet_metrics,
save_insight_metrics,
)
from robotoff.models import Prediction, ProductInsight, db
from robotoff.prediction.category.matcher import predict_from_dataset
from robotoff.products import (
Product,
ProductDataset,
fetch_dataset,
get_min_product_store,
has_dataset_changed,
Expand Down Expand Up @@ -294,26 +288,6 @@ def _update_data():
logger.exception("Exception during product dataset refresh")


def generate_insights() -> None:
"""Generate and import category insights from the latest dataset dump, for
products added at day-1."""
logger.info("Generating new category insights")

datetime_threshold = datetime.datetime.utcnow().replace(
hour=0, minute=0, second=0, microsecond=0
) - datetime.timedelta(days=1)
dataset = ProductDataset(settings.JSONL_DATASET_PATH)
product_predictions_iter = predict_from_dataset(dataset, datetime_threshold)

with db:
import_result = import_insights(
product_predictions_iter,
# Currently the JSONL dataset is OFF-only
server_type=ServerType.off,
)
logger.info(import_result)


def transform_insight_iter(insights_iter: Iterable[dict]):
for insight in insights_iter:
for field, value in insight.items():
Expand Down Expand Up @@ -366,12 +340,6 @@ def run():
max_instances=1,
)

# This job generates category insights using matcher algorithm from the
# last Product Opener data dump.
scheduler.add_job(
generate_insights, "cron", day="*", hour="10", minute=15, max_instances=1
)

scheduler.add_job(
generate_quality_facets,
"cron",
Expand Down
2 changes: 0 additions & 2 deletions robotoff/workers/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,8 @@ def load_resources(refresh: bool = False):
logger.info("Loading resources in memory...")

from robotoff import brands, logos, taxonomy
from robotoff.prediction.category import matcher
from robotoff.prediction.object_detection import ObjectDetectionModelRegistry

matcher.load_resources()
taxonomy.load_resources()
logos.load_resources()
brands.load_resources()
Expand Down
10 changes: 3 additions & 7 deletions robotoff/workers/tasks/product_updated.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
from robotoff.insights.extraction import get_predictions_from_product_name
from robotoff.insights.importer import import_insights, refresh_insights
from robotoff.models import with_db
from robotoff.prediction.category.matcher import predict as predict_category_matcher
from robotoff.prediction.category.neural.category_classifier import CategoryClassifier
from robotoff.products import get_product
from robotoff.redis import Lock, LockedResourceException
Expand Down Expand Up @@ -55,7 +54,7 @@ def update_insights_job(product_id: ProductIdentifier):
)


def add_category_insight(product_id: ProductIdentifier, product: JSONType):
def add_category_insight(product_id: ProductIdentifier, product: JSONType) -> None:
"""Predict categories for product and import predicted category insight.
:param product_id: identifier of the product
Expand All @@ -68,21 +67,18 @@ def add_category_insight(product_id: ProductIdentifier, product: JSONType):
)
return

logger.info("Predicting product categories...")
# predict category using matching algorithm on product name
product_predictions = predict_category_matcher(product)

# predict category using neural model
try:
neural_predictions, _ = CategoryClassifier(
get_taxonomy(TaxonomyType.category.name)
).predict(product, product_id)
product_predictions += neural_predictions
product_predictions = neural_predictions
except requests.exceptions.HTTPError as e:
resp = e.response
logger.error(
f"Category classifier returned an error: {resp.status_code}: %s", resp.text
)
return

if len(product_predictions) < 1:
return
Expand Down
8 changes: 0 additions & 8 deletions tests/unit/workers/tasks/test_product_updated.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,6 @@


def test_add_category_insight_no_insights(mocker):
mocker.patch(
"robotoff.workers.tasks.product_updated.predict_category_matcher",
return_value=[],
)
mocker.patch(
"robotoff.workers.tasks.product_updated.CategoryClassifier.predict",
return_value=([], {}),
Expand All @@ -43,10 +39,6 @@ def test_add_category_insight_with_ml_insights(mocker):
confidence=0.9,
server_type=DEFAULT_PRODUCT_ID.server_type,
)
mocker.patch(
"robotoff.workers.tasks.product_updated.predict_category_matcher",
return_value=[],
)
mocker.patch(
"robotoff.workers.tasks.product_updated.CategoryClassifier.predict",
return_value=([expected_prediction], {}),
Expand Down

0 comments on commit 07ada5b

Please sign in to comment.