Building AI-powered recognition models for the B2B marketplace «Platferrum»

Last year, digitalization projects accounted for 5.7% of all AI implementations within key industrial sectors. With it, B2B marketplaces are experiencing significant growth, already capturing 70% of the B2C online trade market, a trend now expanding into the broader business landscape. Platferrum, a marketplace for rolled metal, combines both trends. In this case study, we'll explain how we helped implement AI-based recognition models and how many types of iron products the AI now recognizes.

Client

Platferrum is a marketplace for rolled metal products. The service launched in October 2022 and was the first project of its kind.

Problem

The Platferrum marketplace accommodates a variety of methods for completing product documentation: each supplier fills out product descriptions in 1C or another system according to their own habits and preferences.

As a result, the Platferrum team encountered a wide range of descriptions for the same product. For example, rebar could be listed as "rebar," "reb.150," "rebar 150," "reb. iron 150," and so on. In the first few months after Platferrum's launch, the project team managed the editing and normalization of product cards manually, handled by the site's content managers.

However, when the number of suppliers exceeded 70 and the number of offers surpassed 20,000, manual processing became unsustainable. Requiring suppliers to use standardized wording would disrupt their business processes and potentially damage their loyalty. Supplier convenience is a priority for the marketplace: without suppliers, there is no product assortment, and without a product assortment, there are no customers. As the number of suppliers increases, so does the influx of new data, which in turn decreases the system's throughput.

There were three options to solve the problem: hire more content managers, which would increase payroll costs; dedicate the project's development team to implement an AI solution in-house, which would consume valuable IT resources; or outsource the task.

Task

The SimbirSoft team had previously assisted with the implementation of platform features, so Platferrum approached us to develop two machine learning (ML) modules:

Product Description Recognition and Matching Module:

This module analyzes purchase requests for metal products from suppliers, identifies product descriptions and quantities, matches the found supplier product cards (SPCs) with corresponding reference product cards (RPCs) in the database, and determines the product category. If the product is successfully recognized, the supplier's card is added to the database and linked to the reference card. If not, a decision is made regarding the addition of a new reference card.

PDF Table Recognition Module: This module extracts information for scoring from suppliers' financial statements, which are submitted in PDF format.

Typically, client financial statements are presented in tables, so the AI scoring task is divided into two stages: table recognition and data extraction.

Challenges of Working with Rolled Metal Product Nomenclature

When we began working on the module, the database already contained approximately 120,000 SPCs, matched manually or algorithmically by creating configurations for each supplier. Creating configurations was inconvenient and added manual effort, making the implementation of an automatic matching module the only optimal solution.

Each SPC and RPC contains a brief textual description and a set of specific attributes characterizing the product.

Card Description Example:

Electric-welded straight-seam pipe 18x1x6000 AISI 304 GOST 11068-81"
Attributes:
   "Electric-welded straight-seam pipe" - product category
    "18x1x6000" - dimensions
   "AISI 304" - steel grade
"GOST 11068-81" - GOST standard

The task of recognizing SPCs is complex because the order of the attributes can vary, and each product category has its own set of attributes. Some data might be missing, and the way attributes are written can differ. Furthermore, words in these documents are often abbreviated, product lists are simplified (e.g., product names are combined), and errors are common.

Card matching is complicated by the fact that the database contains about 30,000 reference cards with very similar product descriptions. Sometimes the difference between descriptions comes down to a single character, usually related to size, GOST standard, or steel grade. The database may also contain duplicates and cards with incorrect descriptions.

The task is further complicated by the variety of supplier request formats: text (.txt), documents (.doc, .docx, .xlsx, .xls), images (.jpg, .png), and even .pdf.

Examples of text requests

- "Beam 20Sh1 STO ASChM 2093 (S245) L=5370 7 pcs"
- "Pipes (detailed list in the attached file) volume negotiable"
- "Tubing NKT 102x6.5x9500 mm with coupling and thread gr. D st.20 GOST 633-80 1.5 tons"
- "Seamless pipe inner diameter 340mm wall not less than 12mm volume negotiable"
- "Need pipe scraps, used, clean, restored, stored without ovals, dents, ellipses without insulation._x000D_
108*5mm-1m,108*9mm-1m,219*6mm-1m,219*9mm-1m,530*8mm-1m,530*12mm-1m,820*10-1m,820*12mm-1m,1220*12mm-1m,820*19mm_x000D_
1m,_x000D_1220*12-1m,1220*19mm-1m volume negotiable"

Each format required a specific approach:

- Text requests: May contain additional information about the supplier, not just the product description. Intelligent text processing is needed to filter this information.

Excel files: Require extraction of the necessary data from tables. This necessitates accurate conversion of tables to text format without disrupting the document structure.
Images: Usually scans or photos of requests that need to be recognized and converted to text.
PDF files: Can be either scans or text. Both image and text processing are required.

The language models needed to be trained to understand the nuances of industrial nomenclature.

Solution

We approached the problem of recognizing products in text requests as a Named Entity Recognition (NER) task.

In our case, the named entities are product descriptions, their quantities, and units of measurement.

Any NLP model can be trained for this task. We chose BERT-like Russian language models such as rubert-tiny2 and sbert_large_nlu_ru because they offer a good balance of performance and quality. Large language models (LLMs) like YandexGPT and LLama were also considered. However, finding a universal prompt was a major challenge with YandexGPT, and LLama had high computational resource requirements and was difficult to integrate into the existing infrastructure.

To train the model, we annotated a dedicated dataset of approximately 4,000 text requests from various suppliers. We used data preprocessing and augmentation techniques, dropout regularization, and early stopping to combat overfitting.

The model training was based on PyTorch, PyTorch Lightning, and Transformers. We chose the PyTorch ecosystem due to the framework's popularity, the availability of pre-trained Russian language models, and its large community.

Requests can be submitted not only as text but also as files, so we needed to implement parsing mechanisms.

For this, we used several Python libraries and tools:

.doc, .docx - docx2txt, python-doc, LibreOffice
.xls, .xlsx - pandas
.pdf - pymupdf, pdf2image

Documents can contain not only text but also images, such as photos of requests with various annotations or scans. Therefore, the text first needs to be recognized by an OCR model before being fed to the NLP model.

We used the popular Tesseract OCR tool for this, as it has relatively high performance, although it is less accurate than alternatives like EasyOCR.

Requests are often presented in tables, where each row specifies the product name, and quantities and units of measurement are in separate columns. In these cases, using an NLP model is not always necessary because we can identify entities simply by column names, if available. A tool for recognizing tables from images and scans would be very useful here, as they are often found in PDF documents, for example. We borrowed this functionality from the second scoring module described below.

The resulting product recognition pipeline is as follows:

We addressed the problem of matching SPCs and RPCs as a product classification task, where each product represents a separate data category.

Thus, with approximately 30,000 RPCs in the database, we have the same number of categories. The samples within each category consist of one RPC and all SPCs matched to it, as these are simply different ways of describing the same product.

For model training, we used the 120,000 already matched SPCs in the database. It's easy to calculate that there are only 4 examples per category on average, which is very limited. Furthermore, the frequency of each product's occurrence is unevenly distributed: some products, like steel sheets, are very popular, while others, like steel valves, have only one entry.

Therefore, to avoid overfitting the model, we filtered out infrequently occurring products and applied data augmentation techniques.

We trained the classification model using a classic approach: we took a pre-trained rubert-tiny2, added a linear layer with an output dimension equal to the number of categories, and a softmax layer at the output. However, before applying the linear layer, we needed to aggregate the output embeddings from BERT, for example, by combining averaging and max pooling.

After training the classification model, we discarded the linear layer and found the closest products using cosine similarity between the embeddings at the output of BERT.

As a result, we created two modules that learned to recognize data in supplier cards, match them with reference cards, and handle phrases not related to the card, such as greetings or well-wishes.

Searching for data in tables boils down to finding the column and row with the required code and textual description. Using ML for such a search in this case is redundant and would lead to unnecessary errors, so we focused on the task of table recognition.

Detecting tables with borders is a common computer vision task, but here we had an additional challenge: finding tables without borders and correctly recognizing cells. Applying text recognition algorithms directly to the document is not feasible, as it is crucial to preserve the table structure for subsequent processing. It's important to consider that the PDF format is complex; pages can have different spatial orientations and slight rotation angles, and documents can be searchable, not requiring ML-based recognition. Therefore, additional preprocessing is required.

To solve the recognition problem, we prepared and annotated a dataset that includes table and cell classes and trained corresponding detector models on it. The resulting pipeline is as follows:

1. Preprocessor: Reads the input PDF document and preprocesses it:

Determines the type: scanned or not.
If scanned, splits it into grayscale page images, corrects page orientation and rotation angle.
If searchable PDF, parsing is performed using the pymupdf library, and the following steps are skipped.

2. Table Detector based on YOLOv8: The model finds the boundaries of all tables in the document and determines their type: clear or unclear.

3. Cell Detector within the table:

If the table has clear borders, a CV border detection algorithm is applied, and the model is not required.
If the table has unclear borders, a cell border detector model based on YOLOv5 is applied.

4. Text Recognition block within cells using EasyOCR: Consists of two models:

Pre-trained text detector based on the CRAFT model.
Pre-trained text recognition model based on the Faster-RCNN architecture.

unnamed (1).png

A significant challenge was the low performance of the pipeline: processing a multi-page scanned PDF document could take up to several minutes. The text recognition stage consumed the most time. However, we managed to partially optimize it and double the performance by tuning the EasyOCR models.

During development, we noticed a peculiarity of the detector: the coordinates of the cells in the image were determined not by their position from left to right and top to bottom, but by the confidence score of the detector model. Once we understood this behavior, we developed an algorithm that transforms cell coordinates into row coordinates for analysis and message generation.

The result is a tool capable of processing any PDF document containing financial statements and automatically extracting the necessary data.

Result

The modules are functioning correctly and handling the influx of suppliers, which has already exceeded 170. The product range is expanding, attracting new users.

The platform's growth without the implementation of AI and process automation would have been significantly more expensive, making the use of artificial intelligence no longer a trend but a necessity.

By 2030, according to the Ministry of Energy's forecasts, the share of enterprises using artificial intelligence in production will grow to 80%. But even today, in a competitive market and with a shortage of skilled personnel, AI implementation becomes a strong advantage.

Other cases

Supporting HeadHunter

Warehouse Management System (WMS) Audit in 10 Days

Nanimatel: a Marketplace for Freelancers

Tochka

AlfaStrakhovanie

Mobile App for Yugoria Insurance Company

Magnit Delivery: IT System Quality Assurance

Designing a Mobile App for ViewEvo

Supporting HeadHunter

Warehouse Management System (WMS) Audit in 10 Days

Nanimatel: a Marketplace for Freelancers

Tochka

AlfaStrakhovanie

Mobile App for Yugoria Insurance Company

Magnit Delivery: IT System Quality Assurance

Designing a Mobile App for ViewEvo

Send us your request

Name or Organization

Phone or Email

Tell us about the project

Attach a file (up to 10MB)

File selected

Required extensions: .txt, .doc, .docx, .odt, .xls, .xlsx, .pdf, .jpg, .jpeg, .png

Maximum file size: 10 MB

I hereby confirm my consent to the processing of my personal data in accordance with Personal data protection and processing policy, SimbirSoft LTD

Projects

Our Workflow

Services

Our History

Insights

Locations

About SimbirSoft

Write Us

request@simbirsoft.com

Personal data protection and processing policy