Entity name matching @AMF

Organized by candres - Current server time: Jan. 25, 2021, 11:16 a.m. UTC


Jan. 13, 2021, 5 p.m. UTC


March 12, 2021, 4 p.m. UTC


Competition Ends
March 12, 2021, 8 p.m. UTC



The Autorité des Marchés Financiers (AMF) regulates the French financial market place, its participants and the investment products distributed via the markets and ensures that investors are properly informed. It is also a driving force behind regulatory change at both European and international levels. As an independent public authority, the AMF has regulatory powers and a substantial level of financial and managerial independence.



As a regulator, the AMF receives financial markets data in terabytes, and, in order to look at the big picture, the AMF needs to combine data from different sources. One of the most important way to combine these data is to use the Legal Entity Identification code (LEI), in other words, the market participants’ code identification.

Unfortunately, some of these data are parsed from text documents (such as corporate operations reporting) which do not systematically required the LEI. In the absence of any LEI, we can, at least, extract the name of the market participant (sometimes as a freetext), and then we can try to identify which is the correct LEI by looking into the official LEI database: the Global Legal Entity Identifier Foundation (GLEIF).

The GLEIF is designed to uniquely and unambiguously identify participants in financial transactions (more information here https://www.gleif.org/en).

Let’s take a fake example…

The QUEEN ELIZABETH CASTLE OF MEY TRUST is an entity whose LEI is 2138002N4IATSZWC6M22.

Let’s imagine that the AMF receives a document as follows:

“The trust in charge of the Queen Elizabeth Castle Of Mey announces that it has issued bonds worth EUR 230 million in December 2020.”

In this case, we can identify that the entity is “Queen Elizabeth Castle Of Mey” but, first we do not have the related LEI, and secondly it is not exactly the same description as in the official register (the GLEIF) since the last word “TRUST” is missing.

Goal Challenge

The goal of the challenge is to predict the LEI of an entity based on the (probably) approximative-description.

From a list of entity whose LEI is missing, and an extraction of the GLEIF, the challenger is invited to predict the LEI of each entity.

One interesting intermediate step could be able to detect clusters among entities from the official database. These clusters could gather entities which belong to the same group. Therefore it could be easier to find the most likely LEI of the entity in its cluster instead of in the whole database.




There are 2 phases:

  • Phase 1: development phase. We provide you with labeled training data and unlabeled validation and test data. Make predictions for both datasets. However, you will receive feed-back on your performance on the validation set only. The performance of your LAST submission will be displayed on the leaderboard.
  • Phase 2: final phase. You do not need to do anything. Your last submission of phase 1 will be automatically forwarded. Your performance on the test set will appear on the leaderboard when the organizers finish checking the submissions.

You must submit 2 files:

  • submission_val.csv containing the prediction for the val.csv dataset
  • submission_test.csv containing the prediction for the test.csv dataset

The submitted files must be csv file with one column named “lei” column that contain the predicted LEI. The predictions file must be in the same order as val.csv/test.csv and the number of lines must be the same.


val.csv submission_val.csv
name lei
name1   lei1
name2   lei2
name3   lei3


Submissions must be made before the end of phase 1. You may submit 5 submissions every day and 100 in total.

This challenge is governed by the general ChaLearn contest rules.


Our approach to this problem is to use text cleaning (such as lowering, removing accent and punctuation) and to create a TF-IDF vector representation of the text. Then compute the Euclidian distances between the approximate names and all the GLEIF names.

This main issue with this method, is computation time, we try fixe this issue using filtering rule and approximate distances.

The performance of our method is about 64%.

Link to the github example here.


LEI: Legal Entity Identification is the market participants’ code identification.

Approximate names: A text roughly equal to the market participant name.

GLEIF: the Global Legal Entity Identifier Foundation, the official LEI database.


Start: Jan. 13, 2021, 5 p.m.

Description: Development phase: create models and submit them or directly submit results on validation and/or test data; feed-back are provided on the validation set only.


Start: March 12, 2021, 4 p.m.

Description: Final phase: submissions from the previous phase are automatically cloned and used to compute the final score. The results on the test set will be revealed when the organizers make them available.

Competition Ends

March 12, 2021, 8 p.m.

You must be logged in to participate in competitions.

Sign In