For the user, the process is pretty simple. You feed unstructured text (news articles, blog posts, research reports, etc.) into the extraction engine. Intelligent Tagging analyzes your text, processes the extracted information, and returns semantic metadata in the RDF, JSON, or N3 format.

What does metadata mean? What is metadata tagging?

Metadata is “data about data.” For example, the title and author of a book is metadata about the book. In information systems, a tag is a non-hierarchical keyword or term assigned to a piece of information (such as an Internet bookmark, a digital image, or a computer file). This kind of metadata helps describe an item and allows it to be found again by browsing or searching.

Metadata can describe many different types of information. Intelligent Tagging metadata, for example could describe:

Entities: (Companies, people, places, products, etc.)

Relationships: (John Doe works for Acme Corp.)

Facts: (John Doe is a 42-year old, male CFO.)

Events: (Jane Doe was appointed a Board member of Acme Corp.)

Topics: (Story is about M&As in the Pharma industry)

By tagging various details you can start creating a strong metadata store that can be constantly maintained and searched for intelligent insight.

What is Entity and Relationship Recognition?

During processing, Intelligent Tagging automatically scans and analyzes the input text, searching for mentions of things like companies, people, cities, industries, products, deals, alliances, company earnings announcements, company layoffs, IPOs, stock splits, business relationships, etc.

Intelligent Tagging classifies mentions of straightforward things like companies, people, cities, telephone numbers, etc. as Entities; more complex mentions that indicate relationships between things are classified as Relations. Some examples of relations are: deals, IPOs, analyst recommendations, company reorganizations, product recalls.

What is the Relevance Tag?

All extracted entities have an associated Relevance tag that indicates how centric the entity is to the containing document. Relevance scores range from 0 to 1. The higher the score, the more relevant the entity is to the containing document.

The subject attribute of the Relevance Tag indicates the entity that is the subject of the relevance score.

The relevance score
Currently, the following values are supported:

1 – This value is reserved for the company identified as the reporting company in a document with a predefined format such as an SEC report.

0.8 – Entities defined as having high relevance receive this score. For example: a company that is mentioned in the title; a company mentioned prominently or frequently in the document.

0.5, 0.2 –These scores are assigned to entities that are not highly relevant to the story. For example: a company that is mentioned only once towards the end of a long document; a company mentioned in passing; a company mentioned as a representing law firm, financial advisor, underwriter, etc.

Note: If the tagging output contains multiple companies with relevance 0.5, or 0.2, it may be an indication that the input document is made up of multiple paragraphs or sections, where each one discusses a single company in detail (for example, Top News, or Breaking News type stories), or that the input document includes charts or tables that display multiple companies. In this case, your use case determines if the company tag is of interest.

0 – This value is reserved for entities identified as irrelevant to the story. For example, mentions of companies as rating agencies, reporting agencies, stock exchanges, and social applications receive a score of 0.

What are Social Tags?

Social tags attempt to emulate how a person would categorize a specific piece of content. For example, if you submit a story about Barak Obama and a piece of legislation, at least one reasonable tag would be U.S. Legislation. A story about the relative merits of BMWs, Ferraris, and Porsches would probably be tagged with sports cars, luxury makes, auto racing, and motorsport.

The story about the Apple Watch Launch generated the following social tags: IOS, Smartwatches, Wearable Computers, Human-computer interaction, Ubiquitous computing, Consumer electronics, Apple Inc., Wearable Technology, and Apple system on a chip.

The SocialTag function does not identify individual items within the text, but rather attempts to provide common sense tags for the piece of content as a whole.

Social tags are derived from the Wikipedia folksonomy. They are periodically updated to keep them current.

What are Topic Tags?

The Intelligent Tagging classification engine identifies the topic or topics that are being discussed in the document. For example, Macroeconomics, Equities, Sports, Entertainment, Politics, Oil & Gas Products, Mergers/Acquisitions/Takeovers, Computer Hardware, Consumer Financial Services, Software and IT Services, etc.

A DocCat (topic) tag is designed to give a general notion of what an input document is about. There is no specific entity recognition in the text, but rather deduction about what the text is about.

The reference list of topics is drawn from the RCS (Refinitiv Classification Services) taxonomy and the IPTC (International Press Telecommunications Council) news taxonomy. Each identified topic results in a Topic (DocCat) tag. It is possible that multiple topics will be identified, or that no topic will be identified if the document does not discuss anything currently defined by the relevant taxonomies.

What is the difference between an Entity Tag and a Social Tag?

An entity extraction is based on an explicit mention in the text, whereas a Social Tag describes what the document is about as a whole. An entity is extracted only if the entity is mentioned in the text. A social tag that identifies a particular subject may be assigned regardless of whether the subject is explicitly mentioned in the text. For example, if you have a news article that discusses a bankruptcy without mentioning the word “bankruptcy.” The same subject may be identified by entity tags and/or social tags in the Intelligent Tagging output.

What is the difference between an Entity Tag and a Topic Tag?

An entity extraction is based on an explicit mention in the text, whereas a topic tag classifies what the document is about as a whole. An entity is extracted only if the entity is mentioned in the text. A topic tag may be assigned regardless of whether the subject is explicitly mentioned in the text. The same subject may be identified by entity tags and/or topic tags in the Intelligent Tagging output.

For example, if the input document mentions German companies like Volkswagen and SAP, the tagging output might include a DocCat (topic) tag for Germany. But unless there is an explicit mention of Germany in the text, the tagging output will not include an em/e/Country (country entity) tag for Germany.

Continuing with this example, if there is an explicit mention of Germany in the text, then the tagging output may include both an em/e/Country (country entity) tag for Germany and a DocCat (topic) tag for Germany. In this case, both tags would output the same RCS code for Germany, as the RCS codes for both tag types come from the same taxonomy.

When entity tags and topic tags overlap, which should I use?

The same subject (for example, a specific country or company) may be identified by entity tags and/or topic tags in the Intelligent Tagging output. Your use-case determines whether the entity tags or the topic tags or both are of interest.

Entity tags can be used to:

Filter content according to whether or not a specific entity is mentioned.

For example, you could use the entity tags to retrieve all documents that mention a particular company or country.

Highlight the entity instances in the narrative text (markup highlighting).

Detect all mentions of an entity.

Topic tags can be used to:

Filter documents according to whether or not they are about a specific topic.

For example, you could use the topic tags to retrieve all the documents that are about a particular company or country.

So depending on your use case, you might use one or both metadata tag types, and it is important to note that there may be overlap.

What is the difference between a Social Tag and a Topic Tag?

Both Social Tags and Topic Tags describe what the text is about as a whole. However, the reference lists are different. For Topic Tags, the reference list of topics is drawn from the RCS (Refinitiv Classification Services) taxonomy, and the International press Telecommunications Council (IPTC) news taxonomy. Social Tags are derived from a recent listing of the entire Wikipedia English Language page database – the Wikipedia Folksonomy. The Wikipedia Folksonomy includes more current-event topics such as “Trump – Russia dossier,” while the TRCS taxonomy includes more rigid topics such as “Mergers & Acquisitions.”

Which tags best support my use case: Topic tags, Entity tags, or SocialTags?

It depends on your use case.

Entity tags can be used to:

Filter content according to whether or not a specific entity is mentioned.

For example, you can use entity tags to retrieve all documents that mention a particular company or country.

Highlight the entity instances in the narrative text (markup highlighting).

Detect all mentions of an entity.

Topic tags:

Topc tags can be used to filter documents according to whether or not they are about a specific topic.

For example, you could use the topic tags to retrieve all the documents that are about a particular company or country.

The list of possible topics is drawn from the RCS (Refinitiv Classification Services) taxonomy and/or by the International Press Telecommunications Council (IPTC) news taxonomy.

SocialTags:

The SocialTag function does not identify individual items within the text, but rather attempts to provide common sense tags for the piece of content as a whole which are derived from Wikipedia categories, or articles. So, you might use SocialTags as secondary filters after applying a query to see if you can narrow down the result set.

Social Tags are derived from a recent listing of the entire Wikipedia English Language page database (the Wikipedia Folksonomy). The Wikipedia Folksonomy includes more current-event topics such as “Trump – Russia dossier,” while the TRCS taxonomy includes more rigid topics such as “Mergers & Acquisitions.”

So depending on your use case, you might use one or all metadata tag types, and it is important to note that there may be overlap.

For more in depth information please see the API User Guide.

What is Slugline tagging?

A slug line is a tag that describes a news story. For example: USA-KENYA-TRUMP, CHINA-BANKS/CCB-RESULTS. A single slug line is assigned by an editor to each Reuters news article as part of the publishing process.

Intelligent Tagging classifies documents using Reuters sluglines, providing another way to consistently classify news documents across multiple sources.

The reference list of sluglines is comprised of the slugs assigned to Reuters news stories in the past three months (a rolling window of the last three months). The reference list is updated on a daily basis.

Slugline Tagging is a premium feature.

How do I enable Slugline tagging?

To enable slugline classification, use the x-calais-selectiveTags header to pass the value, "slugline."

What is the difference between a Topic Tag and a Slugline Tag?

Both DocCat (topic) tags and Slugline tags are used to describe what the document is about as a whole. However, the reference lists are different. For Slugline Tagging, the reference list is comprised of the slugs assigned to Reuters news stories in the past three months (a rolling window of the past three months). The reference list is updated on a daily basis.

For Topic Tagging, the reference list of topics is drawn from the RCS (Refinitiv Classification Services) taxonomy, and the International Press Telecommunications Council (IPTC) news taxonomy. These are well-defined, established, and more static taxonomies that do not frequently change.

What is the TRCS Taxonomy?

The Intelligent Tagging classification engine supports a few hundred topics from the RCS (Refinitiv Classification Services) taxonomy.

Some of the RCS topic classifiers are optimized for tagging news stories or research reports.

The list of supported RCS topics can be downloaded by premium users from MyRefinitiv.

Topic Tagging - What is the Confidence Score?

The Topic tag confidence score is a value from 0 to 1. The value indicates the probability that the topic is indeed discussed in the text and also how centric the topic is to the text. The higher the value, the higher the probability.

The consuming application can use this score to achieve higher accuracy results by ignoring instances with scores below a specified level.

Which Intelligent Tagging metadata concepts are continuously enhanced?

Intelligent Tagging supports an extensive set of (approximately 100) metadata types. We actively focus development efforts on enhancing the focused list of concepts that are the most important to our customers.

Priority 1

The following metadata concepts are actively enhanced and tuned:

Entities: Company, Country, Currency, CurrencyPair, MarketIndex, Person

Classification: Topic, SocialTag, Slugline

Priority 2

The following metadata concepts may be tuned on demand, but not too often:

Entities: City, PharmaceuticalDrug

Relations: Acquisition, Alliance, Bankruptcy, Buybacks, CompanyAffiliates, CompanyEarningsAnnouncement, CompanyLayoffs, CreditRating, Deal, Dividend, IPO, JointVenture, Merger, NaturalDisaster, PersonCareer

Priority 3

The rest of the metadata concepts are experimental and are not maintained (P/R issues are not fixed).

See The complete list of Entity and Relation types is available in the API User Guide.

Are RCS classification topics continually enhanced?

This depends on the topic type. The RCS topics that are optimized for News or Research Reports are enhanced, fixed, and republished by customers with the Self Service Classification tool.

The legacy RCS topics are not actively maintained. But they are gradually being replaced by optimized topics that will then be maintained as described above.

The complete list of supported RCS topics can be downloaded by premium users from MyRefinitiv.

How do I submit PDF documents for best results?

Use the following header values to optimize tagging for PDF documents (in addition to the other mandatory headers):

Content-Type:application/pdf

This header value is mandatory

x-calais-contentClass:research

Use of this header is highly recommended for best quality output when the input files are research reports in PDF format.

x-calais-pdftagzone:true

Use of this header is optional. It extends tagging to tables in PDF documents. By default, the tagging mechanism does not parse tables. This header is supported only when the input files are in PDF format and x-calais-contentClass=research.

How do I tag Research Reports?

To tag Research Reports, use the following headers (in addition to the mandatory headers):

x-calais-contentClass: research

x-calais-source (Specify a source to trigger optimized extraction.)

If the research reports are in PDF format, also use the following headers for the best quality tagging output:

Content-Type: application/pdf

x-calais-pdftagzone: true

What is the best way to submit news articles?

News articles should be submitted with the x-calais-contentClass header set to news, in addition to the mandatory headers..

This is currently the default setting, however it is highly recommended that customers do utilize the header and specify the proper content class to obtain best results. In addition, for news articles in text format, it is highly recommended to wrap the input text in the XML tags utilized by Intelligent tagging (<Title></Title> and <Body></Body>), making sure to use a <Title></Title> tag, and then submit the file for processing.

What is the best way to submit emails?

To effectively submit email content:

Clean any encoding from any content before submission. It is highly recommended for optimal performance that customers clean e-mail headers such as e-mail addresses surrounded by greater than/less than signs.
Define the following request headers (in addition to the other mandatory headers):

- Content-Type: text/raw
- x-calais-contentClass: news

Please note: Intelligent Tagging is trained and optimized to process content classes such as news stories, research documents, filings, etc. Intelligent Tagging is less familiar with this type of input. Intelligent Tagging will process the text but the quality of the output is not guaranteed to be of the same quality as for the known and supported content classes.

how can I tag a very large corpus of documents?

Write an application that sends files to the Intelligent Tagging restful API in an efficient manner.

Browse the Intelligent Tagging Developer Community for sample code that illustrates how to send a request to Intelligent Tagging. In particular, see the Intelligent Tagging Demo Program, on the Downloads page and also see the sample code on the Tutorials page.

For hosted Intelligent Tagging and Open Calais, the processing capacity is defined by your license. When writing the code, you’ll want to maximize capacity to the extent possible, according to your processing quota.

Intelligent Tagging On Premise supports 4 concurrent requests. Design your application to send requests at a rate high enough to take advantage of the 4 concurrencies, but no more.

Which parts of submitted documents does Intelligent Tagging ignore? Do I need to remove anything before submission?

The answer depends on the input format:

text/raw: Intelligent Tagging processes all text submitted in this format. So before submission, remove all the text you do not want to be processed by Intelligent Tagging. For example, photo captions, advertisements, and other elements that do not contribute to the narrative context of the story itself.

For emails, we recommend removing any encoding before submission, along with email headers such as e-mail addresses surrounded by greater than/less than signs.

text/html: Intelligent Tagging ignores HTML tags (the tags themselves, not the tag contents), captions, advertisements, and images.

text/xml: Intelligent Tagging ignores everything but the contents of the XML title and body tags.

application/pdf: Intelligent Tagging attempts to identify the narrative text in the document and tag it, and ignores non-narrative text, footers, disclaimers, and images.

What is a PermID URI?

A Refinitiv PermID is a machine readable, 64 bit number used to create a unique reference to a piece of information. A PermID will never change over time. Behind the scenes, Intelligent Tagging leverages the PermID for tagging organizations, people, and other entities.

You can use the PermIDs exposed in the Intelligent Tagging output to access related data from the high-quality, curated data in the Refinitiv data sets.

We provide no-cost access to PermIDs via the Open PermID initiative, which also includes the Open Calais tagging capability. (Open Calais is the free, limited version of Intelligent Tagging.)

On the Open PermID site (https://permid.org), each PermID is mapped to its own URI, which can be used to extract even more information about your tagged entities.

Browse the Open PermID site to see how PermID URIs can enhance your tagging experience.

What is the PermID attribute in the tagging output?

The PermID attribute is output by multiple Intelligent Tagging metadata types. A PermID is a Refinitiv unique ID which is applied consistently across all documents processed by Intelligent Tagging.

The PermID value identifies an entity type (for example, the Organization entity), or a specific entity (for example, Bank of Japan), depending on the metadata type.

PermID in Entity Markup Tags

In entity markup tags, the PermID attribute value identifies the entity type.

For example, the PermID attribute value of the em/e/Organization tag is the unique ID of the entity type (organization), and not of the organization itself (Bank of Japan).

In fact, every em/e/Organization (Organization entity) tag exposes the same PermID attribute value—the Refinitiv unique ID (PermID) for the entity type, “Organization.” This enables linking all extractions of a particular metadata type, and can be used to build a knowledge graph.

The following Entity Markup tags output the PermID attribute value that identifies the entity type:

em/e/MarketIndex

em/e/Organization

em/e/Person

em/r/CompanyEarningsAnnouncement

em/r/Deal

em/r/IPO

PermID in Entity Resolution and Industry Tags

In entity resolution (disambiguation) tags and Industry tags, the PermID attribute value identifies a specific entity or industry.

For example, the PermID attribute value in the er/Organization (Organization Resolution) tag is the Refinitiv unique ID for Bank of Japan.

The ID can be used to extract information about the entity or industry from the Refinitiv data set. The ID also supports linkage across documents processed by Intelligent Tagging.

The following resolution (disambiguation) tags output the PermID attribute value that identifies the resolved entity (i.e. the unique ID of a specific company, city, deal, person, etc.)

er/Company

er/Continent

er/Deal

er/geo/City

er/geo/Country

er/Organization

er/geo/ProvinceOrState

er/TopmostPublicParentCompany

What is the ForEndUserDisplay Attribute?

The ForEndUserDisplay attribute is present in Topic (DocCat), Entity (em/e), Relation (em/r), SocialTag, and Slugline tags. The attribute value (true or false) is a recommendation of whether the tag is suitable as a search item for a specific document (true) or whether the metadata is primarily of use for aggregation and analytics on large quantities of documents (false).

Some metadata types have a set forenduserdisplay value, while for other metadata types, the forenduserdisplay value is dynamic:

DocCat (Topic) tags– RCS topics optimized with (DIY) Self Service Classification - the forenduserdisplay value is always true.

SocialTags – the forenduserdisplay value is always true.

Slugline tags – the forenduserdisplay value is always true.

Entities and Relations - Some entity and relation metadata types have set forenduserdisplay values, while other entity and relation metadata types have dynamic forenduserdisplay values, determined per instance.

For example:

Every instance of the em/r/Dividend tag (the Dividend relation tag) defines forenduserdisplay as true.

Each instance of the em/e/Company tag is assigned a forenduserdisplay value based on a confidence score calculated for that specific company tag during extraction.

See ForEndUserDisplay attribute in the guide, Intelligent Tagging Semantic Metadata Tags for further information.

Generally speaking, the forenduserdisplay=true status is assigned to metadata types that consistently provide high precision (>80%) results based on our tests.

Continue reading for more detailed information and best practice recommendations for each metadata type.

Best Practice Recommendations

If your use case requires high precision, you can make use of the ForEndUserDisplay attribute to filter the tagging output according to your needs.

The ForEndUserDisplay value is not relevant (there is no need to filter out tags with ForEndUserDisplay =false value) to the following types of use cases:

Aggregation/graph use cases, or use cases where you can gain confidence by processing multiple examples.

For example, looking for trends over time.

Use cases where you may not want to lose hints/indications and can afford a higher rate of precision errors.

For example, monitoring a specific company, industry or country where you don’t want to miss anything.

ForEndUserDisplay attribute in DocCat (Topic) Tags

Topics optimized with (DIY) Self Service Classification are assigned the forEndUserDisplay =true value.

The DocCat tag also assigns a confidence score on a scale of 0 to 1. The value indicates the probability that the topic is indeed discussed in the text and is centric to the text. The higher the value, the higher the probability.

For DocCat (topic) tags, we suggest using the forEndUserDisplay value together with the score value to filter the tagging output to suit your use case.

For example, if your use case requires a high level of precision, we suggest the following:

If the forEndUserDisplay =true and the score is greater than 0.5, use the tag.

If the forEndUserDisplay =false and the score is greater than 0.9, use the tag.

For a semi-automated editorial workflow, if the forEndUserDisplay=false and the score is less than 0.9, we suggest marking the tags as maybe being about the topic.

For a fully automated workflow, we suggest ignoring tags with the forEndUserDisplay=false and a score of less than 0.9.

ForEndUserDisplay attribute in Social Tags

The forEndUserDisplay value is always true for social tags.

ForEndUserDisplay attribute in Entity and Relation Tags

For entities and relations, the forEndUserDisplay value is determined as follows:

em/e/Company, em/e/Person, em/e/PharmaceuticalDrug, em/r/Bankruptcy, em/r/Deal, em/r/IPO: forenduserdisplay value is defined per tag instance, based on the confidencelevel value. For example, Intelligent Tagging determines and assigns the appropriate forEndUserDisplay value to each instance of the em/e/Company tag in the tagging output.

All other entities and relations: Each metadata tag type has a set value.

The list of values is found in the guide Intelligent Tagging Semantic Metadata Tags, under ForEndUserDisplay attribute in Entity and RelationTags

For example, every em/r/Dividend tag is assigned the forenduserdisplay=true value.

As already mentioned, the forenduserdisplay=true status is assigned to metadata types that consistently provide high precision (>80%) results.

ForEndUserDisplay attribute in Slugline Tags

The forenduserdisplay value is always true for Slugline tags.

(The Slugline tag is a premium metadata type.)

Recommendations
If your use case requires a high level of precision, we suggest using the forenduserdisplay value to filter the tagging output as follows:

forenduserdisplay=true: Use all the tags that define forenduserdisplay=true.

forenduserdisplay=false: For a semi-automated editorial workflow, we suggest marking these tags as maybe being about the entity or relation; for a fully automated workflow, we suggest ignoring these tags.

What are RICs?

Reuters Instrument Codes (RICs) are unique identifiers for financial instruments and indices. A RIC is made up primarily of the security's ticker symbol, followed by a period and an exchange code based on the name of the stock exchange using that ticker. For instance, IBM.N is a valid RIC, referring to IBM being traded on the New York Stock Exchange. IBM.L refers to the same stock trading on the London Stock Exchange. The exchange code used in the RIC is proprietary to Thomson Reuters.

Does Intelligent Tagging tag RICs in text?

Company extraction based on a primary ticker symbol or any RIC mention in the text is an Intelligent Tagging capability that was released in 2018. The “recognizedas” attribute of the em/e/Company (company extraction) tag indicates if a company extraction was based on a ticker, RIC, or company name mention in the text. You can use this attribute value to filter your output. (Please note that Ticker extraction is a premium feature.)

Does Intelligent Tagging tag tickets in text? Should I activate this feature?

Company extraction based on a ticker symbol appearing in the text is an Intelligent Tagging capability that was released in 2018. This feature is available to premium users, and is enabled with the x-calais-EnableTickerExtraction header.

Please note that Intelligent Tagging identifies and extracts primary tickers mentioned in the text. For example, “Falabella,” the primary ticker of the company SACI Falabella, is identified as a company mention, whereas “FALABE-OSA,” a non-primary ticker, is not extracted.

The recommendation of whether or not to use ticker extraction depends on your use case.

For example, if your content primarily mentions companies by name rather than by ticker symbol, then ticker extraction may cause noise that you might want to avoid.

If your content contains a lot of abbreviations, we recommend performing content quality tests before enabling ticker extraction in a production environment, to ensure that Intelligent Tagging is accurately identifying which of the abbreviations are ticker mentions.

As of June, 2019, you can trigger recall-oriented ticker extraction optimized for research content such as analyst emails. This feature is available to premium users, and is enabled with the x-calaisEnableTickerRecallOriented header. (This feature is not optimized for documents like long research reports.)

To trigger this workflow, define the x-calaisEnableTickerRecallOriented header IN ADDITION TO the x-calais-EnableTickerExtraction header. Both headers are mandatory for this workflow. If you define the x-calais-EnableTickerRecallOriented header but don’t define the x-calais-EnableTickerExtraction header, there will be no ticker extractions (company extractions based on ticker mentions in the text) at all. When both headers are defined, mentions of both primary and non-primary tickers in the text are identified as companies.

Please remember to enable company tagging and to define all mandatory headers.

In company tagging, how are RICs related to PermIDs?

PermIDs and RICs are two different unique IDs that can be used to unambiguously identify a company. The er/company (company disambiguation) tag exposes both the PermID (permid attribute) and the RIC (primaryric attribute).

What is the Company Entity resolved to?

Intelligent Tagging extracts companies and maps each of them to the corresponding company in the Refinitiv Organization Authority, a universe of over 5 million organizations. This mapping is exposed in the er/company (company disambiguation) tag. The company is identified by name and by PermID.

How do I use the Company Relevance score?

For the highest accuracy, we recommend ranking companies based on the relevance score and not on the relevancecont score. A score of 0.2 should be considered the same as a score of 0.5.

If your use case requires high precision, you can filter for High Relevance company tags only.

If you are monitoring a specific company and do not want to miss any hints/indications, you should not filter company tags by relevance score.

How can I use the Relevance and Confidence scores to filter the tagging output?

Relevance Score: Intelligent Tagging uses the Relevance score to indicate how relevant an extracted entity is to the text as a whole.

Confidence Score: The Confidence score indicates how confident Intelligent Tagging is that the entity type is accurate. For example, for a company extraction, the Confidence score indicates how confident Intelligent Tagging is that the extracted company is indeed a company.

Customers must determine their own threshold for their tolerance for noise. Do they want to only pay attention to companies that are highly relevant (.8) and with a confidence of .8 and above?

Or are they able to tolerate more noise and include confidence of .6 and above? This is up to customers to determine, based on their individual use case.

The consuming application can achieve higher accuracy results by ignoring instances with confidence and relevance scores below a specified level However, raising the threshold boosts Precision at the expense of Recall.

So if your use case requires high precision, you can filter accordingly. But if, for example, you are monitoring a specific company and do not want to miss any hints/indications, you would define a lower threshold, or none at all.

What is the reference list of companies and organizations that Intelligent Tagging resolves to?

Intelligent Tagging maps extracted companies and organizations to the corresponding company/organization and unique ID in the Refinitiv Organization Authority (OA). The OA is a data set that contains around 5 million public and private companies and organizations, and is constantly updated.

What is Zero Relevance and how can I use it to improve Company Tagging?

In company tagging, the relevance score indicates how centric the extracted company is to the story. The relevance score “0” is assigned to a company that is mentioned in the text but that is irrelevant to the story.

For example, mentions of companies such as rating agencies, reporting agencies, stock exchanges, and social applications are usually not relevant to the story.

You can use the relevance score to filter company tags by relevance score, according to your use case.

We recommend using the x-calais-source header, used to optimize extraction from a number of different sources. This header is also used by Intelligent Tagging to better calculate the relevance score if the source of the news document (for example, Reuters) or of the research report (for example, Morgan Stanley) is mentioned in the text.

In this case, the relevance score is usually affected, and likely to be “0” because generally speaking, the source company is less likely to be relevant to the story itself. The relevance score assigned to related companies (such as subsidiaries of the source company) may be affected as well.

What edge does company tagging provide users in the financial domain?

Electronic discovery is presently challenged by huge volumes of unstructured data. Discovery teams are overwhelmed by content. Key word searches are used to discover specific phrases or documents within gigabytes of documents, files, and emails. There is increasing pressure to speed up the discovery process and reduce overall discovery costs. This is where entity recognition can help.

Entity recognition can significantly speed up the rate of discovery. In this context, ‘entities’ are predefined concepts such as companies, organizations, places, people, and products.

The Intelligent Tagging API includes Company Tagging, an entity-recognition capability which identifies and tags mentions of companies in the text.

Potential benefits of company tagging for the financial domain:

Enables more focused discovery when looking for information relevant to a specific company or companies.

Enables precise and targeted searches for specific companies.

Makes it easy to identify companies and the relationships between companies and also between companies and people.

Reduces the time required to comb through the unstructured data and analyze the content.

Slashes the time and effort required to quickly target relevant documents or form a review.

What's the max document size I can submit?

The maximum allowed document size is defined by the Intelligent Tagging package and your license.

When a document that exceeds the maximum size limit is submitted, it is not processed by Intelligent Tagging, and the relevant error message is generated.

However, in addition to the size limit, there is a timeout threshold of 180 seconds – 3 minutes. If the input document is too complex (contains too many entities and relations) to be processed within the defined time limit, it is considered “too large” to process and a timeout error is generated. For example, a 1MB text file that contains 1000 entities.

First, try resubmitting the document (maximum three retries) with a sleep of 750 milliseconds between resubmissions. If the resubmission does not work, try splitting the document into smaller parts for processing.

The license defines the following maximum document sizes:

Free Open Calais

Hosted Intelligent Tagging

Intelligent Tagging On Premise

HTML - 100KB

XML - 100KB

Raw text - 100KB

PDF - Not supported

HTML - 500KB

XML - 500KB

Raw text - 500KB

PDF - 500KB

HTML - 45MB

XML - 1.5MB

Raw text - 1.5MB

PDF - 45MB

What is the smallest document size I can submit?

Intelligent Tagging will accept a document for processing as small as one character. However, while there is no minimum number of characters rule, Intelligent Tagging is optimized to operate on narrative text documents, and may be unable to accurately tag submitted content that is too short. A longer document provides more context for the tagging engine and provides better results.

Is topic tagging (classification) supported for really short documents?

Topic classifiers optimized for News Stories or Research Reports can be used to classify documents as short as one sentence.

The legacy topic classifiers require a minimum context of 500 characters.

How does RCS topic classification handle long documents? Does it apply topic tags based on just part of the document or on the entire document?

RCS topic tags are assigned based on the entire document with one exception:

Some of the optimized for news topic classifiers “read” only the first 1000 characters of the document to determine whether or not the document is about the topic.

All the other topic classifiers (default, optimized for news, and optimized for research) assign tags based on the entire document.

The list of all supported RCS topics can be downloaded by premium users from MyRefinitiv. The topics which are assigned based on the first 1000 characters are indicated in the list.

If I concatenate multiple stories to reduce the number of submissions, will that affect the quality of tagging output?

This approach has advantages and disadvantages.

Advantages:

If you are processing over 100,000 very, very short articles, this could allow you to stretch your daily quota (hosted users).

If the individual articles are indeed very short, tagging output may be improved, because Intelligent Tagging operates better on submissions containing more narrative text.

Disadvantages:

Concatenation of different articles distorts the context. Thus, for example, company tagging quality might be reduced because the relevance score and resolution rely on the context.

A single document that concatenates multiple stories about different subjects will not return accurate Topic tags for all of the subjects in the document.

What is the Content Type?

The Content-Type header is a mandatory header, used to indicate the format of the input document. The content type value (text/html, text/xml, application/pdf, or text/raw) determines exactly how Intelligent Tagging processes the input document prior to tagging (zoning, conversion, cleaning etc.). This in turn affects the tagging output. So it is important to specify the correct content type.

For example, if text with HTML or proprietary XML tags is submitted, setting the Content-Type value properly allows the Intelligent Tagging engine to do the best it can to strip those tags, processing only the narrative text, and ignoring any markup.

PDF files must be submitted as PDF, or a content type error is returned.

Plain text files can be submitted as "text/raw," but tagging results may be better if you first wrap the input text in Intelligent Tagging XML tags (make sure to use a <Title></Title> tag), and then submit the file for processing.

Where can I find a list of error messages that can be returned by Intelligent Tagging?

The list of error messages that can be returned by Intelligent Tagging is found in the API User Guide.

What do I do if I receive an error message?

Generally speaking, many error messages returned by Intelligent Tagging are specific to the state of the system at the moment the request was sent, and may not occur again if the request is re-submitted. So our recommendation is to respond to most errors by waiting 750 milliseconds, then resubmitting the request that returned an error.

What errors cannot be resolved by re-submitting the request?

Errors such as timeouts, server busy, server unavailable, may be resolved by resubmitting the request. But errors such as incorrect token, or daily quota exceeded, would not be resolved by resubmitting the same request.
If you are a licensed Intelligent Tagging user with a technical issue, please contact our Technical Support team at: https://my.refinitiv.com.

What do I do if I get HTTP Error 500 (request timeout)

This error may be recoverable, if the timeout was due to heavy utilization of the system.

Try resubmitting the document (maximum three retries) with a sleep of 750 milliseconds between resubmissions.

If the resubmission does not work, it may be that the input document is too complex (contains too many entities and relations) to be processed within the defined time limit. In this case, you can try splitting the document into smaller parts for processing.

I have noticed specific problems with the tagging output. What should I do?

Premium Intelligent Tagging customers (Hosted or On Premise): Presales/implementation customers can contact their assigned Refinitiv Presales Engineer for assistance. Production customers should contact us at https://my.refinitiv.com.

Free Open Calais customers: Please ”Ask a question” on the Developer Portal.

I am submitting a large document which processes for three minutes but consistently returns a timeout error. What is the problem?

Intelligent Tagging has a three-minute maximum processing time for documents. A document that times-out repeatedly might be too large or complex to be processed. For example, a large document (hundreds of kilobytes of text), containing hundreds or thousands of entity mentions might take Intelligent Tagging longer than three minutes to process. If this is the case, the only option is to split the file into smaller portions for processing.

I cannot connect to the Intelligent Tagging Hosted server. What should I do?

First, check the state of the client application and network connectivity. The Intelligent Tagging hosted platform has proven to be extremely resilient, fault tolerant, and stable. It is rarely inaccessible to customers. If you have verified the state of the client application and network connectivity, and the connectivity issue persists, please contact us at https://my.refinitiv.com.

I am receiving HTTP 429 errors in response to Intelligent Tagging requests. Why?

An HTTP 429 error is generated in when the daily request quota, or the per-second request quota is exceeded.

The error message indicates which quota is exceeded:

“You exceeded the allowed quota of <#> requests per day” indicates a daily request quota error.

“You have exceeded the concurrent request limit for your license key” indicates a per-second request quota error.

An error message says the daily quota has been exceeded. How can I track the number of requests made?

The hosted Intelligent Tagging license allows up to a certain number of document submissions per day. The Intelligent Tagging output files include HTTP response headers that indicate the daily quota defined by your license, and the number of submissions already made.

x-permid-quota-daily: Indicates the daily quota defined by your license.

x-permid-quota-used: Indicates the number of submissions already made.

Please note that these headers are available to hosted Intelligent Tagging users only. For all other deployment options, these headers are not supported and not relevant.