br Representation Seed Outcomes list phrases br The features used
3.3.3. Representation 2: Seed Outcomes + list phrases
The features used in representation 2 are exactly the same as re-presentation 1, however the annotations to the noun phrases are dif-ferent for all noun phrase that appear in same list as a SC 560 term. Phrases that appear in the same list as an outcome will also be anno-tated as an outcome. Seed outcome phrases that are not part of a list are also included in representation using the same label as in representation 1. Thus the number of phrases marked as an outcome in representation
2 will be greater than or equal to representation 1 because all of the initial seed terms and any phrase or abbreviation from in same list as an outcome will be labeled as an outcome.
3.3.4. Representation 3: List phrases and features
The order in which an outcome occurs within a list is irrelevant, but as shown in Fig. 1, the order of phrases within a list can dramatically change the features used to represent the target outcome noun phrase. In representation 3, the boundaries of a list are used to establish the terms that are used before and after the outcome phrase. In sentence 1 the outcome phrase progression-free survival is part of a list that starts immediately before candidate polymorphisms and ends immediately after toxicity, thus in representation 3, the word features are the same for all 4 nouns in the list from sentence 1 and all four nouns would be labeled as an outcome (see Table 2).
The models were created using the training data. The first evalua-tion was applied to the test sets 1 and 2 and used the initial seed terms as the gold standard, where classification performance is reported using standard metrics of accuracy, precision, recall, F1 (the harmonic mean of precision and recall) with respect to predicting survivorship. Accurately predicting the outcomes from the initial set of seed terms is just an intermediate step towards building computational models that can identify overall outcomes. Thus, the second evaluation used a gold standard of overall outcomes that were established by manually re-viewing every noun phrase from the method section within the set of abstracts.
The manual evaluation strategy does not scale to the tens-of-
C. Blake and R. Kehm
thousands of abstracts. However, the manual evaluation revealed that the initial set of survivorship seed terms provided an excellent ap-proximation of outcomes where only 3 of the 500 abstracts used a survivorship seed term in the wrong context (i.e. as a method not an outcome). Thus a similar strategy of using keywords was used to de-velop a silver standard. Specifically abbreviations, words, and phrases identified in the manual gold standard or predicted by the machine learning models when trained with the initial survivorship seed terms were used to identify outcome phrases from anywhere in the abstract. For example response was identified in the manual evaluation and any phrase containing the word response was included in the silver standard. The precision and recall of silver standard was calculated with respect to the noun phrases identified in the manual gold standard.
The models were evaluated with respect to three different stan-dards. The first gold standard used the initial seed terms, the second gold standard uses the manually identified overall outcomes and the third evaluation used a silver standard that was derived from gold standards. r> 4.1. Initial seed term evaluation
The first evaluation considers the initial set of seed terms that focus on survivorship outcomes and enables us to establish if there are en-ough terms identified to build a machine learning model that can ac-curately predict survivorship outcomes. Both the list and machine learning approaches require an initial set of seed terms but as with many language phenomena, specific outcome phrases are not dis-tributed equally within the collection. The top 10 phrases (overall sur-vival, OS, progression free survival, PFS, survival, disease free survival, DFS, RFS, EFS, and event free survival) accounted for 73% of the initial seed outcome terms in the training set. However these frequently occurring terms are also identified in the standards related to breast cancer out-comes. Thus, the proposed approach effectively leverages the dis-proportionate usage of frequent outcomes to identify a long tail of terms that are unlikely to be identified manually. To further illustrate this point, 70% of the distinct phrases in the training and test sets appear only once.