br Materials and methods br The experiments
3. Materials and methods
The experiments use a 4 step process shown in Fig. 1, where the abstracts and initial AEB071 terms are first identified and used to train models that are then evaluated, with respect to a manually verified set of outcomes, the initial seed terms, and new outcomes. Prior to running any experiments a sample of 100 sentences from the method section (the development set) that included at least one survivorship term were manually inspected to identify potential representation strategies for the machine learning models.
A training set was created comprising 1000 abstracts that were drawn at random from the set of abstracts that (a) included at least 5 sentences and (b) contained a seed survivor term within the method section. The first test set (test set 1) was created with 500 PMIDs using the same criterion as the training set. Both a text mining expert (CB)
Fig. 1. Experimental design.
and breast cancer expert (RK) manually reviewed the test set to create the manual gold standard. The review process revealed some survival terms (e.g. survival curve, survival information and survival analysis) were always used in the context of describing a method instead of an out-come and thus removed from the set of initial seed survival terms. The manual evaluation also revealed that systematic reviews and meta-analyses should be excluded. The final manual gold standard of overall outcomes included 453 abstracts.
Informed by the additional review constraints and revised set of survivorship terms, the training set was resampled using constraints (a) and (b) and the additional publication type constraint where the pub-lication type was not (c) a review or meta-analysis. A second test set was created that contained 500 method sections that used constraints
(a) and (c), however the survivor term could appear anywhere in the abstract (i.e. not only in the method section). PMIDS were mutually exclusive in all training and test sets.
3.1. Identify and pre-process abstracts
The American Cancer Association (ACA) characterizes breast cancer treatments as local where surgery or radiation is used and systemic treatments that fall into three types: hormone therapy, chemotherapy, or targeted treatments. Abstracts from each of these three treatment types were identified so that the degree to which outcomes might be influenced by the underlying biological mechanisms could be explored. The ACA website was searched to find 5 target drugs from amongst the different treatments. Abstracts from the 2016 distribution of MEDLINE were searched using the brand name and generic term for each of the drugs as described below.
The hormone therapies Tamoxifen (Nolvadex), Raloxifene (Evista) and Bazedoxifene (Duavee) were searched. Tamoxifen is used with pre-and postmenopausal women for early and locally advanced and meta-static breast cancer. It is also prescribed for ductal carcinoma in situ (DCIS), which is a non-invasive breast cancer, and to reduce the risk of breast cancer for high risk patients who have yet to develop breast cancer. All of the hormone drugs operate as selective estrogen receptor modulators (SERMS) and Tamoxifen and Raloxifene are used prophy-lactically to reduce risk, even if cancer is not yet developed or been detected. Bazedoxifene is a repurposed osteoporosis treatment that is described as a selective estrogen receptor degrader (SERD) that targets the estrogen receptor for destruction. The two chemotherapy drugs Doxorubicin (Adriamycin) and Docetaxel (Docefrez, Taxotere) are both used to treat early and locally advanced breast cancer and metastatic breast cancer. The fifth and final drug was the monoclonal antibody Trastuzumab (Herceptin), an example of a targeted therapy where specific vulnerabilities of the tumor cells are exploited for therapeutic benefit. The brand and generic names for each drug was used as search criterion that returned 88,727 abstracts (see Fig. 1).