DeepHealth will present seven abstracts during RSNA 2022.

Monday, November 28, 9:30 am

M1-STCE Presentation with Q&A

Purpose
A variety of AI tools are FDA-cleared for screening mammography such as triage (CADt) or computer aided detection (CADe) and diagnosis (CADx) products, but there is little data on large-scale deployments of such AI tools in clinical practice. Although some AI triage products have shown promising results in domains such as intracranial hemorrhage, there are fewer indications of immediate value in triage for mammography, perhaps because of the nature of a screening test where very low prevalence is expected. Here we sought to assess radiologists’ performance by triage category and also more granular AI categories.

Materials and Methods
An FDA-cleared CADt algorithm was deployed at 151 clinical sites across multiple US states. Data was collected from 519,281 screening mammograms interpreted by 223 radiologists over a period of more than 10 months. The data collected for each mammogram included the product outputs (“suspicious” or “not suspicious”), the underlying AI numeric scores, radiologists’ BIRADS scores, and all biopsy outcomes within 6 months. The AI scores were also retrospectively binned into four suspicion levels containing ~25% (“Minimal”), ~50% (“Low”), ~20% (“Intermediate”) and ~5% (“High”) of all mammograms. Cancer detection rate (CDR) and recall rate were assessed for each suspicion level.

Results
The observed CDR, total cancers detected, recall rate and total patients recalled are reported by suspicion level. The CDR increased exponentially with increasing AI suspicion (0.19, 0.61, 4.58 and 72.38), while the recall rate increased incrementally with AI suspicion (5.45, 9.82, 17.65, 33.72). There was a dramatic difference in CDR between Minimal and High categories (381 times greater) even though those categories had approximately the same number of recalls (7,003 and 7,947). The Low and Intermediate categories showed similar patterns (23,137 and 23,219 recalls), albeit less pronounced (8 times greater).

Conclusion
In a group of more than half a million women, AI was able to reliably categorize patients using four cancer suspicion levels. Given the dramatic difference in performance at the different suspicion levels, radiologists could potentially increase CDR by adjusting their behavior to focus on High and Intermediate cases while also lowering recall rates by up to 50% through reducing recalls on Minimal and Low cases, which only represented 7% of all cancers. The greater granularity provided by four categories will likely aid radiologists significantly more than a simple binary triage flag.

Clinical Relevance
AI for mammography can indicate cancer suspicion reliably to clinicians as supported by large scale clinical data. Changing behavior by suspicion level could lead to quality improvements for patients.


Radiologist performance measures by AI categories in 151 clinical sites (N=519,281 screening mammograms).
The top row shows radiologist performance measures across triage binary categories, while the bottom row shows the same metrics across a more granular set of four categories. The difference in performance can be seen more pronounced with four categories.

Monday, November 28, 9:30-10:30 am

M3-SSBR03 Presentation with Q&A

Authors
Jorge Onieva, MSc, Leeann Louis, Benjamin Reece, Greg Sorensen, William Lotter

Purpose
Model drift is a challenge in artificial intelligence (AI) where an AI model’s behavior changes over time due to a number of possible factors including changes in the input data distribution. (for example, changes in a screening population over time). We analyzed the performance of a commercially-available AI triage software for mammography to assess the stability of the software’s outputs in real-world clinical use. The software was deployed for one year across a large number of screening exams, allowing a comparison of the AI model’s outputs over time.

Materials and Methods
A total of 303,222 studies (302,612 Digital Breast Tomosynthesis and 610 Full-Field Digital Mammography) were predicted to be “suspicious” or “not suspicious” by the software in 66 different clinical settings across the United States. We aggregated the AI model’s predictions per month (from May 2021 to April 2022) and calculated the proportion of studies classified as “suspicious” at each of the 3 possible operating points used by the software. 95% confidence intervals for the proportions were computed using the Wilson method, as well as a standard deviation of the proportions across the 12 months. Finally, we investigate the similarity of the distributions by computing the Jensen-Shannon distance between the proportions during the first and last 6 months of use of the software. This distance is defined in a [0-1] range, where 0 indicates identical distributions.

Results
The proportion of cases classified as “suspicious” each month remained stable over the course of the year. For each operating point, the standard deviation of the proportion over the 12 months was 0.0041, 0.0021, and 0.0019 respectively. Across all months, the proportions were within a range of 0.015 (1.5%) of each other for each operating point, meaning that the proportion for any given month was less than 1.5% different than any other month for the same operating point. The Jensen-Shannon distance for the proportions between the first and second 6 month periods were all very close to zero for all three operating points: 0.0153, 0.00935, and 0.00808, further quantifying that the proportion of studies outputted as “suspicious” at each operating point remained stable over time.

Figure 1.
Proportion of studies classified as “suspicious” for the software with 95% confidence intervals (Wilson method). Each line represents one operating point.

Conclusions
The proportion of cases classified as “suspicious” by the AI model remained stable over a full year of deployment, indicating that no significant model drift was observed.

Clinical Relevance
AI has the potential to improve screening mammography, but to do so, it must be reliable and robust over time. These results indicate AI model stability at a scale that has not been previously measured.

Monday, November 28, 12:15-12:45 PM

MSA-SPBR Digital Poster

Authors
Melina Tsitsiklis, MD, Leeann Louis, Jiye G. Kim, Bryan Haslam, A. Gregory Sorensen

Purpose
High breast density is associated with both increased breast cancer risk and decreased sensitivity of screening mammography, contributing to higher overall breast cancer rates and interval breast cancer rates. The Breast Cancer Surveillance Consortium (BCSC) previously showed that interval cancer rates are higher in women with dense breasts undergoing screening with full-field digital mammography, but little data is available for digital breast tomosynthesis (DBT). We sought to compare cancer rates by breast density on women undergoing screening with DBT using a dataset comparable to BCSC.

Materials and Methods
Retrospective data was collected including one DBT screening mammogram from each of 559,791 women undergoing screening between 2017-2021 at over 200 clinical locations, excluding first screens, and with at least one year of follow-up. Screen-detected cancer rate and interval cancer rate were calculated per 1000 women. Analyses are separated by women with nondense (fatty or scattered fibroglandular) and dense (heterogeneously dense or very dense) breast tissue, following BI-RADS 5th Edition guidelines.

Results
Across all age groups, women with dense breasts had a 1.8x higher recall rate, 1.4x higher CDR, and 2.9x higher interval cancer rate compared to women with nondense breasts. CDR is higher in women with dense breasts in all age brackets (dense/nondense CDR: age 40-49: 2.0; 50-59: 1.8; 60-69: 1.4; 70-74: 1.4, p < 0.001 for all differences). Interval cancer rate was higher in women with dense breasts in all age brackets (dense/nondense interval cancer rate: age 40-49: 2.5; 50-59: 3.2; 60-69: 2.8; 70-74: 3.7, p<0.01 for all differences). While the positive predictive value of recalls (PPV1) was similar for women with dense breasts compared to nondense breasts in age groups above 50, women ages 40-49 with dense breasts who were recalled were more likely to have cancer detected than women with nondense breasts (dense/nondense PPV1: 1.17).

Conclusion
In a group of more than half a million women, women with dense breasts undergoing screening mammography with DBT show increased rates of cancer detection and interval cancer rates compared to women without dense breasts, across all age groups. Women aged 40-49 with dense breasts had a higher recall rate but also a higher likelihood of a cancer diagnosis than women without dense breasts. Mammographic breast density, even with DBT, remains associated with higher rates of cancer.

Clinical Relevance
Mammographic breast density is a strong risk factor for breast cancer and for interval breast cancers, even when DBT is used. Our results suggest women aged 40-49 with dense breasts particularly benefit from screening mammography.

Figure 1.
A. Cancer detection rate (per 1,000 screens); B. Recall rate (%); C. Interval cancer rate (per 1,000 screens); D. Positive predictive value of recalls (%) in nondense (fatty or scattered fibroglandular) vs. dense (heterogeneously dense or very dense) breasts of women across ages. Numbers in parentheses indicate 95% confidence intervals as calculated with the Adjusted Wald method.

Tuesday, November, 29, 9-9:30 am

T1-STCE Presentation with Q&A

Authors
Leeann Louis, Bryan Haslam, Jiye G. Kim, A. Greg Sorensen

Purpose
There have been many studies of mammography AI that showed retrospective performance, reader study performance, or clinical pilot data, but little data exists showing that large scale deployment helps radiologists. Such large-scale data will likely help radiologists build confidence that AI will generalize to their clinical practice. Our objective in this study was to evaluate performance changes in a large-scale deployment of a CADe/x AI device used for mammography screening.

Materials and Method
An FDA cleared CADe/x AI algorithm providing lesion localization and suspicion for cancer digital breast tomosynthesis screening mammography was deployed at 147 clinical sites across a variety of states in the United States. Data was collected from all screening mammograms that have at least 30 days of follow-up to allow for diagnostic work-up (n=61,961), and included the AI product’s suspicion levels, radiologists BIRADS scores, and all biopsy outcomes available for the patients screened. (These are late-breaking results and will be updated as more data is collected.) For reference, the suspicion levels in ascending order are 1, 2, 3, and 4 and correspond to approximately 25%, 50%, 20% and 5% of all mammograms. For comparison purposes, mammograms from the 14-month period prior to deployment of the AI were also retrospectively analyzed with the same AI software. Performance measures from 185 radiologists were compared before and after deployment. Due to the recent large-scale deployment of the AI algorithm there was limited follow-up time for cancer diagnosis for the mammograms read with the AI (diagnostic workup including biopsy confirmation). For this reason, the cancer detection for all exams was limited to 30 days both before and after deployment. Significant differences were calculated using a Chi-squared test with a p<0.05 cutoff for significance.

Results
The cancer detection rate (CDR) increased by 26% after deployment compared to the CDR before deployment (p=0.026). This was primarily driven by a higher CDR in the level 4 suspicion category of 23.50 per 1000 cases to 29.10 per 1000. There was also a slight increase in recall rate of 11.8% to 12.8% (p<0.001) that did not change the overall PPV1 (1.3% before vs. 1.5% after deployment, p=0.191).

Conclusion
These initial results show that directionally, the radiologists are beginning to detect more cancers even early after the deployment of an AI tool that provides suspicion categories as well as CAD markings. An increase in recall rate was expected as radiologists become accustomed to the product, similar to what is observed when other new technologies such as DBT have been adopted. Our results provide early encouraging evidence that radiologists in clinical practice will benefit at scale, though further monitoring is ongoing and will be reported.

Clinical Relevance
AI for mammography using suspicion levels and localization can help radiologists detect more cancer.

Radiologist performance measures by AI categories in 147 clinical sites. Black lines indicate the 95% confidence interval.CDR overall increases post deployment, primarily due to an increase in CDR for level 4 suspicion exams.

Wednesday, November 30, 9:30-10:30 am

Session W3-SSBR08 Presentation with Q&A

Authors
Leeann D. Louis, Bryan Haslam, Jiye G. Kim, William Lotter, A. Gregory Sorensen

Purpose
Breast cancer screening mammography using digital breast tomosynthesis (DBT) is thought to be an improvement over full-field digital mammography (FFDM) leading to increased cancer detection rate (CDR) and reduced recall rate (RR). Although performance has been shown at a population level, less evidence is available that DBT provides an improvement for different patient subgroups, especially for patients with different racial, ethnic or economic backgrounds. The goal of this study was to determine if improvements from DBT are also found across race, ethnicity, and income.

Materials and Method
Retrospective data taken from 2017-2021 was collected from over 200 clinical sites across the United States. Outcomes measured included RR (%), CDR (per 1,000 women screened), interval cancer rate (ICR, per 1,000 women screened), and positive predictive value of recalls (PPV, %). Race and ethnicity were taken from self-reported intake forms. Income was estimated using the median income from each patient’s zip code as tabulated in the 2021 US Census. Analyses were performed by modality (DBT vs. FFDM), as well as race/ethnicity and income. Comparisons were performed using Chi Square tests with FDR correction for multiple tests.

Results
Data included screening mammograms (excluding first screens) from 906,413 women (median age 60.5±10.5 years); at least 1 year of followup was available for all screens. 64% of these exams were DBT. DBT had similar RR (DBT 7.9% vs. FFDM 7.5%), higher CDR (DBT 4.35 vs. FFDM 3.11) and higher PPV (DBT 5.50% vs. FFDM 4.14%). The real strength of the dataset was its coverage of the screening population by race (White n=785,352, Black or African American n=304,296, Asian n=151,267), ethnicity (Hispanic or Latino n=277,147), and income level (<$50k: n=224,355; $50-75k: n=503,773; $75-100k: n=374,179; >$100k: n=410,324). Including these features in the analysis revealed that DBT had significantly higher CDR and PPV for all races, ethnicities, and income levels (p < 0.001 in all groups). ICR did not differ significantly between DBT and FFDM overall or for any racial, ethnic, or income population.

Conclusion
Consistent with prior studies, breast cancer screening with DBT had a higher CDR and PPV compared with FFDM. This performance improvement with DBT was consistent across races, ethnicities, and socioeconomic status.

Clinical Relevance
Screening using DBT improves performance over FFDM regardless of race, ethnicity, or income.

Figure 1.
Screening mammography outcomes analyzed by race / ethnicity and modality.

Wednesday, November 30, 1:30-2:00 pm

Session W3-SSBR08 Presentation with Q&A

Authors
Jiye G. Kim, Ryan Shnitman, Leeann Louis, Yun Boyer, William Lotter, Bryan Haslam

Purpose
While there is growing evidence that AI shows promise in aiding radiologists with detecting breast cancer in screening mammography, radiologists are eager to know when AI might miss cancer in clinical practice. Here, we studied an FDA-cleared AI device deployed at 137 USA sites to better characterize when it detected cancer, and importantly, when it did not detect cancer in clinical use.

Materials and Method
AI results from more than 610,500 screening DBT mammography exams were analyzed. False negative exams were defined as screening exams which the AI did not flag as suspicious but were followed with malignant pathology. True positive exams were defined as those which the AI flagged as suspicious and were followed with malignant pathology. Clinical information for the false negative exams was compared with that of a sample of randomly selected true positive exams. The clinical information included family history of cancer, density, BIRADS scores, visibility of findings on mammograms, lesion type, whether or not the exam was read with prior exams, whether or not the screening mammogram was interpreted with other modality, cancer pathology and immunohistochemistry profile.

Results
Across the dataset, a total of 2358 patients were diagnosed with breast cancer of which 2198 (93.2%) were flagged by the AI as suspicious (true positives) and 160 (6.8%) were not flagged as suspicious (false negatives). Compared to true positives, false negatives tended to be read more with priors (78.6% vs. 86.3%), were accompanied by lesions not visible on mammograms (2.0% vs. 11.9%) and were more likely to include asymmetry (12.5% vs. 39.0%). Other clinical factors were comparable across true positives and false negatives.

Conclusion
While the AI flagged the vast majority of cancer exams as suspicious, false negatives were more likely to be 1) read with priors indicating the unique role of radiologists in detecting changes over time, 2) not visible on mammograms (e.g., only visible on ultrasound) and 3) asymmetry lesions highlighting that some lesions may be more challenging to detect by the AI. These results were not unexpected given that the AI was not explicitly trained to 1) compare exams across time, 2) detect lesions on other imaging (e.g., ultrasound) or 3) compare lesions across lateralities. Understanding the strengths and weaknesses of AI can help radiologists interpret screening mammograms optimally by complementing AI.

Clinical Relevance
Transparency is essential for clinical adoption of AI. Large prospective data can help promote trust and how best to use the technology, so as to realize the potential of AI in improving patient care.

Thursday, December 1, 9:00-9:30 am

R2-SPBR Digital Poster

Authors
Mina Moussavi, Leeann Louis, Bryan Haslam, A. Gregory Sorensen

Purpose
Screening guidelines are increasingly being scrutinized for appropriateness for all racial groups, as evidence suggests that cancer incidence and mortality may differ by race. Any such changes might need to consider adherence to current guidelines among these racial groups to more realistically address cancer disparities. We sought to compare the percentage of patients returning for recommended annual screening by race/ethnicity in a large and diverse population across the USA.

Materials and Method
Screening mammography data was retrospectively collected from 186 clinical sites across the United States from 2017-2021. Screening exams from 1,391,248 patients above the age of 35 years (mean age 58.2 11.0 years) were collected. We reviewed patients with an initial exam prior to June 2019 and a subsequent screening exam 9-30 months later. The interval between initial and subsequent screening mammograms was defined as annual if they were 9-18 months apart and biennial if they were 19-30 months apart. We then analyzed the proportion of returning patients who were screened at the recommended annual cadence broken down by self-reported race and ethnicity. Confidence intervals were calculated using the adjusted Wald method.

Results
White patients (n=396,372) returned annually more often than Black (n=159,823) and Asian (n=81,276) patients at 75.8%, 71.4%, and 71.8%, respectively. The proportion of Hispanic/Latino (n=152,126), Native Hawaiian/Pacific Islander (n=6,066), and American Indian/Alaska Native patients (n=4,332) was the lowest of all ethnic groups at 69.6%, 69.1%, and 68.3%, respectively.

Conclusion
Black, Asian, and women from other non-White populations return for screening at a meaningfully lower rate than White women. As revisions to guidelines are being considered, these data on real-world adherence to existing guidelines might better inform any changes.

Clinical Relevance
Racial disparities in breast cancer screening adherence continue to persist and should be considered when updating mammography screening guidelines.

Table 1.  
Proportion of patients who returned for an annual screening mammogram (9-18 months after their prior screening exam) by self-reported race/ethnicity from 2017-2021. One exam between January 2017- June 2019 was randomly selected from each of the 1,391,248 patients across 186 sites to represent the initial screen and used as the starting time point for screening interval calculation during the 9-30 month follow-up period. 95% confidence intervals are indicated in brackets.

Race / Ethnicity Proportion of Women Who Returned Within 9-18 Months
White 75.8 (75.6-75.9)
Black or African American 71.4 (71.2-71.7)
American Indian or Alaska Native 68.3 (66.9-69.6)
Asian 71.8 (71.5-72.1)
Native Hawaiian or other Pacific Islander 69.1 (67.9-70.2)
Multiple Race 69.3 (67.8-70.7)
Other Race 68.2 (67.5-68.8)
Hispanic or Latinx 69.6 (69.4-69.9)
Download the Abstracts

Planning to be there?

Schedule a time to talk with us: