Real world data sources, like electronic health records (EHRs), may produce meaningful insights into the impact of COVID-19 infection on patients (pts) with cancer. Newly developed ICD codes are useful for identifying COVID-19 diagnoses in EHRs; however, there is concern over lagged clinical uptake and uncaptured testing outside of the EHR system. These may lead to underestimation of COVID-19 diagnoses in EHRs, thereby mischaracterizing the burden of COVID-19 infection on pts with cancer. Using the nationwide Flatiron Health EHR-derived de-identified database, we constructed and refined a natural language processing (NLP) algorithm to detect ~2400 pts with terms related to COVID-19 present in unstructured clinical notes from Feb 1 to Aug 30, 2020. We manually reviewed charts for 350 randomly selected pts, and confirmed 88 pts with documented COVID-19 diagnoses (PPV = 25%, 95% CI = 21-30%). The resulting estimated cohort of 600 pts was nearly five times larger than that estimated using ICD codes alone. Our work highlights challenges in detecting COVID-19 diagnoses in oncology EHRs with ICD codes, and an opportunity to leverage unstructured data to improve cohort selection.