Scenario Generation
Scenarios for the Avicenna model contain three primary constructs: locations, population and mobility. These constructs form the social network graph of the population for a given area and define the vectors by which disease may progress. The quality of a given model prediction is directly tied to the fidelity and quality of the underlying data used to construct a scenario. In its most basic form, a scenario could be defined that incorporates the entirety of a population within a single location (country) with random interactions. In this case, the model would produce results similar to a traditional mathematical model. Once we begin to include real-world locations, individual demographics, realistic mobility and consider human behavior, the predictions become much more particular to the unique characteristics of a given time and place.
Avicenna Execution Parameters
Each scenario execution provides the following global parameters that impact different mechanics in the Avicenna epidemiological model.
start-quarantine: time at which to start global quarantine measures, e.g. stay at home orders
start-day: logical first day of the year in the model
stay-home-duration: duration of initial stay home period
psick: probability agents are sick at model start
pwwell: probability agents are worried well
wwell_thresh: worried well threshold
wwell_interval: worried well duration
pwwsick: probability work while sick
report: statistics reporting interval
quarantine_compliance: compliance with stay at home measures
These parameters provide information to the scenario on when certain major events may occur, such as shutdown orders, or provide averages for use in determining if an individual person will be worried well, or will comply with quarantine measures.
Disease Definition
The Avicenna epidemiological model enables configuration of the natural disease stages. The stages for a disease in Avicenna are:
1) Incubating
2) Symptomatic
3) Recovered
The Susceptible phase is defined with default values for each parameter. More stages can be defined if the modeler wants to incorporate higher fidelity within the natural disease. Within each stage, multiple parameters can be defined, including:
1) Start Multiplier
2) Stop Multiplier
3) Minimum Duration
4) Maximum Duration
5) Mortality Rate
6) Eligible for Inoculation
7) Progress Trumps Inoculation
8) Hospital Treatment
9) Is Symptomatic
10) Days in Hospital
The start and stop multipliers indicate a maximum / minimum transmission probability between any two individuals. The minimum / maximum duration defines the bounds of the timeframe duration for a stage in days. The mortality rate provides an overall mortality rate for a disease stage. Individual demographics, when available, override the given mortality. “Eligible for Inoculation” is a Boolean parameter that indicates if individuals may be inoculated. “Progress Trumps Inoculation” indicates if disease progression should override the effects of any prophylaxis. “Hospital Treatment” is a probability that individuals will seek out hospital treatment for symptoms. “Is Symptomatic” is a Boolean that indicates if this disease stage produces symptoms. “Days In Hospital” indicates the average number of days individuals remain in hospitalization for a given stage.
Locations
Preparing a scenario for an Avicenna model is a challenge in its own regard. A typical scenario may include between 10 and 30 different sources. We employ data fusion to align the data in both space and time. We frequently have to geocode addresses to define their latitude and longitude. We must also support both aggregation and disaggregation of the data. Finally, we employ geospatial analysis to determine the inclusion and overlap between disparate data levels.
Whenever possible, we use the Federal Information Processing Standard, or FIPS ID for spatial regions. For specific locations that do not have a FIPS ID, such as hospitals, we uniquely identify these entities through a combination of factors, such as city, state, zip code, name and location type.
In Avicenna, there are two types of locations:
1) Geospatial regions, and
2) Geospatial points
Based on the U.S. model, the primary layers of a scenario are geospatial regions. These regions have the following conceptual meanings:
1) Country
2) State or Territory
3) Core-based Statistical Area (CBSA) or Metropolitan Area (micro and macro)
4) Zip Code
5) County
6) Census Tract or Neighborhood
Individual locations can be defined in any layer and are purely conceptual. Locations may represent homes, workplaces or schools to name a few. Individual locations are then a specific location in space. This sources for this data are DHS, USDA and CMS. The following locations are currently defined in Avicenna:
1) Fire Stations
2) Law Enforcement
3) EMS Stations
4) Long-term Care Facilities, e.g., Nursing Homes
5) Hospitals
6) Poultry Producers
Each individual location is geocoded first to lat / lon, but then also to aligned to each regional area to support geospatial visualization.
Modeling locations is heavily dependent on the data source information we have available, and we must account for varying degrees of fidelity in the information available. We then geocode specific addresses and align them with their state, CBSA, zip code and census tract.
Population
When building an Avicenna scenario we start with the background population, and then break out members of that population into more specific cohorts. The three general population sets we define are:
1) Background population, based on Census Bureau totals by census tract
2) Location-based populations, e.g., healthcare workers,
3) Workforce population
The background population forms the basis for the number of people in the model. The U.S. Census Bureau provides information by census tract, or at the neighborhood level, and provides basic information such as number of persons resident, number of commuters, and then some basic demographic data including details such as race, age distribution and blood type percentages. These demographics can be assigned to individuals randomly in the base case. We do also provide (for fee) much higher fidelity demographics that are not based on random selection. These paid services provide thousands of features for each person over the age of 18 in the United States from our data partners.
Location-based populations are dataset dependent. For example, we fuse 4 datasets to provide physician populations per hospital across the U.S. With location-based populations, we typically only know where they spend their day outside of their home. We then leverage the mobility model to backtrack these persons to their home areas, and account for them as a subset of their home population.
Workforce populations are data about individuals that comes from our commercial customers. In these cases, that data is only available to the particular customer. Companies provide us with the home and work locations for all workers, and then we account for them as a subset within the background population. The advantage of this data is, because it is proprietary and private, that we have both the specific work and home locations and assign them as such. This enables us to determine when and where a companies workforce will be impacted by the spread of the disease.
Mobility
The mobility model in an Avicenna scenario defines how people move between locations. The mobility model is primarily based on U.S. Department of Transportation statistics that define the daily mobility between census tracts. This data is updated annually and provides the primary mobility patterns for all individuals across the United States.
Mobility for the subset populations is then modified and / or augmented to include new locations. Each individual in an Avicenna scenario has a defined mobility vector that determines the locations each person will spend time in each day.
Data
The three primary datasets used to define Avicenna scenarios are:
1) U.S. Census Bureau Population Data
2) U.S. Department of Transportation Local Area Transportation Data (LODES7)
3) U.S. Census Bureau Census Tract Cartographic Boundary Files
The population data defines where people live. The transportation data defines how people move between census tracts on a daily basis. And the census tract (CT) cartography data provides information on the boundaries for each neighborhood in the U.S. The cartographic data is used to geocode locations within a region, as well as to provide metadata about each CT, including zip code, CBSA, county and state information.
To build reliable healthcare population data, we bring in multiple datasets from the Center for Medicare and Medicaid, including:
1) Medicare Hospital Compare
2) Medicare Nursing Home Compare
3) CMS National Plan and Provider Enumeration System Database
4) CMS Physician Compare
A complete list of all healthcare workers does not exist or is not publicly available data. While incomplete the majority of physicians are registered with CMS and our dataset encompasses nearly 6.4 million workers. Many times having the complete population is not necessary to determine the pandemic impact on the healthcare worker population. If Avicenna is showing 25 out of 100 hospitals workers are impacted, but there are 200 workers in reality at a given hospital, it can be assumed that 25% of those 200 workers are likely to have been infected. In other words, Avicenna captures the when and where of the impact, if not always the full extent of the impact.
We do provide two different healthcare worker data sets. Physicians Only and All CMS healthcare workers. The physicians only dataset includes over 1 million physicians specifically aligned and geocoded to their primary hospital location. The CMS healthcare workers dataset includes over 6 million healthcare workers and include nursing, pharmacy, school doctors, dentists, chiropractors and many others. Lucd has developed a complete taxonomy of the healthcare workers to aid in identifying subset populations within the overall healthcare workers data set. This enables analysts to break out different specialties, different nursing professions, technicians and more to understand the impact of disease spread within sub-populations of healthcare workers.