Solving the “Big Hot Data Mess”

By Anthony Calamito; Christopher Tucker, Ph.D.; and Abe Usher

This article was originally published in USGIF’s State & Future of GEOINT Report 2017. Download the full report here.

You can’t talk about GEOINT these days without acknowledging the explosion in new big data sources or the accumulation of traditional data sources into large, hard-to-manage data repositories splintered across multiple networks. Big data is also being fragmented by security half-measures and otherwise made generally inaccessible to all of the newfangled big data solutions with which everyone is so enamored. In short, you can’t talk about GEOINT these days without talking about the “big hot data mess” the GEOINT Community currently faces. In this article, we will raise more questions than provide answers, as the answers to date have proven elusive.

Senior leaders, enterprise architects, technology vendors, and software experts are promising to make GEOINT data analysis faster, better, and cheaper, and to provide amazing insights never before possible. They promise to let us collaborate in new and interesting ways using GEOINT data. And they promise to magically have this data flow to the very edge of every network on which the mission is conducted — until, that is, they see the current state of our data.

These new technologies assume all entities, including the National Geospatial-Intelligence Agency (NGA), have actually acquired/licensed the right data and have meaningful access to this wide array of data. These technologies also assume the data hasn’t been squirreled away into countless different physical storage environments on multiple networks with no concern for how many redundant copies of the data have been and continue to be generated. This is compounded by the fact that the full metadata needed to help solve analytic problems is not always available.

The global GEOINT Community — intelligence professionals, warfighters, humanitarians, first responders, municipalities, and businesses — yearns for the wonders of ubiquitous, secure, and time-dominant access, big data analytics, machine learning, and everything else they hear about in the latest Silicon Valley tech press. So, how can the GEOINT Community reach this technical nirvana that has become our new base expectation? How can we understand the big hot data mess and take concrete steps to transform our basic GEOINT infrastructure to comport with modern technological expectations?

The Kitchen Metaphor for Data Challenges

To understand the GEOINT Community’s data challenges, we must have a clear understanding of how impact and value are produced. The value creation process of deriving intelligence from data is much like the operations of a well-run kitchen. Chefs (subject matter experts) use utensils to process and combine raw ingredients using repeatable recipes to produce nutritious, delicious food. Similarly, analysts use technology tools with specific methodology to process and combine raw GEOINT data to produce relevant intelligence products. In the GEOINT Community, we have great “chefs” with excellent “recipes,” but we don’t have a good handle on our “ingredients” (data).

Not everyone knows where to find the ingredients they need to do their job. Some ingredients are stored in the wrong place — like storing ketchup in a freezer where it is rendered useless, or burying spices in the backyard where they will never be discovered by other chefs. Think of a talented chef who repeatedly makes peanut butter and jelly sandwiches because those are the only ingredients she can find or has access to. As a result of our “ingredient challenges,” we are extremely limited in the advanced “utensils” (tools) we can bring to bear.

Know the Data

Why doesn’t every GEOINT desktop have access to every piece of relevant spatiotemporal data that exists, whether government-generated, commercial, or open source? Does the GEOINT Community have a grasp of the massive proliferation of data that is occurring? Does it at least have an exhaustive accounting of what exists, even if it doesn’t have the actual data? Does the community know who is the primary source of the data and not the middleman? What are the business and legal terms (the data licenses) under which it could gain access to each?

Do governments or businesses have a contract vehicle that allows for immediate, time-dominant data access? How do government and commercial entities share and exchange data? How can citizens provide free services back to the government? How can citizens and corporations pay for government collected or collated data so the government can continue to provide data to them in a form that allows easy consumption and provides for commercial entities to profit from government provided data? How can the government leverage citizen scientists to collect, correct, and update unclassified data sets open to the public? What are the privacy implications of unclassified data being made publically available?

Has the massive proliferation of such sources of data outstripped the GEOINT enterprise’s ability to maintain such an ongoing assessment? It’s unclear. However, the confusion spawned by this proliferation and our haphazard grasp of it contributes to the big hot data mess. Does NGA have access to the newest, hottest, best source of data? Of course it does. Somewhere. But whom do I ask for it, and how can I discover this data?

Buy the Data

NGA’s proposed Commercial Initiative to Buy Operationally Responsive GEOINT (CIBORG) vehicle for acquiring data may solve the problem of U.S. government access to this proliferation of data. It is too soon to tell, but perhaps CIBORG will provide transparency with regard to the terms under which NGA and National System for Geospatial Intelligence (NSG) partners can rapidly acquire every kind of spatiotemporal data under the sun. Perhaps it will become clear what it means to have each data source available to the U.S. national security community, international partners, humanitarian partners, and indeed the whole of government and even private citizen use. Will this be the moment when NGA proactively, vigorously, and exhaustively builds a dynamic acquisition vehicle that provides the kind of transparency needed to clean up this big hot data mess? Actions, not rhetoric, will tell the tale over time.

Crowdsourcing the Data

With the popularity of citizen science and the desire for more transparency within government, how can organizations like NGA better leverage crowdsourcing as a means to create and collect data? Initiatives like OpenStreetMap have proven the value of leveraging a community of users from around the globe for creating data sets in areas that have been underserved, are too dangerous to visit, or have not been a focus of data creation.

So, what changes to policy are needed to ensure valuable crowdsourced data sets like OpenStreetMap and others are considered valid, timely data sources like those created by NGA? Will NGA open its unclassified data sets and enable citizen scientists to verify and edit them as needed? With a growing number of autonomous data sensors and an increasingly capable citizen science initiative, how will NGA adapt and leverage crowdsourced data sets as much as possible?

Migrate the (Legacy) Data

Assuming NGA understood all the data sources and mastered their acquisition, we then have the huge burden of the legacy/heritage environments that splinter the management of this data across many networks, file systems, databases, and APIs. This burden makes the timely, efficient, and effective use of big data questionable at best. Plenty of baroque technological strategies have been pursued in the past two decades to wicker these legacy/heritage environments together so seamless data access could be achieved. However, it is the cloud — and, for the GEOINT Community, the Intelligence Community Information Technology Enterprise (IC ITE) cloud — that finally offers the promise, but not yet the reality, of migrating data into an environment that will allow the community to take advantage of modern technologies and strategies. IC ITE offers hope that at least parts of the big hot data mess may soon end. But the path ahead remains challenging.

Cloud Manage the Data

The authors recommend a four-step process to begin to address complex data challenges:

1. Mission needs inventory: Create specific user stories that define the most common activities that support common GEOINT mission threads.

2. Data inventory: Inventory government, commercial, and public GEOINT data sources.

3. “Unlock” analytics: Decouple data from analytics by storing GEOINT content in IC ITE cloud-based open storage systems (e.g., Hadoop, HBase, Accumulo, Elasticsearch) that provide multiple ways of accessing content such as ArcMap, QGIS, full-text search, Google Earth, etc.

4. Simplify data discovery: Put significant effort into communicating to analysts, software engineers, data scientists, and leaders how to access data for each GEOINT mission thread.

Once the transition to the IC ITE cloud occurs, the U.S. government GEOINT Community will be able to consistently apply new and evolving big data and machine learning techniques — on every data source, at global scale, and at whatever arbitrarily dense temporal rate available. Because the IC ITE cloud will exist at every level of classification, powerful technologies will allow for data to be stored at the level of its classification, with seamless cross-domain access for people and processes on every higher network.

Open Geospatial Consortium (OGC) web services and other kinds of micro-services will be enabled on this data and deployed on the elastic cloud within powerful containerization strategies that provide unprecedented flexibility and scalability.

Suddenly, the data will be easily exposed for cataloging and a wide range of indexing schemes that will revolutionize discovery and access. This will also enable a real discussion about new ways individuals, teams, and communities with a vast array of processes can collaboratively interact with each other among the data. The age of the big hot data mess will be over. But, what will it allow us to do?

Leap Forward in Advanced Analytics

The face of GEOINT will be radically transformed by decoupling data analytics from data storage by moving relevant data into an elastic cloud with simple standards for data structure and access. The GEOINT Community will be able to fully exploit the global wealth of data generated about the planet every second of every day, to provide our nation time-dominant decision advantage in the realm of international affairs.

An endless variety of analytic algorithms will be run in real time, concurrently, and service many different mission sets. Machine learning will enable the augmentation of human analytic capabilities, sifting through the endless deluge of data, finding the known, and queuing up the unknown for analysts to solve. And geospatial narratives will be fed and constantly updated by these processes, collaboratively curated by the modern analytic workforce. The volume of continuously dispatched data will be enormous. The fidelity of data derived from it will be unparalleled, and its update cycle will be significantly faster than today. This will be the era in which GEOINT accelerates intelligence insight to action as never before imagined.

To learn more about USGIF, visit the Foundation’s website and follow us onFacebookTwitter, or LinkedIn.