Wrangling Big Metadata

Finally the NYT, in their recent article

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

is getting to some of the nitty gritty on providing transparency for folks as to what big data means, some of the challenges and why it’s becoming such a buzz. Although certainly for those of us who have been cleaning data for a while the wrangling caveat is old news.

In response, this Xentity blog article

I Always Wanted to be a Data Janitor

iterates well that the future challenge, beyond wrangling automation, is the importance of quality metadata and presentation of context.

All too often when searching for data I find myself pouring over metadata trying to save time in looking to see if data fits my needs, ultimately ending up having to take the time to preview and further investigate. This is because the metadata is at the same time too general (what was created) and too technical (how it was created), without providing the context of why it was created. Formats like FGDC and MARC Records have gone so far into standardization that context is completely lost – just like the biology example in the Xentity article – what happens when the source information someone needs is embedded in methodology that doesn’t chunk easily into pre-defined categories? (data gets used out of context). Keywording provides a slightly more dynamic way to present the data, but there is only so much we can do to predict what our users are seeking. Abstracts in professional journals and publications swing to the other end of the spectrum, often providing tips to methodology, but lacking any semblance of order or categorization.

As the Xentity article points out, “there is a major current lack of true incentive other than the right thing to do to assure the data is tagged properly.” What data creators who double as data users know, is that the use of metadata to convey the context and methodology of the data can reduce the potential for mis-use and increase the likelihood that data will find its way to creative and effective uses.

The key to getting people to create meaningful uses and solutions with the data we wrangle is in providing adequate dataset descriptors and inventories. These quick review tools fit best into brainstorming exercises that are essential to the visioning process. If I head down a rabbit hole every time I investigate whether a dataset is relevant to my current quest, I’m far more likely to miss a good idea or wind up on a different course all together.

The data provider must find the incentive value through successful matching of resource and content to end user – visioning the task of the visualizer.

To this end, let us briefly consider that the processing of data and the use of data to visualize relevance and importance. Highlighted in this Nate Silver article:

FiveThirtyEight Life – What the Fox Knows

There are four steps to transforming data that I observe to be very similar to the four basic tasks of a cartographer. For those NYT readers, “wrangling” comes before or is akin to collection and simplification.

FiveThirtyEight: Collection, Organization, Explanation, Generalization

Cartographic Principles: Simplification, Classification, Generalization, Symbolization

These steps transform highly involved and complex systems and collections of data into readable and comprehensible bites.

In the case of journalism, as Silver highlights his use of data analysis to inform the public, the future of news reporting is conveying important facts derived from quantitative and qualitative assessments. Making the “news a little nerdier” means “figuring out how to make data journalism vivid and accessible to a broad audience without sacrificing rigor and accuracy.

As mentioned before, there are of course real dangers that lie within this transformation of quantities to comprehension, and there are information specialists who present data they unwittingly interpreted incorrectly, or intentionally skewed. The same goes for how data is presented, just like Mark Monmonier describes: simple graphical errors can send a wrong message. The need for useful, descriptive and accurate metadata not only provides an avenue to promote creativity, but also provides a mechanism to communicate the essential details of timeliness and methodology.

Only when we know the source of our data is sound can we fully and accurately provide visualization. This brings us to conclude with the work of Edward Tufte, who teaches that the presentation of our work is just as important as the verification that our work is accurate. If we cannot convey what we have found, then our work is for naught.

Data Analysts and Developers are looking for information that will fit into this process, and come together with other processed data to convey a concept or present people with a tool that they couldn’t investigate or accomplish otherwise. For app developers looking to present economic tools to consumers, the information must transform seamlessly and be updated as frequently as possible, and researchers require adequate descriptions on methodology and capture to ensure repeatability and results are sound.

Certainly we must categorize the data we publish and make all efforts to conform to standards required by search engines to gain as many effiiciencies from automation as possible. But like with anything, there is always a point at which a human just has to sit down to read and review. Better metadata and cataloging, combined with the effiiciencies of automation, can provide the winning combination for the result of creative and effective uses for the multitudes of modern data.

Margaret Spyker

Trackbacks & Pings

  • I Heart “I Quant NY” | Web Map Academy :

    […] NY – a blog dedicated to showcasing the findings of Data Scientist Ben Wellington as he wrangles and munges his way through NYC’s open data […]

    3 years ago
  •  I Quant New York | Web Map Academy :

    […] Quant NY – a blog dedicated to showcasing the findings of Data Scientist Ben Wellington as he wrangles and munges his way through NYC’s open data catalog. The most interesting stories are often discovered in […]

    2 years ago

Leave a Reply Text

Your email address will not be published. Required fields are marked *

Powered by WishList Member - Membership Software