

Cheap or low-quality external data comes with poor or no metadata, and small organizations lack the resources to add high-quality metadata to their datasets. Metadata, which is information about the history of the data, and information how it can be technically and legally reused, has a hidden cost. “Data is potential information, analogous to potential energy: work is required to release it.” – Jeffrey Pomerantz Usually the documentation steps are completely omitted. This a very important source of the gap between Big Tech and them: the data usually ends up very expensive, ill-formatted, not readable by computers that use machine learning and AI. In small organizations, where there are no separate data science and data engineering roles, data is usually supposed to be processed and documented by (junior) analysts or researchers.

On aspect of this inequality is the cost of data processing and documentation – a greatly underestimated, and usually not reported cost item. When was a file downloaded from the internet? What happened with it sense? Are their updates? Did the bibliographical reference was made for quotations? Missing values imputed? Currency translated? Who knows about it – who created a dataset, who contributed to it? Which is an intermediate format of a spreadsheet file, and which is the final, checked, approved by a senior manager?īig data creates inequality and injustice. This is the price that project managers and analysts pay for the inadequate documentation of their data assets. Sisyphus was punished by being forced to roll an immense boulder up a hill only for it to roll down every time it neared the top, repeating this action for eternity.
