Page last updated at 02:38 UTC, Friday, 20 April 2018 PH
In the best seller, “Thank You for Being Late,” Thomas Friedman traced the advent of the potentials of big data analysis. He contrasted the system of Google and that of Hadoop. Google’s system is a proprietary closed source system that runs only in Google’s data centers and that people use for everything from basic search to facial identification, spelling, correction, translation, and image recognition. On the other hand, Hadoops’ system is open source and run by everyone else, leveraging millions of cheap servers to do big data analytics. Today tech giants such as IBM and Oracle have standardized on Hadoop and contribute to its open-source community. And since there is so much less friction on an open-source platform, and so many more minds working on it—compared with a proprietary system—it has expanded with unbelievable speed.
Hadoop scaled big data thanks to another critical development as well: the transformation of unstructured data. Before Hadoop, most big companies paid little attention to unstructured data. Instead, they relied on Oracle SQL—a computer language that came out of IBM in the seventies—to store, manage, and query massive amounts of structured data and spreadsheets. “SQL” stood for Structured Query Language. In a structured database the software tells you what each piece of data is. In a bank system it tells you “this is a check,” “this is a transaction,” “this is a balance.” They are all in a structure so the software can quickly find your latest check deposit.
Unstructured data was anything you could not query with SQL. Unstructured data was a mess. It meant you just vacuumed up everything out there that you could digitize and store, without any particular structure. But Hadoop enabled data analysts to search all that unstructured data and find the patterns. This ability to sift mountains of unstructured data, without necessarily knowing what you were looking at, and be able to query it and get answers back and identify patterns was a profound breakthrough. As Doug Cutting, founder of Lucene and father of big data, put it, Hadoop came along and told users: “Give me your digits structured and unstructured and we will make sense of them. So, for instance, a credit card company like Visa was constantly searching for fraud, and it had software that could query a thirty—or sixty-day window, but it could not afford to go beyond that. Hadoop brought a scale that was not there before. Once Visa installed Hadoop it could query four or five years and it suddenly found the biggest fraud pattern it ever found by having a longer window. Hadoop enabled the same tools that people already knew how to use to be used at a scale and affordability that did not exist before. That is why Hadoop is now the main operating system for data analytics supporting both structured and unstructured data. We used to throw away data because it was too costly to store, especially unstructured data. Now that we can sort it all and find patterns in it, everything is worth vacuuming up and saving. If you look at the quantity of data that people are creating and connecting to and the new software tools for analyzing it—they’re all growing at least exponentially.”
In a recent article in the Financial Times (February 1, 2018) entitled “Mapping the economy in real time,” Robin Wigglesworth reported that data companies are converting digital information into instantaneous signals of economic activity. Though doubts over accuracy persist, the trend could help governments make quicker and better decisions. Commenting on a project of some M.I.T professors called Billion Prices Project, the author writes that it is only one example of a broader trend of trawling the swelling sea of big data for clues on how companies, industries or entire economies are performing. Some data are already providing useful, if imperfect, insights. But some experts forecast that the digital fingerprints of our online lives could ultimately be crunched into a real-time map of economic trends that make present-day data look as archaic as the railway freight information of the 1920s.
The trail of our digital exhaust is incomprehensibly vast. The world’s annual data generation is estimated to be doubling every year, and the overall size will reach 44 zettabytes (that’s trillions of gigabytes) by 2020, according to a study by International Data Corporation. If all this information was placed in high-end tablet computers, the pile would reach from Earth to the moon more than six times over. The potential for big data analysis is dizzying. Social media feeds can be used to build real-time gauges of sentiment. Satellites in space see which ships dock where and when whether oil tanks are full or empty, the quality of a crop or even the productivity of a blast furnace. Credit card purchases and email receipts show retail spending. Job listings from hundred of thousands of career sites or corporate websites can reveal employment patterns. And smartphones send location data that show where we are at any given time. In time, the “internet of things” could reveal our daily eating habits through web-connected fridges.
Mining these data sets was once the preserve of sophisticated “quantitative hedge funds.” But some finance ministries, central bankers and statistics agencies are now starting to dabble in the field in order to understand the economic tides better and more swiftly—a development that could have significant public policy implications as well as in influencing corporate strategies. For example, Cargill, the agricultural trading giant—that is making a major investment in Batangas in what could be the largest poultry processing plant in partnership with Jollibee—is hiring data scientists to find ways to profit from the scraps of information picked up as food commodities flow through its factories, silos and ports. In another recent article in Financial Times (January 29, 2018), Gregory Meyer reports Cargill is attempting to better exploit the seven petabytes of information in its proprietary data network. Using information from shipping patterns to the sound of shrimp eating, the company believes data scientists can help it turn a bigger profit. Among the initiatives is machine learning, a branch of artificial intelligence that sifts through vast data sets to find patterns that can guide decisions. The LinkedIn page of Tyler Deutsch, Cargill global data science leader, said that he oversees more than a dozen employees building machine learning models of the agriculture, food, and commodities trading industries globally.
Justin Kershaw, Cargill chief information officer, envisages using machine learning for tasks including finding the best shipping routes, reading satellite images to assess crops’ vigor and interpreting microphone recordings of shrimp to let farmers know when to add more fish feed, one of Cargill’s products. “Shrimp make a sound when they eat,” Mr. Kershaw told the Financial Times. “In the Cargill data platform, we are collecting acoustical information about shrimp and analyzing that.” From these examples, it is obvious that the potentials of Big Data analysis for more enlightened decisions, whether in government, business or any other sector of society, are unlimited. I hope that the efforts to increase the supply of professionals at the different levels the data analytics industry will be given the highest priority by both the academe and the user sectors. For comments, my email address is bernardo.villegas@uap.asia.