Big Data Characteristics (3V, 5V, 10V, 14V)

In recent times, data and analytics technologies have become indispensable assets for most businesses and companies around the world that today have to deal with huge amounts of data.

In fact, more than simple data, these are large datasets from internal and external sources of the company, which can be generated by machines or people. In the process of digital transformation of companies the data-driven approach is indispensable to implement factually weighted marketing strategies.

For all these and other variables that we see in the course of the article, in 2001, Doug Laney described the characteristics of Big Data, at the time as the 3 Vs: Volume, Speed and Variety. Over time, the 3 Vs became the 5 Vs of Big Data, and then other Vs were introduced by researchers and analysts to explain the ever-increasing particularity of this multiform, dynamic and increasingly complex data to manage.

Brief history of Big Data

Just to put the context, before diving into the characteristics of Big Data, let’s try to understand how we get to today’s large volumes of data.

From the earliest times and in the most ancient civilizations there has been a need to possess knowledge of data in order to better manage decisions and obtain advantages. Ancient Egyptians and the Roman Empire where among the firsts. In 300 BC. the famous Library of Alexandria was born, probably the first attempt to collect data. It appears that there were over 100,000 “books” in the library.

big data in antiquity

During the Roman Empire, on the other hand, the first forms of statistical data analysis took place to prevent the most probable enemy insurrections by preparing the armies to face them. This form of analysis seems to anticipate predictive data analysis.

But the one who first dealt with the statistical analysis of data was John Graunt, a London scholar who was a pioneer of the so-called “political arithmetic”, a promoter of statistics and biometrics. Already in 1663 strong> he handles a large volume of it studying the bubonic plague that had struck Europe. In 1800 the perception of the overabundance of data on the occasion of the annual censuses was already clear.

It is 1881 when Herman Hollerith, American engineer invents the first data tabulating machine based on the use of punched cards. The goal is to reduce the computational effort.

Over the course of the 20th century, data has evolved at an impressive and unexpected speed, becoming a lever of progress and evolution. In 1965, the US government builds the first data center to store millions of sets of fingerprints and tax returns.

The term “Big Data” appears in the 90s. It seems to have been the American computer scientist John R. Mashey who first introduced it and made it popular. From a knowledge perspective, data involves a melting pot of subjects such as mathematics, statistics, and data analysis techniques.

From the beginning of the 21st century, data begins to increase, characterized by the volume and speed with which it is generated. And it also changes the way you access them.

In fact, if until the 1950s data analysis was done manually and on paper, the first data centers and relational databases were born between the 1960s and 1970s, tools for collecting and aggregating data.

After a few decades, between 2005 and 2008, it was the turn of the birth of websites and social networks, such as Facebook and YouTube. And again, with the growth of the Internet of Things (IoT), more and more devices and objects connected to the Internet involve a high production of large volumes of data generated by multiple different sources to be captured, processed and to memorize. In 2013, the total amount of data in the world reached 4.4 zettabytes.

Big data is increasing day by day because there are so many activities and operations that we all carry out, every day, producing an imperious amount of data from mobile devices to sensors, from call centers to web servers, from websites to e-commerce, to social networks. To cite just a few examples.

This data has the characteristics of being very large, fast and difficult to manage for traditional databases and existing technologies. This is why more and more companies today, thanks to the digital transformation, feel the need to equip themselves with non-traditional technologies to extrapolate, manage and process terabytes of data even in a fraction of a second.

The first 3 Vs of Big Data

Based on a 2001 study, the analyst Doug Laney defined the characteristics of Big Data according to the 3V model: Volume, Variety, Velocity.

1 V: Volume

Volume refers to the size of the data, i.e. the amounts of data collected and stored generated by humans or machines and coming from various sources, including IoT devices (always-connected sensors), industrial equipment, applications, cloud services, websites, social media, videos and scientific instruments, commercial and banking transactions, financial market movements, etc.

In the past, archiving was a major problem with the limitations of physical spaces intended for data storage. Over time, fortunately, advanced technologies have emerged such as data lakes and Hadoop, which have now become standard tools for storing, processing and analyzing data

2 V: Variety

The volume and velocity of data are important factors for a business, but big data also involves processing different types of data gathered from various sources.

Variety concerns the diversity of formats, sources and structures. Big data information is very different from each other and each has its own origin.

Data sources can be both internal and external. And this heterogeneity can become critical in building a data warehouse.

There is a great variety of data available in all formats, such as numerical data, text documents, images, videos, tweets, emails, audios, blog posts, social network comments, sensor information IoT etc.

For variety, in fact, we mean a great diversity of data types coming from different sources and with different structures.

We can classify Big Data into three types: structured, semi-structured and unstructured or raw data.

Structured data is traditional data, ordered and conforming to a formal structure. It is data stored in relational database systems. For example, a bank statement includes the date, time, and amount.

Semi-structured data is incompletely sorted data that deviates from the standard data structure: log files, JSON files, CSV files, etc.

Unstructured data is unorganized data that cannot fit into relational databases: text files, emails, photos, movies, voice messages, audio files.

To give examples: the data available on the Web is ‘unstructured’. 80% of the world’s data is unstructured. There is a great variety of data on the web. Social media blogs, photos, tweets, videos are not structured data.

It is important to distinguish between the various data sources:

  • The streamed data comes from the Internet of Things (IoT) and other connected devices, such as wearables, smart cars, medical devices, industrial sensors, etc.
  • Social Media Data originates from activities on Facebook, YouTube, Instagram, and others, in the form of images, video, text, and audio in an unstructured or semi-structured form.
  • The public data originating from open data sources, such as ISTAT, the European Union’s Open Data Portal, the US government’s, the CIA’s World Factbook.

To address this diversification in the management of various data and to understand big data, more advanced data analysis tools are needed than simple spreadsheets, such as the Data Analytics model, the process for extracting value from this mass of information.

3 V: Velocity

This aspect indicates the speed with which data is produced. In addition to the exponential amount of incoming data, data speed is also important.

Data velocity refers to the rate of data and information flowing in and out of interconnected systems in real time, hence the increasing rate at which data can be received, processed, stored and analyzed from relational databases.

Data sets must be managed in a timely manner, in real time, especially when dealing with RFID, sensor and IoT systems capable of generating data with very high speed.

In this regard, imagine a machine learning service that constantly uses a flow of data, or a social media platform with billions of people uploading and posting photos 24/7.

The speed of access to data has a strong direct impact on having a clear and comprehensive picture to make timely and accurate business decisions.

Little but good data, i.e. processed in real time, produces better results than a large volume of data that takes too long to acquire and analyze.

Other Vs were then added as characteristics of Big Data that could deepen the nature inherent in their complexity.

4 V: Veracity

Veracity refers to the quality, integrity and accuracy of the data collected. The most perceived problem is linked precisely to the ambiguity and indeterminacy of big data, considering that they come from multiple sources and in different formats, and that the analysis of the acquired data is useless if it is not reliable and well-founded.

Accuracy and reliability are less controllable parameters for many forms of big data. For example, posts with hashtags, abbreviations, typos and slang phrases are very frequent on social networks.

To obtain the truthfulness of the data, thus eliminating incomplete and indeterminate data, it is necessary to use intelligent technologies and not traditional tools for which the adoption would also be much more costly from an economic point of view.

When we mention traditional technologies, we mean relational databases (RDBMS, Relational Database Management Systems) and predictive analysis and data mining tools with strong limitations in the face of the growth of data volumes or the lack of table structures.

5 V: Variability

It is a parameter related to the variety. The data comes from different sources and must therefore be diversified between inconsistent data and useful data, important for the purposes of the informative or predictive utility that they can represent for the company.

6 V: Value

The most important “V” for the impact it has on the company is the value of data linked to the revelation of insights and creation of patterns that are more competitive and advantageous in terms of concrete results.

The concrete result is obtained only when the data is transformed into information of value, from which knowledge can be obtained in order to make targeted decisions to be translated into actions, activities and oriented choices. To do this, you need analytical tools.

It is imperative for every business to evaluate the cost of investments in big data technologies and management, as well as to weigh the value they can bring. It is not so much the amount of data we collect that counts as the value we can draw from it to guide decisions and actions.

The 10 Vs of Big Data

In 2014 Kirk Born, founder of the online platform Data Science Central, redefined the 10 Vs of big data as

Volume, Veoity, Value, Variety, Veracity, Value, Validity, Venue, Vocabulary, and Vagueness.

We comment only on the added Vs.

7 V: Venue 

It refers to the different systems or platforms where data is stored, processed and analyzed. The type of venue used for big data depends on the business needs and the type of data being processed.

8 V: Vocabulary

It refers to the need to share terminology and semantics to describe and define data models and structures.

9 V: Vagueness

The V for vagueness refers to the difficulty of pinning down data precisely, due to its fuzzy or imprecise nature.

The data in question may be partial, uncertain or incomplete, and may not be suitable for traditional analysis. This vagueness can be caused by a number of factors, such as the imprecision of data sources, the variability of the data itself or the complexity of the data acquisition and management processes.

The 14 Vs of Big Data

All the characteristics of big data (the 14 Vs) have been listed and defined by researchers and data scientists in order to explain all their complexity, and to manage them in the most effective way possible.

Volume, Velocity, Value, Variety, Veracity, Validity, Volatility, Visualization, Virality, Viscosity, Variability, Venue, Vocabulary, Vagueness.

We comment only on the added Vs.

10 V: Volatility

Volatility refers to the value of data that changes quickly because new data is continuously being produced (e.g. data from IoT sensors can be highly volatile since it is generated in real time and can change rapidly).

11 V: Display

It refers to the process of representing large amounts of data in order to make it more understandable and explorable. Big data visualization is a critical part of the data analysis process, because it can help analysts identify patterns, trends, and relationships in otherwise hard-to-find data.

12 V: Virality

It refers to the speed with which data is transmitted/disseminated and received for their use.

13: Viscosity

It refers to the event delay, i.e. the time discrepancy between the event that occurred and the event described, which can be a source of obstacles in data management.

The difficulty in data management can be aggravated by the time difference between the real event and its description. For example, if data logging occurs some time after the event, there may be a loss of important information or a reduction in data accuracy.

In conclusion, big data is a fundamental asset for every company characterized by some aspects that we have highlighted with the 5 Vs + the other additions to explain its complexity.

Increasingly sought after for their value, this data needs technologies capable of dealing with all stages of the big data life cycle: acquisition (or data ingestion), storage and organization, transformation and analysis.

But not only descriptive analysis which, although important, looks to the past by observing what happened and measuring its consequences. The added value of big data is given by advanced analysis techniques projected into the future, i.e. predictive and prescriptive, which shed light on aspects that the company can anticipate to avoid risky or inconvenient choices.

Big data technologies are able to analyze data in a fast, deep and granular way, as well as lend themselves as much more flexible and convenient tools in terms of storage and software licenses.

The technologies companies need must also ensure their quality, governance and archiving, as well as the preparation of data for analysis. Some data can be stored locally in traditional data warehouses, for others much more convenient, flexible and lower cost solutions will be adopted.

Examples include data archiving technologies such as data lakes, the open-source Hadoop and Spark frameworks for collecting, storing and processing large volumes of structured and unstructured data, and Cloud Computing platforms that make the big data manipulation processes are simpler and cheaper because they are managed by the provider and provided with the pay-per-use system, therefore without initial costs.

To obtain useful information from data, in the most diverse fields, from production to marketing, to finance, just to name a few, artificial intelligence algorithms are used.