Multivariable Librarianship: Big Data Basics

I had the pleasure of attending Old Dominion University's High Performance Computing Day last week, a symposium focused on the possibilities of research computing. At HPC Day, the keynote "Data in Modern Times" was given by Zachary Brown, and it was an excellent reminder of some Big Data foundations for all of us.

Summarized from my notes:

"Big Data" is a new construction. In the past [really tempted to say "ye olden times" here] SQL-driven relational databases were enough--users could derive whatever insight they needed from the database with a structured query, and call it a day.

These days, there are a series of factors, often known as the "Many V's" that help define, or operationalize what people mean by "Big Data." Zachary focused on three:

Velocity

New data is coming in fast. Often too fast to store without some processing to determine what's worth keeping.

Volume

There's so much data that it's unwieldy to handle with standard or legacy methods.

Variety

The data isn't neatly organized. It may be very text-based, it may include lots of different attributes, or may otherwise not fit into nice boxes for analysis.

As a side note, as data science evolves, other organizations and pundits are suggesting other V's. It's important to note that the original 3V's were measures of magnitude, as discussed here. "Strategy V's" attempt to address facets of the problem such as Veracity--can the data be trusted? To what level?; Value--Why are we bothering to work with the big data anyway?; Variability--do we understand the context of the data?; and (my favorite) Visualization--how can we present the data in a way that carries meaning? More on all 7 V's here.

The keynote continued by addressing ways to "counter" each of the three V's:

To address the needs of Big Data in…	We must be…
Velocity	Agile
Volume	Scalable
Variety	Flexible

Agility, in regards to Big Data, focuses far more on finding the right tools for the right job than on "Agile" software development, or on other "Agile" methods. Embarking on big data analysis can be overwhelming, so preparing to "drink from the firehose" requires dexterity.

Scalability addresses volumetric concerns by planning in advance to create an architecture that can be extended as needed if data continues to grow. Many of the data tools available today work across computer clusters--multiple machines working together to run analyses. That type of architecture works well with a few machines, and just as well with dozens, depending on the needs of the researcher.

Flexibility tackles variety by taking a range of methods to work with Big Data. Unstructured, text-based data? Use machine learning. Messy data? Tools like Open Refine can help scrub your data. Flexible methods allow data scientists to adjust their approaches to suit the problems at hand, rather than cleave to a one-size-fits-really-very-few method.

All in all, the keynote at ODU's High Performance Computing Day served as a great reminder for the various concerns surrounding Big Data, some key aspects of defining "Big Data" as a concept, and some strategic thinking to address the challenge of Big Data. Next steps include putting these reminders into practice as we all handle Big (and smaller) Data in work and life!

Multivariable Librarianship

Monday, February 22, 2016

Big Data Basics

Velocity

Volume

Variety

No comments:

Post a Comment