Edgar Furtado edgardfurtado

Edgar Furtado - Boston, MA

Big Data has been elevated to Data Science from staggering velocity, variety and extreme volume.  When it comes to Data Science 21st Century industries have to ensure that they have the ability to implement Artificial Intelligence in real-time processing.  Handling high volume is a matter of acquiring the right hardware and infrastructure.  Accommodating variety is more complicated and requires subject matter expertise.  Organizations could engage in acquisition of big data hardware warehousing processes to get data into usable shape and experts can leverage variety into big success, but it can also be an opportunity for big failure.

With the introduction of Artificial Intelligence Data scientists now tools 12 important Algorithms such as Regression Algorithms, Instance-based Algorithms, Regularization Algorithms, Decision Tree Algorithms, Bayesian Algorithms, Clustering Algorithms, Association Rule Learning Algorithms, Artificial Neural Network Algorithms, Deep Learning Algorithms, Dimensionality Reduction Algorithms, Ensemble Algorithms, Other Algorithms et al.

During the early stages of developing a data science warehouse, a lot of time and resources is utilized analyzing various incremental levels of Meta data sources and also getting to know the unique methodologies as well as how the data can be profiled on a dashboard.  At the end of all of this Data Scientist come up with a reporting model specifically designed and put into place for a unique brand enterprise. All of this process levels involves on various metrics and decisional process on what data will be included in a model and will be excluded in the new warehouse.

Will this new warehousing process answer questions on what needs to be archived or excluded from a reporting system?   Since the job of a data lake is that of a reservoir that holds petabytes of data and not just data that could be used at a given reporting time but also data that will be used in the future. There is a potential that data science warehouse can be a revenue-generating stream for future analysis of whatever comes out of it.

Will it be possible to sustain data lakes hardware, which is also far different from a data warehouse?  Will it be cheaper to buy commodity, off-the-shelf servers combined with cheap storage making scaling from terabytes and petabytes fairly economical? For low budget organizations, getting cheaper brands will meet the needs until they can prove to management that Big Data needs a budget in the operating cost.

Data Science within Industry

There is a lot of traditional data extraction takes place from various systems with enormous quantitative metrics attached to each transactional event.  It appears Data Lakes is more than likely not able to support non-traditional data sectors such as social; network activities, tweeter texts, web server loggings and other sensor related data all of these metrics fall in a different domain which Data Lakes cannot support. This departure of not being able to harvest non-traditional data sectors can pose a strain on an enterprise-reporting model.

Data Science Challenges?

Making a quick buck out of Big Data may not seem to be that easy.  Is there value to it?  All of this Big Data hype has really put businesses on notice to aggressively hire and retain big data analytic professionals who can handle implementation and management.

How far are we from a full cycle implementation?  It also appears that Big Data solutions are easy to implement and reduce time to value.  This new Data Science industry has leader that are able to offer an ideal solution to handle structured, unstructured and semi-structured data. It is worth looking at other off-the-shelf options or JBOS just a bunch of servers. Can they do the trick?

Data scientists who use tools from Hadoop ecosystem, such as Hive, Pig, and MapReduce, to explore data and investigate relationships are looking for patterns and trends in data of all sizes, from megabytes to petabytes.

Let us look at benefits?

Every organization is looking for ‘Proof-of-value= Return on Investments ( POV = ROI ).  The relief that comes from knowing that “Big Data” can provide value to business entities by showing them return on Investments will bring management and directors to the boardroom for more reasons to approve their bottom line operating costs.

Senior management is also expecting insights into increasing their value on converting prospects, reducing churn and more upselling, improving customer experiences and finally marketing efficiency – all of these factors results in tangible benefits like exponential revenue streams with efficiency and loyalty.  This is many benefits for Return-On-Investment.

At an Enterprise level, businesses should also discuss and expect increases in IT and end user productivity. This translates into benefits!   Big Data Research Polls have also indicated that organizations have documented (with independent research firms) that as many as 20% of employees (IT and business) have a direct benefit of increased productivity from insights that can be quickly generated and implemented- that’s another benefit.

At an Enterprise level, businesses with punitive budgets can look into re offerings to determine if options like pre-built functions or applications or industry knowledgeable professional services are readily available and affordable this is another benefits for low budget Enterprises.  

On the hand, Enterprises have considered Hadoop for many reasons such as low cost, scalability, and flexibility.  The Hadoop Distributed File System (HDFS) accepts files of any type and format, unlike traditional data warehouses, which require a schema up front. With this flexibility, HDFS lends itself to a potentially revolutionary use case known as the data lake.  In Data Lakes, enterprises use HDFS to store and process previously unused data and combine legacy data in new ways.

This process of discovery, preparation, analysis, and reporting has now begun the workflow of data science.  In a Data Lakes environment, Data Analysts can study log files and geolocation data, social media feeds and sensor data. They can also crunch through neat tabular data, completely unstructured text, and everything in between.

The aim of extracting data in repetitive phases can bring meaningful statistical and descriptive analyses, predictive models and visualizations for internal and external audiences.

Ultimately, Data Lakes and Big Data Science form the basis for data­ driven, company­ wide decisions, which is where organizations will benefit most.


Selected References:

http://www.emc.com/big-data

http://pivotal.io/big-data/hadoop/press-release/data-lake-apache-hadoop

https://www.gesoftware.com/industrial-data-lake

http://www.kdnuggets.com/2014/06/data-lakes-vs-data-warehouses.html

https://www.healthcatalyst.com/data-lake-vs-data-warehouse-right-for-healthcare

http://www.analyticsengines.com/developer_blog/data-lake-vs-data-warehouse/

http://www.networkcomputing.com/storage/big-data-for-it-operations-data-lakes-or-data-warehouse/a/d-id/1320313

                                                                  

Quote 1 0
Abhi Nandan AbhizNandan
Nice information related to data science and industry 4.0. Thanks for sharing!
Quote 0 0