PRO+ Premium Content/E-Books

Thank you for joining!
Access your Pro+ Content below.
April 2016

Tools to tackle big data problems

Storage for big data often consists of scale-out NAS or object storage, and many look to commodity hardware as a cost-effective way of capturing petabytes of information. One of the most challenging big data problems is that big data storage systems must perform well enough to enable real-time analysis. Big data analytics often requires processes and people with specific skill sets, but there are software tools for analytics disciplines such as predictive analytics, data mining, text analytics and statistical analysis.

Because big data can scale to petabytes of capacity, organizations are looking for ways to manage it all that is easier and less expensive than traditional scale-out NAS. Object storage and software-defined storage are frequently mentioned as tools that can help remedy big data problems. Both can add intelligence required for analyzing data and take advantage of low-cost disk storage.

Data lakes can help manage those big data problems, but here is what you need to know before making the leap. Data lakes are strongly associated with Hadoop, and use the open source software as a replacement for traditional data warehouses. Hadoop clusters are based on commodity hardware and can hold structured, unstructured and semi-structured data. This makes Hadoop a good choice for log files, Web clickstreams, sensor data, social media posts and other types of applications that produce big data, but there are drawbacks to keep in mind.

CHAPTERS AVAILABLE FOR FREE ACCESS

  • Cataloging the drawbacks to Hadoop data analysis

    Data is growing at record rates, with no signs of slowing. But what good is having petabytes of data if you can’t gain business advantage from it? Accurate analysis of data can have great positive business results, but requires the right tools and techniques. Effective data analytics requires having strategies for storing and managing large volumes of structured and unstructured data, and a method of analyzing it to unlock business data.

    Storage for big data often consists of scale-out NAS or object storage, and many look to commodity hardware as a cost-effective way of capturing petabytes of information. Big data storage systems must not only be capable of holding large quantities of data but they also must perform well enough to enable real-time analysis. Bandwidth and response times are critical factors, and other aspects such as the cloud and Hadoop may play a role depending on the type of data being stored and analyzed.

    Indeed, there are positives and negatives to Hadoop data analysis, but the fact is, there is no magic bullet software to handle big data analytics. It often requires processes and people with specific skill sets, and often tools beyond standard business intelligence and analytics applications. However, there are software tools for analytics disciplines such as predictive analytics, data mining, text analytics and statistical analysis. And for unstructured and semi-structured data that doesn’t fit well in traditional relational databases, Hadoop and other related technologies are gaining in popularity.

    Take a closer look at Hadoop data analysis, particularly at enterprise concerns. You’ll have a better understanding of Hadoop Distributed File System and the role it plays with Hadoop data analysis.

    Download
  • Hadoop Distributed File System options for big data

    Because big data can scale to petabytes of capacity, organizations are looking to manage it in ways that are easier and less expensive than traditional scale-out NAS. Object storage and software-defined storage are frequently mentioned as big data tools. Both can add intelligence required for analyzing data and take advantage of low-cost disk storage.

    An object storage system handles files differently than a traditional file system. Servers use unique identifiers to find objects, which use metadata in a far more detailed way than file systems do. The unique identifiers mean objects can be geographically dispersed because they can be retrieved without the storage system knowing their physical location. That makes objects a good choice for large data stores or data stored in a cloud.

    Software-defined storage has many forms and use cases, but it applies to big data when used to pool and manage data across off-the-shelf commodity hardware. Because the management and analytics happen in software appliances, the hardware can be cheap, deep disk without bells and whistles.

    Perhaps the most well known option available is the Apache Hadoop Distributed File System (HDFS), which is a Java-based file system designed to be used in Hadoop clusters. HDFS currently scales to 200 petabytes and can support single Hadoop clusters of 4,000 nodes. It offers storage performance on a large scale and at a low cost, which is atypical of most enterprise arrays that cannot perform all three tasks simultaneously.

    In this chapter of "Tools to Tackle Big Data Troubles," we look at some core HDFS features, three HDFS commercial distributions and other Hadoop storage-related tools and their related applications.

    Download
  • Hadoop alternatives now offer data center-grade storage

    Data is growing at record rates with no signs of slowing. But what good is having petabytes of data if you can't gain business advantage from it? Accurate analysis of data can have great positive business results, but requires the right tools and techniques. Effective data analytics requires having strategies for storing and managing large volumes of structured and unstructured data and a method of analyzing it to unlock business data.

    Data lakes are strongly associated with Hadoop and use the open source software as a replacement for traditional data warehouses. Hadoop clusters are based on commodity hardware and can hold structured, unstructured and semi-structured data. This makes Hadoop a good choice for log files, web clickstreams, sensor data, social media posts and other types of applications that produce big data. Until recently, Hadoop alternatives were few and far between.

    Still, Hadoop implementations that are not well planned can produce data swamps instead of lakes. Hadoop was not developed to run on shared storage, and storage vendors must tweak their arrays to support the Hadoop Distributed File System, fostering the rise of Hadoop alternatives. Also, Hadoop does not have data governance built in as many data warehouse tools do, allowing Hadoop alternatives to bridge the gap.

    Download

-ADS BY GOOGLE

SearchSolidStateStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close