What is the future of Hadoop 1
Hadoop and the future of the data warehouse
How important is Hadoop in the latest generation of data warehouses?
Bigger, faster, cheaper: that is the promise of distributed data processing infrastructures such as Hadoop or MongoDB. Applications such as
- Sentiment analysis,
- Recommendation engines or
- Realtime personalization in e-commerce.
How can Hadoop save costs and create value in the traditional business intelligence (BI) and data warehouse (DWH) environment?
Myths about Hadoop
First, it's important to know what Hadoop really is and what isn't. When you talk about Hadoop, you basically mean a whole portfolio of solutions. The core consists of the distributed file system HDFS and the processing framework YARN (formerly MapReduce). On top of that there are a bewildering variety of tools such as
- Hive: a module for SQL-like access to structured data
- Pig: a high-level language for creating distributed computing jobs
- HBase: a column store in the style of Google BigTable
- Mahout: a toolkit for data mining
And much more. Technology consultants can help you shed light on this jungle and find the right modules for you.
Hadoop is, so to speak, the Linux of data processing. It is not a tool, but provides a whole platform around which an ecosystem of tools has evolved. Distributors such as Hortonworks, Cloudera or MapR open up this platform for use in companies.
The highlight of Hadoop
What is the key to Hadoop's success? The answer is simple: data storage in a cluster of commodity hardware scales with the purpose of the application. So soon
- particularly large amounts of data or
- unstructured data
are to be processed, the solution shows its advantages. This is especially true if a later increase in data volume or complexity is essential for the business model.
Case 1: Hadoop as a staging area in the ETL process
In the staging of an ETL process, large amounts of data have to be saved in the meantime and processed quickly. At the same time, different requirements are placed on the staging area than on an operational database. A classic relational database is often used for staging in current data warehouses. That is expensive, and in the worst case, additional costs for complex extraction steps or bottlenecks in processing capacity inhibit the productivity and creativity of the analysts. By using Hadoop, cheaper storage space can be gained. Large amounts of data can be processed and complex questions are accessible.
In particular, you can enrich your own data through external influencing factors such as press, weather or trends.
Case 2: Hadoop as an ELT worker
The use case of Hadoop as an ELT worker is similar to its use as a cost-effective staging area. For this purpose, however, the lever is also used due to the improved processing capacity. This means that the calculations are carried out as an ELT process, quasi in-database, on the Hadoop node instead of in the ETL process. This means that computationally intensive analyzes can be carried out that were previously not manageable, for example the question of complicated correlations and links in the data.
Case 3: Hadoop preprocessing unstructured data
Hadoop makes the most of its strengths when it comes to analyzing unstructured data. The reason for this is the distributed infrastructure. This enables parallel processing on multiple nodes. Complexity can be managed through a cost-effective hardware scale-out, i.e. more servers instead of an expensive hardware scale-up, larger servers. The software scale-up, for example higher license costs for volume licenses, is also omitted.
An analysis of unstructured data is usually necessary, especially when it comes to enriching business data. This can be, for example, the preparation of the company's own content such as documentation or e-mail communication. Further examples would be portal logs, press feeds or machine data. The Hadoop cluster enables this data to be preprocessed and made available for use in the data warehouse.
Case 4: Hadoop as a database for fine-grained data
Use cases with a particularly high volume of data have so far often been blocked with classic database technology because the data rates simply cannot be saved quickly enough, let alone analyzed. Such cases are, for example, the searchable storage of many documents and the filing of server or machine logs.
In these cases, Apache HBase shines. It is a schema-free key-value or wide-column store from the Hadoop environment that follows the Google BigTable paradigm. The great advantage of this solution is that enormous amounts of data can be saved in a redundant and fail-safe manner in a short time. The type of storage also enables the high-performance analysis of the data afterwards. If real-time analyzes are required, the data can be processed live at the same time by a stream processor such as Apache Storm or S4.
Case 5: Hadoop as a long-term active archive for raw data
Classic database solutions have one problem: They are expensive when it comes to large amounts of data. Many companies therefore archive their old data offline either on tape or on disk. In the worst case, the data will even be deleted.
Those who act in this way are wasting great potential. One way to realize this potential is to use Hadoop as an active archive. By using commodity hardware, the costs remain manageable and the data can be accessed at any time. The purposes of application have enormous business value:
- Batch run analyzes on the entire database since the beginning of data recording
- ETL processing of subsets of the data in data marts
- Interactive exploration of the raw data
The exploratory analysis of raw data is still a niche application that requires in-depth technical know-how. All other analyzes do not require any additional qualification from the specialist user. The data is then available in the usual form, only much more detailed and with a long-term reference to the past. In the final expansion stage of such a data store is the vision of a universal enterprise data hub, a kind of bus on which all of the company's information converges, is automatically processed and stored for the future. All of this is already possible with today's technology.
Hadoop does not eliminate the need for data warehouses or relational databases. Its role currently lies more in the technical and professional preparation of the data so that it is then available in the data warehouse. It therefore supplements the DWH and enhances it for the specialist user.
Contact: Dipl.-Phys. Johannes Knauf, Consultant BI, Ancud IT-Beratung GmbH, [email protected]
- What is the origin of the Alexander Technique
- What is a whole square
- What is the best website for ghost music production
- What are some good industrial bands
- What music software can play songs automatically
- What are the best college events
- Dogs like to hike
- Which tree grows roots from the branch
- What is the best mono font for iPad
- What are cash and food crops
- What is 80 18 in 1000
- Is Odomos safe for baby's skin
- How can you do with podcasting
- How much do lead generation companies charge
- What things should be avoided with friends
- What does preposition
- Can music give you peace and happiness
- Who should get Ballon dOr next year?
- What is an index tuple
- What was the first book of the Hindus
- What does the name Gumroads mean
- Pretends to be completely human
- What are the economic trends for 2019
- Let's move away from abstract art
- Which songs does the following song example do
- Where can I find DBMS videos
- What are the characteristics of Edwardian literature
- What causes bone cancer in children
- Are the ancient Egyptians extinct?
- Purchased Expedia Travelocity
- Donald Trump wears lipstick
- What is good what is bad 3
- What is the best tax book
- Can the Pomeranians be left alone