Debugging kettle tasks in MapReduce

Finally started to play with bigdata and pentaho. On my specific case, Cloudera CDH3u4. At Mozilla we have a few clusters of over 80 machines that we're using to backup a bunch of services

Debugging mapreduce tasks

It took me a while to get my head around the concepts of how kettle integrated with the mapreducer tasks. When I did, the first thing I noticed is how complex it is to know what's happening. Until Matt Casters and friend get the chance to implement PDI-9148, we need to do things manually - as in inspecting logs, etc.

My first approach was writing to text files. I tested direct output to hdfs, but for some reason didn't work. Using direct file system means that output will be spread through all the cluster nodes. This approach generally sucks.

I also thought about using some hand-made logic in a javascript step, but then looked at the WriteToLog step. This step generally works, but with a great flaw on it; it has no way to limit the output of it. If we have millions of rows, we'll have a huge log generated - and that's not good.

An improved Write To Log step

If it's not there, just do it yourself, the code is open. So I did. I added the ability of specifying a limit to the output of the step. This is very useful to inspect how the dataset is looking inside a map or reduce task. Once I deployed this change to my cluster, this is how my tasktracker log looks like (I ran this with a previous writeToLog version and ended up with a crashed browser and almost half a gigabyte of log files). This shows the first 5 lines of our dataset, with the key and value of our dataset:

I'll work with the kettle team in order to put this into the main code line, hopefully will be in 4.4.1 and 5.0. This is PDI-9195

Debugging kettle tasks in MapReduce - Sane WriteToLog

Debugging mapreduce tasks

An improved Write To Log step

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...