So now you can tell pdi to capture that runtime information (core CE capability) as well as consume it through the new MetaIntegration bridge (EE) so that you can visualize it alongside with your broader lineage context
And how to enable that? Here's all the info you need:
The first section comes directly from help.pentaho.com's lineage page
Setting up Data Lineage
Pentaho now offers you the ability to visualize the end-to-end flow of your data across PDI transformations and jobs, providing you with valuable insights to help you maintain meaningful data. This ability to track your data from source systems to target applications allows you take advantage of third-party tools, such as Meta Integration Technology (MITI) and yEd, to track and view specific data.
Once lineage tracking is enabled, PDI will generate a .graphml file every time you run a transformation. You can then open this file using a third-party tool, such as yEd, to view a tree diagram of the data. By parsing through and teasing out the different parts of the graph, you can gain an end-to-end view into a specific element of data from origin to target. This ability can aid you in both data lineage and impact analysis:
- Data lineage provides the ability to discover the origins of an element of data and describes the sequence of jobs and transformations which have occurred up to the point of the request for the lineage information.
- Impact analysis is the reverse flow of information which can be used to trace the use and consumption of a data item, typically for the purpose of managing change or assessing and auditing access.
Sample Use Cases
Data lineage and impact analysis can be applicable in several ways.
As an ETL Developer:
- There are changes in my source system, such as fields which are added, deleted and renamed. What parts of my ETL processes need to adapt? (Impact Analysis)
- I need additional information in my target system, such as for reports. What sources are can provide this additional information? (Data Lineage)
As a Data Steward:
- There is a need for auditability and transparency to determine where data is coming from. A global, company-wide, metadata repository needs data lineage information from different systems and applications, i.e. very fine-grained metadata.
- What elements (fields, tables, etc.) in my ETL processes are never used? How many times is a specific element used in some or all of my ETL processes?
As a Report/Business User:
- Is my data accurate?
- I want to find reports which include specific information from a source, such as a field. This process is "data discovery." For example, are there any data sources which include sales and gender? Are there any reports which include sales and zip codes?
As a Troubleshooting Operator:
- The numbers in the report are wrong (or supposed to be wrong). What processes (transformations, jobs) are involved to help me determine where these numbers are coming from?
- A job or transformation did not finish successfully. What target tables and fields are affected which are used in the reports?
As an Administrator:
- For documentation and auditing purposes, I want to have a report on external sources and target fields, tables, and databases of my ETL processes. I need the data for a specific date and version.
- To ensure compliance, I want to validate naming conventions of artifacts (fields, tables, etc.)
- For integration into third-party data lineage tools, I want a flexible way of exporting the collected data lineage information.
Architecture
Pentaho's data lineage capabilities allow us to take advantage of tools from Metadata Integration Technology (MITI). If you use a lot of different systems and applications, you can track and visualize specific data across these systems using Pentaho' lineage capabilities and third party tools such as MITI and yEd.

Setup
Modify …\system\karaf\etc\pentaho.metaverse.cfg (Client & DI-Server when needed):
- You need to enable lineage explicitly by setting lineage.execution.runtime = on
- Modify the default folder for lineage GraphML files accordingly: lineage.execution.output.folder=./pentaho-lineage-output
- Set lineage.execution.generation.strategy=latest (by default)
After the execution of a job or transformation, the GraphML files are generated in the defined folder.
API
It is also possible to access the GraphML information via a DI-Server API. There are REST endpoints available to retrieve the lineage related artifacts.
Below are some example curl commands which exercise the REST endpoints available on the DI Server. These calls use basic authentication. For more information on the various ways to authenticate with a BA or DI server, see the "Authenticate with the Server Before Making Service Calls" topic on the Pentaho Documentation site.
For more detailed information about the REST endpoints available, you can got to the Pentaho Wiki to view the attached Enunciate file.
Get all lineage related artifacts
curl --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ="http://localhost:9080/pentaho-di/osgi/cxf/lineage/api/download/all
Get all lineage from a given date forward
curl --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ="http://localhost:9080/pentaho-di/osgi/cxf/lineage/api/download/all/20150706
Get all lineage between 2 dates
curl --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ="http://localhost:9080/pentaho-di/osgi/cxf/lineage/api/download/all/20150101/20150706
Get all of the lineage artifacts for a specific file in the DI repo
curl --request POST --header "Content-Type: application/json" --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ=" --data '{"path": "/LOCAL DI REPO/home/admin/dataGrid-dummy"}'http://localhost:9080/pentaho-di/osgi/cxf/lineage/api/download/file
Get all lineage related artifacts for a specific file in the DI repo between 2 dates
curl --request POST --header "Content-Type: application/json" --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ=" --data '{"path": "/LOCAL DI REPO/home/admin/dataGrid-dummy"}'http://localhost:9080/pentaho-di/osgi/cxf/lineage/api/download/file/20150701/20150707
Invalid date request
curl --header "Authorization: Basic YWRtaW46cGFzc3dvcmQ="http://localhost:9080/pentaho-di/osgi/cxf/lineage/api/download/all/20159999
Setting up yEd to consume the generated data
The above tutorial shows how to generate the data. Next is the consuming part. EE customers can leverage the MetaIntegration bridge, but everyone can also use yEd to read that info.There are some steps and configurations needed to do that, though - or else you'll be puzzled to just see a single box, which is clearly... hum... less than ideal :p
The .cnfx files in this folder are property mapper configurations that enable consistent rendering of Pentaho metaverse graphs.
To install the properties configuration:
- Download this .cnfx configuration file
- Open yEd
- Navigate to the Edit... menu, Properties Mapper ... 3a. If you are just trying to update existing configurations, remove the old ones first. Then...
- Click the import icon on the top left panel of the dialog to import the properties configuration to your yEd installation.
- Open your .graphml file in yEd.
- Navigate to the Edit... menu, Properties Mapper ...
- Select the configuration on the left, and click the Apply button at the bottom right of the dialog to apply the selected configuration formatting. You will need to do this once for each configuration you would like to apply.
Taking it further
In order for this lineage information to be better and better, we encourage the plugin contributors to also contemplate the lineage information on their submissionsCheers
-pedro