While working on projects that are developed, verified, tested and deployed into different environments can be a complex task. In addition, sharing details of production systems in an organisation with too many people is never a good idea.
In this article I will try to outline my suggestions on how we can manage the details used in an IT Project that uses Talend Data Integration as its ETL tool. As a result of adopting these suggestions, it should be possible to deliver a solution into production without disclosing any production details to developers.
Talend's solution is to implement different environment 'names' within same context – or 'configured contexts' as they called them; each of these would have a different set of values for variables. Unfortunately, this methodology implies that context information is embedded into jobs. In my view, this makes for difficult long-term maintainability, and it is very error prone. When following the documentation to create a context to support multiple environments, our result will be similar to the figure below:
fig 1: Talend's default context creation for multiple environments
We find that a context can be created and have one set of values for each of the 'configure context' (environments) – above figure shows the 'configure context window'. It makes possible to run a job with a different set values by choosing the appropriated 'configured context'.
The negative side of this approach is that you have all variables values contained within your jobs, and when exporting your jobs they are also added into the exported archives. So your carefully managed login details are contained in files that are stored in multiple locations – and at least your development team knows all these details as they had to create the contexts in job(!).
On the maintenance front; if it is necessary to change the value of a variable such as a directory, it will also be necessary to change the context as well as all the jobs that have been exported using that context, and than re-export them again.. It can become a tedious maintenance task.
Maintenance of contexts values becomes more of an issue when we also consider that a context can be used by different projects. Now your maintenance has to consider the impact across multiple projects as well.
Instead of using internalised (embedded a la Talend) context within jobs as above described; we can store context values in flat-files, databases or other form of storage – thus externalising the values of variables. This process has multiple advantages; simpler storage model - can be modified without further impact to jobs logics single location for shared contexts no need to re-deploy jobs that use edited contexts keeps sensitive details guarded from everyone – even from the Development Team.
My suggested approach requires an entry point on your favoured OS that provides the location of a context file which contains variables values for the current ETL environment. Independently of your choice of OS, this file has to contain details that will enable for the retrieval of all other context values. In this illustration, our file will contain details to a database connection where a table is used to maintain all variable values for that environment.
This methodology has being successfully applied to Talend in multiple projects; and it has also worked successfully on multiple projects using SSIS (SQL Server Integration Server). As a preference, I created an environment variable that contains a path to a file containing necessary details to populate a MariaDB database connection - this is a simple configuration file. This database has tables that contain all other context information as well other information used during ETL processes. For our purposes, an environment variable can be created in Linux/Debian, like this: ([i]/etc/profiles[/i] OR [i]/init/rc.local)[/i]:
The value of the path can be retrieved by using standard Java methods independently of the Operating System. So we will be able to read this file by reading its path with:
This could be implemented in a Talend job as suggested here:
fig 2: Talend context loading from Operating System variable
Once we have a valid connection to a database containing all other context files; the default structure of Talend context is a key and value pairs.
This structure will be expanded with the addition of a context filter - this will be used to return only one of the contexts at a time.
So our database has a table for storing context similar to this:
CREATE TABLE CONTEXT_ETL_CONFIG (
VAR_KEY VARCHAR(254) NOT NULL,
VAR_VALUE VARCHAR(254) DEFAULT NULL,
CONTEXT_FILTER VARCHAR(254) NOT NULL ,
ALTER TABLE CONTEXT_ETL_CONFIG
ADD CONSTRAINT PK_VAR_KEY_BY_CONTEXT_FILTER PRIMARY KEY ( CONTEXT_FILTER,VAR_KEY) ;
The above structure would support Talend to collect a single context using a query like:
“SELECT VAR_KEY, VAR_VALUE FROM CONTEXT_ETL_CONFIG WHERE CONTEXT_FILTER = '“ + context.v_currContextToLoad + "'
And the dataset returned can be loaded to populate a context with the correct values with the use of variable context.v_currContextToLoad. In each environment, a different set of values will be loaded.
This concept can be further expanded by having a list of contexts (context filters) passed as a parameter into this job, and so each context can be loaded in turn within a loop.
Once contexts are no longer embeded within Talend Jobs, it will be necessary to ensure that standards and naming conventions are adhered to.
In this case, only a small number of trusted people need to know and access production system to garantee that Talend jobs will work correctly once these have been successfully tested.
These trusted employees will be able to to check if all the require details that are used in a solution are available in all envrionments. this work will also be minimised overtime as contexts are shared and validated by multiple projects and jobs deployments.
I believe the above methodology provides great benefits in terms of maintainability, security and flexibility.
As a job is deployed through different environments, there will be no need to edit these jobs – so testing can be done safely. At the same time, contexts stored centrally can be managed securely whereby only a small number of individuals will have access to production environment details; as jobs collect the correct values automatically.
The methodology above have proved as a timesaver while improving overall management of secure details.
I hope the above information is relevant to other Talend Professionals out there.
If you would like to suggest an improvement or a correction - please add your comment below.
Nicolas @ BrainPowered