Talk:Extract, transform, load
This is the talk page for discussing improvements to the Extract, transform, load article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||||||||||||
|
General audience rework
[edit]Due to the technical nature of ETL GA compliance may not be possible. However, the content should not be sales literature for software companies including open source projects. I plan on sinking that content to the bottom into a list of ETL apps.
- The diagram used here is not indicative of using multiple data mediums for I/O. Specifically Input of csv files, spreadsheets, databases, other data streams.
- I will take the lead on sanitizing this article
Mkevlar (talk 4:34, 1 April 2022 (UTC)
Jaspersoft ETL
[edit]IMO it makes sense to include Jaspersoft ETL in the list of Proprietary ETL frameworks —Preceding unsigned comment added by 178.236.241.148 (talk) 12:55, 16 February 2011 (UTC)
Refer ELT
[edit]I'm not an expert on this subject. But you guys who are experts should update this to include at least a reference to ELT (Extract, Load, Transform). You can find many articles about this and many huge companies using this approach for many-hundreds-terabytes datawarehouses. —Preceding unsigned comment added by 201.48.130.65 (talk) 20:00, 28 December 2009 (UTC)
The original term "ETL" was popularized in the early 1990's by W.H. Inmon in "Building the Data Warehouse." Many advocate using the term "ELT" because because the term "load" is typically used to describe the step of loading the data into the RDBMS. For example, in Oracle, the "SQL Loader" tool is used to load external data into the database. After loading comes the transformation processing--hence ELT. — Preceding unsigned comment added by Randygrenier (talk • contribs) 15:34, 10 May 2015 (UTC)
I have changed the content of the section "ETL vs ELT". I have also edited and added references. I can not verify that ELT is the modern approach nor that there are only two broad categories: ETL and ELT. Instead I have referenced what Amazon Web Services says: "ELT is a version of ETL." Also, I think that "ETL" has become more of an abstract concept than a detailed instruction of how to do it. One important distinction is between the transformations that you do to cleans, standardize and conform data vs. the transformations that you do to create queries and data marts. It is hard for me to understand that we can load totally un-transformed data into the backoffice. Rather, some kind of cleansing, standardization and conforming has to be done. But, if we consider these transformations as belonging to extraction, then I can find it meaningful to talk about ELT into the backoffice and do all complex queries (on conformed data) and all creation of data marts in the "frontend" area.
I give you one realistic example of how it may look like and why it is problematic to see the short abbreviation as a description of what to do.
If you consider a solution where you extract many files from a source system into files in a datalake and then transform some of the files when you want to load them into tables in what you may call "extract area" and then continue to transform the tables in the "extract area" into cleansed, standardized and conformed tables in a "backoffice" area and then finally have complex queries to load the data into the "frontend" area, then you could call the whole process ETL (for the "extract area") - ETL (for the "backoffice" area) - ELT (for the frontend area). That will give us something like ETL-ETL-ELT. And if you store the results of the ELT somewhere then you would end up with ETL-ETL-ELT-L.
I have been looking for references on this subject but I have not found any. Please feel free to inform me if there is any. I have the impresseion that ELT is more of a salespitch than a serious architectural design. I remember that Oracle tried to launch their ODI ETL tool (many years ago) with the message that it was an ELT tool.
TemporaryMobility 10:46, 15 November 2018 (UTC) — Preceding unsigned comment added by TemporaryMovement (talk • contribs)
Citation Needed??
[edit]Specifically referring to the statement, "Increasingly, companies are buying ETL tools to help in the creation of ETL processes.[citation needed],"... Really? I'm not sure we need citations when statements are common sense. Companies are always looking for ways to increase the bottom line and IT departments to increase the "plug-and-play" supportability of components. Certainly this is not accomplished through creating (or allowing to be created) tons of proprietary, in-house scripts. So, obviously, companies that discover ETL tools will buy ETL tools to create ETL processes. I wouldn't expect to see "Wind is the movement of air [citation needed]" That's just silly. 71.67.189.53 (talk) 01:42, 24 October 2009 (UTC)
- I do think that the citation is needed to validate the term "Increasingly" used in the paragraph. I believe there is still massive amount of manual creation of ETL without the use of ETL tools so some proof that this is diminishing and the use of COTS products are filling in that part of the marketshare would be useful. 157.203.43.103 (talk) 14:35, 11 March 2010 (UTC)
- I'm with the guy above. As someone who has worked in the industry for 20+ years, there's an awful lot of "ETL Tool evaluation" goes on, but companies (IMO) still shy away from the up-front cost, and err towards sticking with their in-house scripted solutions. This might be changing, but we do need a citation. If it were my $2 I'd say it's not on the increase. Michaelfromtheuk (talk) 09:18, 17 July 2012 (UTC)
Totally agree. In fact, there are organizations that have aborted the use of ETL tools (after much expense) because of their lack of flexibility compared to programming in the RDBMS language (e.g. PL/SQL, Transact SQL, etc.) — Preceding unsigned comment added by Randygrenier (talk • contribs) 15:36, 10 May 2015 (UTC)
External links to other tools
[edit]I suggest that deleting the "External links to other tools" section is long overdue. It has become a spam magnet and a link repository, and Wikipedia is not a link repository. —Veyklevar 05:24, 12 May 2006 (UTC)
Very much agreed. I'd do it myself, but I'm biased.
I'd suggest that rather than a list of links, what would make more sense is a list of "major products" by market share - with citation, although this might be hard to come by. From my own experience (which is over 20 years) I'd say the top 3 are Oracle, IBM, and Ab Initio (which isn't even mentioned). Michaelfromtheuk (talk) 09:51, 5 March 2012 (UTC)
Copyvio
[edit]Some of the "Challenges" section of this article has been copied verbatim from [1]. Haakon 11:57, 22 May 2006 (UTC)
The Image on this page looks like it has been copied from some book. Is that a copyright violation? I don't know how to sign this comment, sorry. — Preceding unsigned comment added by 208.91.190.85 (talk) 18:25, 1 May 2013 (UTC)
Some ETL Tools
[edit]This list of ETL tools is a little selection of the usable products. A recent software is'nt present in this list : Talend Open Studio. I have find severals english source of information of this product and its show the relevance of this software:
- https://analyticscanvas.com — Preceding unsigned comment added by 184.147.110.172 (talk) 15:38, 23 March 2021 (UTC)
The Official website is http://www.talend.com/default.html
I'm actually working to this company, it would be thus preferable that I'm not the author of this addition!
Can somebody it help me to complete article ETL by adding this reference?
Ocarbone 13:36, 17 October 2006 (UTC)
Some more ETL Tools offering companies are: http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html https://www.informatica.com/ http://www.cloveretl.com/ https://adeptia.com http://support.sas.com/software/products/etls/ https://www.mulesoft.com/
Please add some more providers if you know. — Preceding unsigned comment added by 125.63.70.135 (talk) 11:34, 7 March 2017 (UTC)
Hadoop based tools
[edit]Under the section that lists open source tools, shouldn't we mention tools like Cloudbase, Hive, and Pig (all based off Hadoop) as well? —Preceding unsigned comment added by Saurabhnanda (talk • contribs) 04:51, 2 June 2009 (UTC)
It certainly raises an interesting question, and big-data tools can have a part to play in processing very large data-sets into aggregate sets more useful for analysis, however Hadoop et al. typically doesn't perform the Extract or Load steps particularly well, and a traditional ETL tool is about orchestrating the motion, as well as transformation of data. Gradientdescent (talk) 05:21, 22 June 2012 (UTC)
Kettle
[edit]Kettle seems to be a synonym for Pentaho Data Integration. Useful to mention this somewhere in the tool list, and maybe also have a disambiguation page for "Kettle". March 2010, ThP —Preceding unsigned comment added by 88.101.4.80 (talk) 10:30, 19 March 2010 (UTC)
Performance Section
[edit]In addition to the criticism that has been posted as to the "recipe" nature of part of this section (do this first, then do that, etc), as a database consultant with almost 20 years experience, I actually disagree with the fundamental point the author is making here. He says the "bottleneck" is in the loading of the database. Firstly, the "speed" of a load is rarely the step that takes the longest time in any migration / data-cleansing type project. It's the *analysis* that must be performed to map data and data types(!) from disparate sources into a new "consolidated" target database. This involves many many man-hours and meetings and talking to subject-mater experts, etc. In other words, the "technical" aspect of actually loading the final data is in a sense the *easiest* part of the project. It's the figuring out of all the nasty problems that crop up *before* one has a clean mapping of old-->new data and old--> new data types that take up the most time and intellectual effort. Finally, with the ever increasing speeds of computers and ever increasing quantities of fast memory, technical "loading" of data (yes, even with constraints etc needing to be created) is simply not even in the top 5 problems on a person's mind when working on such a project (ETL).
In conclusion, there are many difficult intellectual problems associated with ETL. However the one mentioned by the author (DB load speed) is probably the least among them. Finally, is it our job to "rank" these problems (associated in a project)? I'm not sure. —Preceding unsigned comment added by Frumiousfalafel (talk • contribs) 19:34, 26 October 2009 (UTC)
I agree with you - Wikipedia shouldn't be a recipe book of 'how to do', so any section like this ought to simply mention (by way of example) the types of challenges and activities that occur within ETL, rather than describing any specific process, as it's rare that "one solution fits all" So maybe a 'Challenges' or 'Activities' section detailing the what typically goes on, maybe linking into how this fits into an organisation. There could possibly be a refer-back to 'IT maturity' (I think Gartner publish something on what identifies the evolution of company IT policies - though bound to be behind their pay-wall) Michaelfromtheuk (talk) 09:57, 5 March 2012 (UTC)
ETL Implementation
[edit]I am very confused that whether this project is just similar to "Data Mining in heteregeneous Distributed System" or is completely different. —Preceding unsigned comment added by 116.75.132.149 (talk) 18:20, 11 August 2010 (UTC)
Dubious
[edit]The section Transform says: "Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this calls for automated Data cleansing; no manual cleansing occurs during ETL". The Data cleansing article talks about removing or correcting invalid information, which I would see as different and separate from transcoding. In other words if the data source has "Male" and "Female" and the destination "M" and "F" then this change is transcoding but removing a record that has "Hermaphrodite" in the column would be data cleansing. -- Q Chris (talk) 15:22, 11 April 2011 (UTC)
Disable Triggers
[edit]The author claims "Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step." to gain speed. I don't consider this being a good practice, because triggers ensure database integrity, and if the ETL process fails after the initial step, this might mess up the database completely. — Preceding unsigned comment added by 88.1.46.185 (talk) 08:12, 20 October 2011 (UTC)
ETL Alternatives
[edit]A web search backs up my feeling that the alternative 'data model recasting' listed in the section is not notable and the references cannot be considered reliable. Gradientdescent (talk) 02:12, 5 June 2012 (UTC)
ETL Description Outdated
[edit]This page is not incorrect but it does not contain sufficient citations and is very outdated. It focuses almost exclusively on ETL from the aughts between relational data structures and traditional data warehouse. I don't have space to correct it right now but I tagged it as a needing citations and I hope someone can come in and update this, cite it, and make it more modern. When I have space I will try to come back and spiff it up. :-) Alex Jackl (talk) 13:29, 27 May 2019 (UTC)
I added some citations and generalized the target in the load section. I didn't think this was controversial but apparently there are other opinions on that. Lets talk it through! Alex Jackl (talk) 14:38, 3 June 2019 (UTC)
- I was mostly concerned that you inserted what looked like a signed talk page comment directly into the article. I'm also concerned (though less so) that the two sources you added seem to be unsigned, self published websites. - MrOllie (talk) 14:51, 3 June 2019 (UTC)
- In hindsight I shouldn't have added the signed note right in the page. Good catch and I removed it. I agree that those are not the ideal citations (though they have reasonable content) but the article is severely lacking in in-line citations so I thought better self-published, yet content-ful citations that supported the material than no citations. Thanks MrOllie Alex Jackl (talk) 15:12, 3 June 2019 (UTC)