ETL for Predictive Analytics


Before any datamining activity, before building any predictive model, the first task to accomplish for the dataminer is to place all the available data into the proper format (when using TIMi or Stardust, it usually means obtaining one single very large dataset). Once you have a dataset, you can analyse it with your favourite predictive analytic tool (…and this should be TIMi or StarDust!).

Anatella is an ETL tool built especially for Analytical purposes and predictive datamining. It includes some features (some data transformations and meta-data transformations) that are unique and extremely valuable in this field (see Appendix B of the Anatella Quick User's Guide to know more about this subject). Anatella is the only ETL tool that offers you enough flexibility to create the complex data transformations required for predictive analysis. Inside Anatella, you can even use the powerful & flexible JavaScript language to easily create new, extremely complex data transformations.

To reduce to the minimum the data preparation time required before predictive analytics, you should federate all your data in a single very large Dataset that will be used to:

  • Create all your predictive models.
  • Create all your segmentation models.
  • Perform all the necessary ad-hoc statistical analysis.

This large dataset is typically built using Anatella because it's the only ETL tool unlimited in the number of columns.

To create this large dataset, you will usually manipulate tables that possibly contain thousands of columns because these tables are "consolidations" of all your databases. That's exactly where Anatella has one enormous advantage over nearly all the ETL tools currently available on the market. Indeed, nearly all the ETL tools are forcing you to specify "by hand" very precisely the type (string,double,integer,…) of each column (usually by looking at the first 100 rows). When you have a few dozens of columns, it's ok but when you have thousands of them, it's not possible anymore! From this point-of-view, Anatella is very nice because you don't have to specify, for all the columns, the type of the columns. To summarize, we named this unique functionality in the following way: "Anatella is Meta-Data free".

When I was still using classical ETL tool, I can't count how many time I screwed up a data transformation because the meta-data were incorrectly "guessed" by the tool, (because it looked only at the first 10.000 rows!). This is especially stupid when some of the datasets that I work with have 10.000.000 rows! This is even more irritating when you had to wait for two hours before noticing that the transformation didn't work! This will never happen with Anatella for 3 reasons:

  • Anatella is meta-data free.
  • Anatella offers you (…and most of the time, instantaneously) a complete & full preview of the whole table at any point in your data transformation graph.
  • The Anatella-table-preview shows you new rows as soon as they have been computed, while the data transformation script is still running. This prevents you to wait for 2 hours for nothing because you can directly & instantaneously see the results of your data transformation!

Building a new predictive model is an iterative process: you usually run your datamining tool several times, before obtaining your "final" predictive model. Sometime (very often), you notice that you need to change slightly your learning dataset to produce a correct model. To re-create an new "improved" dataset, you need an "agile" ETL tool. Anatella is the most agile of all ETL's and is ideally suited for all predictive analytic tasks.