Committed to Code

DataCleaner is an Open Source application for profiling, validating and comparing data. These activities help you administer and monitor your data quality in order to ensure that your data is useful and applicable to your business situation.

DataCleaner is the free alternative to software for master data management (MDM) methodologies, data warehousing (DW) projects, statistical research, preparation for extract-transform-load (ETL) activities and more.

Code Analysis


Recent Highlights

Avatar

Large commit — Merging 3.0 branch into trunk

More than 1000 lines of source code were added or removed in this commit.

In commit /p/datacleaner/commits/178331261 by Kasper Sørensen (Using name ‘kasper’) on 2012-05-15 (8 days ago)

See all highlights…


News

Free training in data profiling, cleansing and more

Human Inference is arranging two sessions of free online training, for people wishing to learn about data profiling, data cleansing, deduplication and more ... And of course, how you do it all in DataCleaner.

The two sessions are (click a ... [More] link to read more and to register):

 May 29th, free training for European and APAC timezones.
 June 7th, free training for US timezones.

We hope to see a lot of people join in on the training, which we hope will be a good and fun event, where you'll learn about data quality, data quality tools and get a chance to say hello to other community members. [Less]


DataCleaner 2.5.2 released

DataCleaner 2.5.2 has just been released. The DataCleaner 2.5.2 release is a minor release, but does contain some significant feature improvements and enhancements. Here's a walkthrough of this release:

Apache CouchDB support

We've ... [More] added support for the NoSQL database  Apache CouchDB. DataCleaner supports both reading from, analyzing and writing to your CouchDB instances.

Connect to CouchDB databases

Update table writer

Following our previous efforts to bring ETLightweight-style features into DataCleaner, we've added a writer which updates records in a table. You can use this for example to insert or update records based on specific conditions.

Like the Insert into table writer, the new DataCleaner Update table writer is not restricted to SQL-based databases, but any datastore type which supports writing (currently relational databases, CSV files, Excel spreadsheets, MongoDB databases and MongoDB databases), but the semantics are the same as with a traditional UPDATE TABLE statement in SQL.

Drill-to-detail information saved in result files

When using the Save result feature of DataCleaner 2.5, some users experienced that their drill-to-detail information was lost. In DataCleaner 2.5.2 we now also persist this information, making your DQ archives much more valuable when investigating historic data incidents.

Improved EasyDQ error handling

The EasyDQ components have been improved in terms of error handling. If a momentary network issue occurs or another similar issue causes a few records to fail, the EasyDQ components will now gracefully recover and most importantly - your batch work will prevail even in spite of errors.

Table mapping for NoSQL datastores

Since CouchDB and MongoDB are not table based, but have a more dynamic structure we provide two approaches to working with them: The default, which is to let DataCleaner autodetect a table structure, and the advanced which allows you to manually specify your desired table structure. Previously the advanced option was only available through XML configuration, but now the user interface contains appropriate dialogs for doing this directly in the application.

We hope you enjoy the new 2.5.2 version of DataCleaner. Go get it now at the  downloads page. [Less]


DataCleaner adds data profiling to Pentaho

Today we announce an exciting new partnership with Pentaho, the leading open source Business Intelligence and Business Analytics stack! For the past years Human Inference, members of the DataCleaner community and Pentaho have been in close contact to ... [More] design  a new data quality package for the Pentaho Suite. DataCleaner plays a key part in this new solution.

DataCleaner’s integration in Pentaho is primarily focused on the open source ETL product, Pentaho Data Integration (aka Kettle). Pentaho and Human Inference will be running a joint webinar on May 10th to tell everyone about all the new features ( register for the webinar here), but until then – here’s a summary!

Profile ETL steps using DataCleaner

When working with ETL you often find yourself asking what kinds of values to expect for a particular transformation. With the data quality package for Pentaho we offer a unique integration of profiling and ETL: Simply right click any step in your transformation, select ‘Profile’, and it will start up DataCleaner with the data available for profiling, which the step produces! Not only is this a great feature for Pentaho Data Integration, it is also a one-of-a-kind in the ETL space. We are very excited to see this great use of embedding DataCleaner into other applications.

Right click any step to profile

Execute DataCleaner job

Another great feature in the Pentaho data quality package is that you now orchestrate and execute DataCleaner jobs using Pentaho Data Integration. This makes it significantly easier to manage scheduled executions, data quality monitoring and orchestration of multiple DataCleaner jobs. Mix and match DataCleaner’s DQ jobs with Kettle’s transformations and you’ve got the best of both worlds.

Execute DataCleaner jobs as part of your ETL flow

EasyDQ integration

Additionally, the data quality package for Pentaho contains the  EasyDQ cleansing functions as ETL steps, similar to what you know from their DataCleaner counterparts.

Deduplication and merging via DataCleaner

In addition to embedding DataCleaner for profiling of steps, you can also start up DataCleaner when browsing databases in Pentaho Data Integration. This will create a database connection which is appropriate for more in-depth interactions with the Database. For example, you can use it to find duplicates in your source or destination databases.

Detect duplicates in your sources

For more information:

The press release from Pentaho:
 Pentaho announces new Data Quality solution

Installation instructions and information from Pentaho:
 Pentaho wiki: Human Inference

Example of using the DataCleaner profiler with Pentaho:
 Pentaho wiki: Kettle Data Profiling with DataCleaner

Information about the EasyDQ functions for Pentaho:
 EasyDQ Pentaho page [Less]


Minor improvements and bugfixes in version 2.5.1 of DataCleaner

Today we've released DataCleaner 2.5.1. This is a maintenance release with only minor bugfixes and improvements. But nevertheless we encourage users to upgrade!

Here are the news in DataCleaner 2.5.1:

A bug was fixed in the Table ... [More] lookup transformation, which caused it to be unable to have multiple output columns.
CSV file escape characters have been made configurable.
A minor bug pertaining to empty strings in the Concatenator was fixed.
Support for the  Cubrid database was added.
The converter transformations was adapted to be able to work on multiple fields, not just single fields.

For more information, please refer to the 2.5.1 milestone in the trac system.

We hope you enjoy the new version of DataCleaner! [Less]


DataCleaner 2.5 is out!

Today we announce the general availability of DataCleaner 2.5! This release is the result of months of hard work by the core DataCleaner crew, the EasyDQ group and the community at large.

Let’s get straight to the “What’s new” ... [More] question. There are plenty of major improvements in this release:

Saving results to disk
With DataCleaner 2.5 you can save, archive and share your analysis results. This is not only a time-saver for those who used to do manual exporting of analysis results, but it is also a means to improve your methodology around handling profiling results, sharing them with colleagues and for archiving historically profiles of your data.

Saving is implemented so that future versions and/or custom solutions can take advantage of the results and potentially use it for scheduled profiling, data quality monitoring and more.

Data structure transformers
With the rise of Big Data and NoSQL databases comes more advanced data structures. In next generation databases we see key/value pairs and list structures that are cumbersome to deal with in tools built for traditional relational data. To solve these issues DataCleaner 2.5 ships with a new set of “data structure” transformers, which allow you to easily wrap and unwrap structures, to be able to get to the parts that you want to analyze or process.

The data structure transformers also include parsers and writers for JSON data, which is one of the more common representations of NoSQL datastructures.

Filters and transformers are now all "Transformations"
Since DataCleaner 2.0 we’ve been pushing the idea of transformers and filters. The strength of these two types of components were evident from a technical perspective, but for the end-user the distinction has shown to be distracting from its main use-case: To process data in a flow of actions. Therefore DataCleaner 2.5 has consolidated these two terms, and made them available in a common metaphor for the user: Transformations. This means that the user will no longer have to look in multiple menus to find the component he is looking for.

New EasyDQ transformations: Merge duplicates and Due diligence check
The EasyDQ on-demand data quality platform team has also been busy. We present to you three new functions and an optional extension for the advanced users.

First is the Merge duplicates transformation. With this transformation you can turn your results from Duplicate detection into merged, golden records! The merge component is designed to handle a hierarchy of criteria when merging to make sure that critieria such as well-formedness, update date and manual overriding is taken into account.

Secondly we’ve introduced two services for Due diligence checks. These are transformations which will help you validate that the people you are engaging business with are not connected to sanction lists of terrorists, narcotics trafficking and other security threats.

These new features, as well as the other EasyDQ functions, are described in detail in the  EasyDQ reference documentation.

Lastly, there's a new extension available, the  EasyDQ essentials, which we recommend as a handy extra toolkit for those that want to go deep diving into the features of EasyDQ.

Defining datastore properties on the command line
One of the areas that have been heavily enforced in the later releases of DataCleaner is the command line interface. Using this interface you can set up DataCleaner to execute in all environments, in a scheduled or managed fashion. In DataCleaner 2.5 we’ve also made it possible to override datastore properties from the command line. Why? Because it allows you to reuse the same job on different datastore definitions. If you are for example scanning a directory for CSV files, and want to run a DataCleaner job on each file, this is a solution for you. Refer to  the documentation for further explanation and examples.

Drill to detail information in value distribution results
The Value distribution analyzer now contains a drill to detail option, to make it possible to see the source records for each value in the distribution. This greatly helps usability when doing explorative data profiling.

Database-specific connection panels
The dialogs for setting up database connections have been enhanced with database-specific connection properties. This makes it a lot easier for the end-user to connect to a database without having to know the details of constructing a connection URL.

Database-specific configuration panels have been created for MySQL, PostgreSQL, Microsoft SQL Server and Oracle. Other database types are supported using the traditional way of connecting, as in previous versions of DataCleaner.

Execution and scheduling of DataCleaner jobs using Pentaho Data Integration
Pentaho Data Integration (PDI, aka. Kettle) is an open source ETL product that the EasyDQ and DataCleaner team has had a lot of interactions with. For the DataCleaner 2.5 release we are now announcing that in next version of Pentaho Data Integration you will be able to execute and schedule DataCleaner jobs using Pentaho’s infrastructure.

While this is not available, released software as of today, we are looking forward to telling you more about this in the near future!

For those still reading, we also did some minor improvements in DataCleaner 2.5:

We’ve added some number transformations for generating IDs, incrementing numbers and more.
Implemented a Date range filter, similar to the Number range and String range filters.
Support for matching against Synonym catalogs in Reference data matcher (which is previously known as the Matching analyzer).
Now all components have flow visualizations in their configuration panel. This feature helps retain the overview when working with large analysis jobs.
The sample data (the ‘orderdb’ database) has been reworked to contain better examples of data quality issues.
User experience improvements; more elegant dialog designs and trimming of window layout.

We hope you all enjoy the new release of DataCleaner 2.5. Please let us know what you think on the  forums, or on our  LinkedIn group, or on  Google Plus, or on  Blogger, or  tweet it, or... [Less]


Read all DataCleaner articles…

Edit RSS feeds.