Tuesday, April 2, 2013

SSIS : Fuzzy Grouping Transformation

Fuzzy Grouping Transformation

Use of fuzzy grouping
I wanted to learn, how to use Fuzzy grouping transformation. But I was not getting any scenario, where this transformation could be used. Then while going through Google search, I got some scenario in which this transformation can be used. The scenarios are as follows:
·         Scenario1: Say there are various clothes shops in a market. Some sales man sells cloths in shops. One wants to get the salesman, who has sold maximum clothes. For this the records of sale for each salesman in shop1 and shop2 is as follows:
Shop1:                                                                  shop2:               

        But there are some names, which misspelled .For getting the total sale of each salesman, we will need to use fuzzy grouping, which will group the names which are similar and will assign a common name to the similar names, so after that we can aggregate the sales of each salesman by that common name.
For example; jayant and jayanath are similar, in which one of name misspelled. So the total sale of jayant (jayanth ) is 107.
·         Another scenario can be, when country name is misspelled or if company names misspelled etc.
Implementation of Fuzzy Grouping Transformation
Here I am going to implement the first scenario1, in which we will be doing fuzzy grouping of shop1 and shop2 by Name, and will calculate the total sales made by each salesman. So the result should look like
For implementation, I have followed below steps:
Step1: Add a DFT in your package
Step 2: For source, I have used one excel connection for shop2, and one Oledb connection for Shop1.
I did this, so that we can make this scenario a bit difficult, and we could learn some more.
Before using OleDB connection, let’s create the table and insert the data in SSMS, as below

For using Excel connection insert data of shop2 into one excel sheet of an excelfile

Now we will add Oledb source and Excel source in the DFT, and will configure both

       

Step3:  Union all records of both the shops
As we want to union the records of both the shops, so that we can apply fuzzy grouping on all the records of both shops.
For this, we need to use ‘Union All’ Transformation. But for doing Union, both the source should have same data type for their columns. For making the data type of the mapping columns similar, I am using Derived Column Transformation for both the source.
Now we will be typecasting column Sales of both the shops into one data type(in Derived column transformation) for this double click on derived column transformation, and then add one derived column say ‘sales1’ and set expression , as shown below:
Where (DT_R8)sales means, we are type casting sales column of each to DT_R8 data type, which is nothing but double precision float data type. This is one for both source’s column sales.
Now, we can apply on both the shops, Union All editor looks like below:

Step 4: Add Fuzzy grouping transformation and double click on it and go ‘Coulmn’ tab
                               
In the columns tab,
check the checkbox on left of the column, for applying fuzzy matching on that column, here  I have check the checkbox of name, that means the fuzzy matching will be applied on name column, as you can see match type as ‘Fuzzy’. 
Check the checkbox on right of the column (ie; Pass through), for using those columns in your result. As here I have checked column sales, because in the result we want to get total sale of each sales man , so this column will also be used.
In the Advanced tab
We can set the similarity threshold, token delimiters, etc.
·         Similarity threshold as 0.80: that means only matches which are at least 80% similar to the searched will only be displayed.
·         Token Delimiter: This is default set, as shown below. The delimiters below will break each sentence into multiple words if they found the delimiters, and then the matching will be done.
·         _key_in, a column that uniquely identifies each row.
·         _key_out, a column that identifies a group of duplicate rows. The _key_out column has the value of the _key_in column in the canonical data row. Rows with the same value in _key_out are part of the same group. The _key_outvalue for a group corresponds to the value of _key_in in the canonical data row.
·         _score, a value between 0 and 1 that indicates the similarity of the input row to the canonical row.



Step 5: Add Aggregate transformation, Excel Destination
Aggregate transformation is used for getting the aggregate sum of sale per person.
Before that add a ‘Data Viewer’ on the connector of Fuzzy grouping and Aggregate transformation. So that we can see some result from fuzzy grouping.
Configure the Aggregate transformation
Apply operation sum on Sales column, and Group by on Name_Clean. That means, sales will be sum by Name _Clean. Where Name_Clean is the column generated by fuzzy transformation, along with name, s ales etc., which contains value of name which are part of the same group. Rows with the same value in _key_out are part of the same group.
Step 6: Execute DFT.
After Fuzzy grouping, one Data viewer is used, which gets popup while execution. Here in the Data view we can see that, Name_Clean is Jayanth for both ‘jayant’ and ‘jayanth’. Similarly Name_Clean is ‘sidhart’ for both ‘sidharth’ and ‘sidhart’.

When we continue execution, then one more data viewer is in the connector between Aggregate transformation, and Excel Destination. That data view gets pop up
That shows the result after aggregate, as Sidhart has total 107 sales, and jayanth has 154 sales
This is how the whole task executes, and stores the result in Excel destination


Please provide your feedback for the post, if you find this post useful. Also Post your query or scenario, i will be happy to help.

35 comments:

  1. Thanks a lot ,Ragini!!!!It was really helpful.

    ReplyDelete
  2. It is great sample concern to fuzzy grouping. It is very detailss. Thanks so much

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Thank you very much for posting this article.. it is very useful

    ReplyDelete
  5. a bit difficult but clearly explained, Very useful for me. Thanks a lot.

    ReplyDelete
  6. Hi Thanks for posting a good example, I got one error while creating the package, While connecting the different file sources to UNION ALL , I am getting an ERROR : Does not match the metadata for the associated output column (while combing the name and sales column steps) can you help me out in that step.

    regards,
    Lokesh

    ReplyDelete
    Replies
    1. having the same problem, did you solve the problem?

      Delete
  7. Very helpful article to implement fuzzy logic for shop data..
    Thank you very much ragini...

    Regards,
    Rajesh Yadav

    ReplyDelete
  8. You are a great Teacher..tons of claps from me..awesome article.

    ReplyDelete
  9. complexity simplified ... kudos ...

    ReplyDelete
  10. can you please clarify on one thing that ...... how jayanth and sidhart got the score = 1 ?

    ReplyDelete
  11. Good article. Very helpful

    ReplyDelete
  12. What a great article. Please post more.

    ReplyDelete
  13. Nice article, clearly explained. Thanks

    ReplyDelete
  14. Super explanation. Very useful article. Keep going. Thanks

    ReplyDelete
  15. Thanks alot for this article. Very useful. once again thank u

    ReplyDelete
  16. Really good blog thanks a lot

    ReplyDelete
  17. How to calculate similarity?? with example

    ReplyDelete
  18. Really good blog. All concepts are explained very well.

    ReplyDelete
  19. Very good and great.
    But cant we use Data converstion instead of Derived column. Since this needs just type casting and derived column is generally used for adding a new column.

    Regards
    Nikhila

    ReplyDelete
  20. detailed explanation.. thank u so much..

    ReplyDelete
  21. Hi thanks for posting a good example, I am getting error while executing the package, While connecting from sources file (derived column) to union all , I got an error: "does not match the metadata for the associated output column" combination of the name and sales column steps... can you help me for this step.

    ReplyDelete
  22. Apperciate your effort.. Very detailed explanation..
    Minor correction for the post..Sidhart has total 107 sales, and jayanth has 154 sales replace the number in post..Sidhart has total 154 sales, and jayanth has 107 sales

    ReplyDelete
  23. I really appreciate the way you explained, even any beginners can understand this transformation. Well Done!!

    ReplyDelete
  24. I really appreciate the way you explained, even any beginners can understand this transformation. Well Done!!

    ReplyDelete
  25. Shahriar MahfooziMarch 9, 2016 at 7:43 PM

    Thanks, it's very useful!

    ReplyDelete
  26. Great presentation to understand quickly and easily.

    ReplyDelete