Donnerstag, 2. Juni 2011

Tutorial 4: How to import RSS feeds into MS SQL 2008 (Part 4)

Part 1 | Part 2 | Part 3 | Part 4

The foregoing parts explained how to create the SSIS package to import 
RSS feeds. The 4th and last part of this series covers how to setup the database
tables and stored procedures, which are called be the import package.


The headline table

The table 'fan.TutorialHeadline' is the destination of the RSS feed import. See
it's definition below.

-------------------------------------
-- final import destination
-------------------------------------

CREATE TABLE fan.TutorialHeadline
(
     
id int PRIMARY KEY IDENTITY (0,1),
     
publisher varchar(50),
     title
varchar(255NOT NULL,
     
content varchar(MAX) NULL,
     import_date datetime  DEFAULT GETDATE(),
     
linkurl varchar(255NOT NULL,
     merge_hash
binary(16NULL
ON [PRIMARY]
GO

CREATE
NONCLUSTERED INDEX [Ix_merge_hash] ON [fan].[TutorialHeadline]
(
     
[merge_hash] ASC
)
GO


The import buffer table

To avoid duplicate entries we do not update the headline table directly. We
import the RSS feed data into another import table and merge just the missing
rows into the headline table. In this tutorial the definition of both the
headline and the import tables are identical. See the definition of the import
table below.

-------------------------------------
-- import buffer table
-------------------------------------

CREATE TABLE fan.TutorialImport
(
    
id int PRIMARY KEY IDENTITY (0,1),
    publisher varchar(50),
    title 
varchar(255NOT NULL,
    content varchar(MAXNULL,
    import_date datetime DEFAULT GETDATE(),
    linkurl varchar(255NOT NULL,
    merge_hash binary(16NULL
ON [PRIMARY]
GO

CREATE
NONCLUSTERED INDEX [Ix_merge_hash] ON [fan].[TutorialImport]
(
     
[merge_hash] ASC
)
GO


The hashing function

You may have noticed the indexed column merge_hash in both tables. Hence RSS
doesn't provide a unique ID to identify a RSS data item, I create a unique
hashcode, which is calculated from the feed title and content using MD5. The
function is used by the stored procedure which imports the data. See the
function definition below.

-----------------------------------
--- hash function
-----------------------------------


CREATE
FUNCTION fan.TutorialHash
(@title varchar(100), @content varchar(MAX))
RETURNS binary(16)
AS
BEGIN
    -- create merge hash
    
DECLARE @hash binary(16);
    DECLARE @tc8000 varchar(8000);    SET @tc8000 CONVERT(varchar(8000), cast(@title as varchar(MAX)) ' ' + @content);
    SET @hash CAST(HASHBYTES('MD5',@tc8000AS binary(16));
    RETURN @hash
    END
GO


The import procedure

The stored procedure 'fan.TutorialAddHeadline' is called by SSIS package
created in the previous parts of this tutorial. The procedure is executed for
each item in a RSS feed. It creates the unique hashcode and creates a new row in
the import buffer table.

------------------------------------------
-- TutorialAddHeadline procedure
------------------------------------------

CREATE
PROCEDURE fan.TutorialAddHeadline
@title
varchar(100), @content varchar(MAX), @publisher varchar(50), @linkurl varchar(255)
AS
BEGIN
    -- create merge hash
    
DECLARE @hash binary(16);
    SET @hash fan.TutorialHash(@title@content);

    -- create headline    INSERT INTO fan.TutorialImport (publishertitlecontentlinkurlmerge_hash)
    VALUES (@publisher@title@content@linkurl@hash);
END
GO


Merge recently imported and existing data

The stored procedure 'fan.TutorialMerge' copies all rows from the import
buffer table to the destination table. It uses the MERGE command, which is new
in MS SQL 2008. The MERGE command uses the calculated hash code to find missing
rows in the destination table and inserts only new rows. After the merge the
import buffer table is cleared.
This procedure has to be called by SSIS package after the import. The next
step describes how to add this last task to the SSIS package. See the merge
procedure code listing below.


---------------------------------------
-- merge imported and existing data
---------------------------------------


CREATE
PROCEDURE fan.TutorialMerge
AS
BEGIN
    MERGE 
fan.TutorialHeadline USING fan.TutorialImport m
    ON h.merge_hash m.merge_hash
    WHEN NOT MATCHED BY TARGET THEN
        INSERT (publishertitlecontentimport_datelinkurlmerge_hash)
          VALUES (m.publisherm.titlem.contentm.import_datem.linkurlm.merge_hash);
    DELETE FROM fan.TutorialImport;
END
GO


Adding the merge procedure to the SSIS package

The merge procedure 'fan.TutorialMerge' has to be executed after the import
to update the RSS data. Follow the steps below to add it to the SSIS package.
1. Add a new 'Execute SQL' taks to the SSIS package
2. Click on the foreach loop container and wait for the arrow to appear. Then
drag it on the 'Execute SQL' task
3. Right click the 'Execute SQL' task and choose edit
4. Make sure the task connects to your database and then edit the attribute
SQLStatement and enter the following code. Replace 'CelebrityDB' by your
database name.

USE CelebrityDB -- replace db name
GO

EXEC fan.TutorialMerge


5. Click OK to close the editor

The SSIS package is now complete. You can now deploy it on the server. When
you execute the package,  it should look like the one on the image below.

Executed SSIS package:


Final Result

After the import the RSS feed data is stored in the table 'fan.TutorialHeadline'.
You can read the RSS feeds now with your SQL Server Management Studio.

Result of the RSS import:




Thank you for reading this tutorial!

More tutorials are coming soon. Your feedback helps me a lot to improve this
blog. Please leave a comment!

Kommentare:

  1. Awesome Tutorial, very helpful, has given me my first taste of programming for years! XML, RSS, SQL, Stored procedures and fuctions, hash generator, this tutorial has it all! Only point for any other newbies, fan.xxx should be dbo.xxx in english!

    AntwortenLöschen
  2. Thanks for the great tutorials. Download feed is working fine, but i am not able to pull data from XML to Table using Import RSS File Script Task. Its not giving any error.

    Any help ?

    AntwortenLöschen
  3. I think they changed the format of the RSS feed.
    If you read the XML file, there are no references to ""
    I got SOMETHING into the [TutorialHeadline] table by doing the following
    Look for the line which reads :
    XmlNodeList itemList = doc.GetElementsByTagName("item");

    Change it to :
    XmlNodeList itemList = doc.GetElementsByTagName("entry");

    this does not fix the issue in its entirety, but it does get it to do something.

    AntwortenLöschen
  4. After reading this post its seems very easy that how to import RSS feed into MS SQL. Thanks for sharing this with us, Carry on the great work.

    Regards,
    Healthcare Websites

    AntwortenLöschen