Tuesday, February 21, 2012

RSS Text Data Mining

Hi,

Somewhere in the literature surrounding SQL Server 2005 DM, I saw a reference to a project that collected and mined RSS data.

Does such a project exist?

Thanks,

Bob

Various people in the group (including myself) have played with such projects, but haven't published anything. (For one thing, all the RSS content is copywrited by the RSS provider, which makes it difficult). In any case, the methodology is fairly simple - use Integration Services to read the RSS feed and pump through the text mining transforms, and then use the results in data mining, OLAP, or reporting.

The specifics of the latter part depend on the problem you are trying to solve and the information you have.

For example, I created a package once (with an alpha version of SQL Server - it no longer works) that did the following:

Data Flow 1
1: Read from an rss feed
2: Ignore previously read items
3: Did term extraction to find the terms in the feed
4: Did a lookup to see if the terms were already found
5: Datestamped the new terms and added to the term dictionary

Data Flow 2
1: Read from the same rss feed
2: Ignore previously read items
3: Store rss item identifiers
4: Do term lookup to identify the terms in the feed items
5: Append the transactions (item id, term) in my transaction table

From this system I created cubes to analyze term trends, reports including when terms were first identified, mining models to cluster documents, etc.

No comments:

Post a Comment