Intelligently Detect Duplicate Content Using PHP
I have built a web scraper that takes a website or RSS feed, parses said
contents of the feed and or website, extracts all of the appropriate
information and then saves it into a database. This is a personal
experiment to see if I can build an intelligent and anonymous web scraper
with no real purpose just to see how advanced I can go and then I will be
open sourcing the code for others to learn from.
The problem is I am scraping at present 3 news websites. When it comes to
breaking news, there is a high chance all 3 websites (especially if it's a
big story) will all be writing their own interpretations of the news, but
ultimately it's the same news.
I have been trying to come up with a solution that can detect as best as
it can when an article being pulled in has already been spoken about and
imported from another news website and perhaps the link is associated with
the story (other sites also wrote about this: link1, link2).
Is there a tried and tested way of detecting if one or more pieces of
content are effectively the same? I've written some pseudo-code, but
unfortunately I'm not a very smart developer to take it and make it
something that works.
Here is my thinking:
A link to a website is parsed
Generic words are stripped out and keywords left in (company names,
countries, etc)
The remaining words are then counted and a score is calculated
That's where my thinking hits a roadblock. How do I efficiently create a
snapshot of a page and then compare it to pre-existing content in my
database I've already imported? This is how I think it needs to be done.
Perhaps I am over-thinking this and I merely need to check if articles
have similar titles?
No comments:
Post a Comment