Ticket #653 (new enhancement)

Opened 1 year ago

Last modified 5 months ago

trackback spam filter is too conservative

Reported by: russ Assigned to:
Priority: critical Milestone:
Component: ambra Version: 0.8.2-SNAPSHOT
Keywords: trackback Cc:
Blocking: Blocked By:

Description

works:

  • /article/info%3Adoi%2F10.1371%2Fjournal.pone.0000873

doesn't work but should (IMO)

  • /article/info:doi/10.1371/journal.pone.0000873
    • it's a valid URL that actually resolves to the article on topaz
  • /article/fetchArticle.action?articleURI=info%3Adoi%2F10.1371%2Fjournal.pone.0000873
  • /plosone-doi-resolver/10.1371%2Fjournal.pone.0000873
  • http://<your doi resolver host>/10.1371%2Fjournal.pone.0000873

Dependency Graph

Change History

09/20/07 17:35:12 changed by amit

  • owner changed from jsuttor to russ.

Not sure what is being asked here. Each article exposes a trackback link which it expects outside parties to link to. Are you asking for multiple links to be exposed by an article or are you expecting that we should except also links as valid trackback if they make use of other links (which we have not published)? Afraid my knowledge is very limited here so please walk me through.

09/21/07 09:51:06 changed by russ

  • owner changed from russ to amit.

the trackback system has a spam filter. when the trackback is submitted, topaz scrapes the referring post and rejects the trackback if it can't find the article URL in the referring post.

however, the only article URL that the spam filter accepts is the canonical, unfortunately encoded, URL.

ideally, trackback spam filter should accept all valid URLs that resolve to the article - including unencoded, fetchArticle.action, and doi resolver versions.

09/21/07 13:32:41 changed by amit

  • owner changed from amit to russ.

This was a deliberate choice. Because the location (and hence the URL) of the article can shift around, we don't want trackbacks to have broken links. The canonical URL guarantees that the link to the article will be valid. Any other URL can and will change and that means now there are multiple places the rules needs to be synchronized.

The entire purpose of the DOI resolver was to allow flexibility in terms of URL's. Is there a good business reason to provide support for other URLs?

09/23/07 14:04:41 changed by russ

i don't really understand what you mean by the location of an article can shift around. can you give me an example?

the point of the trackback spam filter is to avoid spam, not to avoid blog posts with broken links. it's our (plos's) responsibility to make sure old article urls are forwarded correctly - topaz is allowed to be in flux, as early adopters we pay a price in apache redirects.

the good business reason is like this:

right now, no one in the blog community is able to successfully create a trackback on plosone because the rules are too stringent. i mean no one. we've got some from bora and that's it.

thus, a feature that we added to curry good favor is actually making people hate us more.

this is happening for two reasons.

  • topaz made a really bad choice for the new article URL, it's longer than it needs to be and it's not human readable.
  • the trackback spam filter rejects a variety of perfectly good article URLs.

to actually be useful, and not cause more problems than it creates, the trackback spam filter needs to accept a variety of valid URLs for an article.

perhaps there should be a section of the config file where regexes can be set up.

if that's really hard, another option would be to remove the trackback spam filter and provide a facility to moderate new trackbacks - ideally proactively, although reactively is fine too for a start.

09/23/07 14:05:12 changed by russ

  • owner changed from russ to amit.

09/25/07 15:55:32 changed by amit

  • milestone set to 0.81.

09/25/07 17:48:42 changed by amit

  • owner changed from amit to russ.

Here is a rough analysis of what will need to be done (and hence might not make it for 0.8.1):

  • We are utilizing the Apache LinkBackExtractor? utility to check if the refererURL has the article URL in it. We will have to modify the Apache code to extract all links from within the page.
  • Match each link against a list of regexp expressions for match
  • On match let the page go through

Given the list of things that might need to be done for 0.8.1, this just might not make it for the next release. Russ, handing it off to you just to make sure you understand.

09/25/07 22:31:48 changed by amit

After looking at what else needs to be done, suspect it will be difficult to get this done for 0.8.1. Leaving it here just in case we have time to deal with it, but doubt it.

09/26/07 13:59:53 changed by russ

  • owner changed from russ to amit.

in that case, can we please turn it off altogether for 0.81?

09/26/07 14:05:47 changed by amit

  • owner changed from amit to russ.

Sure, but that would mean spams getting in there and you will have to go in and hand fix via ITql scripts to delete them as we don't have any provision right now to delete trackbacks. Please confirm with Mark P. and let us know.

09/26/07 17:44:43 changed by russ

  • owner changed from russ to rich.

09/26/07 17:46:35 changed by russ

rich's call perhaps after talking with bora. we'll have reporting in by .8.1 that bora can use to go through the new trackbacks pretty easily, and i think i can write a script to kill a specific annotation pretty easily, so i'm personally in favor of manual moderation.

thanks!

09/26/07 17:48:14 changed by russ

note that we haven't seen any actual spam trackbacks in the logs yet - only rejects of trackbacks that should be considered valid.

of course, once a spammer finds an open target you start getting a lot, so it could be a real issue at some point.

09/27/07 14:48:39 changed by rich

  • owner changed from rich to amit.

Even though we haven't received any spam, we shouldn't remove the spam filter. This is asking for trouble. And since we have no notification system, it will be difficult to find spam trackbacks. Bora reviewed this ticket and said that it will be "horrible" to remove the spam filter.

Since trackbacks are working (albeit not with the best filter), let's go status quo until we have the new reg exps in 0.81 (or .82).

09/27/07 16:52:31 changed by amit

I agree. Removing the spam filter will be "horrible". Depending on how 0.8-rc3 and 0.8.1-SNAPSHOT fair, we will try and get this in 0.8.1. If not, it will be 0.8.2.

09/29/07 10:02:12 changed by amit

  • milestone deleted.

Moving it out of 0.8.1.

10/29/07 20:56:03 changed by amit

  • owner changed from amit to russ.
  • version changed from 0.8 to 0.8.2-SNAPSHOT.

11/08/07 17:46:39 changed by rich

  • owner changed from russ to alex.
  • priority changed from high to medium.

02/25/08 12:30:48 changed by alex

  • status changed from new to assigned.

06/19/08 15:20:14 changed by amit

  • status changed from assigned to new.
  • blockedby changed.
  • milestone set to 0.9.0.
  • owner changed from alex to pradeep.
  • type changed from defect to enhancement.
  • blocking changed.

Pradeep, let's chat about this as I think one option is to pass the scrapped URL via the URL rewrite filter before checking for validity.

06/19/08 18:46:53 changed by amit

  • owner changed from pradeep to amit.

This has become much more complicated because of the urlrewrite capabilities. I have sent an email to the urlrewrite developers. Taking it off Pradeep's plate.

06/23/08 15:04:43 changed by amit

  • priority changed from medium to low.

06/23/08 15:41:53 changed by amit

  • owner deleted.
  • priority changed from low to critical.
  • milestone deleted.