Ticket #618 (new enhancement)

Opened 1 year ago

Last modified 3 months ago

make article URL human friendly - no escaped characters and as short as possible

Reported by: russ Assigned to: rich
Priority: low Milestone:
Component: ambra Version: 0.8
Keywords: Cc:
Blocking: Blocked By:

Description

we have a nice, new, shorter article URL in 0.8 http://plosone-dev.plos.org/article/info%3Adoi%2F10.1371%2Fjournal.pone. 0000443 can you tell me why the slashes are escaped? slashes must be escaped in query strings. however, unescaped slashes would be perfectly valid in the new article URL. although we suggest a filesystem hierarchy by using slashes, it's not at all necessary for slashes in a URL to correspond with a filesystem. one point of having a short url is to make life easier for humans. in our feedback queue, people complain as much about the escaped characters as they do about the length of the URL.

Dependency Graph

Change History

08/28/07 10:23:19 changed by amit

  • milestone set to 0.8.

(follow-up: ↓ 4 ) 08/29/07 14:27:02 changed by jsuttor

  • status changed from new to closed.
  • resolution set to fixed.

the URI schema is /article/doi. our DOIs contain: info:doi/.../... the ":" and "/" are reserved chars and must be encoded in a URI path.

also, with DOIs, if they were modeled on hierarchical paths, not all logic (think clients) would know when to stop the DOI and let the rest of the URI begin.

when putting the DOI in a URI, it's almost like encoding a URI in a URI.

shouldn't people be using a DOI resolver anyway? :)

08/29/07 16:47:09 changed by russ

  • status changed from closed to reopened.
  • resolution deleted.

no, that's not entirely true. : is reserved in all URIs but / is not.

from http://www.w3.org/Addressing/URL/4_URI_Recommentations.html:

The slash ("/", ASCII 2F hex) character is reserved for the delimiting of substrings whose relationship is hierarchical. This enables partial forms of the URI. Substrings consisting of single or double dots ("." or "..") are similarly reserved.

The significance of the slash between two segments is that the segment of the path to the left is more significant than the segment of the path to the right. ("Significance" in this case refers solely to closeness to the root of the hierarchical structure and makes no value judgement!) 

in fact, the slash does delimit a hierarchical relationship within the DOI, so it's perfectly correct to use it un-encoded in a URI.

of course, if you're using it in the query string of a URL it still needs to be encoded.

(in reply to: ↑ 2 ) 08/29/07 16:50:52 changed by russ

Replying to jsuttor:

shouldn't people be using a DOI resolver anyway? :)

not IMO. my understanding is that the DOI system serves a couple of purposes:

# it makes it easy to find an article if you just have the DOI # it allows dois to be remapped if a particular host goes away (eg, if plos folds crossref can remap our DOIs to pubmed)

it might be nice if plos were to use the doi resolver as it's canonical URL, but i don't think it's necessary and we have a lot of incentives not to:

  • branding
  • the historical instability of the crossref doi servers
  • the historical instability of the plos doi resolver (in the bad mulgara era, mulgara would go down and hose the doi resolver, but plosone would still serve cached pages)

08/29/07 17:02:09 changed by russ

hmm.

i see your point though - the DOI is URI, and it's being stuck into another URI (the URL)

the encoding helps note where the nesting begins.

maybe we don't need the whole DOI in the URL?

09/01/07 23:17:52 changed by amit

  • owner changed from jsuttor to russ.
  • status changed from reopened to new.

No. I am afraid we do. Making assumptions on DOI can potentially cause logic problems when things are mixed in. Maybe something to think about when PLoS assign new DOI's...:)

09/04/07 21:58:41 changed by russ

  • owner changed from russ to jsuttor.

i totally disagree.

a URL should human-readable. it should be as short as possible and without any weird, hard to remember, characters.

between the hostname and the path (www.plosone.org/article) all that's necessary to construct an article doi is the final numeric prefix.

for the general, non-plos, case configuration options can be used to explain to the filter how to construct a doi from whatever comes after article. similar to how config options use regexes to determing which journal context we're in.

i'm not clear which things could cause what logic problems when mixed in. could you provide an example?

if the point of changing the article URLs was to solve the long URL problem for humans, then topaz hasn't succeeded yet (and i think this is a problem that should be solved).

however, there could some other good reasons for this that i'm not aware of. please let me know!

since we haven't made this release live yet, now would be a great time to reconsider...

09/04/07 23:39:30 changed by amit

  • owner changed from jsuttor to russ.

I agree on the URL being readable, not sure that using the suffix (I am presuming you meant the suffix and not the prefix) actually makes it any more readable or easy to remember. The mixed case I was talking about is serving of objects under other DOI prefix through the same platform, in which case the numeric suffix is not sufficient. Of course, that can be mitigated by providing another path element which maps the prefix via a map, but the basic point is that our current gamut of use cases is too limited for us to make assumptions about the DOI.

When we actually have time to expose the REST api, this can be revisited.

09/05/07 15:01:53 changed by russ

  • owner changed from russ to amit.
  • summary changed from why do we have escaped slashes in new-style article URL? to make article URL human friendly - no escaped characters and as short as possible.
  • type changed from clarification to enhancement.
  • milestone deleted.

sounds good (and yes i meant suffix :).

using suffix alone gets rid of the encoded characters which i think is the minimum for a human-friendly URL - so really you could just remove the 'info:' and stop encoding the slashes.

however, i think the shortest URL possible is the best.

removing milestone and updating parameters.

10/29/07 20:58:45 changed by amit

  • owner changed from amit to russ.

11/08/07 17:46:09 changed by rich

  • owner changed from russ to alex.

06/19/08 15:15:35 changed by amit

  • status changed from new to closed.
  • resolution set to fixed.
  • blocking changed.
  • blockedby changed.

With the integration of urlrewrite you can map any number of URLs

07/28/08 16:16:26 changed by pradeep

(In [6245]) Rename files with 'Plosone' in its name. Addresses #618.

09/11/08 14:01:26 changed by russ

  • status changed from closed to reopened.
  • resolution deleted.

please provide some documentation for this fix.

it's also unclear if you are proposing we use urlrewrite in the default case to remap this url, in which case the ticket should remain open until the job is complete, or if you are suggesting this as a post-install configuration step (in which case, again, documentation please).

09/11/08 14:01:34 changed by russ

  • owner changed from alex to amit.
  • status changed from reopened to new.

09/11/08 15:39:11 changed by russ

  • owner changed from amit to rich.

i'm also not convinced this fix is good. for example, if we use global url rewriting to change the display of urls on the site, people will be unable to post trackbacks since the canonical url will not be the same as the URL displayed.

it's also not enough to provide an alias that can be used outside of the site in emails, etc. the canonical URL must change to something less strange and unreadable.

note that both machines and humans find this URL hard to read - special configuration is required in apache to support URLs with '%' in them, and some google analytics features fail due to our URLs. it's just bad practice to have canonical URLs with embedded URIs.