Recently, while working on Findsi, I created a database design that included a table that basically listed all url's in the system. What seemed like a simple table turned out pretty hard to create. I specifically required my table to hold a unique list of url's, but did not want to use the url itself as the primary key. So i Created something like this:
- Id : guid : newid(); < PK
- Url : varchar(max); < DESIRED UNIQUE INDEX
Obviously setting the primary key to the Id column was not the problem here. Creating a Unique Index on the Url caused an error stating that only 900 bytes were allowed in each index, what meant either storing only url's shorter than 900 bytes - not an option as i do not control the input of the system - or finding another solution.
After some serious Googling i concluded two things. First, i was definately not the only one having this issue and secondly, people were not really willing to comment on the subject. All proposed solutions consisted of in someway hashing the url, and setting the constraint on a column containing the resulting hash. At first i believed this would solve my problem, but typically hash algorythms use the first n bytes for their encryption. Thus, using an 128-bit MD5 hash on two urls with a length longer than 128 (actually 126 i believe, but i'm not sure) chars, having these 1st 128 chars equally filled, would result in the same hash.
Eventually I got some unexpected help from a University Professor who specializes in Information Retrieval. He assureld me I was not the only one facing this issue and suggested I created two hashes (in code) for each url. One for the first n bytes, the other for the last n bytes. The risk of colissions, using the combined hash result, so he predicted, was extremely low (though not impossible).
- StartHash + EndHash: char(256); < PK
- Url : varchar(max);
So far, it has worked like a charm for me!
42054b03-f8dc-486b-b2b7-e49cc4cb8c3f|3|5.0