2010年6月12日 星期六

Datasheet sites holding millions of webpages

At the beginning of this week, I've tried to upload source code of a web application to a remote server located in mainland China serving for the customers there.

The original web application was developed by one of my colleages. The web application stores nearly one millions of datasheets and information of electronic components in pdf or html format. The web applications make use of the "HTTP 404 File Not Found" error handler in Microsoft Information Internet Server (IIS) to map the requested url to a file inside a big tree-like directory.

Generally speaking, it is quite straight forward if we simply place all the files under a single directory as configurated as the document root of the IIS web server. However, the file system become less responsive when the number of files grow larger and larger. In a non-formal test, we found that the files system make an additional 1 minute delay if the number of files in a single folder grow to one millions.

For us, to get a static pages to load from a web server to a client's software (e.g. webbrowser) in a minute means the response is too slow.

In order to reduces the number of files within a folder, files should be relocated and distributed to a tree-like folders structure, i.e. folders with sub-folders. In our case, we create 36 folders each stores 36 sub-folders. Thus in the final layer of the folders, there are 36 x 36, a total of 1296 folders. Thus the number of files in a single folder reduces from one million to below 1000. Now, the response time of file system is so fast to be detected by a human.

In order to stick to the original naming conventions, the original url of the files have been kept. User don't need to remember the folder's name when he need to input the url directly in a webbrowser and the orignal client software can still use the existing database for connecting to the server.

Now, we have to consider the mapping of url to the actual file system. No matter you are using a database table to store the mapping or use a formula to calculate the mapping. If you uses IIS, apache or LightTPD, etc. Most of the webservers requires you to code the "HTTP 404 File Not found" handler for establish the mapping (Of course, there are always exceptions. For instance, if you uses the Python Django framework, the mapping is defined elsewhere).

In our case, we don't want to maintenance an extra database server. So we simply store the files in the tree-like directories according to the first two letters of the filename. For example, if the file is index.pdf, the first two letters are "i" and "n", the file will be stored as ./i/n/index.pdf under the document root. If the serve is www.example.com and the document root is c:\www, the url and filesystem mapping should be: http://www.example.com/index.pdf map to c:\www\i\n\index.pdf.

In order to illustrate how to configurate an IIS Web Server for further control the HTTP 404 File Not Found Handling, please reference to the following articles:
Create 404 Custom Error for IIS 5.0

沒有留言:

張貼留言