2010年6月26日 星期六

Unstabilities due to DNS update and code update

Currently, I am rearranging all the source codes in most of the site under the buss.hk domain. I've tried to use a masked domain name which map different sub-domain to a single server. Since, it need time to distribute the DNS information, some of my websites seems to be not working and white screen for a long time.

It's the time for reviewing the DNS strategy. I think I would rather map a masked domain to a dynamic domain using CNAME. Then, the response of the DNS should become faster and I should , in the future, avoid adding subdomains directly in the DNS server.

Bye for now, happy programming,
Cloudgen

2010年6月12日 星期六

The difference between IIS 5.0 and IIS 5.1

As I have mentioned in the previous blog article, I've set up a web application for handling nearly one millions of pdf files and web pages. Here's come the second part of the story:


In the company that I am working for has a remote office in mainland China. Currently, most of the server grade computers have been used for different purposes and there is no spare server by now. I search from my company's current network diagram and finally I located a Windows 2000 server which have been used for handling some domains. I can use this server to hold the mention web application.


As some of you may be quite familiar that the Internet Information Server (IIS) used in Windows 2000 cannot be upgrade to any other version. It should be IIS 5.0 (i.e. the version number is 5.0).


After I have focus the Windows 2000 server, I started add a new virtual host to the Windows 2000 server. And then, I map the extension of .htm to be handle with asp.dll for testing. And I change the current HTTP 400 File Not Found Handler to an url located in /404.asp.


Then, we started our configuration testing. We had typed an url that wasn't exist in the web server, for instance, test.pdf. The web browser received the content generated from the 404.asp file which conformed to our expectation. However, when we typed another url, say test.htm, the web browser instead of receving the expected content generated from the 404.asp file, it received firstly an HTTP 302 direction to http://www.example.com/404.asp?404,http://www.example.com/test.htm. Which is out of our expectation.

We suspected the phenomenon is come from some special feature of Internet Information Server IIS 5.0. In order to find out more information, we did a comprehensive google search. However, none of the articles have described this phenomenon, until I found an old article mentioning the IIS 4.0 telling similar situation. We confirmed this behavior by setting up another Windows 2000 server.

In our point of view, we tried to avoid the HTTP 302 because, most of the client software specific for the website don't have the functionality for handling HTTP 302 redirect. However, in the time of writing this article, we don't have a spare machine installed with a Windows 2003 server.

To cope with this problem, I have chosen another PC which has been installed with Windows XP Professional. As you may know that, Windows XP Professional only provide 10 consecutive connection and should not be used as a profession web application server.

I installed Lighttd which is one of my favorite web servers into the Windows XP Professional together with PHP 5. Then I started writing a reverse proxy server which listen to port 80 and feed the request to IIS which was listening to port 8080.

I kept all the code in ASP because I didn't want to rewrite everything.

After, I've finished the code of reverse proxy server which is using PHP running under Lighttpd. I started my beta test to see if these combination would work.

It was a good news and at the same time a bad news. The IIS 5.1 which show a quite different behavior from IIS5.0. There were no more HTTP 302 redirection for handling dynamic pages. That means I don't need the reverse proxy any more under IIS 5.1.

In this case, IIS 5.1 is working exactly the same as IIS 6.0.

So, I learned a lesson this time that never trust the version numbers.

Happy Programming,
Cloudgen

Datasheet sites holding millions of webpages

At the beginning of this week, I've tried to upload source code of a web application to a remote server located in mainland China serving for the customers there.

The original web application was developed by one of my colleages. The web application stores nearly one millions of datasheets and information of electronic components in pdf or html format. The web applications make use of the "HTTP 404 File Not Found" error handler in Microsoft Information Internet Server (IIS) to map the requested url to a file inside a big tree-like directory.

Generally speaking, it is quite straight forward if we simply place all the files under a single directory as configurated as the document root of the IIS web server. However, the file system become less responsive when the number of files grow larger and larger. In a non-formal test, we found that the files system make an additional 1 minute delay if the number of files in a single folder grow to one millions.

For us, to get a static pages to load from a web server to a client's software (e.g. webbrowser) in a minute means the response is too slow.

In order to reduces the number of files within a folder, files should be relocated and distributed to a tree-like folders structure, i.e. folders with sub-folders. In our case, we create 36 folders each stores 36 sub-folders. Thus in the final layer of the folders, there are 36 x 36, a total of 1296 folders. Thus the number of files in a single folder reduces from one million to below 1000. Now, the response time of file system is so fast to be detected by a human.

In order to stick to the original naming conventions, the original url of the files have been kept. User don't need to remember the folder's name when he need to input the url directly in a webbrowser and the orignal client software can still use the existing database for connecting to the server.

Now, we have to consider the mapping of url to the actual file system. No matter you are using a database table to store the mapping or use a formula to calculate the mapping. If you uses IIS, apache or LightTPD, etc. Most of the webservers requires you to code the "HTTP 404 File Not found" handler for establish the mapping (Of course, there are always exceptions. For instance, if you uses the Python Django framework, the mapping is defined elsewhere).

In our case, we don't want to maintenance an extra database server. So we simply store the files in the tree-like directories according to the first two letters of the filename. For example, if the file is index.pdf, the first two letters are "i" and "n", the file will be stored as ./i/n/index.pdf under the document root. If the serve is www.example.com and the document root is c:\www, the url and filesystem mapping should be: http://www.example.com/index.pdf map to c:\www\i\n\index.pdf.

In order to illustrate how to configurate an IIS Web Server for further control the HTTP 404 File Not Found Handling, please reference to the following articles:
Create 404 Custom Error for IIS 5.0