Since Linux Fan and I became friends, I learned a lot about Linux from
Linux Fan; he taught me how to utilize the OS to conduct a business to
provide Internet services. As discussed in the text, Linux is a free computer
operating system created by tens of thousands of volunteer programmers
around the world. Developed under the GNU General Public License, the
source code for Linux is freely available to everyone. Because of its
robustness and secure features, its popularity among computer users has
gained momentum in recent years. Its use ranges from desktop applications
to movie animations to sophisticated distributed computing. With Linux,
you can do a task with zero software cost that could cost you thousands or
even millions of dollars if you do it in MS Windows. Moreover, your chance
of getting infected by virus is a lot smaller as most Linux applications are
written based upon well-studied standards.
Some may argue that Linux is more difficult to use than MS Windows
and consequently, it is not worthwhile to spend a significant amount of time
to study Linux to save a few thousand dollars. It is unfortunate that many
people are not aware that by paying efforts to learn something, you add
value to yourself; the more difficult the material, the more valuable it is.
There's not much difference in productivity between a worker who has
worked in a fast food restaurant for a month and one who has worked for ten
years. However, there's a huge difference for a corresponding pair working
on the design of microprocessors. It is true that it is very easy to install an
application in Windows; all you need to do is to click an icon and the
installation shield takes care of the rest; you do not know what you are doing
and do not know if it is vulnerable to virus attack or the manufacturer has
used it to collect your personal data. You will not gain any knowledge in the
installation process. On the other hand, each time you install a Linux
application, you learn something new. If a company has created an
environment where every employee can add value to herself, it adds value to
itself too. Linux Fan observed that the emergence of Linux might create a
new economic model. In recent years, the world has been evolving to an
information-centric community. The transition is not totally painless. Tens
of thousands of professionals who once thought their college degrees earned
them a measure of security find themselves financially strained and
emotionally exasperated as jobs continue to evaporate. Many of them work
part time in grocery or hardware stores earning minimum wages and make
searching for work their full time job only to find that their old jobs have
gone forever. Here is a much better alternative. Instead of working full time
to search for jobs, Linux Fan suggested that they could learn Linux and use
it to serve others, which could generate a small amount of income at the
beginning and eventually a self-paid high salary after they have added
enough value to themselves.
If you are interested in starting a business to provide Internet services, you
must first convince yourself what Linux Fan and I believe -- there are
unlimited kinds of services that you can provide via the Internet. But
regardless of the services you offer, there are some basics that you may
always need to provide in your site. Readers may be curious about how long
it will take to master all these basics. It really depends on your background
and how much time you plan to spend on your study. Suppose you do this on
a part-time basis, spending about three to four hours a day on the project. If
you have a university degree in science, familiar with a contemporary
programming language like C++ and have taken a couple core computer
science courses such as data structures and file systems, it may take about
three to four years to master the techniques; if your degree is in Computer
Science, it may take about two to three years. You may protest, "That's too
long. I can build a sophisticated site utilizing available commercial software
in a much shorter time." True. If you are working for a company, it is not a
bad idea to ask your company to purchase expensive software applications
so that you can take the short route. However, if you want to start your own
business and want to be successful like Linux Fan, taking a slow route may
be a better approach. First, there's always a trade off between the ease of use
and the functionalities of an application; VISUAL BASIC is easy to learn
and use but you can never do sophisticated programming with it. Second, by
paying efforts to learn something, you add value to yourself; the more
difficult the material, the more valuable it is. Third, after mastering the
techniques, you can easily customize your site and add innovative features to
it. You can build on top of what you know and further extend your
knowledge; this would tilt the learning curve and make it more difficult for
your competitors to catch up. Fourth, the cost of commercial applications
may become a heavy burden to you and substantially raise the risk level of
your business; you may not even be able to survive the first wave of
competition.
Once you have mastered the basics, you become a free person. You
control your destiny and work for your mission. Of course, you do not need
to wait until you have learned all the techniques. You can start with a small
site, serving a few customers at the beginning. As time goes along, you will
be more experienced and knowledgeable. You can then make your system
more secure and robust or you can add more features to it and serve more
customers.
Apache Server
I shall start from the discussion of constructing a web server. I assume
that you know the basics of Linux already.
As one can easily see from the Netcraft survey
(http://www.netcraft.com/Survey/), the most widely used web server is the
Apache HTTP Server (http://www.apache.org/). Derived from the popular
NCSA httpd server, Apache dominates the web, currently accounting for
about 60% active web sites constructions. It is distributed along with many
Linux distributions and is in general installed by default. However, it is
always better to download the latest version from http://www.apache.org or
its mirror sites. The installation is a learning process. You will then know
what's going on and can upgrade or configure it with ease in the future.
To handle all the packages systematically, you may create a directory say,
'/download' to hold all the downloaded distributions. You can then create a
specific directory to unpack the package with the following command,
gunzip -c /download/package_name.tar.gz | tar xvf -
This command unpacks the package into your working directory without
changing the original downloaded package. If something goes wrong, you
can start all over again. In general, an unpacked package consists of a
'README' file and 'INSTALL' file. You can simply follow the instructions
in those files to install the package. In most cases, it is fairly straightforward.
There are many books discussing the configuration and administration of an
Apache Server (See for example, Linux Apache Web Server Administration
(Linux Library) by Charles Aulds, Sybex, Nov. 2000.), which can be
purchased online or from a local bookstore. The Apache Software
Foundation (http://www.apache.org/) web site contains plenty information
of the server and related projects.
After you have learnt the HTTP basics, you have to learn about HTTPS,
which utilizes SSL (Secure Socket Layer) to transmit data in a secure way;
this prepares you to conduct secure e-commerce (see
http://www.openssl.org). These are foundations of your knowledge of the
Apache server and after you have acquired the basic concepts, you should be
able to construct a fancy personal web site. However, your knowledge is not
enough to build a serious commercial site.
One crucial topic about Apache that you need to learn is write modules to
extend it; there's a good book on this written by Stein & MacEachern (L.
Stein, and D. MacEachern, Writing Apache Modules with Perl and C,
O'Reilly & Associates, 1999.) Writing apache modules let you go beyond
simple CGI scripting; apache modules provide performance many times
greater than the fastest conventional CGI scripts. By utilizing the Apache
API, you can make your modules memory-leak proof. You can develop
Apache modules to process images, making secure transactions, streaming
data or adding many innovative features that you can think of. One may also
achieve these functionalities using Java Servets but the Apache modules
approach gives you much better performance and robustness. It may take
you about six months to master the Apache basics.
PHP Programming
The next topic that you want to master is PHP scripting
(http://www.php.net). When I was an ISP (Internet Service Provider), I
started out using ASP (Active Server Pages) to do server side scripting but I
eventually gave up using it because of the cumbersome syntax and limited
support in other environments beyond Windows. I later switched to PHP
scripting and found that it is a better and more powerful language for web
programming. One nice feature of it is that you can use classes to construct
web pages. Very often, beginners tend to use functions to accomplish all the
work with one file containing one function; a directory may contain a few
hundred files. Scripts written in this way make tracing, debugging and
maintenance very difficult. A better way is to group relevant functions into a
class and makes use of inheritance to organize your class structures in a
comprehensive way. On the other hand, you should not make your class too
large as that will slow down your server and will consume substantial
resources. Another alternative to ASP and PHP is JSP (Java Server Pages)
scripting. However, JSP needs to work with Java Servlet to realize its power.
This makes JSP less convenient to use and has worse performance. There's a
tendency for large independent web sites like yahoo.com to standardize their
development using PHP.
Writing PHP scripts is relatively easy if you already know C. It may take
you one week to two months to get yourself proficient in writing PHP
programs.
Qmail
By now you are able to do significant work on web programming. It is time
for you to learn to build and extend an email server. Almost all Internet
Service Providers provide some kind of email services. My favorite email
server is Qmail (http://www.qmail.org), which is a modern email server with
robust and secure features. It is written in a modular way that users can
easily extend or modify its functions. In Qmail, the mail-sending and mail-
receiving servers are decoupled and work independently. However, it does
require one to pay substantial efforts to master its use. If you can
successfully setup a useful email system using qmail, you know how email
works and you know what you are doing. Moreover, to use it for commercial
purposes, most likely you may need to make modifications to it. In many
cases, you may want to integrate your email system with your database so
that you can utilize the advanced features of a contemporary database to
search or to authenticate users.
There are two files that you may want to modify. One is the
checkpassword.c program, which is used to authenticate users when they
retrieve emails; you can easily modify it to authenticate a user against a
database instead of a file. Another program that you may need to modify is
the qmail-smtpd.c, which is responsible for sending emails. The original
program does not require users to authenticate before sending an email; it
only checks if the user's IP address is allowed to relay emails; if its yes, the
user can send emails otherwise the request is denied. This becomes
impractical if you have users coming from many different places. Therefore,
you may want to modify this file so that a login-password authentication is
required when a user wants to send emails via your server. Of course, after
the modification, you need to inform your users that they need to set their
mail clients like Outlook Express to request for authentication upon sending
emails. The modification requires some hacking of the package but it is not
too difficult to do. By hacking the files, you will also learn more about the
system and have a better understanding of how email servers work. Again
you will add value to yourself.
You may also want to provide a web-based email services to your clients.
Such an application can be developed using PHP. It can be easy or difficult,
depending on the features you want to provide with your web-based email
service. A good reference on email programming is Programming Internet
Email by David Wood, O'Reilly, August 1999. It may take you about eight
months or longer to develop applications for competitive email services. If
you want to take a shortcut, you may also use some available free web-based
email packages, like vpopmail (http://inter7.com/vpopmail.html) or
squirrelmail (http://www.squirrelmail.org). There's a site that rates all
significant web-based email packages at
http://www.hotscripts.com/PHP/Scripts_and_Programs/\
Email_Systems/Web-based_Email
PostgreSQL
The next topic you want to learn is the use of database. It is almost
impossible to build a commercial site without using a database engine.
Mysql (http://www.mysql.com) has been a popular open source database.
However, I recommend you to use PostgreSQL database
(http://www.postgresql.org) which has transactions and better security
model though there are some issues concerning this package; often its new
release is not hundred percent backward compatible with old versions; each
time I make an upgrade, I have to make minor modifications to my scripts
which really is a headache; also, it seems that many versions have serious
memory leakage problems. Despite these defects, I still feel that PostgreSQL
is the best open source database available today and is powerful and secure
enough to do sophisticated tasks. For a comprehensive description of
PostgreSQL, you may refer to the book, Practical PostgreSQL by J.C.
Worsley and J.D. Drake, O'Reilly, January 2002.
A few years ago when the web began to emerge, the field was full of
fancy names like 'three tier model', 'middleware', and 'Object Request
Broker'. I was deceived by the names and I tried to access a database using
middleware that has fancy names like ODBC, JDBC, and Data Request
Broker. At the end, I found that all these were not necessary and the more
layers you introduced in a system, the more errors you could induce in it. It
is fine to use the standards like ODBC and JDBC if you work in a big
company. But if you design, build and always maintain the system yourself,
it is not necessary to use the middleware. I later gave up the use of all the
stuffs with fancy names and access the PostgreSQL database via its native
C-interface, which is a lot more straightforward and less error-prone. The
PHP module already has built in functions to access PostgreSQL database
and that makes life even easier.
If you have written a remote client program in Java and you don't want to
use JDBC to access your web site database, your Java program can access it
via a PHP script, which is also straightforward and simple. Or in case you
need to transmit the data in large quantities, you can learn some socket
programming (discussed below) and write a simple server in C to access the
database; your remote Java client can 'talk' to your C server which 'talks' to
the database. Of course, if you like, your remote Java client can also
communicate with an Apache module developed by you to make access to
the database.
It may take you three to nine months to master the use of PostgreSQL.
Java and Network Programming
After you have covered the above topics, you should have a reasonable web
site that may be able to provide Internet Services to others. (Of course, I
assume that you have also learnt some related minor topics such as HTML,
Javascript and XML.) However, you cannot do much with it if you do not
enrich your programming skills. At this point, I recommend that you spend
more efforts to write better programs so that you can be proficient in both
C/C++ and Java. Most likely, you need to use C/C++ to develop server
programs and Java to develop client applications for your users. Though you
may use Java to develop server side programs, in many cases, C/C++ is a
better choice; C/C++ is more sophisticated and the programs thus developed
run significantly faster. On the other hand, Java is a better choice for
developing client programs. Java programs are platform independent and
can be embedded in a browser as applets.
One crucial topic here is socket programming, which allows you to write
a server program to communicate with clients at remote sites. For example,
you can write a chat server in C/C++ and a chat client in Java as an applet.
Your remote user uses a browser to start the chat client applet and send
information to the chat server, which then broadcast the data to all other chat
clients. At the same time you need to learn more about networking and be
proficient in configuring your name server. The following are some good
references on this topic:
1. Paul Albitz and Cricket Liu, DNS and BIND, Fourth Edition,
O'Reilly, April 2001.
2. W. Richard Stevens, Unix Network Programming, Volume 1,
Second Edition, Prentice Hall, 1998.
3. Neil Matthew and Rick Stones, Beginning Linux Programming,
Wrox Press Ltd., 1996.
4. Warren W. Gay, Linux Socket Programming by Example, Que,
2000.
There are many good books on Java in the market. The following are a
few that are appropriate for beginners.
1. Cay S. Horstmann and G. Cornell, Core Java, Volume 1 & 2, Sun
Microsystems Press, 1999.
2. David M. Geary, Graphic Java, Mastering the JFC, Volume 1 &
2,Third Edition, Sun Microsystems Press, 1999.
3. Jacquie Barker, Beginning Java Objects, Wrox Press Ltd., 2000.
The official site of Java, http://java.sun.com contains the latest and other
relevant information about Java. The site
http://jakarta.apache.org
provides information about server-side solutions for Java platform.
This process may consume you three to twelve months.
Clustering
Now with the knowledge you have gained, you may have built a fairly
sophisticated commercial site. However, your learning process is not
complete and your site could not be too useful until you have learnt
clustering and related technologies. If you are serious about your business,
you may plan to serve millions of customers in the long run. This means that
you need many machines to accomplish your goal and you want your system
to be scalable and reliable. When more customers come, you simply add
more machines. Also, you want to have your system up for 24 hours a day.
There are a few approaches to address this problem. A simple and effective
way to accomplish this is to establish a virtual server, which actually
consists of a cluster of machines. Effectively, the cluster of machines
behaves as a single virtual server, which is exposed to end-users. When one
of the machines is down, it is automatically deleted from the cluster. When a
new machine is added, the cluster automatically includes it to help share the
load. An end user will not know when a machine is deleted or added to the
cluster. You can build a Linux Virtual Server (LVS) with Linux machines
using the patches provided by the web site
http://www.linuxvirtualserver.org
which hosts the Linux Virtual Server Project. There is a special member
called 'director' in an LVS system. A user first contacts the director, which
directs the requests to a member in the cluster. Subsequently, the user may
communicate directly or indirectly with the selected cluster machine. Of
course all these are transparent to the user. She only sees a single virtual
server and thinks that she always communicates with a single machine. As
you may have noticed, the director could become the single failure point of
the system. If it is down, a user will find that the system has ceased to
function. To maintain high availability, which is to ensure the whole system
still functions properly when anyone node in the system fails, you may add a
redundant director in your system. You can then 'heartbeat' the two directors
so that when the primary fails, the secondary will take over the tasks
automatically. Details about high availability can be found in the site
http://www.linux-ha.org
The remaining question is how do you manage the files in your cluster. If
you need a sophisticated distributed file system with fault tolerance, you
may consider Coda developed by CMU (http://www.coda.cs.cmu.edu). But
if your main concern is high availability, you may consider InterMezzo
(http://www.inter-mezzo.org), which is a file system with a focus on high
availability. You can use InterMezzo to replicate files across your servers. It
can be used for mobile computing, which means that you can develop yours
scripts and programs on your own personal machine. After you have
thoroughly tested your scripts, you can connect your machine to the
networked cluster; your new scripts will be replicated across the servers in
the network. Or if you update a file in one machine, the changes will be
propagated to all other members. This makes the maintenance and
administration of your cluster a lot easier and less error-prone.
It may take you about six months to learn to build a highly available
Linux Virtual Server cluster with InterMezzo deployed.
Others
After you have mastered all of the above, you are ready to provide reliable
Internet Services to others. However, this is not the end of your learning
process. What other topics you need to learn depend on your business. One
common feature you may need is the streaming capabilities of your web site.
Data streaming in general utilizes Real Time Protocol (RTP) to transmit
data. Unlike general-purpose protocols such as HTTP or FTP, RTP is
designed to transmit media streams that have strict timing requirements.
Applications for data streaming can be conveniently developed using Java
Media Framework (JMF) (see for example, Linden DeCarmo, Core Java
media framework, Prentice Hall, 1999) You can learn more about data
streaming from site http://www.real.com which also provides a free basic
streaming server. Other relevant sites about this include
http://www.shoutcast.com
and
http://www.icecast.org.
Another feature that you may be interested is text to speech synthesis
(TTS) and speech recognition. TTS basically means changing a text to
speech. A useful open-source package on TTS called 'flite' has been
developed by Carnegie Mellon University (CMU); see
http://www.speech.cs.cmu.edu/flite/index.html,
http://www.speech.cs.cmu.edu/flite/index.html/flite/flite.html,
http://www.speech.cs.cmu.edu/hephaestus.html
Concerning speech recognition, there's an open-source project called
'Sphinx' undergoing at CMU. It provides a collection of real-time speech
recognition engines and an acoustic model trainer and documentation for
building related acoustic models
If you need to add standard encryption technologies to service, you may
study the OpenPGP, which was originally derived from PGP (Pretty Good
Privacy), first created by Phil Zimmermann in 1991. You may refer to the
site
http://www.openpgp.org
for further information. A Java implementation of OpenPGP can be found at
http://www.cryptix.org
If you want to use a simple XML based protocol to let applications
exchange information over HTTP, you may study SOAP, which stands for
Simple Object Access Protocol Currently, there are many ways for
applications to communicate. Well-known methods like DCOM and
CORBA utilize Remote Procedure Calls (RPC) for objects to exchange
information, which may give rise to compatibility and security problems. A
better way to communicate between applications is over HTTP as HTTP is
supported by all Internet browsers and servers. SOAP was created to
accomplish this and provides a way to communicate between applications
running on different operating systems, with different technologies and
programming languages. You may refer to the sites
http://ws.apache.org/soap/
http://www.w3.org/TR/SOAP/
for more information.
If you need to control your systems' traffic, refer to
http://www.lartc.org/howto
After you have learned all these, you should have added a lot of value to
yourself; you have become a free person and can work on something you
feel significant and interesting. Some day, tell us your success stories.
|