= A Stroll Through Some of the Ideas and Terminology Surrounding Xml and Alexis
please send comments and questions to: matthew@ella-associates.org
INTRODUCTION
This document attempts to explain some of the concepts and technical ideas
behind the Alexis System. Alexis is java-based xml-oriented system designed
to facilate the collaborative writing of dictionaries and other xml-oriented
projects. This document tries to follow a middle ground between technical
detail and clear explanation. It is designed for those people who need to
approximately understand how the Alexis system works and what advantages
it can bring to teams of lexicographers and editors. There is also an emphasis
in this document on explaining terminology which is often confusing.
In reality, there is often no exact or precise definition for a piece
of terminology and it is only necessary to understand its general or
approximate meaning.
AN EXPLANATION OF XML AND WHERE IT CAME FROM.
I will begin this explanation of XML by referring heavily to HTML. This is
because a great deal of the terminology and concepts behind HTML are
applicable to XML. (As will be made clear later, in some cases HTML actually
is a form of XML)
If you click on the menu item 'view source' or 'view page source' in your Browser (such
as Microsoft Internet Explorer or Netscape Navigator, or Opera, etc) when you are viewing
a web-page on the Internet, you will see something like this:
-->>
--<<
This is a small example of something called HTML, which stands for Hyper-Text Mark-up Language.
All the text between the 'pointy brackets' (<>), which are also known as 'angle brackets' or
'greater and less than signs', is called 'mark-up'. When I say 'between' I mean all the text
which begins with a < and ends with a >. A piece of mark-up, such as '',
is referred to as a 'tag', or 'mark-up tag'. You will notice above that some of the
text is not enclosed by greater and less than signs; For example, the word 'Fun' and
the words 'Games' and 'Horoscopes'. This text is called 'the content', or 'the content
of the HTML page' or something similar. When you look at the HTML page in your Browser
you will not actually see any of the mark-up. You will only see the 'content'. But
the mark-up affects the way that the content appears or behaves. For example, in the
HTML above, the word 'Fun' will appear to be 'bold', that is the text will be darker
and thicker that the normal text on the page. This is because of the mark-up tags
and . These tags tell your Browser that the text between them should be displayed
as bold. This process, which your Browser carries out, of turning mark-up tags into
a web-page is called 'rendering' or 'rendering the html'.
The first tag '' is called the 'opening tag' and the second tag '' is called the 'closing tag'.
All 'closing tags' have the '/' character immediately after the first 'angle bracket' (<).
Almost all HTML tags are matching pairs of opening and closing tags, but not all. In the example above
there is an opening tag '' and its matching closing tag is ''
The word 'Fun' is said to be 'enclosed' by the two tags.
In the HTML above, we can see the line
Games,
In this line, the text 'href="http://www.yahoo.com" is known as an 'attribute' of the tag.
The text 'href' is called the 'name' of the 'attribute' and the text 'http://www.yahoo.com' is called
the 'value' of the attribute. Together this is referred to as a 'name/value pair'.
You may think that I am un-necessarily introducing confusing terminology for no reason. But all this
terminology is important when talking about XML, which, in the long run, is more important than
Html.
A set of mark-up tags is referred to as a 'mark-up language'. HTML is not the only
mark-up language. For example, there is a mark-up language called 'SGML' which
stands for Standard Generalised Mark-up Language, which has been used by publishing
companies for many years. What is the point of these mark-up languages? Why do
we need them? I'm glad you asked. In the past mark-up languages have been mainly
used to 'format' text documents. By 'formatting' I mean the process of taking a
text document that looks as if it has been written on a typewriter, and applying
effects like making some text 'italic', making the title bigger, putting different
parts of the document into different 'fonts' and so on. The majority of HTML tags
are designed for this purpose; they change the appearance or the 'presentation' of
the text which they enclose. You may well ask at this point, why don't we just use
'Microsoft Word' if we want to format text documents? We don't have to learn some
obscure mark-up language when we use Microsoft Word. Microsoft Word uses its own
mark-up language, which the user never sees. One advantage of a mark-up language
like HTML is that it is a system which no one company or person controls.
Another advantage of a mark-up language such as html, as opposed to say the microsoft word
markup language is that it is 'human-readable'. For example, if you get a Microsoft Word
document and you open it in a text editor like Microsoft Notepad, you will notice that
there are all sorts of strange codes and incomprehensible symbols. This is because Notepad
is actually displaying the 'mark-up codes'. You will also be immediately aware that it is
not possible to make any sense of the codes and symbols. On the other hand, if you look
at the source for an HTML page, although the codes may look at first quite complicated, with
a small amount of knowledge it is quite possible to understand what each of the codes is
doing. This is what is meant by 'human-readability'. This is an important characteristic of
HTML and XML.
AN IMPORTANT CONCEPT
In some situations, there will be a set of rules that govern a particular area. But there may
also be a set of rules which govern how those first rules should be made. An example of this is
the constitution of a country and the laws which have effect in the country. The constitution
is, in a sense, a set of rules about how other rules may be created. The Constitution
of a country may state, for example, that you cannot make a law which takes away the
right of a person to freedom of speech, and that any law which attempts to do that
will not be a valid law. This is a very important idea when understanding XML.
XML stands for Extended Mark-up Language. But XML is not really a mark-up language like
HTML. Actually it is a set of rules stating how mark-up languages like HTML should be
created. It defines a set of rules that people should follow when creating mark-up
languages. This may seem very abstract and difficult to understand at first, but its
not. The rules that XML sets out are really pretty simple: it says,
Every opening tag must have a closing tag or, if not, the tag must look like this
in other words it has to end in '/>';
All closing tags must begin with '';
All tags must use the < and > signs as their starting and ending characters (the same as HTML);
Any tag may have 'attributes' but that the 'values' of all attributes must be enclosed in
'double quotes' (which is the " character);
And a few other rules as well.
If a text document follows these rules then it is called 'well-formed'. This is an important
word. Remember it. The rules of XML are called 'strict' because if they are broken in
even the smallest and most in-offensive possible way, then the document ceases to be
'XML' and becomes what is affectionately called 'Tag Soup'. Most HTML documents are
actually 'Tag Soup' because they don't follow all the rules of XML. For example,
the HTML displayed above breaks the rules of XML in a number of ways: The tag
(which creates a horizontal line on the page) has no closing tag. Also, many of the
values of attributes are not enclosed in the " character, such as in the case of the tag
''. In order to obey the rules of XML it should be '
'
If we fixed all these problems, then we would have an HTML document which was also an
XML document.
Lets look at an example of a 'well-formed' (but very short) XML document
-->>
- Dr. Boggles Magical Washing Powder
- Bananas
- Apples
--<<
This is a mark-up language which obeys all the rules of XML. Strictly speaking it
is called an 'instance of XML' or 'an implementation of XML' or 'an application
of XML'. For example it might be given the name 'ShoppingML' which would stand for
Shopping Mark-up Language. But in practice it is extremely common to hear the text
above referred to simply as 'XML' or as an 'XML document'. When you hear the phrase
'XML' it is actually more likely that the speaker is referring to something like
what you can see above, rather than referring to the set of rules which is what XML
really is.
The purpose of XML, unlike HTML, is not to 'format' the appearance of text documents.
The purpose of XML is to store and transmit data in a way that is reliable and
yet flexible. One of the great problems that has occurred in computer systems, has
been the inability of different types of computers and different types of programs
to communicate with one another. For example if you create a Microsoft Word
document on a Microsoft Windows computer and then open that same document on a
Apple Macintosh computer you will often find that a lot of the formatting does
not look the same. Because of the rise of the Internet, it has become more and
more important that all different types of computers with different types of
operating systems should be able to communicate to each other and transfer
data without that data becoming corrupted or distorted in any way. This is
one of the problems which XML solves.
XML is a powerful system because it allows data to be easily 'transformed',
'manipulated' and exchanged between different types of computers and
programs.
XML is very important in the Alexis system for various reasons. Firstly,
Alexis uses XML to transfer all data from the 'server' to the 'client'.
(I will try to explain these terms later). Also, it is customary in the
field of lexicography to store and manipulate the data for lists of
words as XML or SGML. Alexis conforms to this standard by providing comprehensive
XML editing facilities in the 'user-interface' and by storing the word
data as XML in the back-end database.
OF SERVERS AND CLIENTS
Two words which are very frequently 'bandied about' with respect to computers:
'client' and 'server' and the compound form 'client/server'.
These two words represent quite an important concept when dealing with
'networked' software and this concept is important in understanding the
Alexis system.
On the internet, some computers provide 'services' to other computers and
to other programs. A simple example of a 'service' is a program (and
computer) which outputs an accurate representation of the current time.
In other words, some computer on the internet will ask another computer
on the internet 'what is the time, precisely' and the computer that is
asked will reply '4:20:22 am Greenwich Mean Time'. This is a simple but
important example because it demostrates a very common operation on the
Internet. The computer which does the asking is called the 'Client' or
the 'Client Computer' and the computer which replies is called the
'Server' or the 'Server Computer'. The process of asking a question of
another computer is called 'making a request' and the reply is often called
'a response'. In reality, when a computer makes 'requests' they normally don't do so
in plain english (or in any other human language). Instead they might
send a message such as 'GET TIME' and the server will respond
'CURRENT TIME: 4:20:22 am'. Although these messages look like plain english,
they are not really. For example, if the client send a message
'whats the time please', the server computer would not understand it and
would probably return a message like 'INVALID REQUEST'. In other words,
the server only understands a very limited number request messages and
the requests have to be in exactly the right format. For example if the
client computer asks 'GET Time' (using lower case letters), the server
may well not understand this request at all.
This system of sending special request and response messages is called
a 'protocol'. Protocols are extremely important on the Internet because
they allow computers who otherwise know nothing about each other to
communicate. In summary, a client computer makes requests for services
of a server computer using a particular protocol, and the server computer
responds to the request using the same protocol.
I have been referring so far to client and server computers. But computers dont
actually do anything by themselves. It is actually the programs that carry out
actions. In other words, it is actually a program which is running on the client
computer which sends a request to the server computer, and it is actually a program
running on the server computer which replies. For this reason, these programs
are known respectively as 'client programs' (or 'client application', or 'client
software') and 'server programs/applications/software'. To ensure that everybody
becomes completely confused, it is very common not to include the second part of
these phrases. That is, it is common to simply refer to a 'server' or a 'client'
without specifying if the speaker is refering to a computer or to a program. This
does cause confusion and it is necessary to work out which is being referred to from
the context of the sentence.
An example of a client program (or application) is a Web Browser like
Netscape Navigator. The Web Browser makes requests for web pages from some
Web Server located somewhere on the Internet. It does this using a special protocol called
HTTP (Hyper Text Transfer Protocol). And the Web Server responds using the
same protocol. The user of the Web Browser does not actually get to see
the requests being made. They are made 'behind the scenes' as soon as you
type a Web Address in the 'location bar' of your Browser. The actual
requests are quite simple and look something like
GET www.google.com/index.html
The Web Server (for example: Apache) which replies to this request is an example of a
'Server Application (or Program)'.
The Alexis system, like the majority of network based programs, has
a client program and a server program. These are, in reality, completely
seperate programs which can (and probably should) run on completely
different computers. While the Client and Server programs dont actually
need each other to run, they are not able to do anything useful unless
they are able to talk to each other using the Alexis protocol.
The Alexis protocol is actually an implementation of XML (or
colloquially, is XML). In other words, the Alexis Client program
will make a request to the Alexis Server program (the name of which
is 'Chakriya') in approximately the following way
-->>
a*
--<<
What this request means, is that the Client program is asking for
all words which match a certain pattern. In this case the pattern
is that the words should start with the letter 'a'.
The client sends this request to the Server which is located on some
computer on the Internet and the Server responds by sending to the
client all of the words in a particular dictionary which start
with the letter 'a'. The user of the Alexis system (who is using the
client application), can then edit those words and when he or she has
finished editing them he or she can send a request to the Server that
his/ her changes should be saved on the Server.
The advantage of this system of having separate server and client
programs is that it allows for easy collaboration between different
users who can be located anywhere in the world, as long as they
have a connection to the Internet. There is no reason why more than
one Client program cannot be run at the same time and therefore
many different users can use the system at the same time without
being geographically located in the same place. The Server program
(Chakriya) acts like a manager or orchestra conductor for each and
all of the client programs. The Server program stores and manages the
work carried out by each user. In other words, while a user (in this
case, some lexicographer located anywhere in the world) carries out
all her work on her own computer, all the changes that she makes to
different entries (words) in the dictionary which she is working on,
actually get saved to the 'central' Server Computer. In this way,
all the other lexicographers and editors who are working on the same
dictionary (or other xml related project) can see and use the work
that has been done by that person.
This use of a client/ server 'architecture' (as it is known) also eliminates
the possiblity for conflicts between the work carried out by different users.
That is, the Server program makes sure that one user does not over-write
or destroy the work carried out by another user.
ALL ABOUT DATABASES
Data is any set of patterns that have meaning to a human being.
A 'database' is a collection of data usually about a particular
subject or field. In information technology, a computer program
which stores and manages data is refered to as a 'database system' or
'database management system' or 'database server'. As the last term
suggests, most modern database programs use the Client/ Server
system which I have explained above. That is, there is one part
of the program which stores and manages the data which is called
the database server and which responds to requests for that data
from 'Client' programs. The Server program and the Client
program do not have to be on the same computer.
In the same way as other client and server programs the Database
Server and Client use a 'protocol' to communicate to each other.
There are two main types of modern Database systems: The 'hierarchical'
database, and the 'relational' database. A hierarchical database 'looks'
like a (up-side down) tree, something like this:
-->>
village
-- name = almetlla de mar
-- residents
person1
-- first name = paz
-- second name = martinez
person2
-- first name = nuria
-- second name = fill
person2
-- first name = james
-- second name = sweetapple
-- bars
bar1
-- name = pica pica
-- proprietor = montse
bar2
... etc
--<<
Another example of a hierarchical database is the files on a computer.
If you expand each of the folders (directories) on your computer using
a program like Microsoft Explorer you will see the same 'tree' like structure.
Each of the folders is like a branch of the tree and the files are like the
leaves on a tree (but not as attractive).
With a small or large amount of brain-racking you will also notice that the
XML documents which are displayed above are also examples of a
hierarchical database structure. Historically (in the 1970's and early 1980's)
hierarchical databases used to be the most important and most used. However,
more recently the other type of database has become the most common, that is
the 'relational' database. Finally, with the rise of XML, the pendulum is
swinging back towards hierarchical databases (please excuse the cliche).
In order to understand Alexis you really don't have to know anything about
relational databases since only the programmer actually has to deal with them.
But in any case here is a brief explanation:
A 'relational' database looks like a collection of 'tables', in other words,
bits of data divided into rows and columns, like this
-->>
first name last name place of residence favorite bar
---------- --------- ------------------ ------------
paz martinez almetlla de mar pica pica
nuria fill almetlla de mar The English Bar
james sweetapple almetlla de mar N/A
--<<
As you can see, the actual data is pretty much the same as for the hierarchical
database, but the way that it is arranged is different.
To get data in and out of a relational database program you use a thing called
'Structured Query Language', which is usually refered to as 'SQL'. It looks like this:
SELECT 'first name', 'last name'
FROM 'Residents'
WHERE 'place of residence' = 'almetlla de mar'
This is called a 'query'. As you can see it is a very simple language to use since
it is almost the same as English. The 'query' above would get you the following data:
-->>
first name last name
---------- ---------
paz martinez
nuria fill
james sweetapple
--<<
Alexis uses a combination of a hierarchical data structure (XML) and a 'relational'
data structure in order to store the words of a dictionary or other data. But the
user of the Alexis system will only ever have anything to do with the hierarchical
system, since the relational database program (called 'MySQL') is hidden or 'behind-the-scenes'.
... possibly to be continued
OTHER INFORMATION
please see the file
http://www.ella-associates.org/alexis-info/docs/glossary.html
for a list of technical terms and phrases and explanations
WHATS MISSING
This section contains concepts that possibly should be covered but aren't
validation of xml, xml schemas and dtds, parsing of xml.
Methods of transforming xml.
... possibly to be continued
please send comments or corrections to: matthew@ella-associates.org
SOME LINKS
http://cocoon.apache.org/2.0/introduction.html
This is another 'colloquial' style introduction to XML and an explanation
of why it is important and why to use to it. The strength of this document
is that it doesn't assume any previous knowledge. The actual English of
the document is fairly poor.
COMMENTS
Added by: mjb, on Thursday, 15 May 2003, 09:22 PM
This is probably a little bit too dense