Some comments on TCP/IP and Linux Protocol Implementation: Systems Code for the Linux Internet by John Crowcroft and Iain Phillips

This file contains some comments on the book: TCP/IP and Linux Protocol Implementation: Systems Code for the Linux Internet by John Crowcroft and Iain Phillips, Wiley, November 2001, ISBN: 0-471-40882-4.

These are personal comments and are primarily meant to encourage a revision in the book and enable my students and others to read critically. Rest assured that the I have sent comments to one of the authors and to the publisher.

After having read Jon's earlier 1996 on-line book I was looking forward to reading this new book. Unfortunately, based on the first three chapters I find it to be one of the mostly poorly written books which I have read in some time. Perhaps this is because I expected too much. Given Vinton Cerf's quote on the back I expect that it was at least up to the level of Stevens' TCP/IP Illustrated Vol. 2 -- however it is not. (Of course since the quote does not even acknowledge that there is already a book entitled "TCP/IP Illustrated" which I think shows either ignorance or arrogance.)

Comments part 1

One of the first features one notices is that the cross references to structure definitions (inside front cover of Stevens) and the macro and function definitions (inside back cover of Stevens) are missing -- in fact most of the structures, macros, and functions are not even in the index of this book. Now one might say that this is because of the extensive cross references in the text which will let you find the corresponding code/definition/... -- unfortunately this is one of the worst books in terms of the lack of such cross references. The cross references given are much too vague and in some cases seem to be without a target. (More details below.)

On page 45 the example code is divided into 7 figures. However, this seems completely unnecessary:

  1. the first division occurs following line 521 -- Why?
  2. line 522 is omitted (such a missing line sometimes occurs with other divisions {such as 2213 on page 59, 2316 on page 61} and sometimes not) Why?
  3. where is the commentary that would follow such a division? For example, there is no explaination of why the gotos at lines 521, 527, 566 shouldn't have simply had the code which is at 686-689, 675-677, and 670-673 simply inserted in place {note there are other examples of this, but I have cited only three} On some machines a reason to do such a goto X; label Y: ... ... label X: ... goto Y; is because you wish to avoid using the stack (generally during startup before any off chip writable memory has been enabled -- however, that is not the case here; therefore why is this done?)

So after all the code in figures 3.7-3.17 there are 7 lines of text which really don't say much of any relevance to this 7+ pages of text.

When later on page 77 in subsection 3.2.1 the authors refer to write() on the socket mapping into a system call the authors say simply "see Chapter 2", it would have been far more useful to say "see section 2.5 and figure 2.17". Similarily in the same subsection there is a reference to tcp_sendmsg() descheduling the process and the reference is also "see Chapter 2", but where does it show the descheduling?

Note that in the paragraph noted above, the word "this" is missing between "socket" and "maps"; it is the "write" operation which maps into this system call.

On page 66, when discussing the real-time clock, there is a forward reference to 'see later in "TCP Initial Sequence Number" for example', but there is no page number indicated, nor is this string in the index, nor is it in the table of contents, nor can I easily find it when scanning through the code in Chapter 7. Yet in my own case, I find it interesting to know what happens if you don't have a real-time clock which can retain time across reboots. The implication is that you will not correctly generate a different initial TCP sequence number from when you lasted booted. Which of course is an interesting issue, but the reader can't easily see this in the code.

On page 54, the authors show (starting at line 498) the bottom of what is probably the definition of smp_init() -- why show the define of smp_init() for the case of a uniprocessor {and why should t be an empty do {} while (0) ?}

On page 56, the authors begin figure 2.21 without having said what function is being defined. If the authors had shown the function's first line and followed it by a row of ellipsis it would be much easier to understand {this of course applies to all other places where the authors begin a figure in the midst of a definition}.

On page 57, the authors begin figure 2.22 with some code which is the body of a loop which is printing out the names of the block drivers. Why show this code? (Especially without stating what function it is in!)

Similarily why show the code at the bottom of this figure lines 509-510?

On page 58, the authors again don't identify what code this, but one can guess it is the generic IOCTL processing - but it would be good to say this or even better say this and what function it is in. For example, simply adding the following lines would have shown the lines in the function defintion above 2151:

      /**
       *        dev_ioctl       -       network device ioctl
       *        @cmd: command to issue
       *        @arg: pointer to a struct ifreq in user space
       *
       *        Issue ioctl functions to devices. This is normally called by the
       *        user space syscall interfaces but can sometimes be useful for 
       *        other purposes. The return value is the return from the syscall if
       *        positive or a negative errno code on error.
       */
      
    int dev_ioctl(unsigned int cmd, void *arg)
      {
        struct ifreq ifr;
        int ret;
        char *colon;

The comment in line 2161 says that these are atomic operations and don't require locking, but then the code in lines 2174 and 2176 does a lock/unlock!

Of course one could even go farther and say that the IOCTLs in 2165-2172 simply _get_ information while those in 2192-2203 _set_ values; hence the reason for the capability check at line 2204.

On page 60, when processing what I think are Jean Tourrilhes' Wireless Extensions in lines 2253-2255, the code checks to see if the result of the dev_ifsioc() was NULL and if the command was get (IW_IS_GET), then it returns some frequency information by copying it back into user space and then returns "-EFAULT". Why should this system call report EFAULT?

On page 62, it would be useful to say what the meaning of the bit __LINK_STATE_PRESENT is. For example, is it simply telling the driver that the network device is registered or does it actually have something to do with the device's link state? (It is the former, but you have to know a lot to know that -- especially given the name.)

The copy editing was very sloppy. This can be seen in the "tcp_send\_skb()" on page 77 and "NF_hook\_slow( ... )" on page 78. I would guess that these are left over from TeX markup.

In the text just above figure 3.3 on page 77, why is there no section number or page reference -- again just a chapter reference. Is it because this text was written by one author who just hoped that this would be described somewhere in chapter 7 by the other author? With regard to figure 3.3 itself, it would be good to have a note that the skb_* and other socket buffer related functions are discussed in section 5.6 pp 184-196. But in fact skb_push is not obviously there! Note also that the index does list sk_buff strutures, but that the starting page number is 184 and not 185 as shown in the index.

Again on page 78 in subsection 3.2.3 it would be very useful to know where (section number and page) in chapter 8 to look.

In figure 3.4 on page 78, I think that line 360 needs some explaination. The 4 << 12 is putting 4 into the version field, but it is putting a constant 5 in to the IP header length field (IHL) (i.e., 5 << 8), but in the line above there is a check for options, thus this code should probably be (5 + (((opt->optlen)+3) >>2 )) << 8; since the length of the header increases by multiples of 4 bytes (32 bit words) as you add options and opt->optlen is in bytes the +3 makes sure that you end up with the right number of words since IHL is in words.

In conjunction with figure 3.5 the authors might say what the purpose of this code is. In addition, that the flow control ("fc") uses a backoff based on the function ffz(). But where is ffz() described? It is in fact one of the bitops functions:

      ffz --  find first zero in word. 
      unsigned long ffz (unsigned long word);

So why is it complementing xoff and then using ffz rather than simply using ffs (find first set)? What is this algorithm in fact doing? It is walking along the bits of xoff from left to right turning the bit off and making a call to netdev_fc_slots[i].stimul() based on the i'th bit. This table was filled in in the routine:

   int netdev_register_fc(struct device *dev, void (*stimul)(struct device *dev))

In conjunction with figure 3.6 on page 79 is would be very nice if the authors had said that "cng" is short for "congestion", thus get_sample_stats() returns a value which indicates if the interface is experiencing congestion because the queue length is growing (i.e., the backlog (blog) is increasing). The authors might also explain that this code is smoothing the statistics in line 1053 to compute average backlog (avg_blog). In addition, based on the random number (rd) the code picks one of the slots from netdev_max_backlog and if this is on average occupied then the code will return a value indicating NET_RX_DROP or NET_RX_CN_HIGH depending on the congestion conditions, this will later be used to decide if it should drop a packet which it has already received. There should be a cross reference to where the authors will discuss this.

In figure 3.8 on page 81, "different of" should probably be "other than". since th eassumption is that the older protocols are either NET_BH or TIME_BH.

In the text at the bottom of page 81 it would be nice to have useful cross references to net_rx_action() and ip_rcv().

Similarily in subsection 3.2.7 it would be really good to include the file name and line number where these functions are and if they are included in the book then the page number should also be given.

In subsection 3.2.8 it would be interesting if the authors explained why the assignment of a function to sk->data_ready provides the triggering.

In subsection 3.2.9 the authors so some tcpdump output but the authors don't explain it, comment on it, or use it for more than filling space.

In figure 3.10 line 502, why should the state be compared against TCP_ESTABLISHED when this is UDP code? Is this constant being overloaded for use in the UDP case also?

In subsection 3.3.3, page 86, there is again another tcpdump without explaination. What purpose does it serve?

In subsection 3.4.1, page 87, there is again another tcpdump without explaination. What purpose does it serve?

In section 3.5, page 88, there is again another tcpdump. Why does it have line numbers? Why doesn't it correspond to the output generated in figure 3.18 (specifically it does not show the SIGALRM {probably ^C} that occured as shown in fig. 3.18 line 51 nor does it should the output of the statistics fig. 3.18 line 54-56.

What is the relation between "bells" and "brahms" since figure 3.16 shows the ping of "bells" and figure 3.17 line 4 shows "brahms" answering an icmp echo request?

In fig. 3.18 line 11 why all the white space? Why is the connect in line 11 refused and what is its purpose?

In fig. 3.18 why is /etc/resolv.conf checked in lines 8&9 before nsswitch.conf is check? (this would seem to prevent nsswitch from being able to specify files before DNS).

With regard to fig. 3.18 lines 19, 26, 27 - there should be an explaination of why these ports: 53(domain), 1025 (generally the first dynamically assigned port), and 1031 (inetinfo) are used in these connects?

Overall it seems as if the writing was only half done and almost none of the good editing seems to have been done. While one might say that adding more explainations would have made the book even longer -- I would say that without the unexplained code and examples the book would have been much thinner.

Comments part 2

On page 93, there is again an error with "\#" which should of course just be a "#".

The comment in line 46 on page 99, should probably be "wait for server to set up listen" (which the server does in line 93 on page 100). It would also be useful to have a figure and some discussion of this example (aside from the 1 sentence on page 101 and the 2 sentences on page 100).

Using select to simply provide a timeout does not seem like a very useful example. In addition, why not show the select function template - so that one knows what the parameters are.

     int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *errorfds, struct timeval *timeout);

In figure 4.8 on page 103, why not use a symbolic file decriptor rather than a constant? (in this case 0).

With regard to figure 4.10 on page 104, why is there no discussion or explaination of F_SETOWN and F_SETFL?
from: /usr/include/sys/fcntl.h

...
#  define F_SETFL	4	/* Set file status flags */
...

#  define F_SETOWN     12	/* set pgrp ID to receive SIGURG signals */

In figure 4.11 on page 105, line 119 should probably be:

   __pad[__SOCK_SIZE__ - sizeof(sa_family_t) -
         sizeof(unsigned short int) - sizeof(struct in_addr)]

Since sa_family_t might not always be a short int.

Also there should be an explaination that the purpose of this computation is to produce an overall structure which has a fixed size (i.e., __SOCK_SIZE__ ).

In figure 4.12 on page 105, it might be nice to comment that the reason for the union is to allow access as either bytes, shorts, or words.

On page 112, in the first sentence of 5.3 there is a missing "device" between "network" and "interrupts".

On page 114, it is difficult to see what figure 5.8 contributes.

On page 116 in figure 5.10 at line 69, it would be very interesting to have an explaination of what the code does; especially as it involves a function as the target of the assignment:

       softirq_active(cpu) =& ~ active

A similar comment applies to line 115 of figure 5.12 on page 117. It would also be interesting to have some commentary as to why softirq_mask(cpu) is a 32 element bit vector. In fact is would be very interesting to have some commentary on what this procedure open_softirq() really does.

In figure 5.13 on page 118 in line 45, "talklet_trylock()" should be "tasklet_trylock()".

On page 122, there seems to be a word missing between "and using" and "to upcall".

On page 127, figure 5.24 is the same as figure 3.5 on page 79. Why is this code shown again? Why is there still no detailed explaination of what it is doing?

Comments part 3

In conjunction with figure 5.25, the structure of the I/O registers for the device should be explained. This should include the purpose of each of these registers.

What is the meaning of "EL1" in figure 5.25?

In figure 5.28 line 319, why is the command register send an AX_LOOP+1? Why not simply AX_LOOP? What is the purpose of this "1" and why doesn't it have a symbolic name?

In figure 5.29 in lines 346-347, why not use this in conjunction with multicast?

In figure 5.34 line 534 the freeing of this buffer is a very significant event - why is this not cross referenced with the text?

In figure 5.34 and others, the state of lp->loading is quite important but there is no explaination of what these starts are, what they mean, and there is no reason why they should not have symbolic names!

In figure 5.35 the is a test against el_debug, but there is no overall description of the debugging levels and what the purpose of each is. It would also be useful if in conjunction with this figure and figure 5.36, 5.37, ... there were an explaination of txsr, gp, rp, axsr, ... .

There is no explaination of why the processing should give up after 16 collisions. Why?

In figure 5.40, what is PIO? (programmable I/O, i.e., using the processor to move the data rather than using DMA)

In figure 5.41 why repeat lines 739-747 as they have appeared earlier earlier.

In figure 5.42 the authors should comment that skb == NULL means that there are no buffers available.

What is the purpose of lines 879-882 in figure 5.43?

In figure 5.44 there should be an explaination of the names and commentary on the results.

In figure 5.45, why are almost 1/30 of the packets received have errors? Why are there almost 10% overruns?

Why is figure 5.43 part of figure 5.46? Lines 915.. are part of initializing the device - there should be some commentary on this. Why fake a device structure and what does Space.c do? Part of figure 5.47 is really lines 915-935 from the previous figure --- since this is part of the module support.

What is the purpose of the list of authors in figure 5.48?

What is the purpose of the copyright notice in figure 5.49?

It would be interesting if there were commentary of why open should be disallowed from /proc in line 121 of figure 5.51.

Is showing the #ifdef critical in figure 5.52?

Again, why use null loops in the expansions shown in lines 173-176 of figure 5.53?

The value of MAX_SOC_ADDR in figure 5.54 is the maximum length of the address structure.

In the text at the top of page 148, there should be further description of why the check only happens when you move from user to kernel space.

What is the meaning of "1003.1g" in line 260 of figure 5.56?

Comments part 4

Why is the information in lines 268-270 of figure 5.57 the information that is returned for a statfs on a socket. The structure is:

struct statfs {
        long f_type;
        long f_bsize;
        long f_blocks;
        long f_bfree;
        long f_bavail;
        long f_files;
        long f_ffree;
        __kernel_fsid_t f_fsid;
        long f_namelen;
        long f_spare[6];
};

Isn't there more interesting status information that could be returned via f_spare?

What is the purpose of the code in figure 5.58 sockfs_read_super()?

Shouldn't sockfs_delete_dentry return ENOSYS rather than 1, in figure 5.59?

What does the code in fiure 5.61 really do?

In figure 5.62, there is a missing "this" between "may close" and "file descriptor". Why can't we use locking to eliminate the race condition?

In figure 5.63, why is the maximum name length 32, and what is it not a defined constant?

What is mode "3" in figure 5.64 and why isn't it a defined constant?

Figure 5.69 talks about /proc providing a back door, but earlier the code said that an open via /proc was disallowed.

What are the functions scm_send, scm_destroy, and scm_recv and what do they do?

In figure 5.83 there is a missing word "hence" between "inode is NULL" and "we were closing". In addition, why is the first argument of the sock_fasync() a -1? Since the prototype for the function is:

static int sock_fasync(int fd, struct file *filp, int on);

So why use a -1 as the file descriptor? (especially when "on = 0", when the code does not even use the file descriptor fd).

In figure 5.90, why is the size of the module_name explicitly 30? Why not a declared constant?

Why the emoticon "8)" in line 903 of figure 5.92?

A TeX command "tt" slipped though on page 170, just above figure 5.96.

A missing word in line 1026 of the comment in figure 5.97 on pg. 171, "space" should be between "move it to user" and "at the very end".

What is the purpose of the section on "BSDisms" starting on page 175? What do they do? Why include them?

A missing word in line 1489 of the comment in figure 5.109 on page 179, "operation" should be between "control" and "on a socket descriptor".

What is the purpose of the structure declaring the length of the argument lists in lines 1505..1510 of figure 5.109?

In figure 5.110, what is the saving of 20%? What is it in comparison to and why would it be important?

In figure 5.115, in line 1674 shouldn't "Wan" be "WAN"? Why init the wanrouter code on line 1678? (i.e., why should this happen inside socket_init?)

On page 185 the authors state how important the sk_buff structure is, but if so, then why did they discuss it earlier in the book? (as Stevens does with mbufs)

What is the purpose of count in figure 5.122?

Why are the data structures aligned on 16 byte boundaries in line 189 of figure 5.122?

What is the code in figure 5.124 doing?

Figure 5.128 is the same as figure 5.122!

Why doesn't the code in figure 5.131 copy the content? The authors state on page 193 that this is the case, but don't explain its significance.

What are __tcp_ehash and __tcp_bhash? Why are there two different hash structures? (if "b" is for "bind", then what is "e" for?)

In section 5.7.1 on page 199, the authors state: "In any case, one of the most formidible data structures in the Linux communications protocol stack implementation is contained in th innocently named file sock.h." But they don't say what struct this is!

There is no need for a comma after "IP" in the sentence just above figure 5.141 on page 199.

In conjunction with figure 5.147 it would be good if the authors described the coupling of the conditionals to the inclusion of different protocols into the kernel, i.e. If you want Apple Talk, then you have to set CONFIG_ATALK_MODULE to true. There should also be a reference to the file where all this configuration can be done and where descriptions of the meaning of these choices might be found.

In figure 5.148 why repeat lines 610-648 (since they were already in figure 5.147)?

In figure 5.155 it would be good if you described the __x() vs. the x() versions of the functions.

What is the purpose of section 5.7.6?

Comments part 5

Pg. 216, fiure 6.2, what is dst->hh ?

Pg. 217, figure 6.4, what is the BUG_TRAP() ?

Pg. 218, figure 6.6, IP_INC_STATS(), you should explain that IpOutRequests, etc. are SNMP variables

Pg. 219, figure 6.9, "We use to do this" ==> "We used to do this"

Pg. 233, figure 6.31, Why doesn't ICMP use this function?

Pg. 238, figure 6.38, MF = More Fragments or offset therefore reassemble

Pg. 239, figure 6.39, Why isn't st a struct with a processor ID field? Why are the number of input and output packets both incremented?

Pg. 239, figure 6.40 You should explain what matirans are.

Pg. 241, figure 6.43, What are fastroute obstacles?

Pf. 242, figure 6.44, "must reply an ICMP" ==> "must reply with an ICMP" and "telling that the packet's" ==> "telling the sender that the packet's"

Pgg. 243, figure 6.45, you should explain that NF_HOOK has to do the actual sending

Pg. 253, the discussion of reassembly is actually very interesting (that this is the only timer) - hence all the rest of the IP limis are simply counters

Pg. 257, explain traceback and what the security hole is. figure 6.64, line 28, "function is allowed be" ==> "function is allowed to be"

Pg. 264, figure 6.74, explain about Timestamp vs. timestamp and record route (which inserts this nodes address).

Pg. 270, figure 6.82, Explain CMSGs (Console Messages) and how they are accessed.

Pg. 274-275 why is this example chopped up into pieces?

Pg. 288, figure 6.109 you should explain what these values are -- i.e., why is the sturcture initalized with these values.

Pg. 289, figure 6.110 TR == Token Ring -- why isn't this inside a --> -- conditional and only included if you have configured token ring --> -- support (similarily for the FDDI code).

Pg. 292, figure 6.117, who calls arp_find?

Pg. 296, figure 6.123 explain pln == 4

Pg. 304, figure 6.135, explain why you use the constant 0xFFFFFFUL (an unsigned long) - why isn't this a defined constant?

Pg. 345, figure 6.195, explain IPMR

Pg. 386, figure 6.263, cplain what the values in the initialized structures mean. Why these values?

Pg. 480, figure 7.5 explain why there is a gettimeofday() call - it is for the timeout --- see section 7.3.6

Pg. 481, section 7.5 explain what "As UDP is a datagram based protocol, ... " -- this means that the "connection" is only concerned with setting up local state, since there is no connection in the sense of TCP. (note that this same comment applies to the text just after the figure on page 482) Relate the comment here about the timeout to figure 7.5 on the previous page.

Pg. 486, why does the figure say it is "a Web browse with Lynx"? Explain. Also explain the several gettimeofday() calls.

Pg. 487, "rather inefficient on bandwidth use" ==> "rather inefficient with respect to bandwidth use"

Pg. 489, you state that "as soon as the sender starts sending too fast ...", but in fact the loss only occurs is the queues are not long enough.

Pg. 490, section 7.4.2, the cross reference at the end of the first paragraph is wrong it should be a reference to part of fgure 7.1 at the top of page 489.

Pg. 493, in the first line it should be examining the tcpdump from pg. 489.

Pg. 493, figure 7.19, line 259 - what is the reason for laughing?

Pg. 512 the commentary should spell out "VJ" an explain the comment on line 1313 of fiure 7.40 about why it "failed".

Pg. 536, figure 7.65 why does it generate erratics ACKs?

Pg. 549, "through at lease one path" ==> "through at least one path"

Pg. 551, "made my some experts" ==> "made by some experts"

Comments part 6

[A long pause (~1 year) until I could get enough enthusiasm to finish the book.]

Pg. 618, "read" ==> "lead"

Pg. 618, "has carried out" ==> "has received"

Pg. 620, Section 8.4.2 should be tied to figure 8.99

Pg. 644, The should be a comment following the statement "TCP sources will always attempt to fill up such links." that says "However, the output rate can't exceed the rate at which data is generated (at least for very long)! Thus if the source is actually slow, then in fact you may not need TCP's flow control."

Pg. 838, figure 9.288 repeats figure 9.285


Offical disclaimer required by KTH central administration

Detta är en personlig hemsida och åsikter framförda här eller i tillhandahållna länkar representerar inte KTH.

This is a personal homepage. Opinions expressed here or implied by links provided, do not represent the official views of KTH.

For information contact maguire@it.kth.se
© 2002 and 2003 G. Q. Maguire Jr., KTH/IMIT All Rights Reserved.

Last modified: 2003.11.02 11:31:22 MET

Valid HTML 4.01!