CSCI 1312 (Introduction to Programming for Engineering), Fall 2015:
Homework 6

Credit:
20 points.

Reading

Be sure you have read (or at least skimmed) the assigned readings from chapter 7.

Programming Problems

Do the following programming problems. You will end up with at least one code file per problem. Submit your program source (and any other needed files) by sending mail to bmassing@cs.trinity.edu, with each file as an attachment. Please use a subject line that mentions the course and the assignment (e.g., ``csci 1312 homework 6'' or ``CS1 hw6''). You can develop your programs on any system that provides the needed functionality, but I will test them on one of the department's Linux machines, so you should probably make sure they work in that environment before turning them in.

Yes, this writeup is long. But I think the code you write need not be, and it's an interesting problem!

You may have heard claims that E is the most frequently-used character in English text, followed by T, and so forth. Your mission for this assignment is to write two programs that together will allow you to find out how well this claim holds up in practice (and, okay, to give you some practice working with files in C):

(Why two programs? Mostly pedagogical reasons.) This is not a trivial assignment but I am providing a function and an example program that I think will make it doable, and if you capture output of the second program in a file (perhaps using Linux output redirection!) you can use the Linux command
sort -n -r filename
to display the results in a way that shows the most-often-used letter first, etc. One place to look for interesting (or at least non-trivial) text files is Project Gutenberg, though you should be careful to get the plain-text version of whatever book(s) you select (really-plain-text, not UTF8).

Note: For pedagogical reasons I do want you to write two programs; I don't think you'll learn quite as much writing only one. Also for pedagogical reasons -- and because I think it works out better for you and for me -- please go along with the part of the writeup that says that these programs get filenames from command-line arguments rather than by prompting the user as you've done in previous programs.

  1. (10 points) The first program (to analyze a single file) analyzes a single input file and produces an output file. It should get their names from command-line arguments. The output file should have the following format: This is probably easiest to understand with examples: If the input file looks like this:
    testing 1 2 3 4?
    
    TESTING 4 3 2 1!
    
    the output file should look like this:
    24 total text characters
    2 e
    2 g
    2 i
    2 n
    2 s
    4 t
    

    I recommend that you use an array of 26 counters, one for each character of the Roman alphabet. To help(?) you, the file alpha_index.c contains a function that examines a character read from the input file and either returns its index into the alphabet (0 for ``a'' or ``A'', 1 for ``b'', or ``B'', etc.), or -1 if the character is not alphabetic. You can either copy and paste the function from this file into your program, or you can put the file in your directory and use the line #include "alpha_index.c" in your program to have the compiler include it with your code.1

    Of course(?), the program should check that the user supplied two command-line arguments and that the input and output files could be opened.

    Hints:

  2. (10 points) The second program (to merge results from one or more executions of the first program) simply combines results in the obvious(?) way. Given the following two input files (output of the first program): output should look something like the following:
    processing input file sample1-step1-out.txt
      24 total text characters
    processing input file sample2-step1-out.txt
      56 total text characters
    
    summary:
    
    3 a (3.7500%)
    1 c (1.2500%)
    2 d (2.5000%)
    8 e (10.0000%)
    2 f (2.5000%)
    3 g (3.7500%)
    3 h (3.7500%)
    6 i (7.5000%)
    2 l (2.5000%)
    2 m (2.5000%)
    4 n (5.0000%)
    9 o (11.2500%)
    2 p (2.5000%)
    4 r (5.0000%)
    5 s (6.2500%)
    11 t (13.7500%)
    1 w (1.2500%)
    1 y (1.2500%)
    
    11 non-alphabetic text characters (13.7500%)
    
    Of course(?), the program should check that the user supplied at least one command-line argument and that all of the input files could be opened. It probably should also check, as it reads through the input files, that each one is in the right format (output of the first program), since otherwise the program might easily crash. It doesn't have to do this all at once: it's probably simpler to process the input files one at a time, and it's okay if the program starts processing and producing output and then bails out if it encounters an error.

    Hints:



Footnotes

... code.1
I suspect that many programmers would instead use the library function isalpha and do something that directly works with the characters as small integers. I do it this way because the fine print for isalpha suggests that in some circumstances it might recognize as ``alphabetic'' characters I don't mean to include, and while it's true that, these days, in the overwhelming majority of C implementations the representations of the characters of the Roman alphabet are contiguous (e.g., 'a' is 'z'+25), that's not guaranteed by the standard, and my function would work even if it weren't. Also I think it's kind of a clever use of pointer arithmetic!


Berna Massingill
2015-12-04