CSCI 1312 (Introduction to Programming for Engineering), Fall 2018:
Homework 10

Credit:
50 points.

Reading

Be sure you have read (or at least skimmed) the assigned readings from chapters 7 and 12.

Programming Problems

Do the following programming problems. You will end up with at least one code file per problem. Submit your program source (and any other needed files) by sending mail to bmassing@cs.trinity.edu with each file as an attachment. Please use a subject line that mentions the course and the assignment (e.g., ``csci 1312 hw 10'' or ``CS1 hw 10''). You can develop your programs on any system that provides the needed functionality, but I will test them on one of the department's Linux machines, so you should probably make sure they work in that environment before turning them in.

Yes, this writeup is long. But I think the code you write need not be, and it's an interesting problem!

You may have heard claims that E is the most frequently-used character in English text, followed by T, and so forth. Your mission for this assignment is to write two programs that together will allow you to find out how true this claim is for selected text (and, okay, to give you practice working with some course topics):

(Why two programs? Mostly pedagogical reasons.) Writing the programs from scratch is nontrivial (though you could probably do it), so to make it more doable I'm providing starter code that reduces what you need to do and also gives you some practice with UNIX make, discussed in class. Once you have an output file produced by the second program, you can use the Linux command
sort -n -r outfilename
to display the results in a way that shows the most-often-used letter first, etc.

To give you some practice working with structs in C, I want you to do this problem using an array of a structs, with each struct containing a letter and a count of how many times it occurs in the input(s). Since you need such an array in both programs, as well as code to look up a particular letter and increment its counter, it would seem to make sense to have a ``library'' used by both programs that declares/defines the struct and some needed functions. I've written code that declares the needed struct and declares some functions for building and operating on the needed array and also starter code for the two programs. Your mission will be to fill in the missing pieces. There are several ways to combine this ``library'' code with the two programs, but what I want you to do is to use the Linux utility make, as discussed in class. Starter code, with FIXME comments showing where you need to add code:

Rather than copying or downloading each of these files separately, you'll probably find it easier to download the ZIP file hw10.zip and unzip it with unzip hw10.zip. If you prefer to download individual files, NOTE that you should use your browser's ``download'' or ``save'' function to obtain the Makefile rather than copying and pasting text. This is because copy-and-paste will likely replace the tab characters in the file with spaces, with bad consequences (since tabs are semantically significant in makefiles.)

The Makefile includes instructions for ``building'' the project. Note that just using gcc with a single program, as we've been doing, won't work, but once you have all the above files downloaded, typing make will produce two executables, countalpha and mergecounts, that you can run (although they won't do anything very interesting). You might try that before starting to write code.

Instructions for specific files you need to change:

  1. (5 points) The first file you need to change is alphacounters.c, which provides code for functions declared in alphacounters.h. (Note that alphacounters.h also includes comments describing what these functions do -- very important!) There's only one function you need to write code for, the one that given a character finds the element of the array for it and increments its counter, and I'm hoping that the functions I'm providing code for will give you some hints about how to work with the array. You can check that your code at least compiles by typing make again.

  2. (20 points) The next file you need to change is the code for the first program, the one that analyzes a single input file and produces an output file. The starter code checks that there are two command-line arguments (filenames for input and output) and opens the input file. Add code to do the following: This is probably easiest to understand with examples. If the input file looks like this:
    testing 1 2 3 4?
    
    TESTING 4 3 2 1!
    
    the output file should look like this:
    e 2
    g 2
    i 2
    n 2
    s 2
    t 4
    
    and the program should print this:
    alphabet 'abcdefghijklmnopqrstuvwxyz'
    14 alphabetic characters, 36 total characters
    
    And if the input file looks like this:
    Now is the time for all good persons
    to come to the aid of their party!
    
    the output file should look like this:
    a 3
    c 1
    d 2
    e 6
    f 2
    g 1
    h 3
    i 4
    l 2
    m 2
    n 2
    o 9
    p 2
    r 4
    s 3
    t 7
    w 1
    y 1
    
    and the program should print this:
    alphabet 'abcdefghijklmnopqrstuvwxyz'
    55 alphabetic characters, 72 total characters
    

  3. (20 points) The last file you need to change is the code for the second program, the one that merges output from repeated executions of the first program. The starter code checks that there is at least one command-line argument, builds the array of structs, and calls a function processfile for each input filename to process that single file. Add code to do the following:

    About reading lines from the input file, the obvious way to read a character and a long from a file is with fscanf and a format string of "%c %ld", but this doesn't work well after the first line, because the %c picks up the newline character at the end of the second line. There are several ways to cope with that; simplest may be to read into a character array of size 2 using format string "%1s %ld".

    Here too this is probably easiest to understand with an example. Given the two output files shown earlier, the program should combine them to produce an output file containing

    3 a
    1 c
    2 d
    8 e
    2 f
    3 g
    3 h
    6 i
    2 l
    2 m
    4 n
    9 o
    2 p
    4 r
    5 s
    11 t
    1 w
    1 y
    
    and print this:
    alphabet 'abcdefghijklmnopqrstuvwxyz'
    processing input file sample1-out.txt
    processing input file sample2-out.txt
    
    Finally, if you check for errors in the input files for the second problem (optional, for extra credit), the program should give an error message for every line of this input file:
    hello
    x
    100
    x 1000x
    

  4. (5 points) Finally, you should try your programs with some non-trivial input. The Project Gutenberg web site http://www.gutenberg.org is a good source of freely-available text. I downloaded copies of two books (one by Jane Austen, one by P.G. Wodehouse) in UTF-8 format, converted to plain-text, and made another ZIP file hw10-data.zip with the results. Run your programs on these two files and send me your output files (results of running countalpha on each of the input files, and result of running mergecounts to combine them).

    If you find this sort of thing interesting, you could download additional books and try the program with them. I used the following command to convert from UTF-8 to really-plain-ASCII-text:

    iconv -f UTF-8 -t US-ASCII -c infile.txt >outfile.txt

    Or word-processing programs will also export to plain text, though if you try that route you should probably open the resulting file in vim and make sure it looks like text.

    If you do this, send me your additional input files for extra credit.

Honor Code Statement

Include the Honor Code pledge or just the word ``pledged'', plus at least one of the following about collaboration and help (as many as apply).1Text in italics is explanatory or something for you to fill in. For programming assignments, this should go in the body of the e-mail or in a plain-text file honor-code.txt (no word-processor files please).

Essay

Include a brief essay (a sentence or two is fine, though you can write as much as you like) telling me what about the assignment you found interesting, difficult, or otherwise noteworthy. For programming assignments, it should go in the body of the e-mail or in a plain-text file essay.txt (no word-processor files please).



Footnotes

... apply).1
Credit where credit is due: I based the wording of this list on a posting to a SIGCSE mailing list. SIGCSE is the ACM's Special Interest Group on CS Education.


Berna Massingill
2018-11-25