Two Pass Algorithms#
Sometimes as you are working through one step in a loop, you find that you need information that will be available only on a subsequent step. Often the solution is to split your algorithm into two passes through the data. Loop through once to gather the information you need, then loop through again to use that information.
A Simple Example#
Suppose we had a set of scores for a project:
Name |
Score |
---|---|
Albert |
30 |
Bertal |
24 |
Cheri |
32 |
Dolores |
22 |
Ethan |
20 |
———- |
——- |
The scores might be represented as a list of tuples:
scores = [
("Albert", 30),
("Bertal", 24),
("Cheri", 32),
("Dolores", 22),
("Ethan", 20)
]
Suppose we wished to normalize these scores, replacing the raw score by a a ratio to the average (arithmetic mean) score. (This is a very bad idea, but we will suppose it for the sake of the example.) Since Cheri’s score is 125% of the average score, her score should be represented as 1.25. Since Ethan’s score is 78% of the average score, his score should be represented as 0.75.
Name |
Score |
---|---|
Albert |
1.17 |
Bertal |
0.94 |
Cheri |
1.25 |
Dolores |
0.86 |
Ethan |
0.78 |
We might begin to write a simple function to build the list of tuples with normalized scores:
def normalize_scores(scores: List[Tuple[str, Number]]) -> List[Tuple[str, float]]:
"""Normalized scores are fraction of average"""
result = [ ]
for student, score in scores:
normalized = score / mean ## I need the mean here!
result.append((student, normalized))
return result
To make this work, we need the mean value. We could try calculating it as we loop through the entries:
def normalize_scores(scores: List[Tuple[str, Number]]) -> List[Tuple[str, float]]:
"""Normalized scores are fraction of average"""
result = [ ]
total = 0
count = 0
for student, score in scores:
count += 1
total += score
mean = total / count # This is not correct!
normalized = score / mean
result.append((student, normalized))
return result
This doesn’t work … it divides the score by the sum so far rather than the sum of all entries.
We could get very fancy with a list comprehension to calculate the sum and count each time we need them:
def bad_normalize(scores: List[Tuple[str, Number]]) -> List[Tuple[str, float]]:
"""Looks cute, but this algorithm is quadratic in length of list!"""
result = [ ]
count = len(scores)
for student, score in scores:
total = sum([score for name, score in scores])
mean = total / count
normalized = score / mean
result.append((student, normalized))
return result
This looks clever, but it is actually very bad. While the code looks like
it only has one loop, it is actually nested loops, because the
sum
and the list comprehension are actually loops through the
list of scores. We won’t notice the inefficiency for a list of
five scores. We would surely notice for
a list of 1000 scores.
The fix is very simple: We need to calculate the total just once. We loop through the list of scores once to get the total (and perhaps also the count), and then loop through again to produce the normalized list:
def normalize_scores(scores: List[Tuple[str, Number]]) -> List[Tuple[str, float]]:
"""Normalized scores are fraction of average"""
# Pass 1: Gather the summary information
total = 0
count = 0
for student, score in scores:
count += 1
total += score
mean = total / count
# Pass 2: Use the summary information
result = [ ]
for student, score in scores:
normalized = score / mean
result.append((student, normalized))
return result
There are still two loops in this solution, but now they are not nested loops. Suppose there are n entries in the list; nested loops will touch each entry n2 times, but one loop followed by another loop through the same items will touch each entry only 2n times. For a list of 1000 items, the difference is 1,000,000 vs 2000 operations.
Summarizing in a Table#
In the example above, we just needed a couple of variables
(total
and count
) to hold the summary information. Often
we will need more. In that case, instead of a few summary
variables, we may need a table of summary information.
A typical real-life example requiring a table of summary information is producing cross references in text formatting. For example, the source code for a book to be formatted by the LaTeX document processing system might include a cross-reference entry like this:
As we will explain in more detail later (page \ref{sec:binary}),
executable code in computer memory is indistinguishable
from data.
Another part of the document source code might indicate the intended page number:
\Section{Binary Encoding Of Everything You can Imagine}
\label{sec:binary}
Everything is binary. Everything. That too.
And it's all a big undifferentiated soup of numbers
until we decide how to interpret it.
The text formatting system has to essentially format the whole text twice.
On the first pass it may discover that the label sec:binary
appears on
page 42. This might be before or after the cross reference; either way
it saves the pair ("sec:binary", 42)
in a table, and uses that table
on the second pass when it can substitute the page number for the
reference.
As a less realistic but simpler example, suppose we have a list of fruits like this:
fruit_bowl = ["banana", "orange", "banana", "kiwi", "banana", "orange"]
We want “m of n” style formatting for each item in the bowl:
banana 1 of 3
orange 1 of 2
banana 2 of 3
kiwi 1 of 1
banana 3 of 3
orange 2 of 2
We will write a function itemize_fruits
to produce the
printed list of fruits in a bowl.
def itemize_fruits(bowl: List[str]):
"""Print the fruits as
banana 1 of 3
orange 1 of 2
etc
"""
Keeping a count is not difficult, but we need two counts for each fruit:
the total count, and the count of items encountered so far.
In the first pass through the list of fruits, we will produce the
table of total counts:
# Pass 1
full_counts = { } # We'll keep summary information here
for fruit in bowl:
# Ensure there is an entry for the fruit
if fruit not in full_counts:
full_counts[fruit] = 0
full_counts[fruit] += 1
In the second pass, we will recompute counts and store them in a separate table.
# Pass 2
cur_counts = { }
for fruit in bowl:
# Ensure this is an entry for the fruit
if fruit not in cur_counts:
cur_counts[fruit] = 0
cur_counts[fruit] += 1
We could make a third pass through the list to do the actual printing, but it is not necessary. We can print as soon as we have computed the the second count, in the same loop:
def itemize_fruits(bowl: List[str]):
"""Print the fruits as
banana 1 of 3
orange 1 of 2
etc
"""
# Pass 1
full_counts = { } # We'll keep summary information here
for fruit in bowl:
# Ensure there is an entry for the fruit
if fruit not in full_counts:
full_counts[fruit] = 0
full_counts[fruit] += 1
# Pass 2
cur_counts = { }
for fruit in bowl:
# Ensure this is an entry for the fruit
if fruit not in cur_counts:
cur_counts[fruit] = 0
cur_counts[fruit] += 1
# Now we have both pieces of information
print(f"{fruit} {cur_counts[fruit]} of {full_counts[fruit]}")
Application to the Assembler Project#
In week 8 of CIS 211, you will build the address resolution logic of an assembler, a program that translates Duck Machine assembly code to machine code. What you will create actually is the first phase of the assembler, which resolves labels to pc-relative addresses. Given an input program like this:
# Determine the maximum of two integers.
# This program triggers memory-mapped IO to
# read integers from the keyboard and to
# write integers to the display.
LOAD r1,r0,r0[510] # Trigger read from console
LOAD r2,r0,r0[510] # Trigger read from console
SUB r0,r1,r2[0]
JUMP/P r1bigger
STORE r2,r0,r0[511] # Trigger write to console
HALT r0,r0,r0
r1bigger:
STORE r1,r0,r0[511] # Trigger write to console
HALT r0,r0,r0
Your program will translate the label r1bigger
into a
PC-relative address, resulting in a modified (“resolved”)
version of the program, like this:
# Determine the maximum of two integers.
# This program triggers memory-mapped IO to
# read integers from the keyboard and to
# write integers to the display.
LOAD r1,r0,r0[510] # Trigger read from console
LOAD r2,r0,r0[510] # Trigger read from console
SUB r0,r1,r2[0]
ADD/P r15,r0,r15[3] #Jump to r1bigger
STORE r2,r0,r0[511] # Trigger write to console
HALT r0,r0,r0
r1bigger:
STORE r1,r0,r0[511] # Trigger write to console
HALT r0,r0,r0
Note that the JUMP/P
pseudo-instruction has been translated
to the real Duck Machine instruction ADD/P
(add if positive),
and that the instruction adds 3 to the program counter
in register 15. The relative address 3 is the difference
between address 6 at which the label r1bigger
appears
address 3 at which the jump appears.
Just as we counted fruits twice above to itemize them,
your address resolver will need to count instructions
twice to determine the addresses. In the first pass,
you will create a table associating labels like r1bigger
with addresses by counting the number of machine instructions
to be produced. In the second pass, you will count
machine instructions again, and also produce the
modified program code.
Summary#
We often use multi-pass (two-pass or more) algorithms when a step in a loop requires information that will not be available until a future step. We make (at least) one “pass” (loop) through the whole data to gather the needed information, and then a second “pass” to use it. Sometimes we only need a summary variable or two, but often we need a table to hold whatever information we need to summarize.
Example Code#
Sample source code is available for this chapter.