PDP-11/45 RSTS/E boot problem

Mon Feb 4 04:28:08 CST 2019

So I've been helping Fritz look into his -11/45 problem, and things have
gotten to a point where I'd like to reach out for help, more eyes, etc.

I have to say, I spent almost a decade at the start of my career working on
PDP-11 hardware ('new build' DMA devices, as well as fixing broken stuff), and
software, and this is, I think, the most confusing and difficult problem I
have _ever_ seen on one. Hence the above...

What's _particularly_ confusing and difficult is that it seems like _three_
separate, un-related things all go wrong at exactly (2 of 3) or close to (the
other) the same time. And the machine now passes all the diagnostics that have
been thrown at it, particularly the KT11 and RK11 diagnostics (why this is
important will become clear). So here's what we've found to date.

The failure we're looking at is that an attempt to execute the 'ls' command
under Unix V6 fails; it gets a memory mangement fault, and dumps core.

AFAICT, the shell successfully forks, and its attempt to do an exec() of 'ls'
sort of works (more below), but a few instructions in, we get the MM fault - but
there's even more wrong when that happens (details toward the end below).

I've been looking at the core dump produced by the process, which gives me the
registers at the time of the trap, the user's stack, etc - but not a copy of
the binary code - the 'ls' command is a so-called 'pure text', i.e. the binary
is segregated into separate, potentially shared, read-only 'segment(s)' (only
1 in this case) of the PDP-11's User mode address space, and is not included
in the process dump.

(I use the term 'segment', which is actually what DEC called them in the first
version of the PDP-11/45 processor handbook, because that's what they are, not
pages, as pages are on most systems. I assume they changed to 'page' for
marketing reasons. And please, can we hold debate about this and focus on the
problem? Thanks! :-)

I do have the ability to look at the binary that it _should_ be executing, by
examining the command in its file. Also, Fritz has worked out that he can
patch the MM trap vector (before trying to do the 'ls') to halt the machine
when it happens, so he can read out all the KT11 registers, look at the actual
program in main memory, etc.

First oddity - the problem is dependent on the location of the command in main
memory! If Fritz says "sleep 360 &", to run a trivial command in the
background, and _then_ says 'ls' - it works (so we know the binary of 'ls' on
disk is OK)! We _think_ this is because the process executing the 'sleep'
takes up a chunk of main memory, and thus changes the location of the process
executing the 'ls'.

The problem is that I'm reluctant to try and change anything (e.g. to have the
OS print out anything) because that will change the location of things, and we
may (likely?) will not get the problem. With nothing changed, it _reliably_
fails - I've looked at two different core dumps, and all the essential data
(registers, user stack etc) are identical. The KT11 registers all seems to be
the same, too.

So, on to details.

I'm pretty sure the command only gets a few instructions in before it blows
up.  Here are the process' registers, and the _entire_ contents of the user
mode stack:

R0 177770
R1 0
R2 0
R3 0
R4 34
R5 444
SP 177760
PC 010210

060: 000000 000020 000001 177770 177774 177777 071554 000000

010210 turns out to be the first word in 'csv', which is an internal routine
which PDP-11 C uses to build a stack frame - _every_ C routine starts with
a "JSR R5, CSV" instruction as the first thing it does.

So looking at the stack (which looks good; it contains a valid 'argc' and 'argv'
that the process would be started with), and the registers, I'm pretty sure
it does these starting instuctions OK:

  start:
        setd
        mov     sp,r0
        mov     (r0),-(sp)
        tst     (r0)+
        mov     r0,2(sp)
        jsr     pc,_main

  _main:
        jsr     r5,csv

and then blows up on:

  csv:
	mov	r5,r0

So it's the 8th instruction in that blows up (*): but not only is what's in
memory at that location _not_ 'mov r5,r0', it also gets an MM trap that
makes no sense.

(*: In user mode: if you don't have an FPP, the first one will trap, which
UNIX ignores.)

Fritz has looked at the KT11 register when the trap happens, and the PARs and
PDRs all look good. The SSRs contain:

    > SSR's: 040143 000000 010210 000000

SSR2 gives the PC at the time of the fault (again 010210); SSR0 shows:

 Abort - segment (page) length error
 User mode
 Segment (Page) 1

which is the first thing that's wrong - neither the instruction that's
_supposed_ to be there (next), nor the one that's _actually_ there, contains
any reference to segment 1!

The _actual_ code it's trying to execute is:

    > 171600: 016162 004767 000224 000414 006700 006152 006702 006144

(Per UISA0, text base is 0161400, plus a PC of 010210, gives us 0171610, which
is right in the middle there.) That does not, alas, look anything _at all_
like what's _supposed_ to be there, which is:

010200:   110024
          10400         mov r4,r0
          167           jmp 10226 (cret)
          16
PC->      10500         mov r5,r0				(start of CSV)
          10605         mov sp,r5
          10446         mov r4,-(sp)
          10346         mov r3,-(sp)

So somehow the command (at least, this part of it - Fritz is going to check on
the first few instructions, but I'm pretty sure they will be OK) has gotten
read in wrong - but that's the least of our problems!  06700 is 'SXT R0', and
neither that nor 'MOV R5, R0' can _possibly_ cause an MM violation - least of
all one on segment 1 (this code is in segment 0)!

I could see there having been an error reading in the command binary (e.g. maybe
the RK11 has an issue), but WTF is happening here?

Just to make things triply confusing, R5 contains trash! The 'JSR R5, CSV' _should
have put the old PC in R5; but that call to CSV is at 030, so R5 _should_ contain
034, not 0444.

Needless to say, this is a real head-scratcher. What's confusing the heck out
of me are the three separate issues, all happening together - R5 contains
junk, the spurious (?) MM trap, etc.

The bad command binary in main memory could be caused by any number of things:
to get it, Unix reads file system blocks off the disk into buffers in low
memory, and then writes them out to the user's memory with MTPI. So an RK11
glitch could be doing it, but also a KT11 problem, etc.

I'm having a hard time seeing a common thread here - maybe a KT11 issue? But
how would that cause R5 to contain trash? That should only involve the KB11.
And the JSR R5, CSV must have been executed more-or-less OK, otherwise how did
it wind up at CSV?

I was wondering if some noise could be causing it - some sort of pattern
sensitiity - but how is it bashing R5 _and_ causing a spurious MM trap? That's
some glitch!

Most of the data above (e.g. SSR contents at trap time) has been re-checked,
and Fritz is going to check the rest (e.g. actual main memory contents for the
start of the code, and the user's stack - to check that the process' core dump
worked OK - although given the consistent stack contents, I'm expecting those
to be good).

I suggested to him that the time had come to apply the logic analyzer; I'd
love to see (from the IR in the CPU) the instruction that faults, and where it
came from. And also what the bus cycle is that's causing the fault; is it the
instruction fetch (possibly) or something that instruction is trying to do?

Does anyone have any comments/insight that could help work out what's going on
here? Or suggestions on things to look at? If so, thanks!

	Noel