diff --git a/src/sys/ufs/ffs/README.snapshot b/src/sys/ufs/ffs/README.snapshot
new file mode 100644
index 0000000..2ca5a19
--- /dev/null
+++ b/src/sys/ufs/ffs/README.snapshot
@@ -0,0 +1,110 @@
+$FreeBSD: src/sys/ufs/ffs/README.snapshot,v 1.4 2002/12/12 00:31:45 trhodes Exp $
+
+Soft Updates Status
+
+As is detailed in the operational information below, snapshots
+are definitely alpha-test code and are NOT yet ready for production
+use. Much remains to be done to make them really useful, but I
+wanted to let folks get a chance to try it out and start reporting
+bugs and other shortcomings. Such reports should be sent to
+Kirk McKusick <mckusick@mckusick.com>.
+
+
+Snapshot Copyright Restrictions
+
+Snapshots have been introduced to FreeBSD with a `Berkeley-style'
+copyright. The file implementing snapshots resides in the sys/ufs/ffs
+directory and is compiled into the generic kernel by default.
+
+
+Using Snapshots
+
+To create a snapshot of your /var filesystem, run the command:
+
+	mount -u -o snapshot /var/snapshot/snap1 /var
+
+This command will take a snapshot of your /var filesystem and
+leave it in the file /var/snapshot/snap1. Note that snapshot
+files must be created in the filesystem that is being snapshotted.
+I use the convention of putting a `snapshot' directory at the
+root of each filesystem into which I can place snapshots.
+You may create up to 20 snapshots per filesystem. Active snapshots
+are recorded in the superblock, so they persist across unmount
+and remount operations and across system reboots. When you
+are done with a snapshot, it can be removed with the `rm'
+command. Snapshots may be removed in any order, however you
+may not get back all the space contained in the snapshot as
+another snapshot may claim some of the blocks that it is releasing. 
+Note that the `schg' flag is set on snapshots to ensure that
+not even the root user can write to them. The unlink command
+makes an exception for snapshot files in that it allows them
+to be removed even though they have the `schg' flag set, so it
+is not necessary to clear the `schg' flag before removing a
+snapshot file.
+
+Once you have taken a snapshot, there are three interesting
+things that you can do with it:
+
+1) Run fsck on the snapshot file. Assuming that the filesystem
+   was clean when it was mounted, you should always get a clean
+   (and unchanging) result from running fsck on the snapshot.
+   If you are running with soft updates and rebooted after a
+   crash without cleaning up the filesystem, then fsck of the
+   snapshot may find missing blocks and inodes or inodes with
+   link counts that are too high. I have not yet added the
+   system calls to allow fsck to add these missing resources
+   back to the filesystem - that will be added once the basic
+   snapshot code is working properly. So, view those reports
+   as informational for now.
+
+2) Run dump on the snapshot. You will get a dump that is
+   consistent with the filesystem as of the timestamp of the
+   snapshot.
+
+3) Mount the snapshot as a frozen image of the filesystem.
+   To mount the snapshot /var/snapshot/snap1:
+
+	mdconfig -a -t vnode -f /var/snapshot/snap1 -u 4
+	mount -r /dev/md4 /mnt
+
+   You can now cruise around your frozen /var filesystem
+   at /mnt. Everything will be in the same state that it
+   was at the time the snapshot was taken. The one exception
+   is that any earlier snapshots will appear as zero length
+   files. When you are done with the mounted snapshot:
+
+	umount /mnt
+	mdconfig -d -u 4
+
+   Note that under some circumstances, the process accessing
+   the frozen filesystem may deadlock. I am aware of this
+   problem, but the solution is not simple. It requires
+   using buffer read locks rather than exclusive locks when
+   traversing the inode indirect blocks. Until this problem
+   is fixed, you should avoid putting mounted snapshots into
+   production.
+
+
+Performance
+
+It takes about 30 seconds to create a snapshot of an 8Gb filesystem.
+Of that time 25 seconds is spent in preparation; filesystem activity
+is only suspended for the final 5 seconds of that period. Snapshot
+removal of an 8Gb filesystem takes about two minutes. Filesystem
+activity is never suspended during snapshot removal.
+
+The suspend time may be expanded by several minutes if a process
+is in the midst of removing many files as all the soft updates
+backlog must be cleared. Generally snapshots do not slow the system
+down appreciably except when removing many small files (i.e., any
+file less than 96Kb whose last block is a fragment) that are claimed
+by a snapshot. Here, the snapshot code must make a copy of every
+released fragment which slows the rate of file removal to about
+twenty files per second once the soft updates backlog limit is
+reached.
+
+
+How Snapshots Work
+
+For more general information on snapshots, please see:
+	http://www.mckusick.com/softdep/
diff --git a/src/sys/ufs/ffs/README.softupdates b/src/sys/ufs/ffs/README.softupdates
new file mode 100644
index 0000000..a965f4f
--- /dev/null
+++ b/src/sys/ufs/ffs/README.softupdates
@@ -0,0 +1,58 @@
+$FreeBSD: src/sys/ufs/ffs/README.softupdates,v 1.9 2000/07/08 02:31:21 mckusick Exp $
+
+Using Soft Updates
+
+To enable the soft updates feature in your kernel, add option
+SOFTUPDATES to your kernel configuration.
+
+Once you are running a kernel with soft update support, you need to enable
+it for whichever filesystems you wish to run with the soft update policy.
+This is done with the -n option to tunefs(8) on the UNMOUNTED filesystems,
+e.g. from single-user mode you'd do something like:
+
+	tunefs -n enable /usr
+
+To permanently enable soft updates on the /usr filesystem (or at least
+until a corresponding ``tunefs -n disable'' is done).
+
+
+Soft Updates Copyright Restrictions
+
+As of June 2000 the restrictive copyright has been removed and 
+replaced with a `Berkeley-style' copyright. The files implementing
+soft updates now reside in the sys/ufs/ffs directory and are
+compiled into the generic kernel by default.
+
+
+Soft Updates Status
+
+The soft updates code has been running in production on many
+systems for the past two years generally quite successfully.
+The two current sets of shortcomings are:
+
+1) On filesystems that are chronically full, the two minute lag
+   from the time a file is deleted until its free space shows up
+   will result in premature filesystem full failures. This
+   failure mode is most evident in small filesystems such as
+   the root. For this reason, use of soft updates is not
+   recommended on the root filesystem.
+
+2) If your system routines runs parallel processes each of which
+   remove many files, the kernel memory rate limiting code may
+   not be able to slow removal operations to a level sustainable
+   by the disk subsystem. The result is that the kernel runs out
+   of memory and hangs.
+
+Both of these problems are being addressed, but have not yet
+been resolved. There are no other known problems at this time.
+
+
+How Soft Updates Work
+
+For more general information on soft updates, please see:
+	http://www.mckusick.com/softdep/
+	http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/
+
+--
+Marshall Kirk McKusick <mckusick@mckusick.com>
+July 2000
diff --git a/src/sys/ufs/ffs/ffs_alloc.c b/src/sys/ufs/ffs/ffs_alloc.c
new file mode 100644
index 0000000..485d679
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_alloc.c
@@ -0,0 +1,2362 @@
+/*
+ * Copyright (c) 2002 Networks Associates Technology, Inc.
+ * All rights reserved.
+ *
+ * This software was developed for the FreeBSD Project by Marshall
+ * Kirk McKusick and Network Associates Laboratories, the Security
+ * Research Division of Network Associates, Inc. under DARPA/SPAWAR
+ * contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA CHATS
+ * research program
+ *
+ * Copyright (c) 1982, 1986, 1989, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. All advertising materials mentioning features or use of this software
+ *    must display the following acknowledgement:
+ *	This product includes software developed by the University of
+ *	California, Berkeley and its contributors.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)ffs_alloc.c	8.18 (Berkeley) 5/26/95
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_alloc.c,v 1.116 2003/10/31 07:25:06 truckman Exp $");
+
+#include "opt_quota.h"
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/bio.h>
+#include <sys/buf.h>
+#include <sys/conf.h>
+#include <sys/file.h>
+#include <sys/filedesc.h>
+#include <sys/proc.h>
+#include <sys/vnode.h>
+#include <sys/mount.h>
+#include <sys/kernel.h>
+#include <sys/sysctl.h>
+#include <sys/syslog.h>
+
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufs_extern.h>
+#include <ufs/ufs/ufsmount.h>
+
+#include <ufs/ffs/fs.h>
+#include <ufs/ffs/ffs_extern.h>
+
+typedef ufs2_daddr_t allocfcn_t(struct inode *ip, int cg, ufs2_daddr_t bpref,
+				  int size);
+
+static ufs2_daddr_t ffs_alloccg(struct inode *, int, ufs2_daddr_t, int);
+static ufs2_daddr_t
+	      ffs_alloccgblk(struct inode *, struct buf *, ufs2_daddr_t);
+#ifdef DIAGNOSTIC
+static int	ffs_checkblk(struct inode *, ufs2_daddr_t, long);
+#endif
+static ufs2_daddr_t ffs_clusteralloc(struct inode *, int, ufs2_daddr_t, int);
+static ino_t	ffs_dirpref(struct inode *);
+static ufs2_daddr_t ffs_fragextend(struct inode *, int, ufs2_daddr_t, int, int);
+static void	ffs_fserr(struct fs *, ino_t, char *);
+static ufs2_daddr_t	ffs_hashalloc
+		(struct inode *, int, ufs2_daddr_t, int, allocfcn_t *);
+static ufs2_daddr_t ffs_nodealloccg(struct inode *, int, ufs2_daddr_t, int);
+static ufs1_daddr_t ffs_mapsearch(struct fs *, struct cg *, ufs2_daddr_t, int);
+static int	ffs_reallocblks_ufs1(struct vop_reallocblks_args *);
+static int	ffs_reallocblks_ufs2(struct vop_reallocblks_args *);
+
+/*
+ * Allocate a block in the filesystem.
+ *
+ * The size of the requested block is given, which must be some
+ * multiple of fs_fsize and <= fs_bsize.
+ * A preference may be optionally specified. If a preference is given
+ * the following hierarchy is used to allocate a block:
+ *   1) allocate the requested block.
+ *   2) allocate a rotationally optimal block in the same cylinder.
+ *   3) allocate a block in the same cylinder group.
+ *   4) quadradically rehash into other cylinder groups, until an
+ *      available block is located.
+ * If no block preference is given the following heirarchy is used
+ * to allocate a block:
+ *   1) allocate a block in the cylinder group that contains the
+ *      inode for the file.
+ *   2) quadradically rehash into other cylinder groups, until an
+ *      available block is located.
+ */
+int
+ffs_alloc(ip, lbn, bpref, size, cred, bnp)
+	struct inode *ip;
+	ufs2_daddr_t lbn, bpref;
+	int size;
+	struct ucred *cred;
+	ufs2_daddr_t *bnp;
+{
+	struct fs *fs;
+	ufs2_daddr_t bno;
+	int cg, reclaimed;
+#ifdef QUOTA
+	int error;
+#endif
+
+	*bnp = 0;
+	fs = ip->i_fs;
+#ifdef DIAGNOSTIC
+	if ((u_int)size > fs->fs_bsize || fragoff(fs, size) != 0) {
+		printf("dev = %s, bsize = %ld, size = %d, fs = %s\n",
+		    devtoname(ip->i_dev), (long)fs->fs_bsize, size,
+		    fs->fs_fsmnt);
+		panic("ffs_alloc: bad size");
+	}
+	if (cred == NOCRED)
+		panic("ffs_alloc: missing credential");
+#endif /* DIAGNOSTIC */
+	reclaimed = 0;
+retry:
+	if (size == fs->fs_bsize && fs->fs_cstotal.cs_nbfree == 0)
+		goto nospace;
+	if (suser_cred(cred, PRISON_ROOT) &&
+	    freespace(fs, fs->fs_minfree) - numfrags(fs, size) < 0)
+		goto nospace;
+#ifdef QUOTA
+	error = chkdq(ip, btodb(size), cred, 0);
+	if (error)
+		return (error);
+#endif
+	if (bpref >= fs->fs_size)
+		bpref = 0;
+	if (bpref == 0)
+		cg = ino_to_cg(fs, ip->i_number);
+	else
+		cg = dtog(fs, bpref);
+	bno = ffs_hashalloc(ip, cg, bpref, size, ffs_alloccg);
+	if (bno > 0) {
+		DIP(ip, i_blocks) += btodb(size);
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		*bnp = bno;
+		return (0);
+	}
+#ifdef QUOTA
+	/*
+	 * Restore user's disk quota because allocation failed.
+	 */
+	(void) chkdq(ip, -btodb(size), cred, FORCE);
+#endif
+nospace:
+	if (fs->fs_pendingblocks > 0 && reclaimed == 0) {
+		reclaimed = 1;
+		softdep_request_cleanup(fs, ITOV(ip));
+		goto retry;
+	}
+	ffs_fserr(fs, ip->i_number, "filesystem full");
+	uprintf("\n%s: write failed, filesystem is full\n", fs->fs_fsmnt);
+	return (ENOSPC);
+}
+
+/*
+ * Reallocate a fragment to a bigger size
+ *
+ * The number and size of the old block is given, and a preference
+ * and new size is also specified. The allocator attempts to extend
+ * the original block. Failing that, the regular block allocator is
+ * invoked to get an appropriate block.
+ */
+int
+ffs_realloccg(ip, lbprev, bprev, bpref, osize, nsize, cred, bpp)
+	struct inode *ip;
+	ufs2_daddr_t lbprev;
+	ufs2_daddr_t bprev;
+	ufs2_daddr_t bpref;
+	int osize, nsize;
+	struct ucred *cred;
+	struct buf **bpp;
+{
+	struct vnode *vp;
+	struct fs *fs;
+	struct buf *bp;
+	int cg, request, error, reclaimed;
+	ufs2_daddr_t bno;
+
+	*bpp = 0;
+	vp = ITOV(ip);
+	fs = ip->i_fs;
+#ifdef DIAGNOSTIC
+	if (vp->v_mount->mnt_kern_flag & MNTK_SUSPENDED)
+		panic("ffs_realloccg: allocation on suspended filesystem");
+	if ((u_int)osize > fs->fs_bsize || fragoff(fs, osize) != 0 ||
+	    (u_int)nsize > fs->fs_bsize || fragoff(fs, nsize) != 0) {
+		printf(
+		"dev = %s, bsize = %ld, osize = %d, nsize = %d, fs = %s\n",
+		    devtoname(ip->i_dev), (long)fs->fs_bsize, osize,
+		    nsize, fs->fs_fsmnt);
+		panic("ffs_realloccg: bad size");
+	}
+	if (cred == NOCRED)
+		panic("ffs_realloccg: missing credential");
+#endif /* DIAGNOSTIC */
+	reclaimed = 0;
+retry:
+	if (suser_cred(cred, PRISON_ROOT) &&
+	    freespace(fs, fs->fs_minfree) -  numfrags(fs, nsize - osize) < 0)
+		goto nospace;
+	if (bprev == 0) {
+		printf("dev = %s, bsize = %ld, bprev = %jd, fs = %s\n",
+		    devtoname(ip->i_dev), (long)fs->fs_bsize, (intmax_t)bprev,
+		    fs->fs_fsmnt);
+		panic("ffs_realloccg: bad bprev");
+	}
+	/*
+	 * Allocate the extra space in the buffer.
+	 */
+	error = bread(vp, lbprev, osize, NOCRED, &bp);
+	if (error) {
+		brelse(bp);
+		return (error);
+	}
+
+	if (bp->b_blkno == bp->b_lblkno) {
+		if (lbprev >= NDADDR)
+			panic("ffs_realloccg: lbprev out of range");
+		bp->b_blkno = fsbtodb(fs, bprev);
+	}
+
+#ifdef QUOTA
+	error = chkdq(ip, btodb(nsize - osize), cred, 0);
+	if (error) {
+		brelse(bp);
+		return (error);
+	}
+#endif
+	/*
+	 * Check for extension in the existing location.
+	 */
+	cg = dtog(fs, bprev);
+	bno = ffs_fragextend(ip, cg, bprev, osize, nsize);
+	if (bno) {
+		if (bp->b_blkno != fsbtodb(fs, bno))
+			panic("ffs_realloccg: bad blockno");
+		DIP(ip, i_blocks) += btodb(nsize - osize);
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		allocbuf(bp, nsize);
+		bp->b_flags |= B_DONE;
+		bzero((char *)bp->b_data + osize, (u_int)nsize - osize);
+		*bpp = bp;
+		return (0);
+	}
+	/*
+	 * Allocate a new disk location.
+	 */
+	if (bpref >= fs->fs_size)
+		bpref = 0;
+	switch ((int)fs->fs_optim) {
+	case FS_OPTSPACE:
+		/*
+		 * Allocate an exact sized fragment. Although this makes
+		 * best use of space, we will waste time relocating it if
+		 * the file continues to grow. If the fragmentation is
+		 * less than half of the minimum free reserve, we choose
+		 * to begin optimizing for time.
+		 */
+		request = nsize;
+		if (fs->fs_minfree <= 5 ||
+		    fs->fs_cstotal.cs_nffree >
+		    (off_t)fs->fs_dsize * fs->fs_minfree / (2 * 100))
+			break;
+		log(LOG_NOTICE, "%s: optimization changed from SPACE to TIME\n",
+			fs->fs_fsmnt);
+		fs->fs_optim = FS_OPTTIME;
+		break;
+	case FS_OPTTIME:
+		/*
+		 * At this point we have discovered a file that is trying to
+		 * grow a small fragment to a larger fragment. To save time,
+		 * we allocate a full sized block, then free the unused portion.
+		 * If the file continues to grow, the `ffs_fragextend' call
+		 * above will be able to grow it in place without further
+		 * copying. If aberrant programs cause disk fragmentation to
+		 * grow within 2% of the free reserve, we choose to begin
+		 * optimizing for space.
+		 */
+		request = fs->fs_bsize;
+		if (fs->fs_cstotal.cs_nffree <
+		    (off_t)fs->fs_dsize * (fs->fs_minfree - 2) / 100)
+			break;
+		log(LOG_NOTICE, "%s: optimization changed from TIME to SPACE\n",
+			fs->fs_fsmnt);
+		fs->fs_optim = FS_OPTSPACE;
+		break;
+	default:
+		printf("dev = %s, optim = %ld, fs = %s\n",
+		    devtoname(ip->i_dev), (long)fs->fs_optim, fs->fs_fsmnt);
+		panic("ffs_realloccg: bad optim");
+		/* NOTREACHED */
+	}
+	bno = ffs_hashalloc(ip, cg, bpref, request, ffs_alloccg);
+	if (bno > 0) {
+		bp->b_blkno = fsbtodb(fs, bno);
+		if (!DOINGSOFTDEP(vp))
+			ffs_blkfree(fs, ip->i_devvp, bprev, (long)osize,
+			    ip->i_number);
+		if (nsize < request)
+			ffs_blkfree(fs, ip->i_devvp, bno + numfrags(fs, nsize),
+			    (long)(request - nsize), ip->i_number);
+		DIP(ip, i_blocks) += btodb(nsize - osize);
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		allocbuf(bp, nsize);
+		bp->b_flags |= B_DONE;
+		bzero((char *)bp->b_data + osize, (u_int)nsize - osize);
+		*bpp = bp;
+		return (0);
+	}
+#ifdef QUOTA
+	/*
+	 * Restore user's disk quota because allocation failed.
+	 */
+	(void) chkdq(ip, -btodb(nsize - osize), cred, FORCE);
+#endif
+	brelse(bp);
+nospace:
+	/*
+	 * no space available
+	 */
+	if (fs->fs_pendingblocks > 0 && reclaimed == 0) {
+		reclaimed = 1;
+		softdep_request_cleanup(fs, vp);
+		goto retry;
+	}
+	ffs_fserr(fs, ip->i_number, "filesystem full");
+	uprintf("\n%s: write failed, filesystem is full\n", fs->fs_fsmnt);
+	return (ENOSPC);
+}
+
+/*
+ * Reallocate a sequence of blocks into a contiguous sequence of blocks.
+ *
+ * The vnode and an array of buffer pointers for a range of sequential
+ * logical blocks to be made contiguous is given. The allocator attempts
+ * to find a range of sequential blocks starting as close as possible
+ * from the end of the allocation for the logical block immediately
+ * preceding the current range. If successful, the physical block numbers
+ * in the buffer pointers and in the inode are changed to reflect the new
+ * allocation. If unsuccessful, the allocation is left unchanged. The
+ * success in doing the reallocation is returned. Note that the error
+ * return is not reflected back to the user. Rather the previous block
+ * allocation will be used.
+ */
+
+SYSCTL_NODE(_vfs, OID_AUTO, ffs, CTLFLAG_RW, 0, "FFS filesystem");
+
+static int doasyncfree = 1;
+SYSCTL_INT(_vfs_ffs, OID_AUTO, doasyncfree, CTLFLAG_RW, &doasyncfree, 0, "");
+
+static int doreallocblks = 1;
+SYSCTL_INT(_vfs_ffs, OID_AUTO, doreallocblks, CTLFLAG_RW, &doreallocblks, 0, "");
+
+#ifdef DEBUG
+static volatile int prtrealloc = 0;
+#endif
+
+int
+ffs_reallocblks(ap)
+	struct vop_reallocblks_args /* {
+		struct vnode *a_vp;
+		struct cluster_save *a_buflist;
+	} */ *ap;
+{
+
+	if (doreallocblks == 0)
+		return (ENOSPC);
+	if (VTOI(ap->a_vp)->i_ump->um_fstype == UFS1)
+		return (ffs_reallocblks_ufs1(ap));
+	return (ffs_reallocblks_ufs2(ap));
+}
+	
+static int
+ffs_reallocblks_ufs1(ap)
+	struct vop_reallocblks_args /* {
+		struct vnode *a_vp;
+		struct cluster_save *a_buflist;
+	} */ *ap;
+{
+	struct fs *fs;
+	struct inode *ip;
+	struct vnode *vp;
+	struct buf *sbp, *ebp;
+	ufs1_daddr_t *bap, *sbap, *ebap = 0;
+	struct cluster_save *buflist;
+	ufs_lbn_t start_lbn, end_lbn;
+	ufs1_daddr_t soff, newblk, blkno;
+	ufs2_daddr_t pref;
+	struct indir start_ap[NIADDR + 1], end_ap[NIADDR + 1], *idp;
+	int i, len, start_lvl, end_lvl, ssize;
+
+	vp = ap->a_vp;
+	ip = VTOI(vp);
+	fs = ip->i_fs;
+	if (fs->fs_contigsumsize <= 0)
+		return (ENOSPC);
+	buflist = ap->a_buflist;
+	len = buflist->bs_nchildren;
+	start_lbn = buflist->bs_children[0]->b_lblkno;
+	end_lbn = start_lbn + len - 1;
+#ifdef DIAGNOSTIC
+	for (i = 0; i < len; i++)
+		if (!ffs_checkblk(ip,
+		   dbtofsb(fs, buflist->bs_children[i]->b_blkno), fs->fs_bsize))
+			panic("ffs_reallocblks: unallocated block 1");
+	for (i = 1; i < len; i++)
+		if (buflist->bs_children[i]->b_lblkno != start_lbn + i)
+			panic("ffs_reallocblks: non-logical cluster");
+	blkno = buflist->bs_children[0]->b_blkno;
+	ssize = fsbtodb(fs, fs->fs_frag);
+	for (i = 1; i < len - 1; i++)
+		if (buflist->bs_children[i]->b_blkno != blkno + (i * ssize))
+			panic("ffs_reallocblks: non-physical cluster %d", i);
+#endif
+	/*
+	 * If the latest allocation is in a new cylinder group, assume that
+	 * the filesystem has decided to move and do not force it back to
+	 * the previous cylinder group.
+	 */
+	if (dtog(fs, dbtofsb(fs, buflist->bs_children[0]->b_blkno)) !=
+	    dtog(fs, dbtofsb(fs, buflist->bs_children[len - 1]->b_blkno)))
+		return (ENOSPC);
+	if (ufs_getlbns(vp, start_lbn, start_ap, &start_lvl) ||
+	    ufs_getlbns(vp, end_lbn, end_ap, &end_lvl))
+		return (ENOSPC);
+	/*
+	 * Get the starting offset and block map for the first block.
+	 */
+	if (start_lvl == 0) {
+		sbap = &ip->i_din1->di_db[0];
+		soff = start_lbn;
+	} else {
+		idp = &start_ap[start_lvl - 1];
+		if (bread(vp, idp->in_lbn, (int)fs->fs_bsize, NOCRED, &sbp)) {
+			brelse(sbp);
+			return (ENOSPC);
+		}
+		sbap = (ufs1_daddr_t *)sbp->b_data;
+		soff = idp->in_off;
+	}
+	/*
+	 * Find the preferred location for the cluster.
+	 */
+	pref = ffs_blkpref_ufs1(ip, start_lbn, soff, sbap);
+	/*
+	 * If the block range spans two block maps, get the second map.
+	 */
+	if (end_lvl == 0 || (idp = &end_ap[end_lvl - 1])->in_off + 1 >= len) {
+		ssize = len;
+	} else {
+#ifdef DIAGNOSTIC
+		if (start_ap[start_lvl-1].in_lbn == idp->in_lbn)
+			panic("ffs_reallocblk: start == end");
+#endif
+		ssize = len - (idp->in_off + 1);
+		if (bread(vp, idp->in_lbn, (int)fs->fs_bsize, NOCRED, &ebp))
+			goto fail;
+		ebap = (ufs1_daddr_t *)ebp->b_data;
+	}
+	/*
+	 * Search the block map looking for an allocation of the desired size.
+	 */
+	if ((newblk = ffs_hashalloc(ip, dtog(fs, pref), pref,
+	    len, ffs_clusteralloc)) == 0)
+		goto fail;
+	/*
+	 * We have found a new contiguous block.
+	 *
+	 * First we have to replace the old block pointers with the new
+	 * block pointers in the inode and indirect blocks associated
+	 * with the file.
+	 */
+#ifdef DEBUG
+	if (prtrealloc)
+		printf("realloc: ino %d, lbns %jd-%jd\n\told:", ip->i_number,
+		    (intmax_t)start_lbn, (intmax_t)end_lbn);
+#endif
+	blkno = newblk;
+	for (bap = &sbap[soff], i = 0; i < len; i++, blkno += fs->fs_frag) {
+		if (i == ssize) {
+			bap = ebap;
+			soff = -i;
+		}
+#ifdef DIAGNOSTIC
+		if (!ffs_checkblk(ip,
+		   dbtofsb(fs, buflist->bs_children[i]->b_blkno), fs->fs_bsize))
+			panic("ffs_reallocblks: unallocated block 2");
+		if (dbtofsb(fs, buflist->bs_children[i]->b_blkno) != *bap)
+			panic("ffs_reallocblks: alloc mismatch");
+#endif
+#ifdef DEBUG
+		if (prtrealloc)
+			printf(" %d,", *bap);
+#endif
+		if (DOINGSOFTDEP(vp)) {
+			if (sbap == &ip->i_din1->di_db[0] && i < ssize)
+				softdep_setup_allocdirect(ip, start_lbn + i,
+				    blkno, *bap, fs->fs_bsize, fs->fs_bsize,
+				    buflist->bs_children[i]);
+			else
+				softdep_setup_allocindir_page(ip, start_lbn + i,
+				    i < ssize ? sbp : ebp, soff + i, blkno,
+				    *bap, buflist->bs_children[i]);
+		}
+		*bap++ = blkno;
+	}
+	/*
+	 * Next we must write out the modified inode and indirect blocks.
+	 * For strict correctness, the writes should be synchronous since
+	 * the old block values may have been written to disk. In practise
+	 * they are almost never written, but if we are concerned about
+	 * strict correctness, the `doasyncfree' flag should be set to zero.
+	 *
+	 * The test on `doasyncfree' should be changed to test a flag
+	 * that shows whether the associated buffers and inodes have
+	 * been written. The flag should be set when the cluster is
+	 * started and cleared whenever the buffer or inode is flushed.
+	 * We can then check below to see if it is set, and do the
+	 * synchronous write only when it has been cleared.
+	 */
+	if (sbap != &ip->i_din1->di_db[0]) {
+		if (doasyncfree)
+			bdwrite(sbp);
+		else
+			bwrite(sbp);
+	} else {
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		if (!doasyncfree)
+			UFS_UPDATE(vp, 1);
+	}
+	if (ssize < len) {
+		if (doasyncfree)
+			bdwrite(ebp);
+		else
+			bwrite(ebp);
+	}
+	/*
+	 * Last, free the old blocks and assign the new blocks to the buffers.
+	 */
+#ifdef DEBUG
+	if (prtrealloc)
+		printf("\n\tnew:");
+#endif
+	for (blkno = newblk, i = 0; i < len; i++, blkno += fs->fs_frag) {
+		if (!DOINGSOFTDEP(vp))
+			ffs_blkfree(fs, ip->i_devvp,
+			    dbtofsb(fs, buflist->bs_children[i]->b_blkno),
+			    fs->fs_bsize, ip->i_number);
+		buflist->bs_children[i]->b_blkno = fsbtodb(fs, blkno);
+#ifdef DIAGNOSTIC
+		if (!ffs_checkblk(ip,
+		   dbtofsb(fs, buflist->bs_children[i]->b_blkno), fs->fs_bsize))
+			panic("ffs_reallocblks: unallocated block 3");
+#endif
+#ifdef DEBUG
+		if (prtrealloc)
+			printf(" %d,", blkno);
+#endif
+	}
+#ifdef DEBUG
+	if (prtrealloc) {
+		prtrealloc--;
+		printf("\n");
+	}
+#endif
+	return (0);
+
+fail:
+	if (ssize < len)
+		brelse(ebp);
+	if (sbap != &ip->i_din1->di_db[0])
+		brelse(sbp);
+	return (ENOSPC);
+}
+
+static int
+ffs_reallocblks_ufs2(ap)
+	struct vop_reallocblks_args /* {
+		struct vnode *a_vp;
+		struct cluster_save *a_buflist;
+	} */ *ap;
+{
+	struct fs *fs;
+	struct inode *ip;
+	struct vnode *vp;
+	struct buf *sbp, *ebp;
+	ufs2_daddr_t *bap, *sbap, *ebap = 0;
+	struct cluster_save *buflist;
+	ufs_lbn_t start_lbn, end_lbn;
+	ufs2_daddr_t soff, newblk, blkno, pref;
+	struct indir start_ap[NIADDR + 1], end_ap[NIADDR + 1], *idp;
+	int i, len, start_lvl, end_lvl, ssize;
+
+	vp = ap->a_vp;
+	ip = VTOI(vp);
+	fs = ip->i_fs;
+	if (fs->fs_contigsumsize <= 0)
+		return (ENOSPC);
+	buflist = ap->a_buflist;
+	len = buflist->bs_nchildren;
+	start_lbn = buflist->bs_children[0]->b_lblkno;
+	end_lbn = start_lbn + len - 1;
+#ifdef DIAGNOSTIC
+	for (i = 0; i < len; i++)
+		if (!ffs_checkblk(ip,
+		   dbtofsb(fs, buflist->bs_children[i]->b_blkno), fs->fs_bsize))
+			panic("ffs_reallocblks: unallocated block 1");
+	for (i = 1; i < len; i++)
+		if (buflist->bs_children[i]->b_lblkno != start_lbn + i)
+			panic("ffs_reallocblks: non-logical cluster");
+	blkno = buflist->bs_children[0]->b_blkno;
+	ssize = fsbtodb(fs, fs->fs_frag);
+	for (i = 1; i < len - 1; i++)
+		if (buflist->bs_children[i]->b_blkno != blkno + (i * ssize))
+			panic("ffs_reallocblks: non-physical cluster %d", i);
+#endif
+	/*
+	 * If the latest allocation is in a new cylinder group, assume that
+	 * the filesystem has decided to move and do not force it back to
+	 * the previous cylinder group.
+	 */
+	if (dtog(fs, dbtofsb(fs, buflist->bs_children[0]->b_blkno)) !=
+	    dtog(fs, dbtofsb(fs, buflist->bs_children[len - 1]->b_blkno)))
+		return (ENOSPC);
+	if (ufs_getlbns(vp, start_lbn, start_ap, &start_lvl) ||
+	    ufs_getlbns(vp, end_lbn, end_ap, &end_lvl))
+		return (ENOSPC);
+	/*
+	 * Get the starting offset and block map for the first block.
+	 */
+	if (start_lvl == 0) {
+		sbap = &ip->i_din2->di_db[0];
+		soff = start_lbn;
+	} else {
+		idp = &start_ap[start_lvl - 1];
+		if (bread(vp, idp->in_lbn, (int)fs->fs_bsize, NOCRED, &sbp)) {
+			brelse(sbp);
+			return (ENOSPC);
+		}
+		sbap = (ufs2_daddr_t *)sbp->b_data;
+		soff = idp->in_off;
+	}
+	/*
+	 * Find the preferred location for the cluster.
+	 */
+	pref = ffs_blkpref_ufs2(ip, start_lbn, soff, sbap);
+	/*
+	 * If the block range spans two block maps, get the second map.
+	 */
+	if (end_lvl == 0 || (idp = &end_ap[end_lvl - 1])->in_off + 1 >= len) {
+		ssize = len;
+	} else {
+#ifdef DIAGNOSTIC
+		if (start_ap[start_lvl-1].in_lbn == idp->in_lbn)
+			panic("ffs_reallocblk: start == end");
+#endif
+		ssize = len - (idp->in_off + 1);
+		if (bread(vp, idp->in_lbn, (int)fs->fs_bsize, NOCRED, &ebp))
+			goto fail;
+		ebap = (ufs2_daddr_t *)ebp->b_data;
+	}
+	/*
+	 * Search the block map looking for an allocation of the desired size.
+	 */
+	if ((newblk = ffs_hashalloc(ip, dtog(fs, pref), pref,
+	    len, ffs_clusteralloc)) == 0)
+		goto fail;
+	/*
+	 * We have found a new contiguous block.
+	 *
+	 * First we have to replace the old block pointers with the new
+	 * block pointers in the inode and indirect blocks associated
+	 * with the file.
+	 */
+#ifdef DEBUG
+	if (prtrealloc)
+		printf("realloc: ino %d, lbns %jd-%jd\n\told:", ip->i_number,
+		    (intmax_t)start_lbn, (intmax_t)end_lbn);
+#endif
+	blkno = newblk;
+	for (bap = &sbap[soff], i = 0; i < len; i++, blkno += fs->fs_frag) {
+		if (i == ssize) {
+			bap = ebap;
+			soff = -i;
+		}
+#ifdef DIAGNOSTIC
+		if (!ffs_checkblk(ip,
+		   dbtofsb(fs, buflist->bs_children[i]->b_blkno), fs->fs_bsize))
+			panic("ffs_reallocblks: unallocated block 2");
+		if (dbtofsb(fs, buflist->bs_children[i]->b_blkno) != *bap)
+			panic("ffs_reallocblks: alloc mismatch");
+#endif
+#ifdef DEBUG
+		if (prtrealloc)
+			printf(" %jd,", (intmax_t)*bap);
+#endif
+		if (DOINGSOFTDEP(vp)) {
+			if (sbap == &ip->i_din2->di_db[0] && i < ssize)
+				softdep_setup_allocdirect(ip, start_lbn + i,
+				    blkno, *bap, fs->fs_bsize, fs->fs_bsize,
+				    buflist->bs_children[i]);
+			else
+				softdep_setup_allocindir_page(ip, start_lbn + i,
+				    i < ssize ? sbp : ebp, soff + i, blkno,
+				    *bap, buflist->bs_children[i]);
+		}
+		*bap++ = blkno;
+	}
+	/*
+	 * Next we must write out the modified inode and indirect blocks.
+	 * For strict correctness, the writes should be synchronous since
+	 * the old block values may have been written to disk. In practise
+	 * they are almost never written, but if we are concerned about
+	 * strict correctness, the `doasyncfree' flag should be set to zero.
+	 *
+	 * The test on `doasyncfree' should be changed to test a flag
+	 * that shows whether the associated buffers and inodes have
+	 * been written. The flag should be set when the cluster is
+	 * started and cleared whenever the buffer or inode is flushed.
+	 * We can then check below to see if it is set, and do the
+	 * synchronous write only when it has been cleared.
+	 */
+	if (sbap != &ip->i_din2->di_db[0]) {
+		if (doasyncfree)
+			bdwrite(sbp);
+		else
+			bwrite(sbp);
+	} else {
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		if (!doasyncfree)
+			UFS_UPDATE(vp, 1);
+	}
+	if (ssize < len) {
+		if (doasyncfree)
+			bdwrite(ebp);
+		else
+			bwrite(ebp);
+	}
+	/*
+	 * Last, free the old blocks and assign the new blocks to the buffers.
+	 */
+#ifdef DEBUG
+	if (prtrealloc)
+		printf("\n\tnew:");
+#endif
+	for (blkno = newblk, i = 0; i < len; i++, blkno += fs->fs_frag) {
+		if (!DOINGSOFTDEP(vp))
+			ffs_blkfree(fs, ip->i_devvp,
+			    dbtofsb(fs, buflist->bs_children[i]->b_blkno),
+			    fs->fs_bsize, ip->i_number);
+		buflist->bs_children[i]->b_blkno = fsbtodb(fs, blkno);
+#ifdef DIAGNOSTIC
+		if (!ffs_checkblk(ip,
+		   dbtofsb(fs, buflist->bs_children[i]->b_blkno), fs->fs_bsize))
+			panic("ffs_reallocblks: unallocated block 3");
+#endif
+#ifdef DEBUG
+		if (prtrealloc)
+			printf(" %jd,", (intmax_t)blkno);
+#endif
+	}
+#ifdef DEBUG
+	if (prtrealloc) {
+		prtrealloc--;
+		printf("\n");
+	}
+#endif
+	return (0);
+
+fail:
+	if (ssize < len)
+		brelse(ebp);
+	if (sbap != &ip->i_din2->di_db[0])
+		brelse(sbp);
+	return (ENOSPC);
+}
+
+/*
+ * Allocate an inode in the filesystem.
+ *
+ * If allocating a directory, use ffs_dirpref to select the inode.
+ * If allocating in a directory, the following hierarchy is followed:
+ *   1) allocate the preferred inode.
+ *   2) allocate an inode in the same cylinder group.
+ *   3) quadradically rehash into other cylinder groups, until an
+ *      available inode is located.
+ * If no inode preference is given the following heirarchy is used
+ * to allocate an inode:
+ *   1) allocate an inode in cylinder group 0.
+ *   2) quadradically rehash into other cylinder groups, until an
+ *      available inode is located.
+ */
+int
+ffs_valloc(pvp, mode, cred, vpp)
+	struct vnode *pvp;
+	int mode;
+	struct ucred *cred;
+	struct vnode **vpp;
+{
+	struct inode *pip;
+	struct fs *fs;
+	struct inode *ip;
+	struct timespec ts;
+	ino_t ino, ipref;
+	int cg, error;
+
+	*vpp = NULL;
+	pip = VTOI(pvp);
+	fs = pip->i_fs;
+	if (fs->fs_cstotal.cs_nifree == 0)
+		goto noinodes;
+
+	if ((mode & IFMT) == IFDIR)
+		ipref = ffs_dirpref(pip);
+	else
+		ipref = pip->i_number;
+	if (ipref >= fs->fs_ncg * fs->fs_ipg)
+		ipref = 0;
+	cg = ino_to_cg(fs, ipref);
+	/*
+	 * Track number of dirs created one after another
+	 * in a same cg without intervening by files.
+	 */
+	if ((mode & IFMT) == IFDIR) {
+		if (fs->fs_contigdirs[cg] < 255)
+			fs->fs_contigdirs[cg]++;
+	} else {
+		if (fs->fs_contigdirs[cg] > 0)
+			fs->fs_contigdirs[cg]--;
+	}
+	ino = (ino_t)ffs_hashalloc(pip, cg, ipref, mode,
+					(allocfcn_t *)ffs_nodealloccg);
+	if (ino == 0)
+		goto noinodes;
+	error = VFS_VGET(pvp->v_mount, ino, LK_EXCLUSIVE, vpp);
+	if (error) {
+		UFS_VFREE(pvp, ino, mode);
+		return (error);
+	}
+	ip = VTOI(*vpp);
+	if (ip->i_mode) {
+		printf("mode = 0%o, inum = %lu, fs = %s\n",
+		    ip->i_mode, (u_long)ip->i_number, fs->fs_fsmnt);
+		panic("ffs_valloc: dup alloc");
+	}
+	if (DIP(ip, i_blocks) && (fs->fs_flags & FS_UNCLEAN) == 0) {  /* XXX */
+		printf("free inode %s/%lu had %ld blocks\n",
+		    fs->fs_fsmnt, (u_long)ino, (long)DIP(ip, i_blocks));
+		DIP(ip, i_blocks) = 0;
+	}
+	ip->i_flags = 0;
+	DIP(ip, i_flags) = 0;
+	/*
+	 * Set up a new generation number for this inode.
+	 */
+	if (ip->i_gen == 0 || ++ip->i_gen == 0)
+		ip->i_gen = arc4random() / 2 + 1;
+	DIP(ip, i_gen) = ip->i_gen;
+	if (fs->fs_magic == FS_UFS2_MAGIC) {
+		vfs_timestamp(&ts);
+		ip->i_din2->di_birthtime = ts.tv_sec;
+		ip->i_din2->di_birthnsec = ts.tv_nsec;
+	}
+	return (0);
+noinodes:
+	ffs_fserr(fs, pip->i_number, "out of inodes");
+	uprintf("\n%s: create/symlink failed, no inodes free\n", fs->fs_fsmnt);
+	return (ENOSPC);
+}
+
+/*
+ * Find a cylinder group to place a directory.
+ *
+ * The policy implemented by this algorithm is to allocate a
+ * directory inode in the same cylinder group as its parent
+ * directory, but also to reserve space for its files inodes
+ * and data. Restrict the number of directories which may be
+ * allocated one after another in the same cylinder group
+ * without intervening allocation of files.
+ *
+ * If we allocate a first level directory then force allocation
+ * in another cylinder group.
+ */
+static ino_t
+ffs_dirpref(pip)
+	struct inode *pip;
+{
+	struct fs *fs;
+	int cg, prefcg, dirsize, cgsize;
+	int avgifree, avgbfree, avgndir, curdirsize;
+	int minifree, minbfree, maxndir;
+	int mincg, minndir;
+	int maxcontigdirs;
+
+	fs = pip->i_fs;
+
+	avgifree = fs->fs_cstotal.cs_nifree / fs->fs_ncg;
+	avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
+	avgndir = fs->fs_cstotal.cs_ndir / fs->fs_ncg;
+
+	/*
+	 * Force allocation in another cg if creating a first level dir.
+	 */
+	ASSERT_VOP_LOCKED(ITOV(pip), "ffs_dirpref");
+	if (ITOV(pip)->v_vflag & VV_ROOT) {
+		prefcg = arc4random() % fs->fs_ncg;
+		mincg = prefcg;
+		minndir = fs->fs_ipg;
+		for (cg = prefcg; cg < fs->fs_ncg; cg++)
+			if (fs->fs_cs(fs, cg).cs_ndir < minndir &&
+			    fs->fs_cs(fs, cg).cs_nifree >= avgifree &&
+			    fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) {
+				mincg = cg;
+				minndir = fs->fs_cs(fs, cg).cs_ndir;
+			}
+		for (cg = 0; cg < prefcg; cg++)
+			if (fs->fs_cs(fs, cg).cs_ndir < minndir &&
+			    fs->fs_cs(fs, cg).cs_nifree >= avgifree &&
+			    fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) {
+				mincg = cg;
+				minndir = fs->fs_cs(fs, cg).cs_ndir;
+			}
+		return ((ino_t)(fs->fs_ipg * mincg));
+	}
+
+	/*
+	 * Count various limits which used for
+	 * optimal allocation of a directory inode.
+	 */
+	maxndir = min(avgndir + fs->fs_ipg / 16, fs->fs_ipg);
+	minifree = avgifree - avgifree / 4;
+	if (minifree < 1)
+		minifree = 1;
+	minbfree = avgbfree - avgbfree / 4;
+	if (minbfree < 1)
+		minbfree = 1;
+	cgsize = fs->fs_fsize * fs->fs_fpg;
+	dirsize = fs->fs_avgfilesize * fs->fs_avgfpdir;
+	curdirsize = avgndir ? (cgsize - avgbfree * fs->fs_bsize) / avgndir : 0;
+	if (dirsize < curdirsize)
+		dirsize = curdirsize;
+	maxcontigdirs = min((avgbfree * fs->fs_bsize) / dirsize, 255);
+	if (fs->fs_avgfpdir > 0)
+		maxcontigdirs = min(maxcontigdirs,
+				    fs->fs_ipg / fs->fs_avgfpdir);
+	if (maxcontigdirs == 0)
+		maxcontigdirs = 1;
+
+	/*
+	 * Limit number of dirs in one cg and reserve space for 
+	 * regular files, but only if we have no deficit in
+	 * inodes or space.
+	 */
+	prefcg = ino_to_cg(fs, pip->i_number);
+	for (cg = prefcg; cg < fs->fs_ncg; cg++)
+		if (fs->fs_cs(fs, cg).cs_ndir < maxndir &&
+		    fs->fs_cs(fs, cg).cs_nifree >= minifree &&
+	    	    fs->fs_cs(fs, cg).cs_nbfree >= minbfree) {
+			if (fs->fs_contigdirs[cg] < maxcontigdirs)
+				return ((ino_t)(fs->fs_ipg * cg));
+		}
+	for (cg = 0; cg < prefcg; cg++)
+		if (fs->fs_cs(fs, cg).cs_ndir < maxndir &&
+		    fs->fs_cs(fs, cg).cs_nifree >= minifree &&
+	    	    fs->fs_cs(fs, cg).cs_nbfree >= minbfree) {
+			if (fs->fs_contigdirs[cg] < maxcontigdirs)
+				return ((ino_t)(fs->fs_ipg * cg));
+		}
+	/*
+	 * This is a backstop when we have deficit in space.
+	 */
+	for (cg = prefcg; cg < fs->fs_ncg; cg++)
+		if (fs->fs_cs(fs, cg).cs_nifree >= avgifree)
+			return ((ino_t)(fs->fs_ipg * cg));
+	for (cg = 0; cg < prefcg; cg++)
+		if (fs->fs_cs(fs, cg).cs_nifree >= avgifree)
+			break;
+	return ((ino_t)(fs->fs_ipg * cg));
+}
+
+/*
+ * Select the desired position for the next block in a file.  The file is
+ * logically divided into sections. The first section is composed of the
+ * direct blocks. Each additional section contains fs_maxbpg blocks.
+ *
+ * If no blocks have been allocated in the first section, the policy is to
+ * request a block in the same cylinder group as the inode that describes
+ * the file. If no blocks have been allocated in any other section, the
+ * policy is to place the section in a cylinder group with a greater than
+ * average number of free blocks.  An appropriate cylinder group is found
+ * by using a rotor that sweeps the cylinder groups. When a new group of
+ * blocks is needed, the sweep begins in the cylinder group following the
+ * cylinder group from which the previous allocation was made. The sweep
+ * continues until a cylinder group with greater than the average number
+ * of free blocks is found. If the allocation is for the first block in an
+ * indirect block, the information on the previous allocation is unavailable;
+ * here a best guess is made based upon the logical block number being
+ * allocated.
+ *
+ * If a section is already partially allocated, the policy is to
+ * contiguously allocate fs_maxcontig blocks. The end of one of these
+ * contiguous blocks and the beginning of the next is laid out
+ * contiguously if possible.
+ */
+ufs2_daddr_t
+ffs_blkpref_ufs1(ip, lbn, indx, bap)
+	struct inode *ip;
+	ufs_lbn_t lbn;
+	int indx;
+	ufs1_daddr_t *bap;
+{
+	struct fs *fs;
+	int cg;
+	int avgbfree, startcg;
+
+	fs = ip->i_fs;
+	if (indx % fs->fs_maxbpg == 0 || bap[indx - 1] == 0) {
+		if (lbn < NDADDR + NINDIR(fs)) {
+			cg = ino_to_cg(fs, ip->i_number);
+			return (fs->fs_fpg * cg + fs->fs_frag);
+		}
+		/*
+		 * Find a cylinder with greater than average number of
+		 * unused data blocks.
+		 */
+		if (indx == 0 || bap[indx - 1] == 0)
+			startcg =
+			    ino_to_cg(fs, ip->i_number) + lbn / fs->fs_maxbpg;
+		else
+			startcg = dtog(fs, bap[indx - 1]) + 1;
+		startcg %= fs->fs_ncg;
+		avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
+		for (cg = startcg; cg < fs->fs_ncg; cg++)
+			if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) {
+				fs->fs_cgrotor = cg;
+				return (fs->fs_fpg * cg + fs->fs_frag);
+			}
+		for (cg = 0; cg <= startcg; cg++)
+			if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) {
+				fs->fs_cgrotor = cg;
+				return (fs->fs_fpg * cg + fs->fs_frag);
+			}
+		return (0);
+	}
+	/*
+	 * We just always try to lay things out contiguously.
+	 */
+	return (bap[indx - 1] + fs->fs_frag);
+}
+
+/*
+ * Same as above, but for UFS2
+ */
+ufs2_daddr_t
+ffs_blkpref_ufs2(ip, lbn, indx, bap)
+	struct inode *ip;
+	ufs_lbn_t lbn;
+	int indx;
+	ufs2_daddr_t *bap;
+{
+	struct fs *fs;
+	int cg;
+	int avgbfree, startcg;
+
+	fs = ip->i_fs;
+	if (indx % fs->fs_maxbpg == 0 || bap[indx - 1] == 0) {
+		if (lbn < NDADDR + NINDIR(fs)) {
+			cg = ino_to_cg(fs, ip->i_number);
+			return (fs->fs_fpg * cg + fs->fs_frag);
+		}
+		/*
+		 * Find a cylinder with greater than average number of
+		 * unused data blocks.
+		 */
+		if (indx == 0 || bap[indx - 1] == 0)
+			startcg =
+			    ino_to_cg(fs, ip->i_number) + lbn / fs->fs_maxbpg;
+		else
+			startcg = dtog(fs, bap[indx - 1]) + 1;
+		startcg %= fs->fs_ncg;
+		avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
+		for (cg = startcg; cg < fs->fs_ncg; cg++)
+			if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) {
+				fs->fs_cgrotor = cg;
+				return (fs->fs_fpg * cg + fs->fs_frag);
+			}
+		for (cg = 0; cg <= startcg; cg++)
+			if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) {
+				fs->fs_cgrotor = cg;
+				return (fs->fs_fpg * cg + fs->fs_frag);
+			}
+		return (0);
+	}
+	/*
+	 * We just always try to lay things out contiguously.
+	 */
+	return (bap[indx - 1] + fs->fs_frag);
+}
+
+/*
+ * Implement the cylinder overflow algorithm.
+ *
+ * The policy implemented by this algorithm is:
+ *   1) allocate the block in its requested cylinder group.
+ *   2) quadradically rehash on the cylinder group number.
+ *   3) brute force search for a free block.
+ */
+/*VARARGS5*/
+static ufs2_daddr_t
+ffs_hashalloc(ip, cg, pref, size, allocator)
+	struct inode *ip;
+	int cg;
+	ufs2_daddr_t pref;
+	int size;	/* size for data blocks, mode for inodes */
+	allocfcn_t *allocator;
+{
+	struct fs *fs;
+	ufs2_daddr_t result;
+	int i, icg = cg;
+
+#ifdef DIAGNOSTIC
+	if (ITOV(ip)->v_mount->mnt_kern_flag & MNTK_SUSPENDED)
+		panic("ffs_hashalloc: allocation on suspended filesystem");
+#endif
+	fs = ip->i_fs;
+	/*
+	 * 1: preferred cylinder group
+	 */
+	result = (*allocator)(ip, cg, pref, size);
+	if (result)
+		return (result);
+	/*
+	 * 2: quadratic rehash
+	 */
+	for (i = 1; i < fs->fs_ncg; i *= 2) {
+		cg += i;
+		if (cg >= fs->fs_ncg)
+			cg -= fs->fs_ncg;
+		result = (*allocator)(ip, cg, 0, size);
+		if (result)
+			return (result);
+	}
+	/*
+	 * 3: brute force search
+	 * Note that we start at i == 2, since 0 was checked initially,
+	 * and 1 is always checked in the quadratic rehash.
+	 */
+	cg = (icg + 2) % fs->fs_ncg;
+	for (i = 2; i < fs->fs_ncg; i++) {
+		result = (*allocator)(ip, cg, 0, size);
+		if (result)
+			return (result);
+		cg++;
+		if (cg == fs->fs_ncg)
+			cg = 0;
+	}
+	return (0);
+}
+
+/*
+ * Determine whether a fragment can be extended.
+ *
+ * Check to see if the necessary fragments are available, and
+ * if they are, allocate them.
+ */
+static ufs2_daddr_t
+ffs_fragextend(ip, cg, bprev, osize, nsize)
+	struct inode *ip;
+	int cg;
+	ufs2_daddr_t bprev;
+	int osize, nsize;
+{
+	struct fs *fs;
+	struct cg *cgp;
+	struct buf *bp;
+	long bno;
+	int frags, bbase;
+	int i, error;
+	u_int8_t *blksfree;
+
+	fs = ip->i_fs;
+	if (fs->fs_cs(fs, cg).cs_nffree < numfrags(fs, nsize - osize))
+		return (0);
+	frags = numfrags(fs, nsize);
+	bbase = fragnum(fs, bprev);
+	if (bbase > fragnum(fs, (bprev + frags - 1))) {
+		/* cannot extend across a block boundary */
+		return (0);
+	}
+	error = bread(ip->i_devvp, fsbtodb(fs, cgtod(fs, cg)),
+		(int)fs->fs_cgsize, NOCRED, &bp);
+	if (error) {
+		brelse(bp);
+		return (0);
+	}
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp)) {
+		brelse(bp);
+		return (0);
+	}
+	bp->b_xflags |= BX_BKGRDWRITE;
+	cgp->cg_old_time = cgp->cg_time = time_second;
+	bno = dtogd(fs, bprev);
+	blksfree = cg_blksfree(cgp);
+	for (i = numfrags(fs, osize); i < frags; i++)
+		if (isclr(blksfree, bno + i)) {
+			brelse(bp);
+			return (0);
+		}
+	/*
+	 * the current fragment can be extended
+	 * deduct the count on fragment being extended into
+	 * increase the count on the remaining fragment (if any)
+	 * allocate the extended piece
+	 */
+	for (i = frags; i < fs->fs_frag - bbase; i++)
+		if (isclr(blksfree, bno + i))
+			break;
+	cgp->cg_frsum[i - numfrags(fs, osize)]--;
+	if (i != frags)
+		cgp->cg_frsum[i - frags]++;
+	for (i = numfrags(fs, osize); i < frags; i++) {
+		clrbit(blksfree, bno + i);
+		cgp->cg_cs.cs_nffree--;
+		fs->fs_cstotal.cs_nffree--;
+		fs->fs_cs(fs, cg).cs_nffree--;
+	}
+	fs->fs_fmod = 1;
+	if (DOINGSOFTDEP(ITOV(ip)))
+		softdep_setup_blkmapdep(bp, fs, bprev);
+	if (fs->fs_active != 0)
+		atomic_clear_int(&ACTIVECGNUM(fs, cg), ACTIVECGOFF(cg));
+	bdwrite(bp);
+	return (bprev);
+}
+
+/*
+ * Determine whether a block can be allocated.
+ *
+ * Check to see if a block of the appropriate size is available,
+ * and if it is, allocate it.
+ */
+static ufs2_daddr_t
+ffs_alloccg(ip, cg, bpref, size)
+	struct inode *ip;
+	int cg;
+	ufs2_daddr_t bpref;
+	int size;
+{
+	struct fs *fs;
+	struct cg *cgp;
+	struct buf *bp;
+	ufs1_daddr_t bno;
+	ufs2_daddr_t blkno;
+	int i, allocsiz, error, frags;
+	u_int8_t *blksfree;
+
+	fs = ip->i_fs;
+	if (fs->fs_cs(fs, cg).cs_nbfree == 0 && size == fs->fs_bsize)
+		return (0);
+	error = bread(ip->i_devvp, fsbtodb(fs, cgtod(fs, cg)),
+		(int)fs->fs_cgsize, NOCRED, &bp);
+	if (error) {
+		brelse(bp);
+		return (0);
+	}
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp) ||
+	    (cgp->cg_cs.cs_nbfree == 0 && size == fs->fs_bsize)) {
+		brelse(bp);
+		return (0);
+	}
+	bp->b_xflags |= BX_BKGRDWRITE;
+	cgp->cg_old_time = cgp->cg_time = time_second;
+	if (size == fs->fs_bsize) {
+		blkno = ffs_alloccgblk(ip, bp, bpref);
+		if (fs->fs_active != 0)
+			atomic_clear_int(&ACTIVECGNUM(fs, cg), ACTIVECGOFF(cg));
+		bdwrite(bp);
+		return (blkno);
+	}
+	/*
+	 * check to see if any fragments are already available
+	 * allocsiz is the size which will be allocated, hacking
+	 * it down to a smaller size if necessary
+	 */
+	blksfree = cg_blksfree(cgp);
+	frags = numfrags(fs, size);
+	for (allocsiz = frags; allocsiz < fs->fs_frag; allocsiz++)
+		if (cgp->cg_frsum[allocsiz] != 0)
+			break;
+	if (allocsiz == fs->fs_frag) {
+		/*
+		 * no fragments were available, so a block will be
+		 * allocated, and hacked up
+		 */
+		if (cgp->cg_cs.cs_nbfree == 0) {
+			brelse(bp);
+			return (0);
+		}
+		blkno = ffs_alloccgblk(ip, bp, bpref);
+		bno = dtogd(fs, blkno);
+		for (i = frags; i < fs->fs_frag; i++)
+			setbit(blksfree, bno + i);
+		i = fs->fs_frag - frags;
+		cgp->cg_cs.cs_nffree += i;
+		fs->fs_cstotal.cs_nffree += i;
+		fs->fs_cs(fs, cg).cs_nffree += i;
+		fs->fs_fmod = 1;
+		cgp->cg_frsum[i]++;
+		if (fs->fs_active != 0)
+			atomic_clear_int(&ACTIVECGNUM(fs, cg), ACTIVECGOFF(cg));
+		bdwrite(bp);
+		return (blkno);
+	}
+	bno = ffs_mapsearch(fs, cgp, bpref, allocsiz);
+	if (bno < 0) {
+		brelse(bp);
+		return (0);
+	}
+	for (i = 0; i < frags; i++)
+		clrbit(blksfree, bno + i);
+	cgp->cg_cs.cs_nffree -= frags;
+	fs->fs_cstotal.cs_nffree -= frags;
+	fs->fs_cs(fs, cg).cs_nffree -= frags;
+	fs->fs_fmod = 1;
+	cgp->cg_frsum[allocsiz]--;
+	if (frags != allocsiz)
+		cgp->cg_frsum[allocsiz - frags]++;
+	blkno = cg * fs->fs_fpg + bno;
+	if (DOINGSOFTDEP(ITOV(ip)))
+		softdep_setup_blkmapdep(bp, fs, blkno);
+	if (fs->fs_active != 0)
+		atomic_clear_int(&ACTIVECGNUM(fs, cg), ACTIVECGOFF(cg));
+	bdwrite(bp);
+	return (blkno);
+}
+
+/*
+ * Allocate a block in a cylinder group.
+ *
+ * This algorithm implements the following policy:
+ *   1) allocate the requested block.
+ *   2) allocate a rotationally optimal block in the same cylinder.
+ *   3) allocate the next available block on the block rotor for the
+ *      specified cylinder group.
+ * Note that this routine only allocates fs_bsize blocks; these
+ * blocks may be fragmented by the routine that allocates them.
+ */
+static ufs2_daddr_t
+ffs_alloccgblk(ip, bp, bpref)
+	struct inode *ip;
+	struct buf *bp;
+	ufs2_daddr_t bpref;
+{
+	struct fs *fs;
+	struct cg *cgp;
+	ufs1_daddr_t bno;
+	ufs2_daddr_t blkno;
+	u_int8_t *blksfree;
+
+	fs = ip->i_fs;
+	cgp = (struct cg *)bp->b_data;
+	blksfree = cg_blksfree(cgp);
+	if (bpref == 0 || dtog(fs, bpref) != cgp->cg_cgx) {
+		bpref = cgp->cg_rotor;
+	} else {
+		bpref = blknum(fs, bpref);
+		bno = dtogd(fs, bpref);
+		/*
+		 * if the requested block is available, use it
+		 */
+		if (ffs_isblock(fs, blksfree, fragstoblks(fs, bno)))
+			goto gotit;
+	}
+	/*
+	 * Take the next available block in this cylinder group.
+	 */
+	bno = ffs_mapsearch(fs, cgp, bpref, (int)fs->fs_frag);
+	if (bno < 0)
+		return (0);
+	cgp->cg_rotor = bno;
+gotit:
+	blkno = fragstoblks(fs, bno);
+	ffs_clrblock(fs, blksfree, (long)blkno);
+	ffs_clusteracct(fs, cgp, blkno, -1);
+	cgp->cg_cs.cs_nbfree--;
+	fs->fs_cstotal.cs_nbfree--;
+	fs->fs_cs(fs, cgp->cg_cgx).cs_nbfree--;
+	fs->fs_fmod = 1;
+	blkno = cgp->cg_cgx * fs->fs_fpg + bno;
+	if (DOINGSOFTDEP(ITOV(ip)))
+		softdep_setup_blkmapdep(bp, fs, blkno);
+	return (blkno);
+}
+
+/*
+ * Determine whether a cluster can be allocated.
+ *
+ * We do not currently check for optimal rotational layout if there
+ * are multiple choices in the same cylinder group. Instead we just
+ * take the first one that we find following bpref.
+ */
+static ufs2_daddr_t
+ffs_clusteralloc(ip, cg, bpref, len)
+	struct inode *ip;
+	int cg;
+	ufs2_daddr_t bpref;
+	int len;
+{
+	struct fs *fs;
+	struct cg *cgp;
+	struct buf *bp;
+	int i, run, bit, map, got;
+	ufs2_daddr_t bno;
+	u_char *mapp;
+	int32_t *lp;
+	u_int8_t *blksfree;
+
+	fs = ip->i_fs;
+	if (fs->fs_maxcluster[cg] < len)
+		return (0);
+	if (bread(ip->i_devvp, fsbtodb(fs, cgtod(fs, cg)), (int)fs->fs_cgsize,
+	    NOCRED, &bp))
+		goto fail;
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp))
+		goto fail;
+	bp->b_xflags |= BX_BKGRDWRITE;
+	/*
+	 * Check to see if a cluster of the needed size (or bigger) is
+	 * available in this cylinder group.
+	 */
+	lp = &cg_clustersum(cgp)[len];
+	for (i = len; i <= fs->fs_contigsumsize; i++)
+		if (*lp++ > 0)
+			break;
+	if (i > fs->fs_contigsumsize) {
+		/*
+		 * This is the first time looking for a cluster in this
+		 * cylinder group. Update the cluster summary information
+		 * to reflect the true maximum sized cluster so that
+		 * future cluster allocation requests can avoid reading
+		 * the cylinder group map only to find no clusters.
+		 */
+		lp = &cg_clustersum(cgp)[len - 1];
+		for (i = len - 1; i > 0; i--)
+			if (*lp-- > 0)
+				break;
+		fs->fs_maxcluster[cg] = i;
+		goto fail;
+	}
+	/*
+	 * Search the cluster map to find a big enough cluster.
+	 * We take the first one that we find, even if it is larger
+	 * than we need as we prefer to get one close to the previous
+	 * block allocation. We do not search before the current
+	 * preference point as we do not want to allocate a block
+	 * that is allocated before the previous one (as we will
+	 * then have to wait for another pass of the elevator
+	 * algorithm before it will be read). We prefer to fail and
+	 * be recalled to try an allocation in the next cylinder group.
+	 */
+	if (dtog(fs, bpref) != cg)
+		bpref = 0;
+	else
+		bpref = fragstoblks(fs, dtogd(fs, blknum(fs, bpref)));
+	mapp = &cg_clustersfree(cgp)[bpref / NBBY];
+	map = *mapp++;
+	bit = 1 << (bpref % NBBY);
+	for (run = 0, got = bpref; got < cgp->cg_nclusterblks; got++) {
+		if ((map & bit) == 0) {
+			run = 0;
+		} else {
+			run++;
+			if (run == len)
+				break;
+		}
+		if ((got & (NBBY - 1)) != (NBBY - 1)) {
+			bit <<= 1;
+		} else {
+			map = *mapp++;
+			bit = 1;
+		}
+	}
+	if (got >= cgp->cg_nclusterblks)
+		goto fail;
+	/*
+	 * Allocate the cluster that we have found.
+	 */
+	blksfree = cg_blksfree(cgp);
+	for (i = 1; i <= len; i++)
+		if (!ffs_isblock(fs, blksfree, got - run + i))
+			panic("ffs_clusteralloc: map mismatch");
+	bno = cg * fs->fs_fpg + blkstofrags(fs, got - run + 1);
+	if (dtog(fs, bno) != cg)
+		panic("ffs_clusteralloc: allocated out of group");
+	len = blkstofrags(fs, len);
+	for (i = 0; i < len; i += fs->fs_frag)
+		if (ffs_alloccgblk(ip, bp, bno + i) != bno + i)
+			panic("ffs_clusteralloc: lost block");
+	if (fs->fs_active != 0)
+		atomic_clear_int(&ACTIVECGNUM(fs, cg), ACTIVECGOFF(cg));
+	bdwrite(bp);
+	return (bno);
+
+fail:
+	brelse(bp);
+	return (0);
+}
+
+/*
+ * Determine whether an inode can be allocated.
+ *
+ * Check to see if an inode is available, and if it is,
+ * allocate it using the following policy:
+ *   1) allocate the requested inode.
+ *   2) allocate the next available inode after the requested
+ *      inode in the specified cylinder group.
+ */
+static ufs2_daddr_t
+ffs_nodealloccg(ip, cg, ipref, mode)
+	struct inode *ip;
+	int cg;
+	ufs2_daddr_t ipref;
+	int mode;
+{
+	struct fs *fs;
+	struct cg *cgp;
+	struct buf *bp, *ibp;
+	u_int8_t *inosused;
+	struct ufs2_dinode *dp2;
+	int error, start, len, loc, map, i;
+
+	fs = ip->i_fs;
+	if (fs->fs_cs(fs, cg).cs_nifree == 0)
+		return (0);
+	error = bread(ip->i_devvp, fsbtodb(fs, cgtod(fs, cg)),
+		(int)fs->fs_cgsize, NOCRED, &bp);
+	if (error) {
+		brelse(bp);
+		return (0);
+	}
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp) || cgp->cg_cs.cs_nifree == 0) {
+		brelse(bp);
+		return (0);
+	}
+	bp->b_xflags |= BX_BKGRDWRITE;
+	cgp->cg_old_time = cgp->cg_time = time_second;
+	inosused = cg_inosused(cgp);
+	if (ipref) {
+		ipref %= fs->fs_ipg;
+		if (isclr(inosused, ipref))
+			goto gotit;
+	}
+	start = cgp->cg_irotor / NBBY;
+	len = howmany(fs->fs_ipg - cgp->cg_irotor, NBBY);
+	loc = skpc(0xff, len, &inosused[start]);
+	if (loc == 0) {
+		len = start + 1;
+		start = 0;
+		loc = skpc(0xff, len, &inosused[0]);
+		if (loc == 0) {
+			printf("cg = %d, irotor = %ld, fs = %s\n",
+			    cg, (long)cgp->cg_irotor, fs->fs_fsmnt);
+			panic("ffs_nodealloccg: map corrupted");
+			/* NOTREACHED */
+		}
+	}
+	i = start + len - loc;
+	map = inosused[i];
+	ipref = i * NBBY;
+	for (i = 1; i < (1 << NBBY); i <<= 1, ipref++) {
+		if ((map & i) == 0) {
+			cgp->cg_irotor = ipref;
+			goto gotit;
+		}
+	}
+	printf("fs = %s\n", fs->fs_fsmnt);
+	panic("ffs_nodealloccg: block not in map");
+	/* NOTREACHED */
+gotit:
+	if (DOINGSOFTDEP(ITOV(ip)))
+		softdep_setup_inomapdep(bp, ip, cg * fs->fs_ipg + ipref);
+	setbit(inosused, ipref);
+	cgp->cg_cs.cs_nifree--;
+	fs->fs_cstotal.cs_nifree--;
+	fs->fs_cs(fs, cg).cs_nifree--;
+	fs->fs_fmod = 1;
+	if ((mode & IFMT) == IFDIR) {
+		cgp->cg_cs.cs_ndir++;
+		fs->fs_cstotal.cs_ndir++;
+		fs->fs_cs(fs, cg).cs_ndir++;
+	}
+	/*
+	 * Check to see if we need to initialize more inodes.
+	 */
+	if (fs->fs_magic == FS_UFS2_MAGIC &&
+	    ipref + INOPB(fs) > cgp->cg_initediblk &&
+	    cgp->cg_initediblk < cgp->cg_niblk) {
+		ibp = getblk(ip->i_devvp, fsbtodb(fs,
+		    ino_to_fsba(fs, cg * fs->fs_ipg + cgp->cg_initediblk)),
+		    (int)fs->fs_bsize, 0, 0, 0);
+		bzero(ibp->b_data, (int)fs->fs_bsize);
+		dp2 = (struct ufs2_dinode *)(ibp->b_data);
+		for (i = 0; i < INOPB(fs); i++) {
+			dp2->di_gen = arc4random() / 2 + 1;
+			dp2++;
+		}
+		bawrite(ibp);
+		cgp->cg_initediblk += INOPB(fs);
+	}
+	if (fs->fs_active != 0)
+		atomic_clear_int(&ACTIVECGNUM(fs, cg), ACTIVECGOFF(cg));
+	bdwrite(bp);
+	return (cg * fs->fs_ipg + ipref);
+}
+
+/*
+ * check if a block is free
+ */
+static int
+ffs_isfreeblock(struct fs *fs, u_char *cp, ufs1_daddr_t h)
+{
+
+	switch ((int)fs->fs_frag) {
+	case 8:
+		return (cp[h] == 0);
+	case 4:
+		return ((cp[h >> 1] & (0x0f << ((h & 0x1) << 2))) == 0);
+	case 2:
+		return ((cp[h >> 2] & (0x03 << ((h & 0x3) << 1))) == 0);
+	case 1:
+		return ((cp[h >> 3] & (0x01 << (h & 0x7))) == 0);
+	default:
+		panic("ffs_isfreeblock");
+	}
+	return (0);
+}
+
+/*
+ * Free a block or fragment.
+ *
+ * The specified block or fragment is placed back in the
+ * free map. If a fragment is deallocated, a possible
+ * block reassembly is checked.
+ */
+void
+ffs_blkfree(fs, devvp, bno, size, inum)
+	struct fs *fs;
+	struct vnode *devvp;
+	ufs2_daddr_t bno;
+	long size;
+	ino_t inum;
+{
+	struct cg *cgp;
+	struct buf *bp;
+	ufs1_daddr_t fragno, cgbno;
+	ufs2_daddr_t cgblkno;
+	int i, cg, blk, frags, bbase;
+	u_int8_t *blksfree;
+	dev_t dev;
+
+	cg = dtog(fs, bno);
+	if (devvp->v_type != VCHR) {
+		/* devvp is a snapshot */
+		dev = VTOI(devvp)->i_devvp->v_rdev;
+		cgblkno = fragstoblks(fs, cgtod(fs, cg));
+	} else {
+		/* devvp is a normal disk device */
+		dev = devvp->v_rdev;
+		cgblkno = fsbtodb(fs, cgtod(fs, cg));
+		ASSERT_VOP_LOCKED(devvp, "ffs_blkfree");
+		if ((devvp->v_vflag & VV_COPYONWRITE) &&
+		    ffs_snapblkfree(fs, devvp, bno, size, inum))
+			return;
+		VOP_FREEBLKS(devvp, fsbtodb(fs, bno), size);
+	}
+#ifdef DIAGNOSTIC
+	if (dev->si_mountpoint &&
+	    (dev->si_mountpoint->mnt_kern_flag & MNTK_SUSPENDED))
+		panic("ffs_blkfree: deallocation on suspended filesystem");
+	if ((u_int)size > fs->fs_bsize || fragoff(fs, size) != 0 ||
+	    fragnum(fs, bno) + numfrags(fs, size) > fs->fs_frag) {
+		printf("dev=%s, bno = %jd, bsize = %ld, size = %ld, fs = %s\n",
+		    devtoname(dev), (intmax_t)bno, (long)fs->fs_bsize,
+		    size, fs->fs_fsmnt);
+		panic("ffs_blkfree: bad size");
+	}
+#endif
+	if ((u_int)bno >= fs->fs_size) {
+		printf("bad block %jd, ino %lu\n", (intmax_t)bno,
+		    (u_long)inum);
+		ffs_fserr(fs, inum, "bad block");
+		return;
+	}
+	if (bread(devvp, cgblkno, (int)fs->fs_cgsize, NOCRED, &bp)) {
+		brelse(bp);
+		return;
+	}
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp)) {
+		brelse(bp);
+		return;
+	}
+	bp->b_xflags |= BX_BKGRDWRITE;
+	cgp->cg_old_time = cgp->cg_time = time_second;
+	cgbno = dtogd(fs, bno);
+	blksfree = cg_blksfree(cgp);
+	if (size == fs->fs_bsize) {
+		fragno = fragstoblks(fs, cgbno);
+		if (!ffs_isfreeblock(fs, blksfree, fragno)) {
+			if (devvp->v_type != VCHR) {
+				/* devvp is a snapshot */
+				brelse(bp);
+				return;
+			}
+			printf("dev = %s, block = %jd, fs = %s\n",
+			    devtoname(dev), (intmax_t)bno, fs->fs_fsmnt);
+			panic("ffs_blkfree: freeing free block");
+		}
+		ffs_setblock(fs, blksfree, fragno);
+		ffs_clusteracct(fs, cgp, fragno, 1);
+		cgp->cg_cs.cs_nbfree++;
+		fs->fs_cstotal.cs_nbfree++;
+		fs->fs_cs(fs, cg).cs_nbfree++;
+	} else {
+		bbase = cgbno - fragnum(fs, cgbno);
+		/*
+		 * decrement the counts associated with the old frags
+		 */
+		blk = blkmap(fs, blksfree, bbase);
+		ffs_fragacct(fs, blk, cgp->cg_frsum, -1);
+		/*
+		 * deallocate the fragment
+		 */
+		frags = numfrags(fs, size);
+		for (i = 0; i < frags; i++) {
+			if (isset(blksfree, cgbno + i)) {
+				printf("dev = %s, block = %jd, fs = %s\n",
+				    devtoname(dev), (intmax_t)(bno + i),
+				    fs->fs_fsmnt);
+				panic("ffs_blkfree: freeing free frag");
+			}
+			setbit(blksfree, cgbno + i);
+		}
+		cgp->cg_cs.cs_nffree += i;
+		fs->fs_cstotal.cs_nffree += i;
+		fs->fs_cs(fs, cg).cs_nffree += i;
+		/*
+		 * add back in counts associated with the new frags
+		 */
+		blk = blkmap(fs, blksfree, bbase);
+		ffs_fragacct(fs, blk, cgp->cg_frsum, 1);
+		/*
+		 * if a complete block has been reassembled, account for it
+		 */
+		fragno = fragstoblks(fs, bbase);
+		if (ffs_isblock(fs, blksfree, fragno)) {
+			cgp->cg_cs.cs_nffree -= fs->fs_frag;
+			fs->fs_cstotal.cs_nffree -= fs->fs_frag;
+			fs->fs_cs(fs, cg).cs_nffree -= fs->fs_frag;
+			ffs_clusteracct(fs, cgp, fragno, 1);
+			cgp->cg_cs.cs_nbfree++;
+			fs->fs_cstotal.cs_nbfree++;
+			fs->fs_cs(fs, cg).cs_nbfree++;
+		}
+	}
+	fs->fs_fmod = 1;
+	if (fs->fs_active != 0)
+		atomic_clear_int(&ACTIVECGNUM(fs, cg), ACTIVECGOFF(cg));
+	bdwrite(bp);
+}
+
+#ifdef DIAGNOSTIC
+/*
+ * Verify allocation of a block or fragment. Returns true if block or
+ * fragment is allocated, false if it is free.
+ */
+static int
+ffs_checkblk(ip, bno, size)
+	struct inode *ip;
+	ufs2_daddr_t bno;
+	long size;
+{
+	struct fs *fs;
+	struct cg *cgp;
+	struct buf *bp;
+	ufs1_daddr_t cgbno;
+	int i, error, frags, free;
+	u_int8_t *blksfree;
+
+	fs = ip->i_fs;
+	if ((u_int)size > fs->fs_bsize || fragoff(fs, size) != 0) {
+		printf("bsize = %ld, size = %ld, fs = %s\n",
+		    (long)fs->fs_bsize, size, fs->fs_fsmnt);
+		panic("ffs_checkblk: bad size");
+	}
+	if ((u_int)bno >= fs->fs_size)
+		panic("ffs_checkblk: bad block %jd", (intmax_t)bno);
+	error = bread(ip->i_devvp, fsbtodb(fs, cgtod(fs, dtog(fs, bno))),
+		(int)fs->fs_cgsize, NOCRED, &bp);
+	if (error)
+		panic("ffs_checkblk: cg bread failed");
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp))
+		panic("ffs_checkblk: cg magic mismatch");
+	bp->b_xflags |= BX_BKGRDWRITE;
+	blksfree = cg_blksfree(cgp);
+	cgbno = dtogd(fs, bno);
+	if (size == fs->fs_bsize) {
+		free = ffs_isblock(fs, blksfree, fragstoblks(fs, cgbno));
+	} else {
+		frags = numfrags(fs, size);
+		for (free = 0, i = 0; i < frags; i++)
+			if (isset(blksfree, cgbno + i))
+				free++;
+		if (free != 0 && free != frags)
+			panic("ffs_checkblk: partially free fragment");
+	}
+	brelse(bp);
+	return (!free);
+}
+#endif /* DIAGNOSTIC */
+
+/*
+ * Free an inode.
+ */
+int
+ffs_vfree(pvp, ino, mode)
+	struct vnode *pvp;
+	ino_t ino;
+	int mode;
+{
+	if (DOINGSOFTDEP(pvp)) {
+		softdep_freefile(pvp, ino, mode);
+		return (0);
+	}
+	return (ffs_freefile(VTOI(pvp)->i_fs, VTOI(pvp)->i_devvp, ino, mode));
+}
+
+/*
+ * Do the actual free operation.
+ * The specified inode is placed back in the free map.
+ */
+int
+ffs_freefile(fs, devvp, ino, mode)
+	struct fs *fs;
+	struct vnode *devvp;
+	ino_t ino;
+	int mode;
+{
+	struct cg *cgp;
+	struct buf *bp;
+	ufs2_daddr_t cgbno;
+	int error, cg;
+	u_int8_t *inosused;
+	dev_t dev;
+
+	cg = ino_to_cg(fs, ino);
+	if (devvp->v_type != VCHR) {
+		/* devvp is a snapshot */
+		dev = VTOI(devvp)->i_devvp->v_rdev;
+		cgbno = fragstoblks(fs, cgtod(fs, cg));
+	} else {
+		/* devvp is a normal disk device */
+		dev = devvp->v_rdev;
+		cgbno = fsbtodb(fs, cgtod(fs, cg));
+	}
+	if ((u_int)ino >= fs->fs_ipg * fs->fs_ncg)
+		panic("ffs_freefile: range: dev = %s, ino = %lu, fs = %s",
+		    devtoname(dev), (u_long)ino, fs->fs_fsmnt);
+	if ((error = bread(devvp, cgbno, (int)fs->fs_cgsize, NOCRED, &bp))) {
+		brelse(bp);
+		return (error);
+	}
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp)) {
+		brelse(bp);
+		return (0);
+	}
+	bp->b_xflags |= BX_BKGRDWRITE;
+	cgp->cg_old_time = cgp->cg_time = time_second;
+	inosused = cg_inosused(cgp);
+	ino %= fs->fs_ipg;
+	if (isclr(inosused, ino)) {
+		printf("dev = %s, ino = %lu, fs = %s\n", devtoname(dev),
+		    (u_long)ino + cg * fs->fs_ipg, fs->fs_fsmnt);
+		if (fs->fs_ronly == 0)
+			panic("ffs_freefile: freeing free inode");
+	}
+	clrbit(inosused, ino);
+	if (ino < cgp->cg_irotor)
+		cgp->cg_irotor = ino;
+	cgp->cg_cs.cs_nifree++;
+	fs->fs_cstotal.cs_nifree++;
+	fs->fs_cs(fs, cg).cs_nifree++;
+	if ((mode & IFMT) == IFDIR) {
+		cgp->cg_cs.cs_ndir--;
+		fs->fs_cstotal.cs_ndir--;
+		fs->fs_cs(fs, cg).cs_ndir--;
+	}
+	fs->fs_fmod = 1;
+	if (fs->fs_active != 0)
+		atomic_clear_int(&ACTIVECGNUM(fs, cg), ACTIVECGOFF(cg));
+	bdwrite(bp);
+	return (0);
+}
+
+/*
+ * Check to see if a file is free.
+ */
+int
+ffs_checkfreefile(fs, devvp, ino)
+	struct fs *fs;
+	struct vnode *devvp;
+	ino_t ino;
+{
+	struct cg *cgp;
+	struct buf *bp;
+	ufs2_daddr_t cgbno;
+	int ret, cg;
+	u_int8_t *inosused;
+
+	cg = ino_to_cg(fs, ino);
+	if (devvp->v_type != VCHR) {
+		/* devvp is a snapshot */
+		cgbno = fragstoblks(fs, cgtod(fs, cg));
+	} else {
+		/* devvp is a normal disk device */
+		cgbno = fsbtodb(fs, cgtod(fs, cg));
+	}
+	if ((u_int)ino >= fs->fs_ipg * fs->fs_ncg)
+		return (1);
+	if (bread(devvp, cgbno, (int)fs->fs_cgsize, NOCRED, &bp)) {
+		brelse(bp);
+		return (1);
+	}
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp)) {
+		brelse(bp);
+		return (1);
+	}
+	inosused = cg_inosused(cgp);
+	ino %= fs->fs_ipg;
+	ret = isclr(inosused, ino);
+	brelse(bp);
+	return (ret);
+}
+
+/*
+ * Find a block of the specified size in the specified cylinder group.
+ *
+ * It is a panic if a request is made to find a block if none are
+ * available.
+ */
+static ufs1_daddr_t
+ffs_mapsearch(fs, cgp, bpref, allocsiz)
+	struct fs *fs;
+	struct cg *cgp;
+	ufs2_daddr_t bpref;
+	int allocsiz;
+{
+	ufs1_daddr_t bno;
+	int start, len, loc, i;
+	int blk, field, subfield, pos;
+	u_int8_t *blksfree;
+
+	/*
+	 * find the fragment by searching through the free block
+	 * map for an appropriate bit pattern
+	 */
+	if (bpref)
+		start = dtogd(fs, bpref) / NBBY;
+	else
+		start = cgp->cg_frotor / NBBY;
+	blksfree = cg_blksfree(cgp);
+	len = howmany(fs->fs_fpg, NBBY) - start;
+	loc = scanc((u_int)len, (u_char *)&blksfree[start],
+		(u_char *)fragtbl[fs->fs_frag],
+		(u_char)(1 << (allocsiz - 1 + (fs->fs_frag % NBBY))));
+	if (loc == 0) {
+		len = start + 1;
+		start = 0;
+		loc = scanc((u_int)len, (u_char *)&blksfree[0],
+			(u_char *)fragtbl[fs->fs_frag],
+			(u_char)(1 << (allocsiz - 1 + (fs->fs_frag % NBBY))));
+		if (loc == 0) {
+			printf("start = %d, len = %d, fs = %s\n",
+			    start, len, fs->fs_fsmnt);
+			panic("ffs_alloccg: map corrupted");
+			/* NOTREACHED */
+		}
+	}
+	bno = (start + len - loc) * NBBY;
+	cgp->cg_frotor = bno;
+	/*
+	 * found the byte in the map
+	 * sift through the bits to find the selected frag
+	 */
+	for (i = bno + NBBY; bno < i; bno += fs->fs_frag) {
+		blk = blkmap(fs, blksfree, bno);
+		blk <<= 1;
+		field = around[allocsiz];
+		subfield = inside[allocsiz];
+		for (pos = 0; pos <= fs->fs_frag - allocsiz; pos++) {
+			if ((blk & field) == subfield)
+				return (bno + pos);
+			field <<= 1;
+			subfield <<= 1;
+		}
+	}
+	printf("bno = %lu, fs = %s\n", (u_long)bno, fs->fs_fsmnt);
+	panic("ffs_alloccg: block not in map");
+	return (-1);
+}
+
+/*
+ * Update the cluster map because of an allocation or free.
+ *
+ * Cnt == 1 means free; cnt == -1 means allocating.
+ */
+void
+ffs_clusteracct(fs, cgp, blkno, cnt)
+	struct fs *fs;
+	struct cg *cgp;
+	ufs1_daddr_t blkno;
+	int cnt;
+{
+	int32_t *sump;
+	int32_t *lp;
+	u_char *freemapp, *mapp;
+	int i, start, end, forw, back, map, bit;
+
+	if (fs->fs_contigsumsize <= 0)
+		return;
+	freemapp = cg_clustersfree(cgp);
+	sump = cg_clustersum(cgp);
+	/*
+	 * Allocate or clear the actual block.
+	 */
+	if (cnt > 0)
+		setbit(freemapp, blkno);
+	else
+		clrbit(freemapp, blkno);
+	/*
+	 * Find the size of the cluster going forward.
+	 */
+	start = blkno + 1;
+	end = start + fs->fs_contigsumsize;
+	if (end >= cgp->cg_nclusterblks)
+		end = cgp->cg_nclusterblks;
+	mapp = &freemapp[start / NBBY];
+	map = *mapp++;
+	bit = 1 << (start % NBBY);
+	for (i = start; i < end; i++) {
+		if ((map & bit) == 0)
+			break;
+		if ((i & (NBBY - 1)) != (NBBY - 1)) {
+			bit <<= 1;
+		} else {
+			map = *mapp++;
+			bit = 1;
+		}
+	}
+	forw = i - start;
+	/*
+	 * Find the size of the cluster going backward.
+	 */
+	start = blkno - 1;
+	end = start - fs->fs_contigsumsize;
+	if (end < 0)
+		end = -1;
+	mapp = &freemapp[start / NBBY];
+	map = *mapp--;
+	bit = 1 << (start % NBBY);
+	for (i = start; i > end; i--) {
+		if ((map & bit) == 0)
+			break;
+		if ((i & (NBBY - 1)) != 0) {
+			bit >>= 1;
+		} else {
+			map = *mapp--;
+			bit = 1 << (NBBY - 1);
+		}
+	}
+	back = start - i;
+	/*
+	 * Account for old cluster and the possibly new forward and
+	 * back clusters.
+	 */
+	i = back + forw + 1;
+	if (i > fs->fs_contigsumsize)
+		i = fs->fs_contigsumsize;
+	sump[i] += cnt;
+	if (back > 0)
+		sump[back] -= cnt;
+	if (forw > 0)
+		sump[forw] -= cnt;
+	/*
+	 * Update cluster summary information.
+	 */
+	lp = &sump[fs->fs_contigsumsize];
+	for (i = fs->fs_contigsumsize; i > 0; i--)
+		if (*lp-- > 0)
+			break;
+	fs->fs_maxcluster[cgp->cg_cgx] = i;
+}
+
+/*
+ * Fserr prints the name of a filesystem with an error diagnostic.
+ *
+ * The form of the error message is:
+ *	fs: error message
+ */
+static void
+ffs_fserr(fs, inum, cp)
+	struct fs *fs;
+	ino_t inum;
+	char *cp;
+{
+	struct thread *td = curthread;	/* XXX */
+	struct proc *p = td->td_proc;
+
+	log(LOG_ERR, "pid %d (%s), uid %d inumber %d on %s: %s\n",
+	    p->p_pid, p->p_comm, td->td_ucred->cr_uid, inum, fs->fs_fsmnt, cp);
+}
+
+/*
+ * This function provides the capability for the fsck program to
+ * update an active filesystem. Six operations are provided:
+ *
+ * adjrefcnt(inode, amt) - adjusts the reference count on the
+ *	specified inode by the specified amount. Under normal
+ *	operation the count should always go down. Decrementing
+ *	the count to zero will cause the inode to be freed.
+ * adjblkcnt(inode, amt) - adjust the number of blocks used to
+ *	by the specifed amount.
+ * freedirs(inode, count) - directory inodes [inode..inode + count - 1]
+ *	are marked as free. Inodes should never have to be marked
+ *	as in use.
+ * freefiles(inode, count) - file inodes [inode..inode + count - 1]
+ *	are marked as free. Inodes should never have to be marked
+ *	as in use.
+ * freeblks(blockno, size) - blocks [blockno..blockno + size - 1]
+ *	are marked as free. Blocks should never have to be marked
+ *	as in use.
+ * setflags(flags, set/clear) - the fs_flags field has the specified
+ *	flags set (second parameter +1) or cleared (second parameter -1).
+ */
+
+static int sysctl_ffs_fsck(SYSCTL_HANDLER_ARGS);
+
+SYSCTL_PROC(_vfs_ffs, FFS_ADJ_REFCNT, adjrefcnt, CTLFLAG_WR|CTLTYPE_STRUCT,
+	0, 0, sysctl_ffs_fsck, "S,fsck", "Adjust Inode Reference Count");
+
+SYSCTL_NODE(_vfs_ffs, FFS_ADJ_BLKCNT, adjblkcnt, CTLFLAG_WR,
+	sysctl_ffs_fsck, "Adjust Inode Used Blocks Count");
+
+SYSCTL_NODE(_vfs_ffs, FFS_DIR_FREE, freedirs, CTLFLAG_WR,
+	sysctl_ffs_fsck, "Free Range of Directory Inodes");
+
+SYSCTL_NODE(_vfs_ffs, FFS_FILE_FREE, freefiles, CTLFLAG_WR,
+	sysctl_ffs_fsck, "Free Range of File Inodes");
+
+SYSCTL_NODE(_vfs_ffs, FFS_BLK_FREE, freeblks, CTLFLAG_WR,
+	sysctl_ffs_fsck, "Free Range of Blocks");
+
+SYSCTL_NODE(_vfs_ffs, FFS_SET_FLAGS, setflags, CTLFLAG_WR,
+	sysctl_ffs_fsck, "Change Filesystem Flags");
+
+#ifdef DEBUG
+static int fsckcmds = 0;
+SYSCTL_INT(_debug, OID_AUTO, fsckcmds, CTLFLAG_RW, &fsckcmds, 0, "");
+#endif /* DEBUG */
+
+static int
+sysctl_ffs_fsck(SYSCTL_HANDLER_ARGS)
+{
+	struct fsck_cmd cmd;
+	struct ufsmount *ump;
+	struct vnode *vp;
+	struct inode *ip;
+	struct mount *mp;
+	struct fs *fs;
+	ufs2_daddr_t blkno;
+	long blkcnt, blksize;
+	struct file *fp;
+	int filetype, error;
+
+	if (req->newlen > sizeof cmd)
+		return (EBADRPC);
+	if ((error = SYSCTL_IN(req, &cmd, sizeof cmd)) != 0)
+		return (error);
+	if (cmd.version != FFS_CMD_VERSION)
+		return (ERPCMISMATCH);
+	if ((error = getvnode(curproc->p_fd, cmd.handle, &fp)) != 0)
+		return (error);
+	vn_start_write(fp->f_data, &mp, V_WAIT);
+	if (mp == 0 || strncmp(mp->mnt_stat.f_fstypename, "ufs", MFSNAMELEN)) {
+		vn_finished_write(mp);
+		fdrop(fp, curthread);
+		return (EINVAL);
+	}
+	if (mp->mnt_flag & MNT_RDONLY) {
+		vn_finished_write(mp);
+		fdrop(fp, curthread);
+		return (EROFS);
+	}
+	ump = VFSTOUFS(mp);
+	fs = ump->um_fs;
+	filetype = IFREG;
+
+	switch (oidp->oid_number) {
+
+	case FFS_SET_FLAGS:
+#ifdef DEBUG
+		if (fsckcmds)
+			printf("%s: %s flags\n", mp->mnt_stat.f_mntonname,
+			    cmd.size > 0 ? "set" : "clear");
+#endif /* DEBUG */
+		if (cmd.size > 0)
+			fs->fs_flags |= (long)cmd.value;
+		else
+			fs->fs_flags &= ~(long)cmd.value;
+		break;
+
+	case FFS_ADJ_REFCNT:
+#ifdef DEBUG
+		if (fsckcmds) {
+			printf("%s: adjust inode %jd count by %jd\n",
+			    mp->mnt_stat.f_mntonname, (intmax_t)cmd.value,
+			    (intmax_t)cmd.size);
+		}
+#endif /* DEBUG */
+		if ((error = VFS_VGET(mp, (ino_t)cmd.value, LK_EXCLUSIVE, &vp)))
+			break;
+		ip = VTOI(vp);
+		ip->i_nlink += cmd.size;
+		DIP(ip, i_nlink) = ip->i_nlink;
+		ip->i_effnlink += cmd.size;
+		ip->i_flag |= IN_CHANGE;
+		if (DOINGSOFTDEP(vp))
+			softdep_change_linkcnt(ip);
+		vput(vp);
+		break;
+
+	case FFS_ADJ_BLKCNT:
+#ifdef DEBUG
+		if (fsckcmds) {
+			printf("%s: adjust inode %jd block count by %jd\n",
+			    mp->mnt_stat.f_mntonname, (intmax_t)cmd.value,
+			    (intmax_t)cmd.size);
+		}
+#endif /* DEBUG */
+		if ((error = VFS_VGET(mp, (ino_t)cmd.value, LK_EXCLUSIVE, &vp)))
+			break;
+		ip = VTOI(vp);
+		DIP(ip, i_blocks) += cmd.size;
+		ip->i_flag |= IN_CHANGE;
+		vput(vp);
+		break;
+
+	case FFS_DIR_FREE:
+		filetype = IFDIR;
+		/* fall through */
+
+	case FFS_FILE_FREE:
+#ifdef DEBUG
+		if (fsckcmds) {
+			if (cmd.size == 1)
+				printf("%s: free %s inode %d\n",
+				    mp->mnt_stat.f_mntonname,
+				    filetype == IFDIR ? "directory" : "file",
+				    (ino_t)cmd.value);
+			else
+				printf("%s: free %s inodes %d-%d\n",
+				    mp->mnt_stat.f_mntonname,
+				    filetype == IFDIR ? "directory" : "file",
+				    (ino_t)cmd.value,
+				    (ino_t)(cmd.value + cmd.size - 1));
+		}
+#endif /* DEBUG */
+		while (cmd.size > 0) {
+			if ((error = ffs_freefile(fs, ump->um_devvp, cmd.value,
+			    filetype)))
+				break;
+			cmd.size -= 1;
+			cmd.value += 1;
+		}
+		break;
+
+	case FFS_BLK_FREE:
+#ifdef DEBUG
+		if (fsckcmds) {
+			if (cmd.size == 1)
+				printf("%s: free block %jd\n",
+				    mp->mnt_stat.f_mntonname,
+				    (intmax_t)cmd.value);
+			else
+				printf("%s: free blocks %jd-%jd\n",
+				    mp->mnt_stat.f_mntonname, 
+				    (intmax_t)cmd.value,
+				    (intmax_t)cmd.value + cmd.size - 1);
+		}
+#endif /* DEBUG */
+		blkno = cmd.value;
+		blkcnt = cmd.size;
+		blksize = fs->fs_frag - (blkno % fs->fs_frag);
+		while (blkcnt > 0) {
+			if (blksize > blkcnt)
+				blksize = blkcnt;
+			ffs_blkfree(fs, ump->um_devvp, blkno,
+			    blksize * fs->fs_fsize, ROOTINO);
+			blkno += blksize;
+			blkcnt -= blksize;
+			blksize = fs->fs_frag;
+		}
+		break;
+
+	default:
+#ifdef DEBUG
+		if (fsckcmds) {
+			printf("Invalid request %d from fsck\n",
+			    oidp->oid_number);
+		}
+#endif /* DEBUG */
+		error = EINVAL;
+		break;
+
+	}
+	fdrop(fp, curthread);
+	vn_finished_write(mp);
+	return (error);
+}
diff --git a/src/sys/ufs/ffs/ffs_balloc.c b/src/sys/ufs/ffs/ffs_balloc.c
new file mode 100644
index 0000000..6f6a328
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_balloc.c
@@ -0,0 +1,887 @@
+/*
+ * Copyright (c) 2002 Networks Associates Technology, Inc.
+ * All rights reserved.
+ *
+ * This software was developed for the FreeBSD Project by Marshall
+ * Kirk McKusick and Network Associates Laboratories, the Security
+ * Research Division of Network Associates, Inc. under DARPA/SPAWAR
+ * contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA CHATS
+ * research program
+ *
+ * Copyright (c) 1982, 1986, 1989, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. All advertising materials mentioning features or use of this software
+ *    must display the following acknowledgement:
+ *	This product includes software developed by the University of
+ *	California, Berkeley and its contributors.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)ffs_balloc.c	8.8 (Berkeley) 6/16/95
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_balloc.c,v 1.43 2003/08/15 20:03:19 phk Exp $");
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/bio.h>
+#include <sys/buf.h>
+#include <sys/lock.h>
+#include <sys/mount.h>
+#include <sys/vnode.h>
+
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufs_extern.h>
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/ufsmount.h>
+
+#include <ufs/ffs/fs.h>
+#include <ufs/ffs/ffs_extern.h>
+
+/*
+ * Balloc defines the structure of filesystem storage
+ * by allocating the physical blocks on a device given
+ * the inode and the logical block number in a file.
+ * This is the allocation strategy for UFS1. Below is
+ * the allocation strategy for UFS2.
+ */
+int
+ffs_balloc_ufs1(struct vnode *vp, off_t startoffset, int size,
+    struct ucred *cred, int flags, struct buf **bpp)
+{
+	struct inode *ip;
+	struct ufs1_dinode *dp;
+	ufs_lbn_t lbn, lastlbn;
+	struct fs *fs;
+	ufs1_daddr_t nb;
+	struct buf *bp, *nbp;
+	struct indir indirs[NIADDR + 2];
+	int deallocated, osize, nsize, num, i, error;
+	ufs2_daddr_t newb;
+	ufs1_daddr_t *bap, pref;
+	ufs1_daddr_t *allocib, *blkp, *allocblk, allociblk[NIADDR + 1];
+	int unwindidx = -1;
+	struct thread *td = curthread;	/* XXX */
+
+	ip = VTOI(vp);
+	dp = ip->i_din1;
+	fs = ip->i_fs;
+	lbn = lblkno(fs, startoffset);
+	size = blkoff(fs, startoffset) + size;
+	if (size > fs->fs_bsize)
+		panic("ffs_balloc_ufs1: blk too big");
+	*bpp = NULL;
+	if (flags & IO_EXT)
+		return (EOPNOTSUPP);
+	if (lbn < 0)
+		return (EFBIG);
+
+	/*
+	 * If the next write will extend the file into a new block,
+	 * and the file is currently composed of a fragment
+	 * this fragment has to be extended to be a full block.
+	 */
+	lastlbn = lblkno(fs, ip->i_size);
+	if (lastlbn < NDADDR && lastlbn < lbn) {
+		nb = lastlbn;
+		osize = blksize(fs, ip, nb);
+		if (osize < fs->fs_bsize && osize > 0) {
+			error = ffs_realloccg(ip, nb, dp->di_db[nb],
+			   ffs_blkpref_ufs1(ip, lastlbn, (int)nb,
+			   &dp->di_db[0]), osize, (int)fs->fs_bsize, cred, &bp);
+			if (error)
+				return (error);
+			if (DOINGSOFTDEP(vp))
+				softdep_setup_allocdirect(ip, nb,
+				    dbtofsb(fs, bp->b_blkno), dp->di_db[nb],
+				    fs->fs_bsize, osize, bp);
+			ip->i_size = smalllblktosize(fs, nb + 1);
+			dp->di_size = ip->i_size;
+			dp->di_db[nb] = dbtofsb(fs, bp->b_blkno);
+			ip->i_flag |= IN_CHANGE | IN_UPDATE;
+			if (flags & IO_SYNC)
+				bwrite(bp);
+			else
+				bawrite(bp);
+		}
+	}
+	/*
+	 * The first NDADDR blocks are direct blocks
+	 */
+	if (lbn < NDADDR) {
+		if (flags & BA_METAONLY)
+			panic("ffs_balloc_ufs1: BA_METAONLY for direct block");
+		nb = dp->di_db[lbn];
+		if (nb != 0 && ip->i_size >= smalllblktosize(fs, lbn + 1)) {
+			error = bread(vp, lbn, fs->fs_bsize, NOCRED, &bp);
+			if (error) {
+				brelse(bp);
+				return (error);
+			}
+			bp->b_blkno = fsbtodb(fs, nb);
+			*bpp = bp;
+			return (0);
+		}
+		if (nb != 0) {
+			/*
+			 * Consider need to reallocate a fragment.
+			 */
+			osize = fragroundup(fs, blkoff(fs, ip->i_size));
+			nsize = fragroundup(fs, size);
+			if (nsize <= osize) {
+				error = bread(vp, lbn, osize, NOCRED, &bp);
+				if (error) {
+					brelse(bp);
+					return (error);
+				}
+				bp->b_blkno = fsbtodb(fs, nb);
+			} else {
+				error = ffs_realloccg(ip, lbn, dp->di_db[lbn],
+				    ffs_blkpref_ufs1(ip, lbn, (int)lbn,
+				    &dp->di_db[0]), osize, nsize, cred, &bp);
+				if (error)
+					return (error);
+				if (DOINGSOFTDEP(vp))
+					softdep_setup_allocdirect(ip, lbn,
+					    dbtofsb(fs, bp->b_blkno), nb,
+					    nsize, osize, bp);
+			}
+		} else {
+			if (ip->i_size < smalllblktosize(fs, lbn + 1))
+				nsize = fragroundup(fs, size);
+			else
+				nsize = fs->fs_bsize;
+			error = ffs_alloc(ip, lbn,
+			    ffs_blkpref_ufs1(ip, lbn, (int)lbn, &dp->di_db[0]),
+			    nsize, cred, &newb);
+			if (error)
+				return (error);
+			bp = getblk(vp, lbn, nsize, 0, 0, 0);
+			bp->b_blkno = fsbtodb(fs, newb);
+			if (flags & BA_CLRBUF)
+				vfs_bio_clrbuf(bp);
+			if (DOINGSOFTDEP(vp))
+				softdep_setup_allocdirect(ip, lbn, newb, 0,
+				    nsize, 0, bp);
+		}
+		dp->di_db[lbn] = dbtofsb(fs, bp->b_blkno);
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		*bpp = bp;
+		return (0);
+	}
+	/*
+	 * Determine the number of levels of indirection.
+	 */
+	pref = 0;
+	if ((error = ufs_getlbns(vp, lbn, indirs, &num)) != 0)
+		return(error);
+#ifdef DIAGNOSTIC
+	if (num < 1)
+		panic ("ffs_balloc_ufs1: ufs_getlbns returned indirect block");
+#endif
+	/*
+	 * Fetch the first indirect block allocating if necessary.
+	 */
+	--num;
+	nb = dp->di_ib[indirs[0].in_off];
+	allocib = NULL;
+	allocblk = allociblk;
+	if (nb == 0) {
+		pref = ffs_blkpref_ufs1(ip, lbn, 0, (ufs1_daddr_t *)0);
+	        if ((error = ffs_alloc(ip, lbn, pref, (int)fs->fs_bsize,
+		    cred, &newb)) != 0)
+			return (error);
+		nb = newb;
+		*allocblk++ = nb;
+		bp = getblk(vp, indirs[1].in_lbn, fs->fs_bsize, 0, 0, 0);
+		bp->b_blkno = fsbtodb(fs, nb);
+		vfs_bio_clrbuf(bp);
+		if (DOINGSOFTDEP(vp)) {
+			softdep_setup_allocdirect(ip, NDADDR + indirs[0].in_off,
+			    newb, 0, fs->fs_bsize, 0, bp);
+			bdwrite(bp);
+		} else {
+			/*
+			 * Write synchronously so that indirect blocks
+			 * never point at garbage.
+			 */
+			if (DOINGASYNC(vp))
+				bdwrite(bp);
+			else if ((error = bwrite(bp)) != 0)
+				goto fail;
+		}
+		allocib = &dp->di_ib[indirs[0].in_off];
+		*allocib = nb;
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+	}
+	/*
+	 * Fetch through the indirect blocks, allocating as necessary.
+	 */
+	for (i = 1;;) {
+		error = bread(vp,
+		    indirs[i].in_lbn, (int)fs->fs_bsize, NOCRED, &bp);
+		if (error) {
+			brelse(bp);
+			goto fail;
+		}
+		bap = (ufs1_daddr_t *)bp->b_data;
+		nb = bap[indirs[i].in_off];
+		if (i == num)
+			break;
+		i += 1;
+		if (nb != 0) {
+			bqrelse(bp);
+			continue;
+		}
+		if (pref == 0)
+			pref = ffs_blkpref_ufs1(ip, lbn, 0, (ufs1_daddr_t *)0);
+		if ((error =
+		    ffs_alloc(ip, lbn, pref, (int)fs->fs_bsize, cred, &newb)) != 0) {
+			brelse(bp);
+			goto fail;
+		}
+		nb = newb;
+		*allocblk++ = nb;
+		nbp = getblk(vp, indirs[i].in_lbn, fs->fs_bsize, 0, 0, 0);
+		nbp->b_blkno = fsbtodb(fs, nb);
+		vfs_bio_clrbuf(nbp);
+		if (DOINGSOFTDEP(vp)) {
+			softdep_setup_allocindir_meta(nbp, ip, bp,
+			    indirs[i - 1].in_off, nb);
+			bdwrite(nbp);
+		} else {
+			/*
+			 * Write synchronously so that indirect blocks
+			 * never point at garbage.
+			 */
+			if ((error = bwrite(nbp)) != 0) {
+				brelse(bp);
+				goto fail;
+			}
+		}
+		bap[indirs[i - 1].in_off] = nb;
+		if (allocib == NULL && unwindidx < 0)
+			unwindidx = i - 1;
+		/*
+		 * If required, write synchronously, otherwise use
+		 * delayed write.
+		 */
+		if (flags & IO_SYNC) {
+			bwrite(bp);
+		} else {
+			if (bp->b_bufsize == fs->fs_bsize)
+				bp->b_flags |= B_CLUSTEROK;
+			bdwrite(bp);
+		}
+	}
+	/*
+	 * If asked only for the indirect block, then return it.
+	 */
+	if (flags & BA_METAONLY) {
+		*bpp = bp;
+		return (0);
+	}
+	/*
+	 * Get the data block, allocating if necessary.
+	 */
+	if (nb == 0) {
+		pref = ffs_blkpref_ufs1(ip, lbn, indirs[i].in_off, &bap[0]);
+		error = ffs_alloc(ip,
+		    lbn, pref, (int)fs->fs_bsize, cred, &newb);
+		if (error) {
+			brelse(bp);
+			goto fail;
+		}
+		nb = newb;
+		*allocblk++ = nb;
+		nbp = getblk(vp, lbn, fs->fs_bsize, 0, 0, 0);
+		nbp->b_blkno = fsbtodb(fs, nb);
+		if (flags & BA_CLRBUF)
+			vfs_bio_clrbuf(nbp);
+		if (DOINGSOFTDEP(vp))
+			softdep_setup_allocindir_page(ip, lbn, bp,
+			    indirs[i].in_off, nb, 0, nbp);
+		bap[indirs[i].in_off] = nb;
+		/*
+		 * If required, write synchronously, otherwise use
+		 * delayed write.
+		 */
+		if (flags & IO_SYNC) {
+			bwrite(bp);
+		} else {
+			if (bp->b_bufsize == fs->fs_bsize)
+				bp->b_flags |= B_CLUSTEROK;
+			bdwrite(bp);
+		}
+		*bpp = nbp;
+		return (0);
+	}
+	brelse(bp);
+	if (flags & BA_CLRBUF) {
+		int seqcount = (flags & BA_SEQMASK) >> BA_SEQSHIFT;
+		if (seqcount && (vp->v_mount->mnt_flag & MNT_NOCLUSTERR) == 0) {
+			error = cluster_read(vp, ip->i_size, lbn,
+			    (int)fs->fs_bsize, NOCRED,
+			    MAXBSIZE, seqcount, &nbp);
+		} else {
+			error = bread(vp, lbn, (int)fs->fs_bsize, NOCRED, &nbp);
+		}
+		if (error) {
+			brelse(nbp);
+			goto fail;
+		}
+	} else {
+		nbp = getblk(vp, lbn, fs->fs_bsize, 0, 0, 0);
+		nbp->b_blkno = fsbtodb(fs, nb);
+	}
+	*bpp = nbp;
+	return (0);
+fail:
+	/*
+	 * If we have failed to allocate any blocks, simply return the error.
+	 * This is the usual case and avoids the need to fsync the file.
+	 */
+	if (allocblk == allociblk && allocib == NULL && unwindidx == -1)
+		return (error);
+	/*
+	 * If we have failed part way through block allocation, we
+	 * have to deallocate any indirect blocks that we have allocated.
+	 * We have to fsync the file before we start to get rid of all
+	 * of its dependencies so that we do not leave them dangling.
+	 * We have to sync it at the end so that the soft updates code
+	 * does not find any untracked changes. Although this is really
+	 * slow, running out of disk space is not expected to be a common
+	 * occurence. The error return from fsync is ignored as we already
+	 * have an error to return to the user.
+	 */
+	(void) VOP_FSYNC(vp, cred, MNT_WAIT, td);
+	for (deallocated = 0, blkp = allociblk; blkp < allocblk; blkp++) {
+		ffs_blkfree(fs, ip->i_devvp, *blkp, fs->fs_bsize, ip->i_number);
+		deallocated += fs->fs_bsize;
+	}
+	if (allocib != NULL) {
+		*allocib = 0;
+	} else if (unwindidx >= 0) {
+		int r;
+
+		r = bread(vp, indirs[unwindidx].in_lbn, 
+		    (int)fs->fs_bsize, NOCRED, &bp);
+		if (r) {
+			panic("Could not unwind indirect block, error %d", r);
+			brelse(bp);
+		} else {
+			bap = (ufs1_daddr_t *)bp->b_data;
+			bap[indirs[unwindidx].in_off] = 0;
+			if (flags & IO_SYNC) {
+				bwrite(bp);
+			} else {
+				if (bp->b_bufsize == fs->fs_bsize)
+					bp->b_flags |= B_CLUSTEROK;
+				bdwrite(bp);
+			}
+		}
+	}
+	if (deallocated) {
+#ifdef QUOTA
+		/*
+		 * Restore user's disk quota because allocation failed.
+		 */
+		(void) chkdq(ip, -btodb(deallocated), cred, FORCE);
+#endif
+		dp->di_blocks -= btodb(deallocated);
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+	}
+	(void) VOP_FSYNC(vp, cred, MNT_WAIT, td);
+	return (error);
+}
+
+/*
+ * Balloc defines the structure of file system storage
+ * by allocating the physical blocks on a device given
+ * the inode and the logical block number in a file.
+ * This is the allocation strategy for UFS2. Above is
+ * the allocation strategy for UFS1.
+ */
+int
+ffs_balloc_ufs2(struct vnode *vp, off_t startoffset, int size,
+    struct ucred *cred, int flags, struct buf **bpp)
+{
+	struct inode *ip;
+	struct ufs2_dinode *dp;
+	ufs_lbn_t lbn, lastlbn;
+	struct fs *fs;
+	struct buf *bp, *nbp;
+	struct indir indirs[NIADDR + 2];
+	ufs2_daddr_t nb, newb, *bap, pref;
+	ufs2_daddr_t *allocib, *blkp, *allocblk, allociblk[NIADDR + 1];
+	int deallocated, osize, nsize, num, i, error;
+	int unwindidx = -1;
+	struct thread *td = curthread;	/* XXX */
+
+	ip = VTOI(vp);
+	dp = ip->i_din2;
+	fs = ip->i_fs;
+	lbn = lblkno(fs, startoffset);
+	size = blkoff(fs, startoffset) + size;
+	if (size > fs->fs_bsize)
+		panic("ffs_balloc_ufs2: blk too big");
+	*bpp = NULL;
+	if (lbn < 0)
+		return (EFBIG);
+
+	/*
+	 * Check for allocating external data.
+	 */
+	if (flags & IO_EXT) {
+		if (lbn >= NXADDR)
+			return (EFBIG);
+		/*
+		 * If the next write will extend the data into a new block,
+		 * and the data is currently composed of a fragment
+		 * this fragment has to be extended to be a full block.
+		 */
+		lastlbn = lblkno(fs, dp->di_extsize);
+		if (lastlbn < lbn) {
+			nb = lastlbn;
+			osize = sblksize(fs, dp->di_extsize, nb);
+			if (osize < fs->fs_bsize && osize > 0) {
+				error = ffs_realloccg(ip, -1 - nb,
+				    dp->di_extb[nb],
+				    ffs_blkpref_ufs2(ip, lastlbn, (int)nb,
+				    &dp->di_extb[0]), osize,
+				    (int)fs->fs_bsize, cred, &bp);
+				if (error)
+					return (error);
+				if (DOINGSOFTDEP(vp))
+					softdep_setup_allocext(ip, nb,
+					    dbtofsb(fs, bp->b_blkno),
+					    dp->di_extb[nb],
+					    fs->fs_bsize, osize, bp);
+				dp->di_extsize = smalllblktosize(fs, nb + 1);
+				dp->di_extb[nb] = dbtofsb(fs, bp->b_blkno);
+				bp->b_xflags |= BX_ALTDATA;
+				ip->i_flag |= IN_CHANGE | IN_UPDATE;
+				if (flags & IO_SYNC)
+					bwrite(bp);
+				else
+					bawrite(bp);
+			}
+		}
+		/*
+		 * All blocks are direct blocks
+		 */
+		if (flags & BA_METAONLY)
+			panic("ffs_balloc_ufs2: BA_METAONLY for ext block");
+		nb = dp->di_extb[lbn];
+		if (nb != 0 && dp->di_extsize >= smalllblktosize(fs, lbn + 1)) {
+			error = bread(vp, -1 - lbn, fs->fs_bsize, NOCRED, &bp);
+			if (error) {
+				brelse(bp);
+				return (error);
+			}
+			bp->b_blkno = fsbtodb(fs, nb);
+			bp->b_xflags |= BX_ALTDATA;
+			*bpp = bp;
+			return (0);
+		}
+		if (nb != 0) {
+			/*
+			 * Consider need to reallocate a fragment.
+			 */
+			osize = fragroundup(fs, blkoff(fs, dp->di_extsize));
+			nsize = fragroundup(fs, size);
+			if (nsize <= osize) {
+				error = bread(vp, -1 - lbn, osize, NOCRED, &bp);
+				if (error) {
+					brelse(bp);
+					return (error);
+				}
+				bp->b_blkno = fsbtodb(fs, nb);
+				bp->b_xflags |= BX_ALTDATA;
+			} else {
+				error = ffs_realloccg(ip, -1 - lbn,
+				    dp->di_extb[lbn],
+				    ffs_blkpref_ufs2(ip, lbn, (int)lbn,
+				    &dp->di_extb[0]), osize, nsize, cred, &bp);
+				if (error)
+					return (error);
+				bp->b_xflags |= BX_ALTDATA;
+				if (DOINGSOFTDEP(vp))
+					softdep_setup_allocext(ip, lbn,
+					    dbtofsb(fs, bp->b_blkno), nb,
+					    nsize, osize, bp);
+			}
+		} else {
+			if (dp->di_extsize < smalllblktosize(fs, lbn + 1))
+				nsize = fragroundup(fs, size);
+			else
+				nsize = fs->fs_bsize;
+			error = ffs_alloc(ip, lbn,
+			   ffs_blkpref_ufs2(ip, lbn, (int)lbn, &dp->di_extb[0]),
+			   nsize, cred, &newb);
+			if (error)
+				return (error);
+			bp = getblk(vp, -1 - lbn, nsize, 0, 0, 0);
+			bp->b_blkno = fsbtodb(fs, newb);
+			bp->b_xflags |= BX_ALTDATA;
+			if (flags & BA_CLRBUF)
+				vfs_bio_clrbuf(bp);
+			if (DOINGSOFTDEP(vp))
+				softdep_setup_allocext(ip, lbn, newb, 0,
+				    nsize, 0, bp);
+		}
+		dp->di_extb[lbn] = dbtofsb(fs, bp->b_blkno);
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		*bpp = bp;
+		return (0);
+	}
+	/*
+	 * If the next write will extend the file into a new block,
+	 * and the file is currently composed of a fragment
+	 * this fragment has to be extended to be a full block.
+	 */
+	lastlbn = lblkno(fs, ip->i_size);
+	if (lastlbn < NDADDR && lastlbn < lbn) {
+		nb = lastlbn;
+		osize = blksize(fs, ip, nb);
+		if (osize < fs->fs_bsize && osize > 0) {
+			error = ffs_realloccg(ip, nb, dp->di_db[nb],
+				ffs_blkpref_ufs2(ip, lastlbn, (int)nb,
+				    &dp->di_db[0]), osize, (int)fs->fs_bsize,
+				    cred, &bp);
+			if (error)
+				return (error);
+			if (DOINGSOFTDEP(vp))
+				softdep_setup_allocdirect(ip, nb,
+				    dbtofsb(fs, bp->b_blkno),
+				    dp->di_db[nb],
+				    fs->fs_bsize, osize, bp);
+			ip->i_size = smalllblktosize(fs, nb + 1);
+			dp->di_size = ip->i_size;
+			dp->di_db[nb] = dbtofsb(fs, bp->b_blkno);
+			ip->i_flag |= IN_CHANGE | IN_UPDATE;
+			if (flags & IO_SYNC)
+				bwrite(bp);
+			else
+				bawrite(bp);
+		}
+	}
+	/*
+	 * The first NDADDR blocks are direct blocks
+	 */
+	if (lbn < NDADDR) {
+		if (flags & BA_METAONLY)
+			panic("ffs_balloc_ufs2: BA_METAONLY for direct block");
+		nb = dp->di_db[lbn];
+		if (nb != 0 && ip->i_size >= smalllblktosize(fs, lbn + 1)) {
+			error = bread(vp, lbn, fs->fs_bsize, NOCRED, &bp);
+			if (error) {
+				brelse(bp);
+				return (error);
+			}
+			bp->b_blkno = fsbtodb(fs, nb);
+			*bpp = bp;
+			return (0);
+		}
+		if (nb != 0) {
+			/*
+			 * Consider need to reallocate a fragment.
+			 */
+			osize = fragroundup(fs, blkoff(fs, ip->i_size));
+			nsize = fragroundup(fs, size);
+			if (nsize <= osize) {
+				error = bread(vp, lbn, osize, NOCRED, &bp);
+				if (error) {
+					brelse(bp);
+					return (error);
+				}
+				bp->b_blkno = fsbtodb(fs, nb);
+			} else {
+				error = ffs_realloccg(ip, lbn, dp->di_db[lbn],
+				    ffs_blkpref_ufs2(ip, lbn, (int)lbn,
+				       &dp->di_db[0]), osize, nsize, cred, &bp);
+				if (error)
+					return (error);
+				if (DOINGSOFTDEP(vp))
+					softdep_setup_allocdirect(ip, lbn,
+					    dbtofsb(fs, bp->b_blkno), nb,
+					    nsize, osize, bp);
+			}
+		} else {
+			if (ip->i_size < smalllblktosize(fs, lbn + 1))
+				nsize = fragroundup(fs, size);
+			else
+				nsize = fs->fs_bsize;
+			error = ffs_alloc(ip, lbn,
+			    ffs_blkpref_ufs2(ip, lbn, (int)lbn,
+				&dp->di_db[0]), nsize, cred, &newb);
+			if (error)
+				return (error);
+			bp = getblk(vp, lbn, nsize, 0, 0, 0);
+			bp->b_blkno = fsbtodb(fs, newb);
+			if (flags & BA_CLRBUF)
+				vfs_bio_clrbuf(bp);
+			if (DOINGSOFTDEP(vp))
+				softdep_setup_allocdirect(ip, lbn, newb, 0,
+				    nsize, 0, bp);
+		}
+		dp->di_db[lbn] = dbtofsb(fs, bp->b_blkno);
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		*bpp = bp;
+		return (0);
+	}
+	/*
+	 * Determine the number of levels of indirection.
+	 */
+	pref = 0;
+	if ((error = ufs_getlbns(vp, lbn, indirs, &num)) != 0)
+		return(error);
+#ifdef DIAGNOSTIC
+	if (num < 1)
+		panic ("ffs_balloc_ufs2: ufs_getlbns returned indirect block");
+#endif
+	/*
+	 * Fetch the first indirect block allocating if necessary.
+	 */
+	--num;
+	nb = dp->di_ib[indirs[0].in_off];
+	allocib = NULL;
+	allocblk = allociblk;
+	if (nb == 0) {
+		pref = ffs_blkpref_ufs2(ip, lbn, 0, (ufs2_daddr_t *)0);
+	        if ((error = ffs_alloc(ip, lbn, pref, (int)fs->fs_bsize,
+		    cred, &newb)) != 0)
+			return (error);
+		nb = newb;
+		*allocblk++ = nb;
+		bp = getblk(vp, indirs[1].in_lbn, fs->fs_bsize, 0, 0, 0);
+		bp->b_blkno = fsbtodb(fs, nb);
+		vfs_bio_clrbuf(bp);
+		if (DOINGSOFTDEP(vp)) {
+			softdep_setup_allocdirect(ip, NDADDR + indirs[0].in_off,
+			    newb, 0, fs->fs_bsize, 0, bp);
+			bdwrite(bp);
+		} else {
+			/*
+			 * Write synchronously so that indirect blocks
+			 * never point at garbage.
+			 */
+			if (DOINGASYNC(vp))
+				bdwrite(bp);
+			else if ((error = bwrite(bp)) != 0)
+				goto fail;
+		}
+		allocib = &dp->di_ib[indirs[0].in_off];
+		*allocib = nb;
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+	}
+	/*
+	 * Fetch through the indirect blocks, allocating as necessary.
+	 */
+	for (i = 1;;) {
+		error = bread(vp,
+		    indirs[i].in_lbn, (int)fs->fs_bsize, NOCRED, &bp);
+		if (error) {
+			brelse(bp);
+			goto fail;
+		}
+		bap = (ufs2_daddr_t *)bp->b_data;
+		nb = bap[indirs[i].in_off];
+		if (i == num)
+			break;
+		i += 1;
+		if (nb != 0) {
+			bqrelse(bp);
+			continue;
+		}
+		if (pref == 0)
+			pref = ffs_blkpref_ufs2(ip, lbn, 0, (ufs2_daddr_t *)0);
+		if ((error =
+		    ffs_alloc(ip, lbn, pref, (int)fs->fs_bsize, cred, &newb)) != 0) {
+			brelse(bp);
+			goto fail;
+		}
+		nb = newb;
+		*allocblk++ = nb;
+		nbp = getblk(vp, indirs[i].in_lbn, fs->fs_bsize, 0, 0, 0);
+		nbp->b_blkno = fsbtodb(fs, nb);
+		vfs_bio_clrbuf(nbp);
+		if (DOINGSOFTDEP(vp)) {
+			softdep_setup_allocindir_meta(nbp, ip, bp,
+			    indirs[i - 1].in_off, nb);
+			bdwrite(nbp);
+		} else {
+			/*
+			 * Write synchronously so that indirect blocks
+			 * never point at garbage.
+			 */
+			if ((error = bwrite(nbp)) != 0) {
+				brelse(bp);
+				goto fail;
+			}
+		}
+		bap[indirs[i - 1].in_off] = nb;
+		if (allocib == NULL && unwindidx < 0)
+			unwindidx = i - 1;
+		/*
+		 * If required, write synchronously, otherwise use
+		 * delayed write.
+		 */
+		if (flags & IO_SYNC) {
+			bwrite(bp);
+		} else {
+			if (bp->b_bufsize == fs->fs_bsize)
+				bp->b_flags |= B_CLUSTEROK;
+			bdwrite(bp);
+		}
+	}
+	/*
+	 * If asked only for the indirect block, then return it.
+	 */
+	if (flags & BA_METAONLY) {
+		*bpp = bp;
+		return (0);
+	}
+	/*
+	 * Get the data block, allocating if necessary.
+	 */
+	if (nb == 0) {
+		pref = ffs_blkpref_ufs2(ip, lbn, indirs[i].in_off, &bap[0]);
+		error = ffs_alloc(ip,
+		    lbn, pref, (int)fs->fs_bsize, cred, &newb);
+		if (error) {
+			brelse(bp);
+			goto fail;
+		}
+		nb = newb;
+		*allocblk++ = nb;
+		nbp = getblk(vp, lbn, fs->fs_bsize, 0, 0, 0);
+		nbp->b_blkno = fsbtodb(fs, nb);
+		if (flags & BA_CLRBUF)
+			vfs_bio_clrbuf(nbp);
+		if (DOINGSOFTDEP(vp))
+			softdep_setup_allocindir_page(ip, lbn, bp,
+			    indirs[i].in_off, nb, 0, nbp);
+		bap[indirs[i].in_off] = nb;
+		/*
+		 * If required, write synchronously, otherwise use
+		 * delayed write.
+		 */
+		if (flags & IO_SYNC) {
+			bwrite(bp);
+		} else {
+			if (bp->b_bufsize == fs->fs_bsize)
+				bp->b_flags |= B_CLUSTEROK;
+			bdwrite(bp);
+		}
+		*bpp = nbp;
+		return (0);
+	}
+	brelse(bp);
+	/*
+	 * If requested clear invalid portions of the buffer.  If we
+	 * have to do a read-before-write (typical if BA_CLRBUF is set),
+	 * try to do some read-ahead in the sequential case to reduce
+	 * the number of I/O transactions.
+	 */
+	if (flags & BA_CLRBUF) {
+		int seqcount = (flags & BA_SEQMASK) >> BA_SEQSHIFT;
+		if (seqcount && (vp->v_mount->mnt_flag & MNT_NOCLUSTERR) == 0) {
+			error = cluster_read(vp, ip->i_size, lbn,
+			    (int)fs->fs_bsize, NOCRED,
+			    MAXBSIZE, seqcount, &nbp);
+		} else {
+			error = bread(vp, lbn, (int)fs->fs_bsize, NOCRED, &nbp);
+		}
+		if (error) {
+			brelse(nbp);
+			goto fail;
+		}
+	} else {
+		nbp = getblk(vp, lbn, fs->fs_bsize, 0, 0, 0);
+		nbp->b_blkno = fsbtodb(fs, nb);
+	}
+	*bpp = nbp;
+	return (0);
+fail:
+	/*
+	 * If we have failed to allocate any blocks, simply return the error.
+	 * This is the usual case and avoids the need to fsync the file.
+	 */
+	if (allocblk == allociblk && allocib == NULL && unwindidx == -1)
+		return (error);
+	/*
+	 * If we have failed part way through block allocation, we
+	 * have to deallocate any indirect blocks that we have allocated.
+	 * We have to fsync the file before we start to get rid of all
+	 * of its dependencies so that we do not leave them dangling.
+	 * We have to sync it at the end so that the soft updates code
+	 * does not find any untracked changes. Although this is really
+	 * slow, running out of disk space is not expected to be a common
+	 * occurence. The error return from fsync is ignored as we already
+	 * have an error to return to the user.
+	 */
+	(void) VOP_FSYNC(vp, cred, MNT_WAIT, td);
+	for (deallocated = 0, blkp = allociblk; blkp < allocblk; blkp++) {
+		ffs_blkfree(fs, ip->i_devvp, *blkp, fs->fs_bsize, ip->i_number);
+		deallocated += fs->fs_bsize;
+	}
+	if (allocib != NULL) {
+		*allocib = 0;
+	} else if (unwindidx >= 0) {
+		int r;
+
+		r = bread(vp, indirs[unwindidx].in_lbn, 
+		    (int)fs->fs_bsize, NOCRED, &bp);
+		if (r) {
+			panic("Could not unwind indirect block, error %d", r);
+			brelse(bp);
+		} else {
+			bap = (ufs2_daddr_t *)bp->b_data;
+			bap[indirs[unwindidx].in_off] = 0;
+			if (flags & IO_SYNC) {
+				bwrite(bp);
+			} else {
+				if (bp->b_bufsize == fs->fs_bsize)
+					bp->b_flags |= B_CLUSTEROK;
+				bdwrite(bp);
+			}
+		}
+	}
+	if (deallocated) {
+#ifdef QUOTA
+		/*
+		 * Restore user's disk quota because allocation failed.
+		 */
+		(void) chkdq(ip, -btodb(deallocated), cred, FORCE);
+#endif
+		dp->di_blocks -= btodb(deallocated);
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+	}
+	(void) VOP_FSYNC(vp, cred, MNT_WAIT, td);
+	return (error);
+}
diff --git a/src/sys/ufs/ffs/ffs_extern.h b/src/sys/ufs/ffs/ffs_extern.h
new file mode 100644
index 0000000..a06335d
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_extern.h
@@ -0,0 +1,132 @@
+/*-
+ * Copyright (c) 1991, 1993, 1994
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. All advertising materials mentioning features or use of this software
+ *    must display the following acknowledgement:
+ *	This product includes software developed by the University of
+ *	California, Berkeley and its contributors.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)ffs_extern.h	8.6 (Berkeley) 3/30/95
+ * $FreeBSD: src/sys/ufs/ffs/ffs_extern.h,v 1.55 2003/02/22 00:29:50 mckusick Exp $
+ */
+
+#ifndef _UFS_FFS_EXTERN_H
+#define	_UFS_FFS_EXTERN_H
+
+struct buf;
+struct cg;
+struct fid;
+struct fs;
+struct inode;
+struct malloc_type;
+struct mount;
+struct thread;
+struct sockaddr;
+struct statfs;
+struct ucred;
+struct vnode;
+struct vop_fsync_args;
+struct vop_reallocblks_args;
+struct vop_copyonwrite_args;
+
+int	ffs_alloc(struct inode *,
+	    ufs2_daddr_t, ufs2_daddr_t, int, struct ucred *, ufs2_daddr_t *);
+int	ffs_balloc_ufs1(struct vnode *a_vp, off_t a_startoffset, int a_size,
+            struct ucred *a_cred, int a_flags, struct buf **a_bpp);
+int	ffs_balloc_ufs2(struct vnode *a_vp, off_t a_startoffset, int a_size,
+            struct ucred *a_cred, int a_flags, struct buf **a_bpp);
+int	ffs_blkatoff(struct vnode *, off_t, char **, struct buf **);
+void	ffs_blkfree(struct fs *, struct vnode *, ufs2_daddr_t, long, ino_t);
+ufs2_daddr_t ffs_blkpref_ufs1(struct inode *, ufs_lbn_t, int, ufs1_daddr_t *);
+ufs2_daddr_t ffs_blkpref_ufs2(struct inode *, ufs_lbn_t, int, ufs2_daddr_t *);
+int	ffs_checkfreefile(struct fs *, struct vnode *, ino_t);
+void	ffs_clrblock(struct fs *, u_char *, ufs1_daddr_t);
+void	ffs_clusteracct	(struct fs *, struct cg *, ufs1_daddr_t, int);
+vfs_fhtovp_t ffs_fhtovp;
+int	ffs_flushfiles(struct mount *, int, struct thread *);
+void	ffs_fragacct(struct fs *, int, int32_t [], int);
+int	ffs_freefile(struct fs *, struct vnode *, ino_t, int);
+int	ffs_isblock(struct fs *, u_char *, ufs1_daddr_t);
+void	ffs_load_inode(struct buf *, struct inode *, struct fs *, ino_t);
+int	ffs_mountroot(void);
+vfs_mount_t ffs_mount;
+int	ffs_reallocblks(struct vop_reallocblks_args *);
+int	ffs_realloccg(struct inode *, ufs2_daddr_t, ufs2_daddr_t,
+	    ufs2_daddr_t, int, int, struct ucred *, struct buf **);
+void	ffs_setblock(struct fs *, u_char *, ufs1_daddr_t);
+int	ffs_snapblkfree(struct fs *, struct vnode *, ufs2_daddr_t, long, ino_t);
+void	ffs_snapremove(struct vnode *vp);
+int	ffs_snapshot(struct mount *mp, char *snapfile);
+void	ffs_snapshot_mount(struct mount *mp);
+void	ffs_snapshot_unmount(struct mount *mp);
+vfs_statfs_t ffs_statfs;
+vfs_sync_t ffs_sync;
+int	ffs_truncate(struct vnode *, off_t, int, struct ucred *, struct thread *);
+vfs_unmount_t ffs_unmount;
+int	ffs_update(struct vnode *, int);
+int	ffs_valloc(struct vnode *, int, struct ucred *, struct vnode **);
+
+int	ffs_vfree(struct vnode *, ino_t, int);
+vfs_vget_t ffs_vget;
+vfs_vptofh_t ffs_vptofh;
+
+extern vop_t **ffs_vnodeop_p;
+extern vop_t **ffs_specop_p;
+extern vop_t **ffs_fifoop_p;
+
+/*
+ * Soft update function prototypes.
+ */
+void	softdep_initialize(void);
+void	softdep_uninitialize(void);
+int	softdep_mount(struct vnode *, struct mount *, struct fs *,
+	    struct ucred *);
+int	softdep_flushworklist(struct mount *, int *, struct thread *);
+int	softdep_flushfiles(struct mount *, int, struct thread *);
+void	softdep_update_inodeblock(struct inode *, struct buf *, int);
+void	softdep_load_inodeblock(struct inode *);
+void	softdep_freefile(struct vnode *, ino_t, int);
+int	softdep_request_cleanup(struct fs *, struct vnode *);
+void	softdep_setup_freeblocks(struct inode *, off_t, int);
+void	softdep_setup_inomapdep(struct buf *, struct inode *, ino_t);
+void	softdep_setup_blkmapdep(struct buf *, struct fs *, ufs2_daddr_t);
+void	softdep_setup_allocdirect(struct inode *, ufs_lbn_t, ufs2_daddr_t,
+	    ufs2_daddr_t, long, long, struct buf *);
+void	softdep_setup_allocext(struct inode *, ufs_lbn_t, ufs2_daddr_t,
+	    ufs2_daddr_t, long, long, struct buf *);
+void	softdep_setup_allocindir_meta(struct buf *, struct inode *,
+	    struct buf *, int, ufs2_daddr_t);
+void	softdep_setup_allocindir_page(struct inode *, ufs_lbn_t,
+	    struct buf *, int, ufs2_daddr_t, ufs2_daddr_t, struct buf *);
+void	softdep_fsync_mountdev(struct vnode *);
+int	softdep_sync_metadata(struct vop_fsync_args *);
+/* XXX incorrectly moved to mount.h - should be indirect function */
+#if 0
+int	softdep_fsync(struct vnode *vp);
+#endif
+
+#endif /* !_UFS_FFS_EXTERN_H */
diff --git a/src/sys/ufs/ffs/ffs_inode.c b/src/sys/ufs/ffs/ffs_inode.c
new file mode 100644
index 0000000..5b76166
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_inode.c
@@ -0,0 +1,641 @@
+/*
+ * Copyright (c) 1982, 1986, 1989, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. All advertising materials mentioning features or use of this software
+ *    must display the following acknowledgement:
+ *	This product includes software developed by the University of
+ *	California, Berkeley and its contributors.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)ffs_inode.c	8.13 (Berkeley) 4/21/95
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_inode.c,v 1.91 2003/10/18 14:10:27 phk Exp $");
+
+#include "opt_quota.h"
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/mount.h>
+#include <sys/proc.h>
+#include <sys/bio.h>
+#include <sys/buf.h>
+#include <sys/vnode.h>
+#include <sys/malloc.h>
+#include <sys/resourcevar.h>
+#include <sys/vmmeter.h>
+#include <sys/stat.h>
+
+#include <vm/vm.h>
+#include <vm/vm_extern.h>
+
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/ufsmount.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufs_extern.h>
+
+#include <ufs/ffs/fs.h>
+#include <ufs/ffs/ffs_extern.h>
+
+static int ffs_indirtrunc(struct inode *, ufs2_daddr_t, ufs2_daddr_t,
+	    ufs2_daddr_t, int, ufs2_daddr_t *);
+
+/*
+ * Update the access, modified, and inode change times as specified by the
+ * IN_ACCESS, IN_UPDATE, and IN_CHANGE flags respectively.  Write the inode
+ * to disk if the IN_MODIFIED flag is set (it may be set initially, or by
+ * the timestamp update).  The IN_LAZYMOD flag is set to force a write
+ * later if not now.  If we write now, then clear both IN_MODIFIED and
+ * IN_LAZYMOD to reflect the presumably successful write, and if waitfor is
+ * set, then wait for the write to complete.
+ */
+int
+ffs_update(vp, waitfor)
+	struct vnode *vp;
+	int waitfor;
+{
+	struct fs *fs;
+	struct buf *bp;
+	struct inode *ip;
+	int error;
+
+#ifdef DEBUG_VFS_LOCKS
+	if ((vp->v_iflag & VI_XLOCK) == 0)
+		ASSERT_VOP_LOCKED(vp, "ffs_update");
+#endif
+	ufs_itimes(vp);
+	ip = VTOI(vp);
+	if ((ip->i_flag & IN_MODIFIED) == 0 && waitfor == 0)
+		return (0);
+	ip->i_flag &= ~(IN_LAZYMOD | IN_MODIFIED);
+	fs = ip->i_fs;
+	if (fs->fs_ronly)
+		return (0);
+	/*
+	 * Ensure that uid and gid are correct. This is a temporary
+	 * fix until fsck has been changed to do the update.
+	 */
+	if (fs->fs_magic == FS_UFS1_MAGIC &&		/* XXX */
+	    fs->fs_old_inodefmt < FS_44INODEFMT) {	/* XXX */
+		ip->i_din1->di_ouid = ip->i_uid;	/* XXX */
+		ip->i_din1->di_ogid = ip->i_gid;	/* XXX */
+	}						/* XXX */
+	error = bread(ip->i_devvp, fsbtodb(fs, ino_to_fsba(fs, ip->i_number)),
+		(int)fs->fs_bsize, NOCRED, &bp);
+	if (error) {
+		brelse(bp);
+		return (error);
+	}
+	if (DOINGSOFTDEP(vp))
+		softdep_update_inodeblock(ip, bp, waitfor);
+	else if (ip->i_effnlink != ip->i_nlink)
+		panic("ffs_update: bad link cnt");
+	if (ip->i_ump->um_fstype == UFS1)
+		*((struct ufs1_dinode *)bp->b_data +
+		    ino_to_fsbo(fs, ip->i_number)) = *ip->i_din1;
+	else
+		*((struct ufs2_dinode *)bp->b_data +
+		    ino_to_fsbo(fs, ip->i_number)) = *ip->i_din2;
+	if (waitfor && !DOINGASYNC(vp)) {
+		return (bwrite(bp));
+	} else if (vm_page_count_severe() || buf_dirty_count_severe()) {
+		return (bwrite(bp));
+	} else {
+		if (bp->b_bufsize == fs->fs_bsize)
+			bp->b_flags |= B_CLUSTEROK;
+		bdwrite(bp);
+		return (0);
+	}
+}
+
+#define	SINGLE	0	/* index of single indirect block */
+#define	DOUBLE	1	/* index of double indirect block */
+#define	TRIPLE	2	/* index of triple indirect block */
+/*
+ * Truncate the inode oip to at most length size, freeing the
+ * disk blocks.
+ */
+int
+ffs_truncate(vp, length, flags, cred, td)
+	struct vnode *vp;
+	off_t length;
+	int flags;
+	struct ucred *cred;
+	struct thread *td;
+{
+	struct vnode *ovp = vp;
+	struct inode *oip;
+	ufs2_daddr_t bn, lbn, lastblock, lastiblock[NIADDR], indir_lbn[NIADDR];
+	ufs2_daddr_t oldblks[NDADDR + NIADDR], newblks[NDADDR + NIADDR];
+	ufs2_daddr_t count, blocksreleased = 0, datablocks;
+	struct fs *fs;
+	struct buf *bp;
+	int needextclean, softdepslowdown, extblocks;
+	int offset, size, level, nblocks;
+	int i, error, allerror;
+	off_t osize;
+
+	oip = VTOI(ovp);
+	fs = oip->i_fs;
+	if (length < 0)
+		return (EINVAL);
+	/*
+	 * Historically clients did not have to specify which data
+	 * they were truncating. So, if not specified, we assume
+	 * traditional behavior, e.g., just the normal data.
+	 */
+	if ((flags & (IO_EXT | IO_NORMAL)) == 0)
+		flags |= IO_NORMAL;
+	/*
+	 * If we are truncating the extended-attributes, and cannot
+	 * do it with soft updates, then do it slowly here. If we are
+	 * truncating both the extended attributes and the file contents
+	 * (e.g., the file is being unlinked), then pick it off with
+	 * soft updates below.
+	 */
+	needextclean = 0;
+	softdepslowdown = DOINGSOFTDEP(ovp) && softdep_slowdown(ovp);
+	extblocks = 0;
+	datablocks = DIP(oip, i_blocks);
+	if (fs->fs_magic == FS_UFS2_MAGIC && oip->i_din2->di_extsize > 0) {
+		extblocks = btodb(fragroundup(fs, oip->i_din2->di_extsize));
+		datablocks -= extblocks;
+	}
+	if ((flags & IO_EXT) && extblocks > 0) {
+		if (DOINGSOFTDEP(ovp) && softdepslowdown == 0 && length == 0) {
+			if ((flags & IO_NORMAL) == 0) {
+				softdep_setup_freeblocks(oip, length, IO_EXT);
+				return (0);
+			}
+			needextclean = 1;
+		} else {
+			if (length != 0)
+				panic("ffs_truncate: partial trunc of extdata");
+			if ((error = VOP_FSYNC(ovp, cred, MNT_WAIT, td)) != 0)
+				return (error);
+			osize = oip->i_din2->di_extsize;
+			oip->i_din2->di_blocks -= extblocks;
+#ifdef QUOTA
+			(void) chkdq(oip, -extblocks, NOCRED, 0);
+#endif
+			vinvalbuf(ovp, V_ALT, cred, td, 0, 0);
+			oip->i_din2->di_extsize = 0;
+			for (i = 0; i < NXADDR; i++) {
+				oldblks[i] = oip->i_din2->di_extb[i];
+				oip->i_din2->di_extb[i] = 0;
+			}
+			oip->i_flag |= IN_CHANGE | IN_UPDATE;
+			if ((error = ffs_update(ovp, 1)))
+				return (error);
+			for (i = 0; i < NXADDR; i++) {
+				if (oldblks[i] == 0)
+					continue;
+				ffs_blkfree(fs, oip->i_devvp, oldblks[i],
+				    sblksize(fs, osize, i), oip->i_number);
+			}
+		}
+	}
+	if ((flags & IO_NORMAL) == 0)
+		return (0);
+	if (length > fs->fs_maxfilesize)
+		return (EFBIG);
+	if (ovp->v_type == VLNK &&
+	    (oip->i_size < ovp->v_mount->mnt_maxsymlinklen ||
+	     datablocks == 0)) {
+#ifdef DIAGNOSTIC
+		if (length != 0)
+			panic("ffs_truncate: partial truncate of symlink");
+#endif
+		bzero(SHORTLINK(oip), (u_int)oip->i_size);
+		oip->i_size = 0;
+		DIP(oip, i_size) = 0;
+		oip->i_flag |= IN_CHANGE | IN_UPDATE;
+		if (needextclean)
+			softdep_setup_freeblocks(oip, length, IO_EXT);
+		return (UFS_UPDATE(ovp, 1));
+	}
+	if (oip->i_size == length) {
+		oip->i_flag |= IN_CHANGE | IN_UPDATE;
+		if (needextclean)
+			softdep_setup_freeblocks(oip, length, IO_EXT);
+		return (UFS_UPDATE(ovp, 0));
+	}
+	if (fs->fs_ronly)
+		panic("ffs_truncate: read-only filesystem");
+#ifdef QUOTA
+	error = getinoquota(oip);
+	if (error)
+		return (error);
+#endif
+	if ((oip->i_flags & SF_SNAPSHOT) != 0)
+		ffs_snapremove(ovp);
+	ovp->v_lasta = ovp->v_clen = ovp->v_cstart = ovp->v_lastw = 0;
+	if (DOINGSOFTDEP(ovp)) {
+		if (length > 0 || softdepslowdown) {
+			/*
+			 * If a file is only partially truncated, then
+			 * we have to clean up the data structures
+			 * describing the allocation past the truncation
+			 * point. Finding and deallocating those structures
+			 * is a lot of work. Since partial truncation occurs
+			 * rarely, we solve the problem by syncing the file
+			 * so that it will have no data structures left.
+			 */
+			if ((error = VOP_FSYNC(ovp, cred, MNT_WAIT, td)) != 0)
+				return (error);
+			if (oip->i_flag & IN_SPACECOUNTED)
+				fs->fs_pendingblocks -= datablocks;
+		} else {
+#ifdef QUOTA
+			(void) chkdq(oip, -datablocks, NOCRED, 0);
+#endif
+			softdep_setup_freeblocks(oip, length, needextclean ?
+			    IO_EXT | IO_NORMAL : IO_NORMAL);
+			vinvalbuf(ovp, needextclean ? 0 : V_NORMAL,
+			    cred, td, 0, 0);
+			oip->i_flag |= IN_CHANGE | IN_UPDATE;
+			return (ffs_update(ovp, 0));
+		}
+	}
+	osize = oip->i_size;
+	/*
+	 * Lengthen the size of the file. We must ensure that the
+	 * last byte of the file is allocated. Since the smallest
+	 * value of osize is 0, length will be at least 1.
+	 */
+	if (osize < length) {
+		vnode_pager_setsize(ovp, length);
+		flags |= BA_CLRBUF;
+		error = UFS_BALLOC(ovp, length - 1, 1, cred, flags, &bp);
+		if (error)
+			return (error);
+		oip->i_size = length;
+		DIP(oip, i_size) = length;
+		if (bp->b_bufsize == fs->fs_bsize)
+			bp->b_flags |= B_CLUSTEROK;
+		if (flags & IO_SYNC)
+			bwrite(bp);
+		else
+			bawrite(bp);
+		oip->i_flag |= IN_CHANGE | IN_UPDATE;
+		return (UFS_UPDATE(ovp, 1));
+	}
+	/*
+	 * Shorten the size of the file. If the file is not being
+	 * truncated to a block boundary, the contents of the
+	 * partial block following the end of the file must be
+	 * zero'ed in case it ever becomes accessible again because
+	 * of subsequent file growth. Directories however are not
+	 * zero'ed as they should grow back initialized to empty.
+	 */
+	offset = blkoff(fs, length);
+	if (offset == 0) {
+		oip->i_size = length;
+		DIP(oip, i_size) = length;
+	} else {
+		lbn = lblkno(fs, length);
+		flags |= BA_CLRBUF;
+		error = UFS_BALLOC(ovp, length - 1, 1, cred, flags, &bp);
+		if (error) {
+			return (error);
+		}
+		/*
+		 * When we are doing soft updates and the UFS_BALLOC
+		 * above fills in a direct block hole with a full sized
+		 * block that will be truncated down to a fragment below,
+		 * we must flush out the block dependency with an FSYNC
+		 * so that we do not get a soft updates inconsistency
+		 * when we create the fragment below.
+		 */
+		if (DOINGSOFTDEP(ovp) && lbn < NDADDR &&
+		    fragroundup(fs, blkoff(fs, length)) < fs->fs_bsize &&
+		    (error = VOP_FSYNC(ovp, cred, MNT_WAIT, td)) != 0)
+			return (error);
+		oip->i_size = length;
+		DIP(oip, i_size) = length;
+		size = blksize(fs, oip, lbn);
+		if (ovp->v_type != VDIR)
+			bzero((char *)bp->b_data + offset,
+			    (u_int)(size - offset));
+		/* Kirk's code has reallocbuf(bp, size, 1) here */
+		allocbuf(bp, size);
+		if (bp->b_bufsize == fs->fs_bsize)
+			bp->b_flags |= B_CLUSTEROK;
+		if (flags & IO_SYNC)
+			bwrite(bp);
+		else
+			bawrite(bp);
+	}
+	/*
+	 * Calculate index into inode's block list of
+	 * last direct and indirect blocks (if any)
+	 * which we want to keep.  Lastblock is -1 when
+	 * the file is truncated to 0.
+	 */
+	lastblock = lblkno(fs, length + fs->fs_bsize - 1) - 1;
+	lastiblock[SINGLE] = lastblock - NDADDR;
+	lastiblock[DOUBLE] = lastiblock[SINGLE] - NINDIR(fs);
+	lastiblock[TRIPLE] = lastiblock[DOUBLE] - NINDIR(fs) * NINDIR(fs);
+	nblocks = btodb(fs->fs_bsize);
+	/*
+	 * Update file and block pointers on disk before we start freeing
+	 * blocks.  If we crash before free'ing blocks below, the blocks
+	 * will be returned to the free list.  lastiblock values are also
+	 * normalized to -1 for calls to ffs_indirtrunc below.
+	 */
+	for (level = TRIPLE; level >= SINGLE; level--) {
+		oldblks[NDADDR + level] = DIP(oip, i_ib[level]);
+		if (lastiblock[level] < 0) {
+			DIP(oip, i_ib[level]) = 0;
+			lastiblock[level] = -1;
+		}
+	}
+	for (i = 0; i < NDADDR; i++) {
+		oldblks[i] = DIP(oip, i_db[i]);
+		if (i > lastblock)
+			DIP(oip, i_db[i]) = 0;
+	}
+	oip->i_flag |= IN_CHANGE | IN_UPDATE;
+	allerror = UFS_UPDATE(ovp, 1);
+	
+	/*
+	 * Having written the new inode to disk, save its new configuration
+	 * and put back the old block pointers long enough to process them.
+	 * Note that we save the new block configuration so we can check it
+	 * when we are done.
+	 */
+	for (i = 0; i < NDADDR; i++) {
+		newblks[i] = DIP(oip, i_db[i]);
+		DIP(oip, i_db[i]) = oldblks[i];
+	}
+	for (i = 0; i < NIADDR; i++) {
+		newblks[NDADDR + i] = DIP(oip, i_ib[i]);
+		DIP(oip, i_ib[i]) = oldblks[NDADDR + i];
+	}
+	oip->i_size = osize;
+	DIP(oip, i_size) = osize;
+
+	error = vtruncbuf(ovp, cred, td, length, fs->fs_bsize);
+	if (error && (allerror == 0))
+		allerror = error;
+
+	/*
+	 * Indirect blocks first.
+	 */
+	indir_lbn[SINGLE] = -NDADDR;
+	indir_lbn[DOUBLE] = indir_lbn[SINGLE] - NINDIR(fs) - 1;
+	indir_lbn[TRIPLE] = indir_lbn[DOUBLE] - NINDIR(fs) * NINDIR(fs) - 1;
+	for (level = TRIPLE; level >= SINGLE; level--) {
+		bn = DIP(oip, i_ib[level]);
+		if (bn != 0) {
+			error = ffs_indirtrunc(oip, indir_lbn[level],
+			    fsbtodb(fs, bn), lastiblock[level], level, &count);
+			if (error)
+				allerror = error;
+			blocksreleased += count;
+			if (lastiblock[level] < 0) {
+				DIP(oip, i_ib[level]) = 0;
+				ffs_blkfree(fs, oip->i_devvp, bn, fs->fs_bsize,
+				    oip->i_number);
+				blocksreleased += nblocks;
+			}
+		}
+		if (lastiblock[level] >= 0)
+			goto done;
+	}
+
+	/*
+	 * All whole direct blocks or frags.
+	 */
+	for (i = NDADDR - 1; i > lastblock; i--) {
+		long bsize;
+
+		bn = DIP(oip, i_db[i]);
+		if (bn == 0)
+			continue;
+		DIP(oip, i_db[i]) = 0;
+		bsize = blksize(fs, oip, i);
+		ffs_blkfree(fs, oip->i_devvp, bn, bsize, oip->i_number);
+		blocksreleased += btodb(bsize);
+	}
+	if (lastblock < 0)
+		goto done;
+
+	/*
+	 * Finally, look for a change in size of the
+	 * last direct block; release any frags.
+	 */
+	bn = DIP(oip, i_db[lastblock]);
+	if (bn != 0) {
+		long oldspace, newspace;
+
+		/*
+		 * Calculate amount of space we're giving
+		 * back as old block size minus new block size.
+		 */
+		oldspace = blksize(fs, oip, lastblock);
+		oip->i_size = length;
+		DIP(oip, i_size) = length;
+		newspace = blksize(fs, oip, lastblock);
+		if (newspace == 0)
+			panic("ffs_truncate: newspace");
+		if (oldspace - newspace > 0) {
+			/*
+			 * Block number of space to be free'd is
+			 * the old block # plus the number of frags
+			 * required for the storage we're keeping.
+			 */
+			bn += numfrags(fs, newspace);
+			ffs_blkfree(fs, oip->i_devvp, bn, oldspace - newspace,
+			    oip->i_number);
+			blocksreleased += btodb(oldspace - newspace);
+		}
+	}
+done:
+#ifdef DIAGNOSTIC
+	for (level = SINGLE; level <= TRIPLE; level++)
+		if (newblks[NDADDR + level] != DIP(oip, i_ib[level]))
+			panic("ffs_truncate1");
+	for (i = 0; i < NDADDR; i++)
+		if (newblks[i] != DIP(oip, i_db[i]))
+			panic("ffs_truncate2");
+	VI_LOCK(ovp);
+	if (length == 0 &&
+	    (fs->fs_magic != FS_UFS2_MAGIC || oip->i_din2->di_extsize == 0) &&
+	    (!TAILQ_EMPTY(&ovp->v_dirtyblkhd) ||
+	     !TAILQ_EMPTY(&ovp->v_cleanblkhd)))
+		panic("ffs_truncate3");
+	VI_UNLOCK(ovp);
+#endif /* DIAGNOSTIC */
+	/*
+	 * Put back the real size.
+	 */
+	oip->i_size = length;
+	DIP(oip, i_size) = length;
+	DIP(oip, i_blocks) -= blocksreleased;
+
+	if (DIP(oip, i_blocks) < 0)			/* sanity */
+		DIP(oip, i_blocks) = 0;
+	oip->i_flag |= IN_CHANGE;
+#ifdef QUOTA
+	(void) chkdq(oip, -blocksreleased, NOCRED, 0);
+#endif
+	return (allerror);
+}
+
+/*
+ * Release blocks associated with the inode ip and stored in the indirect
+ * block bn.  Blocks are free'd in LIFO order up to (but not including)
+ * lastbn.  If level is greater than SINGLE, the block is an indirect block
+ * and recursive calls to indirtrunc must be used to cleanse other indirect
+ * blocks.
+ */
+static int
+ffs_indirtrunc(ip, lbn, dbn, lastbn, level, countp)
+	struct inode *ip;
+	ufs2_daddr_t lbn, lastbn;
+	ufs2_daddr_t dbn;
+	int level;
+	ufs2_daddr_t *countp;
+{
+	struct buf *bp;
+	struct fs *fs = ip->i_fs;
+	struct vnode *vp;
+	caddr_t copy = NULL;
+	int i, nblocks, error = 0, allerror = 0;
+	ufs2_daddr_t nb, nlbn, last;
+	ufs2_daddr_t blkcount, factor, blocksreleased = 0;
+	ufs1_daddr_t *bap1 = NULL;
+	ufs2_daddr_t *bap2 = NULL;
+#	define BAP(ip, i) (((ip)->i_ump->um_fstype == UFS1) ? bap1[i] : bap2[i])
+
+	/*
+	 * Calculate index in current block of last
+	 * block to be kept.  -1 indicates the entire
+	 * block so we need not calculate the index.
+	 */
+	factor = 1;
+	for (i = SINGLE; i < level; i++)
+		factor *= NINDIR(fs);
+	last = lastbn;
+	if (lastbn > 0)
+		last /= factor;
+	nblocks = btodb(fs->fs_bsize);
+	/*
+	 * Get buffer of block pointers, zero those entries corresponding
+	 * to blocks to be free'd, and update on disk copy first.  Since
+	 * double(triple) indirect before single(double) indirect, calls
+	 * to bmap on these blocks will fail.  However, we already have
+	 * the on disk address, so we have to set the b_blkno field
+	 * explicitly instead of letting bread do everything for us.
+	 */
+	vp = ITOV(ip);
+	bp = getblk(vp, lbn, (int)fs->fs_bsize, 0, 0, 0);
+	if ((bp->b_flags & B_CACHE) == 0) {
+		curproc->p_stats->p_ru.ru_inblock++;	/* pay for read */
+		bp->b_iocmd = BIO_READ;
+		bp->b_flags &= ~B_INVAL;
+		bp->b_ioflags &= ~BIO_ERROR;
+		if (bp->b_bcount > bp->b_bufsize)
+			panic("ffs_indirtrunc: bad buffer size");
+		bp->b_blkno = dbn;
+		vfs_busy_pages(bp, 0);
+		bp->b_iooffset = dbtob(bp->b_blkno);
+		VOP_STRATEGY(bp->b_vp, bp);
+		error = bufwait(bp);
+	}
+	if (error) {
+		brelse(bp);
+		*countp = 0;
+		return (error);
+	}
+
+	if (ip->i_ump->um_fstype == UFS1)
+		bap1 = (ufs1_daddr_t *)bp->b_data;
+	else
+		bap2 = (ufs2_daddr_t *)bp->b_data;
+	if (lastbn != -1) {
+		MALLOC(copy, caddr_t, fs->fs_bsize, M_TEMP, M_WAITOK);
+		bcopy((caddr_t)bp->b_data, copy, (u_int)fs->fs_bsize);
+		for (i = last + 1; i < NINDIR(fs); i++)
+			BAP(ip, i) = 0;
+		if (DOINGASYNC(vp)) {
+			bawrite(bp);
+		} else {
+			error = bwrite(bp);
+			if (error)
+				allerror = error;
+		}
+		if (ip->i_ump->um_fstype == UFS1)
+			bap1 = (ufs1_daddr_t *)copy;
+		else
+			bap2 = (ufs2_daddr_t *)copy;
+	}
+
+	/*
+	 * Recursively free totally unused blocks.
+	 */
+	for (i = NINDIR(fs) - 1, nlbn = lbn + 1 - i * factor; i > last;
+	    i--, nlbn += factor) {
+		nb = BAP(ip, i);
+		if (nb == 0)
+			continue;
+		if (level > SINGLE) {
+			if ((error = ffs_indirtrunc(ip, nlbn, fsbtodb(fs, nb),
+			    (ufs2_daddr_t)-1, level - 1, &blkcount)) != 0)
+				allerror = error;
+			blocksreleased += blkcount;
+		}
+		ffs_blkfree(fs, ip->i_devvp, nb, fs->fs_bsize, ip->i_number);
+		blocksreleased += nblocks;
+	}
+
+	/*
+	 * Recursively free last partial block.
+	 */
+	if (level > SINGLE && lastbn >= 0) {
+		last = lastbn % factor;
+		nb = BAP(ip, i);
+		if (nb != 0) {
+			error = ffs_indirtrunc(ip, nlbn, fsbtodb(fs, nb),
+			    last, level - 1, &blkcount);
+			if (error)
+				allerror = error;
+			blocksreleased += blkcount;
+		}
+	}
+	if (copy != NULL) {
+		FREE(copy, M_TEMP);
+	} else {
+		bp->b_flags |= B_INVAL | B_NOCACHE;
+		brelse(bp);
+	}
+
+	*countp = blocksreleased;
+	return (allerror);
+}
diff --git a/src/sys/ufs/ffs/ffs_rawread.c b/src/sys/ufs/ffs/ffs_rawread.c
new file mode 100644
index 0000000..5819e82
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_rawread.c
@@ -0,0 +1,498 @@
+/*-
+ * Copyright (c) 2000-2003 Tor Egge
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_rawread.c,v 1.12 2003/11/15 09:28:09 phk Exp $");
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/fcntl.h>
+#include <sys/file.h>
+#include <sys/stat.h>
+#include <sys/proc.h>
+#include <sys/limits.h>
+#include <sys/mount.h>
+#include <sys/namei.h>
+#include <sys/vnode.h>
+#include <sys/conf.h>
+#include <sys/filio.h>
+#include <sys/ttycom.h>
+#include <sys/bio.h>
+#include <sys/buf.h>
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufsmount.h>
+#include <ufs/ufs/ufs_extern.h>
+#include <ufs/ffs/fs.h>
+
+#include <vm/vm.h>
+#include <vm/vm_extern.h>
+#include <vm/vm_object.h>
+#include <sys/kernel.h>
+#include <sys/sysctl.h>
+
+static int ffs_rawread_readahead(struct vnode *vp,
+				 caddr_t udata,
+				 off_t offset,
+				 size_t len,
+				 struct thread *td,
+				 struct buf *bp,
+				 caddr_t sa);
+static int ffs_rawread_main(struct vnode *vp,
+			    struct uio *uio);
+
+static int ffs_rawread_sync(struct vnode *vp, struct thread *td);
+
+int ffs_rawread(struct vnode *vp, struct uio *uio, int *workdone);
+
+void ffs_rawread_setup(void);
+
+static void ffs_rawreadwakeup(struct buf *bp);
+
+
+SYSCTL_DECL(_vfs_ffs);
+
+static int ffsrawbufcnt = 4;
+SYSCTL_INT(_vfs_ffs, OID_AUTO, ffsrawbufcnt, CTLFLAG_RD, &ffsrawbufcnt, 0,
+	   "Buffers available for raw reads");
+
+static int allowrawread = 1;
+SYSCTL_INT(_vfs_ffs, OID_AUTO, allowrawread, CTLFLAG_RW, &allowrawread, 0,
+	   "Flag to enable raw reads");
+
+static int rawreadahead = 1;
+SYSCTL_INT(_vfs_ffs, OID_AUTO, rawreadahead, CTLFLAG_RW, &rawreadahead, 0,
+	   "Flag to enable readahead for long raw reads");
+
+
+void
+ffs_rawread_setup(void)
+{
+	ffsrawbufcnt = (nswbuf > 100 ) ? (nswbuf - (nswbuf >> 4)) : nswbuf - 8;
+}
+
+
+static int
+ffs_rawread_sync(struct vnode *vp, struct thread *td)
+{
+	int spl;
+	int error;
+	int upgraded;
+
+	GIANT_REQUIRED;
+	/* Check for dirty mmap, pending writes and dirty buffers */
+	spl = splbio();
+	VI_LOCK(vp);
+	if (vp->v_numoutput > 0 ||
+	    !TAILQ_EMPTY(&vp->v_dirtyblkhd) ||
+	    (vp->v_iflag & VI_OBJDIRTY) != 0) {
+		splx(spl);
+		VI_UNLOCK(vp);
+		
+		if (VOP_ISLOCKED(vp, td) != LK_EXCLUSIVE) {
+			upgraded = 1;
+			/* Upgrade to exclusive lock, this might block */
+			VOP_LOCK(vp, LK_UPGRADE | LK_NOPAUSE, td);
+		} else
+			upgraded = 0;
+			
+		
+		/* Attempt to msync mmap() regions to clean dirty mmap */ 
+		VI_LOCK(vp);
+		if ((vp->v_iflag & VI_OBJDIRTY) != 0) {
+			struct vm_object *obj;
+			VI_UNLOCK(vp);
+			if (VOP_GETVOBJECT(vp, &obj) == 0) {
+				VM_OBJECT_LOCK(obj);
+				vm_object_page_clean(obj, 0, 0, OBJPC_SYNC);
+				VM_OBJECT_UNLOCK(obj);
+			}
+			VI_LOCK(vp);
+		}
+
+		/* Wait for pending writes to complete */
+		spl = splbio();
+		while (vp->v_numoutput) {
+			vp->v_iflag |= VI_BWAIT;
+			error = msleep((caddr_t)&vp->v_numoutput,
+				       VI_MTX(vp),
+				       PRIBIO + 1,
+				       "rawrdfls", 0);
+			if (error != 0) {
+				splx(spl);
+				VI_UNLOCK(vp);
+				if (upgraded != 0)
+					VOP_LOCK(vp, LK_DOWNGRADE, td);
+				return (error);
+			}
+		}
+		/* Flush dirty buffers */
+		if (!TAILQ_EMPTY(&vp->v_dirtyblkhd)) {
+			splx(spl);
+			VI_UNLOCK(vp);
+			if ((error = VOP_FSYNC(vp, NOCRED, MNT_WAIT, td)) != 0) {
+				if (upgraded != 0)
+					VOP_LOCK(vp, LK_DOWNGRADE, td);
+				return (error);
+			}
+			VI_LOCK(vp);
+			spl = splbio();
+			if (vp->v_numoutput > 0 ||
+			    !TAILQ_EMPTY(&vp->v_dirtyblkhd))
+				panic("ffs_rawread_sync: dirty bufs");
+		}
+		splx(spl);
+		VI_UNLOCK(vp);
+		if (upgraded != 0)
+			VOP_LOCK(vp, LK_DOWNGRADE, td);
+	} else {
+		splx(spl);
+		VI_UNLOCK(vp);
+	}
+	return 0;
+}
+
+
+static int
+ffs_rawread_readahead(struct vnode *vp,
+		      caddr_t udata,
+		      off_t offset,
+		      size_t len,
+		      struct thread *td,
+		      struct buf *bp,
+		      caddr_t sa)
+{
+	int error;
+	u_int iolen;
+	off_t blockno;
+	int blockoff;
+	int bsize;
+	struct vnode *dp;
+	int bforwards;
+	struct inode *ip;
+	ufs2_daddr_t blkno;
+	
+	GIANT_REQUIRED;
+	bsize = vp->v_mount->mnt_stat.f_iosize;
+	
+	ip = VTOI(vp);
+	dp = ip->i_devvp;
+
+	iolen = ((vm_offset_t) udata) & PAGE_MASK;
+	bp->b_bcount = len;
+	if (bp->b_bcount + iolen > bp->b_kvasize) {
+		bp->b_bcount = bp->b_kvasize;
+		if (iolen != 0)
+			bp->b_bcount -= PAGE_SIZE;
+	}
+	bp->b_flags = 0;	/* XXX necessary ? */
+	bp->b_iocmd = BIO_READ;
+	bp->b_iodone = ffs_rawreadwakeup;
+	bp->b_data = udata;
+	bp->b_saveaddr = sa;
+	blockno = offset / bsize;
+	blockoff = (offset % bsize) / DEV_BSIZE;
+	if ((daddr_t) blockno != blockno) {
+		return EINVAL; /* blockno overflow */
+	}
+	
+	bp->b_lblkno = bp->b_blkno = blockno;
+	
+	error = ufs_bmaparray(vp, bp->b_lblkno, &blkno, NULL, &bforwards, NULL);
+	if (error != 0)
+		return error;
+	if (blkno == -1) {
+
+		/* Fill holes with NULs to preserve semantics */
+		
+		if (bp->b_bcount + blockoff * DEV_BSIZE > bsize)
+			bp->b_bcount = bsize - blockoff * DEV_BSIZE;
+		bp->b_bufsize = bp->b_bcount;
+		
+		if (vmapbuf(bp) < 0)
+			return EFAULT;
+		
+		if (ticks - PCPU_GET(switchticks) >= hogticks)
+			uio_yield();
+		bzero(bp->b_data, bp->b_bufsize);
+
+		/* Mark operation completed (similar to bufdone()) */
+
+		bp->b_resid = 0;
+		bp->b_flags |= B_DONE;
+		return 0;
+	}
+	bp->b_blkno = blkno + blockoff;
+	bp->b_offset = bp->b_iooffset = (blkno + blockoff) * DEV_BSIZE;
+	
+	if (bp->b_bcount + blockoff * DEV_BSIZE > bsize * (1 + bforwards))
+		bp->b_bcount = bsize * (1 + bforwards) - blockoff * DEV_BSIZE;
+	bp->b_bufsize = bp->b_bcount;
+	bp->b_dev = dp->v_rdev;
+	
+	if (vmapbuf(bp) < 0)
+		return EFAULT;
+	
+	if (dp->v_type == VCHR)
+		(void) VOP_SPECSTRATEGY(dp, bp);
+	else
+		(void) VOP_STRATEGY(dp, bp);
+	return 0;
+}
+
+
+static int
+ffs_rawread_main(struct vnode *vp,
+		 struct uio *uio)
+{
+	int error, nerror;
+	struct buf *bp, *nbp, *tbp;
+	caddr_t sa, nsa, tsa;
+	u_int iolen;
+	int spl;
+	caddr_t udata;
+	long resid;
+	off_t offset;
+	struct thread *td;
+	
+	GIANT_REQUIRED;
+	td = uio->uio_td ? uio->uio_td : curthread;
+	udata = uio->uio_iov->iov_base;
+	resid = uio->uio_resid;
+	offset = uio->uio_offset;
+
+	/*
+	 * keep the process from being swapped
+	 */
+	PHOLD(td->td_proc);
+	
+	error = 0;
+	nerror = 0;
+	
+	bp = NULL;
+	nbp = NULL;
+	sa = NULL;
+	nsa = NULL;
+	
+	while (resid > 0) {
+		
+		if (bp == NULL) { /* Setup first read */
+			/* XXX: Leave some bufs for swap */
+			bp = getpbuf(&ffsrawbufcnt);
+			sa = bp->b_data;
+			bp->b_vp = vp; 
+			error = ffs_rawread_readahead(vp, udata, offset,
+						     resid, td, bp, sa);
+			if (error != 0)
+				break;
+			
+			if (resid > bp->b_bufsize) { /* Setup fist readahead */
+				/* XXX: Leave bufs for swap */
+				if (rawreadahead != 0) 
+					nbp = trypbuf(&ffsrawbufcnt);
+				else
+					nbp = NULL;
+				if (nbp != NULL) {
+					nsa = nbp->b_data;
+					nbp->b_vp = vp;
+					
+					nerror = ffs_rawread_readahead(vp, 
+								       udata +
+								       bp->b_bufsize,
+								       offset +
+								       bp->b_bufsize,
+								       resid -
+								       bp->b_bufsize,
+								       td,
+								       nbp,
+								       nsa);
+					if (nerror) {
+						relpbuf(nbp, &ffsrawbufcnt);
+						nbp = NULL;
+					}
+				}
+			}
+		}
+		
+		spl = splbio();
+		bwait(bp, PRIBIO, "rawrd");
+		splx(spl);
+		
+		vunmapbuf(bp);
+		
+		iolen = bp->b_bcount - bp->b_resid;
+		if (iolen == 0 && (bp->b_ioflags & BIO_ERROR) == 0) {
+			nerror = 0;	/* Ignore possible beyond EOF error */
+			break; /* EOF */
+		}
+		
+		if ((bp->b_ioflags & BIO_ERROR) != 0) {
+			error = bp->b_error;
+			break;
+		}
+		resid -= iolen;
+		udata += iolen;
+		offset += iolen;
+		if (iolen < bp->b_bufsize) {
+			/* Incomplete read.  Try to read remaining part */
+			error = ffs_rawread_readahead(vp,
+						      udata,
+						      offset,
+						      bp->b_bufsize - iolen,
+						      td,
+						      bp,
+						      sa);
+			if (error != 0)
+				break;
+		} else if (nbp != NULL) { /* Complete read with readahead */
+			
+			tbp = bp;
+			bp = nbp;
+			nbp = tbp;
+			
+			tsa = sa;
+			sa = nsa;
+			nsa = tsa;
+			
+			if (resid <= bp->b_bufsize) { /* No more readaheads */
+				relpbuf(nbp, &ffsrawbufcnt);
+				nbp = NULL;
+			} else { /* Setup next readahead */
+				nerror = ffs_rawread_readahead(vp,
+							       udata +
+							       bp->b_bufsize,
+							       offset +
+							       bp->b_bufsize,
+							       resid -
+							       bp->b_bufsize,
+							       td,
+							       nbp,
+							       nsa);
+				if (nerror != 0) {
+					relpbuf(nbp, &ffsrawbufcnt);
+					nbp = NULL;
+				}
+			}
+		} else if (nerror != 0) {/* Deferred Readahead error */
+			break;		
+		}  else if (resid > 0) { /* More to read, no readahead */
+			error = ffs_rawread_readahead(vp, udata, offset,
+						      resid, td, bp, sa);
+			if (error != 0)
+				break;
+		}
+	}
+	
+	if (bp != NULL)
+		relpbuf(bp, &ffsrawbufcnt);
+	if (nbp != NULL) {			/* Run down readahead buffer */
+		spl = splbio();
+		bwait(nbp, PRIBIO, "rawrd");
+		splx(spl);
+		vunmapbuf(nbp);
+		relpbuf(nbp, &ffsrawbufcnt);
+	}
+	
+	if (error == 0)
+		error = nerror;
+	PRELE(td->td_proc);
+	uio->uio_iov->iov_base = udata;
+	uio->uio_resid = resid;
+	uio->uio_offset = offset;
+	return error;
+}
+
+
+int
+ffs_rawread(struct vnode *vp,
+	    struct uio *uio,
+	    int *workdone)
+{
+	if (allowrawread != 0 &&
+	    uio->uio_iovcnt == 1 && 
+	    uio->uio_segflg == UIO_USERSPACE &&
+	    uio->uio_resid == uio->uio_iov->iov_len &&
+	    (((uio->uio_td != NULL) ? uio->uio_td : curthread)->td_flags &
+	     TDF_DEADLKTREAT) == 0) {
+		int secsize;		/* Media sector size */
+		off_t filebytes;	/* Bytes left of file */
+		int blockbytes;		/* Bytes left of file in full blocks */
+		int partialbytes;	/* Bytes in last partial block */
+		int skipbytes;		/* Bytes not to read in ffs_rawread */
+		struct inode *ip;
+		int error;
+		
+
+		/* Only handle sector aligned reads */
+		ip = VTOI(vp);
+		secsize = ip->i_devvp->v_rdev->si_bsize_phys;
+		if ((uio->uio_offset & (secsize - 1)) == 0 &&
+		    (uio->uio_resid & (secsize - 1)) == 0) {
+			
+			/* Sync dirty pages and buffers if needed */
+			error = ffs_rawread_sync(vp,
+						 (uio->uio_td != NULL) ?
+						 uio->uio_td : curthread);
+			if (error != 0)
+				return error;
+			
+			/* Check for end of file */
+			if (ip->i_size > uio->uio_offset) {
+				filebytes = ip->i_size - uio->uio_offset;
+
+				/* No special eof handling needed ? */
+				if (uio->uio_resid <= filebytes) {
+					*workdone = 1;
+					return ffs_rawread_main(vp, uio);
+				}
+				
+				partialbytes = ((unsigned int) ip->i_size) %
+					ip->i_fs->fs_bsize;
+				blockbytes = (int) filebytes - partialbytes;
+				if (blockbytes > 0) {
+					skipbytes = uio->uio_resid -
+						blockbytes;
+					uio->uio_resid = blockbytes;
+					error = ffs_rawread_main(vp, uio);
+					uio->uio_resid += skipbytes;
+					if (error != 0)
+						return error;
+					/* Read remaining part using buffer */
+				}
+			}
+		}
+	}
+	*workdone = 0;
+	return 0;
+}
+
+
+static void
+ffs_rawreadwakeup(struct buf *bp)
+{
+	bdone(bp);
+}
diff --git a/src/sys/ufs/ffs/ffs_snapshot.c b/src/sys/ufs/ffs/ffs_snapshot.c
new file mode 100644
index 0000000..d195b6e
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_snapshot.c
@@ -0,0 +1,2114 @@
+/*
+ * Copyright 2000 Marshall Kirk McKusick. All Rights Reserved.
+ *
+ * Further information about snapshots can be obtained from:
+ *
+ *	Marshall Kirk McKusick		http://www.mckusick.com/softdep/
+ *	1614 Oxford Street		mckusick@mckusick.com
+ *	Berkeley, CA 94709-1608		+1-510-843-9542
+ *	USA
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY MARSHALL KIRK MCKUSICK ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED.  IN NO EVENT SHALL MARSHALL KIRK MCKUSICK BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)ffs_snapshot.c	8.11 (McKusick) 7/23/00
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_snapshot.c,v 1.76 2003/11/13 03:56:32 alc Exp $");
+
+#include <sys/param.h>
+#include <sys/kernel.h>
+#include <sys/systm.h>
+#include <sys/conf.h>
+#include <sys/bio.h>
+#include <sys/buf.h>
+#include <sys/proc.h>
+#include <sys/namei.h>
+#include <sys/sched.h>
+#include <sys/stat.h>
+#include <sys/malloc.h>
+#include <sys/mount.h>
+#include <sys/resource.h>
+#include <sys/resourcevar.h>
+#include <sys/vnode.h>
+
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/ufsmount.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufs_extern.h>
+
+#include <ufs/ffs/fs.h>
+#include <ufs/ffs/ffs_extern.h>
+
+#define KERNCRED thread0.td_ucred
+#define DEBUG 1
+
+static int cgaccount(int, struct vnode *, struct buf *, int);
+static int expunge_ufs1(struct vnode *, struct inode *, struct fs *,
+    int (*)(struct vnode *, ufs1_daddr_t *, ufs1_daddr_t *, struct fs *,
+    ufs_lbn_t, int), int);
+static int indiracct_ufs1(struct vnode *, struct vnode *, int,
+    ufs1_daddr_t, ufs_lbn_t, ufs_lbn_t, ufs_lbn_t, ufs_lbn_t, struct fs *,
+    int (*)(struct vnode *, ufs1_daddr_t *, ufs1_daddr_t *, struct fs *,
+    ufs_lbn_t, int), int);
+static int fullacct_ufs1(struct vnode *, ufs1_daddr_t *, ufs1_daddr_t *,
+    struct fs *, ufs_lbn_t, int);
+static int snapacct_ufs1(struct vnode *, ufs1_daddr_t *, ufs1_daddr_t *,
+    struct fs *, ufs_lbn_t, int);
+static int mapacct_ufs1(struct vnode *, ufs1_daddr_t *, ufs1_daddr_t *,
+    struct fs *, ufs_lbn_t, int);
+static int expunge_ufs2(struct vnode *, struct inode *, struct fs *,
+    int (*)(struct vnode *, ufs2_daddr_t *, ufs2_daddr_t *, struct fs *,
+    ufs_lbn_t, int), int);
+static int indiracct_ufs2(struct vnode *, struct vnode *, int,
+    ufs2_daddr_t, ufs_lbn_t, ufs_lbn_t, ufs_lbn_t, ufs_lbn_t, struct fs *,
+    int (*)(struct vnode *, ufs2_daddr_t *, ufs2_daddr_t *, struct fs *,
+    ufs_lbn_t, int), int);
+static int fullacct_ufs2(struct vnode *, ufs2_daddr_t *, ufs2_daddr_t *,
+    struct fs *, ufs_lbn_t, int);
+static int snapacct_ufs2(struct vnode *, ufs2_daddr_t *, ufs2_daddr_t *,
+    struct fs *, ufs_lbn_t, int);
+static int mapacct_ufs2(struct vnode *, ufs2_daddr_t *, ufs2_daddr_t *,
+    struct fs *, ufs_lbn_t, int);
+static int ffs_copyonwrite(struct vnode *, struct buf *);
+static int readblock(struct buf *, ufs2_daddr_t);
+
+/*
+ * To ensure the consistency of snapshots across crashes, we must
+ * synchronously write out copied blocks before allowing the
+ * originals to be modified. Because of the rather severe speed
+ * penalty that this imposes, the following flag allows this
+ * crash persistence to be disabled.
+ */
+int dopersistence = 0;
+
+#ifdef DEBUG
+#include <sys/sysctl.h>
+SYSCTL_INT(_debug, OID_AUTO, dopersistence, CTLFLAG_RW, &dopersistence, 0, "");
+static int snapdebug = 0;
+SYSCTL_INT(_debug, OID_AUTO, snapdebug, CTLFLAG_RW, &snapdebug, 0, "");
+int collectsnapstats = 0;
+SYSCTL_INT(_debug, OID_AUTO, collectsnapstats, CTLFLAG_RW, &collectsnapstats,
+	0, "");
+#endif /* DEBUG */
+
+/*
+ * Create a snapshot file and initialize it for the filesystem.
+ */
+int
+ffs_snapshot(mp, snapfile)
+	struct mount *mp;
+	char *snapfile;
+{
+	ufs2_daddr_t numblks, blkno, *blkp, *snapblklist;
+	int error, cg, snaploc;
+	int i, size, len, loc;
+	int flag = mp->mnt_flag;
+	struct timespec starttime = {0, 0}, endtime;
+	char saved_nice = 0;
+	long redo = 0, snaplistsize = 0;
+	int32_t *lp;
+	void *space;
+	struct fs *copy_fs = NULL, *fs = VFSTOUFS(mp)->um_fs;
+	struct snaphead *snaphead;
+	struct thread *td = curthread;
+	struct inode *ip, *xp;
+	struct buf *bp, *nbp, *ibp, *sbp = NULL;
+	struct nameidata nd;
+	struct mount *wrtmp;
+	struct vattr vat;
+	struct vnode *vp, *xvp, *nvp, *devvp;
+	struct uio auio;
+	struct iovec aiov;
+
+	/*
+	 * Need to serialize access to snapshot code per filesystem.
+	 */
+	/*
+	 * Assign a snapshot slot in the superblock.
+	 */
+	for (snaploc = 0; snaploc < FSMAXSNAP; snaploc++)
+		if (fs->fs_snapinum[snaploc] == 0)
+			break;
+	if (snaploc == FSMAXSNAP)
+		return (ENOSPC);
+	/*
+	 * Create the snapshot file.
+	 */
+restart:
+	NDINIT(&nd, CREATE, LOCKPARENT | LOCKLEAF, UIO_USERSPACE, snapfile, td);
+	if ((error = namei(&nd)) != 0)
+		return (error);
+	if (nd.ni_vp != NULL) {
+		vput(nd.ni_vp);
+		error = EEXIST;
+	}
+	if (nd.ni_dvp->v_mount != mp)
+		error = EXDEV;
+	if (error) {
+		NDFREE(&nd, NDF_ONLY_PNBUF);
+		if (nd.ni_dvp == nd.ni_vp)
+			vrele(nd.ni_dvp);
+		else
+			vput(nd.ni_dvp);
+		return (error);
+	}
+	VATTR_NULL(&vat);
+	vat.va_type = VREG;
+	vat.va_mode = S_IRUSR;
+	vat.va_vaflags |= VA_EXCLUSIVE;
+	if (VOP_GETWRITEMOUNT(nd.ni_dvp, &wrtmp))
+		wrtmp = NULL;
+	if (wrtmp != mp)
+		panic("ffs_snapshot: mount mismatch");
+	if (vn_start_write(NULL, &wrtmp, V_NOWAIT) != 0) {
+		NDFREE(&nd, NDF_ONLY_PNBUF);
+		vput(nd.ni_dvp);
+		if ((error = vn_start_write(NULL, &wrtmp,
+		    V_XSLEEP | PCATCH)) != 0)
+			return (error);
+		goto restart;
+	}
+	VOP_LEASE(nd.ni_dvp, td, KERNCRED, LEASE_WRITE);
+	error = VOP_CREATE(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vat);
+	vput(nd.ni_dvp);
+	if (error) {
+		NDFREE(&nd, NDF_ONLY_PNBUF);
+		vn_finished_write(wrtmp);
+		return (error);
+	}
+	vp = nd.ni_vp;
+	ip = VTOI(vp);
+	devvp = ip->i_devvp;
+	/*
+	 * Allocate and copy the last block contents so as to be able
+	 * to set size to that of the filesystem.
+	 */
+	numblks = howmany(fs->fs_size, fs->fs_frag);
+	error = UFS_BALLOC(vp, lblktosize(fs, (off_t)(numblks - 1)),
+	    fs->fs_bsize, KERNCRED, BA_CLRBUF, &bp);
+	if (error)
+		goto out;
+	ip->i_size = lblktosize(fs, (off_t)numblks);
+	DIP(ip, i_size) = ip->i_size;
+	ip->i_flag |= IN_CHANGE | IN_UPDATE;
+	if ((error = readblock(bp, numblks - 1)) != 0)
+		goto out;
+	bawrite(bp);
+	/*
+	 * Preallocate critical data structures so that we can copy
+	 * them in without further allocation after we suspend all
+	 * operations on the filesystem. We would like to just release
+	 * the allocated buffers without writing them since they will
+	 * be filled in below once we are ready to go, but this upsets
+	 * the soft update code, so we go ahead and write the new buffers.
+	 *
+	 * Allocate all indirect blocks and mark all of them as not
+	 * needing to be copied.
+	 */
+	for (blkno = NDADDR; blkno < numblks; blkno += NINDIR(fs)) {
+		error = UFS_BALLOC(vp, lblktosize(fs, (off_t)blkno),
+		    fs->fs_bsize, td->td_ucred, BA_METAONLY, &ibp);
+		if (error)
+			goto out;
+		bawrite(ibp);
+	}
+	/*
+	 * Allocate copies for the superblock and its summary information.
+	 */
+	error = UFS_BALLOC(vp, fs->fs_sblockloc, fs->fs_sbsize, KERNCRED,
+	    0, &nbp);
+	if (error)
+		goto out;
+	bawrite(nbp);
+	blkno = fragstoblks(fs, fs->fs_csaddr);
+	len = howmany(fs->fs_cssize, fs->fs_bsize);
+	for (loc = 0; loc < len; loc++) {
+		error = UFS_BALLOC(vp, lblktosize(fs, (off_t)(blkno + loc)),
+		    fs->fs_bsize, KERNCRED, 0, &nbp);
+		if (error)
+			goto out;
+		bawrite(nbp);
+	}
+	/*
+	 * Allocate all cylinder group blocks.
+	 */
+	for (cg = 0; cg < fs->fs_ncg; cg++) {
+		error = UFS_BALLOC(vp, lfragtosize(fs, cgtod(fs, cg)),
+		    fs->fs_bsize, KERNCRED, 0, &nbp);
+		if (error)
+			goto out;
+		bawrite(nbp);
+	}
+	/*
+	 * Copy all the cylinder group maps. Although the
+	 * filesystem is still active, we hope that only a few
+	 * cylinder groups will change between now and when we
+	 * suspend operations. Thus, we will be able to quickly
+	 * touch up the few cylinder groups that changed during
+	 * the suspension period.
+	 */
+	len = howmany(fs->fs_ncg, NBBY);
+	MALLOC(fs->fs_active, int *, len, M_DEVBUF, M_WAITOK);
+	bzero(fs->fs_active, len);
+	for (cg = 0; cg < fs->fs_ncg; cg++) {
+		error = UFS_BALLOC(vp, lfragtosize(fs, cgtod(fs, cg)),
+		    fs->fs_bsize, KERNCRED, 0, &nbp);
+		if (error)
+			goto out;
+		error = cgaccount(cg, vp, nbp, 1);
+		bawrite(nbp);
+		if (error)
+			goto out;
+	}
+	/*
+	 * Change inode to snapshot type file.
+	 */
+	ip->i_flags |= SF_SNAPSHOT;
+	DIP(ip, i_flags) = ip->i_flags;
+	ip->i_flag |= IN_CHANGE | IN_UPDATE;
+	/*
+	 * Ensure that the snapshot is completely on disk.
+	 * Since we have marked it as a snapshot it is safe to
+	 * unlock it as no process will be allowed to write to it.
+	 */
+	if ((error = VOP_FSYNC(vp, KERNCRED, MNT_WAIT, td)) != 0)
+		goto out;
+	VOP_UNLOCK(vp, 0, td);
+	/*
+	 * All allocations are done, so we can now snapshot the system.
+	 *
+	 * Recind nice scheduling while running with the filesystem suspended.
+	 */
+	if (td->td_ksegrp->kg_nice > 0) {
+		PROC_LOCK(td->td_proc);
+		mtx_lock_spin(&sched_lock);
+		saved_nice = td->td_ksegrp->kg_nice;
+		sched_nice(td->td_ksegrp, 0);
+		mtx_unlock_spin(&sched_lock);
+		PROC_UNLOCK(td->td_proc);
+	}
+	/*
+	 * Suspend operation on filesystem.
+	 */
+	for (;;) {
+		vn_finished_write(wrtmp);
+		if ((error = vfs_write_suspend(vp->v_mount)) != 0) {
+			vn_start_write(NULL, &wrtmp, V_WAIT);
+			vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
+			goto out;
+		}
+		if (mp->mnt_kern_flag & MNTK_SUSPENDED)
+			break;
+		vn_start_write(NULL, &wrtmp, V_WAIT);
+	}
+	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
+	if (collectsnapstats)
+		nanotime(&starttime);
+	/*
+	 * First, copy all the cylinder group maps that have changed.
+	 */
+	for (cg = 0; cg < fs->fs_ncg; cg++) {
+		if ((ACTIVECGNUM(fs, cg) & ACTIVECGOFF(cg)) != 0)
+			continue;
+		redo++;
+		error = UFS_BALLOC(vp, lfragtosize(fs, cgtod(fs, cg)),
+		    fs->fs_bsize, KERNCRED, 0, &nbp);
+		if (error)
+			goto out1;
+		error = cgaccount(cg, vp, nbp, 2);
+		bawrite(nbp);
+		if (error)
+			goto out1;
+	}
+	/*
+	 * Grab a copy of the superblock and its summary information.
+	 * We delay writing it until the suspension is released below.
+	 */
+	error = bread(vp, lblkno(fs, fs->fs_sblockloc), fs->fs_bsize,
+	    KERNCRED, &sbp);
+	if (error) {
+		brelse(sbp);
+		sbp = NULL;
+		goto out1;
+	}
+	loc = blkoff(fs, fs->fs_sblockloc);
+	copy_fs = (struct fs *)(sbp->b_data + loc);
+	bcopy(fs, copy_fs, fs->fs_sbsize);
+	if ((fs->fs_flags & (FS_UNCLEAN | FS_NEEDSFSCK)) == 0)
+		copy_fs->fs_clean = 1;
+	size = fs->fs_bsize < SBLOCKSIZE ? fs->fs_bsize : SBLOCKSIZE;
+	if (fs->fs_sbsize < size)
+		bzero(&sbp->b_data[loc + fs->fs_sbsize], size - fs->fs_sbsize);
+	size = blkroundup(fs, fs->fs_cssize);
+	if (fs->fs_contigsumsize > 0)
+		size += fs->fs_ncg * sizeof(int32_t);
+	space = malloc((u_long)size, M_UFSMNT, M_WAITOK);
+	copy_fs->fs_csp = space;
+	bcopy(fs->fs_csp, copy_fs->fs_csp, fs->fs_cssize);
+	(char *)space += fs->fs_cssize;
+	loc = howmany(fs->fs_cssize, fs->fs_fsize);
+	i = fs->fs_frag - loc % fs->fs_frag;
+	len = (i == fs->fs_frag) ? 0 : i * fs->fs_fsize;
+	if (len > 0) {
+		if ((error = bread(devvp, fsbtodb(fs, fs->fs_csaddr + loc),
+		    len, KERNCRED, &bp)) != 0) {
+			brelse(bp);
+			free(copy_fs->fs_csp, M_UFSMNT);
+			bawrite(sbp);
+			sbp = NULL;
+			goto out1;
+		}
+		bcopy(bp->b_data, space, (u_int)len);
+		(char *)space += len;
+		bp->b_flags |= B_INVAL | B_NOCACHE;
+		brelse(bp);
+	}
+	if (fs->fs_contigsumsize > 0) {
+		copy_fs->fs_maxcluster = lp = space;
+		for (i = 0; i < fs->fs_ncg; i++)
+			*lp++ = fs->fs_contigsumsize;
+	}
+	/*
+	 * We must check for active files that have been unlinked
+	 * (e.g., with a zero link count). We have to expunge all
+	 * trace of these files from the snapshot so that they are
+	 * not reclaimed prematurely by fsck or unnecessarily dumped.
+	 * We turn off the MNTK_SUSPENDED flag to avoid a panic from
+	 * spec_strategy about writing on a suspended filesystem.
+	 * Note that we skip unlinked snapshot files as they will
+	 * be handled separately below.
+	 *
+	 * We also calculate the needed size for the snapshot list.
+	 */
+	snaplistsize = fs->fs_ncg + howmany(fs->fs_cssize, fs->fs_bsize) +
+	    FSMAXSNAP + 1 /* superblock */ + 1 /* last block */ + 1 /* size */;
+	mp->mnt_kern_flag &= ~MNTK_SUSPENDED;
+	MNT_ILOCK(mp);
+loop:
+	for (xvp = TAILQ_FIRST(&mp->mnt_nvnodelist); xvp; xvp = nvp) {
+		/*
+		 * Make sure this vnode wasn't reclaimed in getnewvnode().
+		 * Start over if it has (it won't be on the list anymore).
+		 */
+		if (xvp->v_mount != mp)
+			goto loop;
+		nvp = TAILQ_NEXT(xvp, v_nmntvnodes);
+		VI_LOCK(xvp);
+		MNT_IUNLOCK(mp);
+		if ((xvp->v_iflag & VI_XLOCK) ||
+		    xvp->v_usecount == 0 || xvp->v_type == VNON ||
+		    (VTOI(xvp)->i_flags & SF_SNAPSHOT)) {
+			VI_UNLOCK(xvp);
+			MNT_ILOCK(mp);
+			continue;
+		}
+		if (snapdebug)
+			vprint("ffs_snapshot: busy vnode", xvp);
+		if (vn_lock(xvp, LK_EXCLUSIVE | LK_INTERLOCK, td) != 0) {
+			MNT_ILOCK(mp);
+			goto loop;
+		}
+		if (VOP_GETATTR(xvp, &vat, td->td_ucred, td) == 0 &&
+		    vat.va_nlink > 0) {
+			VOP_UNLOCK(xvp, 0, td);
+			MNT_ILOCK(mp);
+			continue;
+		}
+		xp = VTOI(xvp);
+		if (ffs_checkfreefile(copy_fs, vp, xp->i_number)) {
+			VOP_UNLOCK(xvp, 0, td);
+			MNT_ILOCK(mp);
+			continue;
+		}
+		/*
+		 * If there is a fragment, clear it here.
+		 */
+		blkno = 0;
+		loc = howmany(xp->i_size, fs->fs_bsize) - 1;
+		if (loc < NDADDR) {
+			len = fragroundup(fs, blkoff(fs, xp->i_size));
+			if (len < fs->fs_bsize) {
+				ffs_blkfree(copy_fs, vp, DIP(xp, i_db[loc]),
+				    len, xp->i_number);
+				blkno = DIP(xp, i_db[loc]);
+				DIP(xp, i_db[loc]) = 0;
+			}
+		}
+		snaplistsize += 1;
+		if (xp->i_ump->um_fstype == UFS1)
+			error = expunge_ufs1(vp, xp, copy_fs, fullacct_ufs1,
+			    BLK_NOCOPY);
+		else
+			error = expunge_ufs2(vp, xp, copy_fs, fullacct_ufs2,
+			    BLK_NOCOPY);
+		if (blkno)
+			DIP(xp, i_db[loc]) = blkno;
+		if (!error)
+			error = ffs_freefile(copy_fs, vp, xp->i_number,
+			    xp->i_mode);
+		VOP_UNLOCK(xvp, 0, td);
+		if (error) {
+			free(copy_fs->fs_csp, M_UFSMNT);
+			bawrite(sbp);
+			sbp = NULL;
+			goto out1;
+		}
+		MNT_ILOCK(mp);
+	}
+	MNT_IUNLOCK(mp);
+	/*
+	 * If there already exist snapshots on this filesystem, grab a
+	 * reference to their shared lock. If this is the first snapshot
+	 * on this filesystem, we need to allocate a lock for the snapshots
+	 * to share. In either case, acquire the snapshot lock and give
+	 * up our original private lock.
+	 */
+	VI_LOCK(devvp);
+	snaphead = &devvp->v_rdev->si_snapshots;
+	if ((xp = TAILQ_FIRST(snaphead)) != NULL) {
+		VI_LOCK(vp);
+		vp->v_vnlock = ITOV(xp)->v_vnlock;
+		VI_UNLOCK(devvp);
+	} else {
+		struct lock *lkp;
+
+		VI_UNLOCK(devvp);
+		MALLOC(lkp, struct lock *, sizeof(struct lock), M_UFSMNT,
+		    M_WAITOK);
+		lockinit(lkp, PVFS, "snaplk", VLKTIMEOUT,
+		    LK_CANRECURSE | LK_NOPAUSE);
+		VI_LOCK(vp);
+		vp->v_vnlock = lkp;
+	}
+	vn_lock(vp, LK_INTERLOCK | LK_EXCLUSIVE | LK_RETRY, td);
+	transferlockers(&vp->v_lock, vp->v_vnlock);
+	lockmgr(&vp->v_lock, LK_RELEASE, NULL, td);
+	/*
+	 * If this is the first snapshot on this filesystem, then we need
+	 * to allocate the space for the list of preallocated snapshot blocks.
+	 * This list will be refined below, but this preliminary one will
+	 * keep us out of deadlock until the full one is ready.
+	 */
+	if (xp == NULL) {
+		MALLOC(snapblklist, daddr_t *, snaplistsize * sizeof(daddr_t),
+		    M_UFSMNT, M_WAITOK);
+		blkp = &snapblklist[1];
+		*blkp++ = lblkno(fs, fs->fs_sblockloc);
+		blkno = fragstoblks(fs, fs->fs_csaddr);
+		for (cg = 0; cg < fs->fs_ncg; cg++) {
+			if (fragstoblks(fs, cgtod(fs, cg) > blkno))
+				break;
+			*blkp++ = fragstoblks(fs, cgtod(fs, cg));
+		}
+		len = howmany(fs->fs_cssize, fs->fs_bsize);
+		for (loc = 0; loc < len; loc++)
+			*blkp++ = blkno + loc;
+		for (; cg < fs->fs_ncg; cg++)
+			*blkp++ = fragstoblks(fs, cgtod(fs, cg));
+		snapblklist[0] = blkp - snapblklist;
+		VI_LOCK(devvp);
+		if (devvp->v_rdev->si_snapblklist != NULL)
+			panic("ffs_snapshot: non-empty list");
+		devvp->v_rdev->si_snapblklist = snapblklist;
+		devvp->v_rdev->si_snaplistsize = blkp - snapblklist;
+		VI_UNLOCK(devvp);
+	}
+	/*
+	 * Record snapshot inode. Since this is the newest snapshot,
+	 * it must be placed at the end of the list.
+	 */
+	VI_LOCK(devvp);
+	fs->fs_snapinum[snaploc] = ip->i_number;
+	if (ip->i_nextsnap.tqe_prev != 0)
+		panic("ffs_snapshot: %d already on list", ip->i_number);
+	TAILQ_INSERT_TAIL(snaphead, ip, i_nextsnap);
+	devvp->v_rdev->si_copyonwrite = ffs_copyonwrite;
+	devvp->v_vflag |= VV_COPYONWRITE;
+	VI_UNLOCK(devvp);
+	ASSERT_VOP_LOCKED(vp, "ffs_snapshot vp");
+	vp->v_vflag |= VV_SYSTEM;
+out1:
+	/*
+	 * Resume operation on filesystem.
+	 */
+	vfs_write_resume(vp->v_mount);
+	vn_start_write(NULL, &wrtmp, V_WAIT);
+	if (collectsnapstats && starttime.tv_sec > 0) {
+		nanotime(&endtime);
+		timespecsub(&endtime, &starttime);
+		printf("%s: suspended %ld.%03ld sec, redo %ld of %d\n",
+		    vp->v_mount->mnt_stat.f_mntonname, (long)endtime.tv_sec,
+		    endtime.tv_nsec / 1000000, redo, fs->fs_ncg);
+	}
+	if (sbp == NULL)
+		goto out;
+	/*
+	 * Copy allocation information from all the snapshots in
+	 * this snapshot and then expunge them from its view.
+	 */
+	snaphead = &devvp->v_rdev->si_snapshots;
+	TAILQ_FOREACH(xp, snaphead, i_nextsnap) {
+		if (xp == ip)
+			break;
+		if (xp->i_ump->um_fstype == UFS1)
+			error = expunge_ufs1(vp, xp, fs, snapacct_ufs1,
+			    BLK_SNAP);
+		else
+			error = expunge_ufs2(vp, xp, fs, snapacct_ufs2,
+			    BLK_SNAP);
+		if (error) {
+			fs->fs_snapinum[snaploc] = 0;
+			goto done;
+		}
+	}
+	/*
+	 * Allocate space for the full list of preallocated snapshot blocks.
+	 */
+	MALLOC(snapblklist, daddr_t *, snaplistsize * sizeof(daddr_t),
+	    M_UFSMNT, M_WAITOK);
+	ip->i_snapblklist = &snapblklist[1];
+	/*
+	 * Expunge the blocks used by the snapshots from the set of
+	 * blocks marked as used in the snapshot bitmaps. Also, collect
+	 * the list of allocated blocks in i_snapblklist.
+	 */
+	if (ip->i_ump->um_fstype == UFS1)
+		error = expunge_ufs1(vp, ip, copy_fs, mapacct_ufs1, BLK_SNAP);
+	else
+		error = expunge_ufs2(vp, ip, copy_fs, mapacct_ufs2, BLK_SNAP);
+	if (error) {
+		fs->fs_snapinum[snaploc] = 0;
+		FREE(snapblklist, M_UFSMNT);
+		goto done;
+	}
+	if (snaplistsize < ip->i_snapblklist - snapblklist)
+		panic("ffs_snapshot: list too small");
+	snaplistsize = ip->i_snapblklist - snapblklist;
+	snapblklist[0] = snaplistsize;
+	ip->i_snapblklist = 0;
+	/*
+	 * Write out the list of allocated blocks to the end of the snapshot.
+	 */
+	auio.uio_iov = &aiov;
+	auio.uio_iovcnt = 1;
+	aiov.iov_base = (void *)snapblklist;
+	aiov.iov_len = snaplistsize * sizeof(daddr_t);
+	auio.uio_resid = aiov.iov_len;;
+	auio.uio_offset = ip->i_size;
+	auio.uio_segflg = UIO_SYSSPACE;
+	auio.uio_rw = UIO_WRITE;
+	auio.uio_td = td;
+	if ((error = VOP_WRITE(vp, &auio, IO_UNIT, td->td_ucred)) != 0) {
+		fs->fs_snapinum[snaploc] = 0;
+		FREE(snapblklist, M_UFSMNT);
+		goto done;
+	}
+	/*
+	 * Write the superblock and its summary information
+	 * to the snapshot.
+	 */
+	blkno = fragstoblks(fs, fs->fs_csaddr);
+	len = howmany(fs->fs_cssize, fs->fs_bsize);
+	space = copy_fs->fs_csp;
+	for (loc = 0; loc < len; loc++) {
+		error = bread(vp, blkno + loc, fs->fs_bsize, KERNCRED, &nbp);
+		if (error) {
+			brelse(nbp);
+			fs->fs_snapinum[snaploc] = 0;
+			FREE(snapblklist, M_UFSMNT);
+			goto done;
+		}
+		bcopy(space, nbp->b_data, fs->fs_bsize);
+		space = (char *)space + fs->fs_bsize;
+		bawrite(nbp);
+	}
+	/*
+	 * As this is the newest list, it is the most inclusive, so
+	 * should replace the previous list.
+	 */
+	VI_LOCK(devvp);
+	space = devvp->v_rdev->si_snapblklist;
+	devvp->v_rdev->si_snapblklist = snapblklist;
+	devvp->v_rdev->si_snaplistsize = snaplistsize;
+	VI_UNLOCK(devvp);
+	if (space != NULL)
+		FREE(space, M_UFSMNT);
+done:
+	free(copy_fs->fs_csp, M_UFSMNT);
+	bawrite(sbp);
+out:
+	if (saved_nice > 0) {
+		PROC_LOCK(td->td_proc);
+		mtx_lock_spin(&sched_lock);
+		sched_nice(td->td_ksegrp, saved_nice);
+		mtx_unlock_spin(&sched_lock);
+		PROC_UNLOCK(td->td_proc);
+	}
+	if (fs->fs_active != 0) {
+		FREE(fs->fs_active, M_DEVBUF);
+		fs->fs_active = 0;
+	}
+	mp->mnt_flag = flag;
+	if (error)
+		(void) UFS_TRUNCATE(vp, (off_t)0, 0, NOCRED, td);
+	(void) VOP_FSYNC(vp, KERNCRED, MNT_WAIT, td);
+	if (error)
+		vput(vp);
+	else
+		VOP_UNLOCK(vp, 0, td);
+	vn_finished_write(wrtmp);
+	return (error);
+}
+
+/*
+ * Copy a cylinder group map. All the unallocated blocks are marked
+ * BLK_NOCOPY so that the snapshot knows that it need not copy them
+ * if they are later written. If passno is one, then this is a first
+ * pass, so only setting needs to be done. If passno is 2, then this
+ * is a revision to a previous pass which must be undone as the
+ * replacement pass is done.
+ */
+static int
+cgaccount(cg, vp, nbp, passno)
+	int cg;
+	struct vnode *vp;
+	struct buf *nbp;
+	int passno;
+{
+	struct buf *bp, *ibp;
+	struct inode *ip;
+	struct cg *cgp;
+	struct fs *fs;
+	ufs2_daddr_t base, numblks;
+	int error, len, loc, indiroff;
+
+	ip = VTOI(vp);
+	fs = ip->i_fs;
+	error = bread(ip->i_devvp, fsbtodb(fs, cgtod(fs, cg)),
+		(int)fs->fs_cgsize, KERNCRED, &bp);
+	if (error) {
+		brelse(bp);
+		return (error);
+	}
+	cgp = (struct cg *)bp->b_data;
+	if (!cg_chkmagic(cgp)) {
+		brelse(bp);
+		return (EIO);
+	}
+	atomic_set_int(&ACTIVECGNUM(fs, cg), ACTIVECGOFF(cg));
+	bcopy(bp->b_data, nbp->b_data, fs->fs_cgsize);
+	if (fs->fs_cgsize < fs->fs_bsize)
+		bzero(&nbp->b_data[fs->fs_cgsize],
+		    fs->fs_bsize - fs->fs_cgsize);
+	if (passno == 2)
+		nbp->b_flags |= B_VALIDSUSPWRT;
+	numblks = howmany(fs->fs_size, fs->fs_frag);
+	len = howmany(fs->fs_fpg, fs->fs_frag);
+	base = cg * fs->fs_fpg / fs->fs_frag;
+	if (base + len >= numblks)
+		len = numblks - base - 1;
+	loc = 0;
+	if (base < NDADDR) {
+		for ( ; loc < NDADDR; loc++) {
+			if (ffs_isblock(fs, cg_blksfree(cgp), loc))
+				DIP(ip, i_db[loc]) = BLK_NOCOPY;
+			else if (passno == 2 && DIP(ip, i_db[loc])== BLK_NOCOPY)
+				DIP(ip, i_db[loc]) = 0;
+			else if (passno == 1 && DIP(ip, i_db[loc])== BLK_NOCOPY)
+				panic("ffs_snapshot: lost direct block");
+		}
+	}
+	error = UFS_BALLOC(vp, lblktosize(fs, (off_t)(base + loc)),
+	    fs->fs_bsize, KERNCRED, BA_METAONLY, &ibp);
+	if (error) {
+		brelse(bp);
+		return (error);
+	}
+	indiroff = (base + loc - NDADDR) % NINDIR(fs);
+	for ( ; loc < len; loc++, indiroff++) {
+		if (indiroff >= NINDIR(fs)) {
+			if (passno == 2)
+				ibp->b_flags |= B_VALIDSUSPWRT;
+			bawrite(ibp);
+			error = UFS_BALLOC(vp,
+			    lblktosize(fs, (off_t)(base + loc)),
+			    fs->fs_bsize, KERNCRED, BA_METAONLY, &ibp);
+			if (error) {
+				brelse(bp);
+				return (error);
+			}
+			indiroff = 0;
+		}
+		if (ip->i_ump->um_fstype == UFS1) {
+			if (ffs_isblock(fs, cg_blksfree(cgp), loc))
+				((ufs1_daddr_t *)(ibp->b_data))[indiroff] =
+				    BLK_NOCOPY;
+			else if (passno == 2 && ((ufs1_daddr_t *)(ibp->b_data))
+			    [indiroff] == BLK_NOCOPY)
+				((ufs1_daddr_t *)(ibp->b_data))[indiroff] = 0;
+			else if (passno == 1 && ((ufs1_daddr_t *)(ibp->b_data))
+			    [indiroff] == BLK_NOCOPY)
+				panic("ffs_snapshot: lost indirect block");
+			continue;
+		}
+		if (ffs_isblock(fs, cg_blksfree(cgp), loc))
+			((ufs2_daddr_t *)(ibp->b_data))[indiroff] = BLK_NOCOPY;
+		else if (passno == 2 &&
+		    ((ufs2_daddr_t *)(ibp->b_data)) [indiroff] == BLK_NOCOPY)
+			((ufs2_daddr_t *)(ibp->b_data))[indiroff] = 0;
+		else if (passno == 1 &&
+		    ((ufs2_daddr_t *)(ibp->b_data)) [indiroff] == BLK_NOCOPY)
+			panic("ffs_snapshot: lost indirect block");
+	}
+	bqrelse(bp);
+	if (passno == 2)
+		ibp->b_flags |= B_VALIDSUSPWRT;
+	bdwrite(ibp);
+	return (0);
+}
+
+/*
+ * Before expunging a snapshot inode, note all the
+ * blocks that it claims with BLK_SNAP so that fsck will
+ * be able to account for those blocks properly and so
+ * that this snapshot knows that it need not copy them
+ * if the other snapshot holding them is freed. This code
+ * is reproduced once each for UFS1 and UFS2.
+ */
+static int
+expunge_ufs1(snapvp, cancelip, fs, acctfunc, expungetype)
+	struct vnode *snapvp;
+	struct inode *cancelip;
+	struct fs *fs;
+	int (*acctfunc)(struct vnode *, ufs1_daddr_t *, ufs1_daddr_t *,
+	    struct fs *, ufs_lbn_t, int);
+	int expungetype;
+{
+	int i, error, indiroff;
+	ufs_lbn_t lbn, rlbn;
+	ufs2_daddr_t len, blkno, numblks, blksperindir;
+	struct ufs1_dinode *dip;
+	struct thread *td = curthread;
+	struct buf *bp;
+
+	/*
+	 * Prepare to expunge the inode. If its inode block has not
+	 * yet been copied, then allocate and fill the copy.
+	 */
+	lbn = fragstoblks(fs, ino_to_fsba(fs, cancelip->i_number));
+	blkno = 0;
+	if (lbn < NDADDR) {
+		blkno = VTOI(snapvp)->i_din1->di_db[lbn];
+	} else {
+		td->td_pflags |= TDP_COWINPROGRESS;
+		error = UFS_BALLOC(snapvp, lblktosize(fs, (off_t)lbn),
+		   fs->fs_bsize, KERNCRED, BA_METAONLY, &bp);
+		td->td_pflags &= ~TDP_COWINPROGRESS;
+		if (error)
+			return (error);
+		indiroff = (lbn - NDADDR) % NINDIR(fs);
+		blkno = ((ufs1_daddr_t *)(bp->b_data))[indiroff];
+		bqrelse(bp);
+	}
+	if (blkno != 0) {
+		if ((error = bread(snapvp, lbn, fs->fs_bsize, KERNCRED, &bp)))
+			return (error);
+	} else {
+		error = UFS_BALLOC(snapvp, lblktosize(fs, (off_t)lbn),
+		    fs->fs_bsize, KERNCRED, 0, &bp);
+		if (error)
+			return (error);
+		if ((error = readblock(bp, lbn)) != 0)
+			return (error);
+	}
+	/*
+	 * Set a snapshot inode to be a zero length file, regular files
+	 * to be completely unallocated.
+	 */
+	dip = (struct ufs1_dinode *)bp->b_data +
+	    ino_to_fsbo(fs, cancelip->i_number);
+	if (expungetype == BLK_NOCOPY)
+		dip->di_mode = 0;
+	dip->di_size = 0;
+	dip->di_blocks = 0;
+	dip->di_flags &= ~SF_SNAPSHOT;
+	bzero(&dip->di_db[0], (NDADDR + NIADDR) * sizeof(ufs1_daddr_t));
+	bdwrite(bp);
+	/*
+	 * Now go through and expunge all the blocks in the file
+	 * using the function requested.
+	 */
+	numblks = howmany(cancelip->i_size, fs->fs_bsize);
+	if ((error = (*acctfunc)(snapvp, &cancelip->i_din1->di_db[0],
+	    &cancelip->i_din1->di_db[NDADDR], fs, 0, expungetype)))
+		return (error);
+	if ((error = (*acctfunc)(snapvp, &cancelip->i_din1->di_ib[0],
+	    &cancelip->i_din1->di_ib[NIADDR], fs, -1, expungetype)))
+		return (error);
+	blksperindir = 1;
+	lbn = -NDADDR;
+	len = numblks - NDADDR;
+	rlbn = NDADDR;
+	for (i = 0; len > 0 && i < NIADDR; i++) {
+		error = indiracct_ufs1(snapvp, ITOV(cancelip), i,
+		    cancelip->i_din1->di_ib[i], lbn, rlbn, len,
+		    blksperindir, fs, acctfunc, expungetype);
+		if (error)
+			return (error);
+		blksperindir *= NINDIR(fs);
+		lbn -= blksperindir + 1;
+		len -= blksperindir;
+		rlbn += blksperindir;
+	}
+	return (0);
+}
+
+/*
+ * Descend an indirect block chain for vnode cancelvp accounting for all
+ * its indirect blocks in snapvp.
+ */ 
+static int
+indiracct_ufs1(snapvp, cancelvp, level, blkno, lbn, rlbn, remblks,
+	    blksperindir, fs, acctfunc, expungetype)
+	struct vnode *snapvp;
+	struct vnode *cancelvp;
+	int level;
+	ufs1_daddr_t blkno;
+	ufs_lbn_t lbn;
+	ufs_lbn_t rlbn;
+	ufs_lbn_t remblks;
+	ufs_lbn_t blksperindir;
+	struct fs *fs;
+	int (*acctfunc)(struct vnode *, ufs1_daddr_t *, ufs1_daddr_t *,
+	    struct fs *, ufs_lbn_t, int);
+	int expungetype;
+{
+	int error, num, i;
+	ufs_lbn_t subblksperindir;
+	struct indir indirs[NIADDR + 2];
+	ufs1_daddr_t last, *bap;
+	struct buf *bp;
+
+	if (blkno == 0) {
+		if (expungetype == BLK_NOCOPY)
+			return (0);
+		panic("indiracct_ufs1: missing indir");
+	}
+	if ((error = ufs_getlbns(cancelvp, rlbn, indirs, &num)) != 0)
+		return (error);
+	if (lbn != indirs[num - 1 - level].in_lbn || num < 2)
+		panic("indiracct_ufs1: botched params");
+	/*
+	 * We have to expand bread here since it will deadlock looking
+	 * up the block number for any blocks that are not in the cache.
+	 */
+	bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
+	bp->b_blkno = fsbtodb(fs, blkno);
+	if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
+	    (error = readblock(bp, fragstoblks(fs, blkno)))) {
+		brelse(bp);
+		return (error);
+	}
+	/*
+	 * Account for the block pointers in this indirect block.
+	 */
+	last = howmany(remblks, blksperindir);
+	if (last > NINDIR(fs))
+		last = NINDIR(fs);
+	MALLOC(bap, ufs1_daddr_t *, fs->fs_bsize, M_DEVBUF, M_WAITOK);
+	bcopy(bp->b_data, (caddr_t)bap, fs->fs_bsize);
+	bqrelse(bp);
+	error = (*acctfunc)(snapvp, &bap[0], &bap[last], fs,
+	    level == 0 ? rlbn : -1, expungetype);
+	if (error || level == 0)
+		goto out;
+	/*
+	 * Account for the block pointers in each of the indirect blocks
+	 * in the levels below us.
+	 */
+	subblksperindir = blksperindir / NINDIR(fs);
+	for (lbn++, level--, i = 0; i < last; i++) {
+		error = indiracct_ufs1(snapvp, cancelvp, level, bap[i], lbn,
+		    rlbn, remblks, subblksperindir, fs, acctfunc, expungetype);
+		if (error)
+			goto out;
+		rlbn += blksperindir;
+		lbn -= blksperindir;
+		remblks -= blksperindir;
+	}
+out:
+	FREE(bap, M_DEVBUF);
+	return (error);
+}
+
+/*
+ * Do both snap accounting and map accounting.
+ */
+static int
+fullacct_ufs1(vp, oldblkp, lastblkp, fs, lblkno, exptype)
+	struct vnode *vp;
+	ufs1_daddr_t *oldblkp, *lastblkp;
+	struct fs *fs;
+	ufs_lbn_t lblkno;
+	int exptype;	/* BLK_SNAP or BLK_NOCOPY */
+{
+	int error;
+
+	if ((error = snapacct_ufs1(vp, oldblkp, lastblkp, fs, lblkno, exptype)))
+		return (error);
+	return (mapacct_ufs1(vp, oldblkp, lastblkp, fs, lblkno, exptype));
+}
+
+/*
+ * Identify a set of blocks allocated in a snapshot inode.
+ */
+static int
+snapacct_ufs1(vp, oldblkp, lastblkp, fs, lblkno, expungetype)
+	struct vnode *vp;
+	ufs1_daddr_t *oldblkp, *lastblkp;
+	struct fs *fs;
+	ufs_lbn_t lblkno;
+	int expungetype;	/* BLK_SNAP or BLK_NOCOPY */
+{
+	struct inode *ip = VTOI(vp);
+	ufs1_daddr_t blkno, *blkp;
+	ufs_lbn_t lbn;
+	struct buf *ibp;
+	int error;
+
+	for ( ; oldblkp < lastblkp; oldblkp++) {
+		blkno = *oldblkp;
+		if (blkno == 0 || blkno == BLK_NOCOPY || blkno == BLK_SNAP)
+			continue;
+		lbn = fragstoblks(fs, blkno);
+		if (lbn < NDADDR) {
+			blkp = &ip->i_din1->di_db[lbn];
+			ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		} else {
+			error = UFS_BALLOC(vp, lblktosize(fs, (off_t)lbn),
+			    fs->fs_bsize, KERNCRED, BA_METAONLY, &ibp);
+			if (error)
+				return (error);
+			blkp = &((ufs1_daddr_t *)(ibp->b_data))
+			    [(lbn - NDADDR) % NINDIR(fs)];
+		}
+		/*
+		 * If we are expunging a snapshot vnode and we
+		 * find a block marked BLK_NOCOPY, then it is
+		 * one that has been allocated to this snapshot after
+		 * we took our current snapshot and can be ignored.
+		 */
+		if (expungetype == BLK_SNAP && *blkp == BLK_NOCOPY) {
+			if (lbn >= NDADDR)
+				brelse(ibp);
+		} else {
+			if (*blkp != 0)
+				panic("snapacct_ufs1: bad block");
+			*blkp = expungetype;
+			if (lbn >= NDADDR)
+				bdwrite(ibp);
+		}
+	}
+	return (0);
+}
+
+/*
+ * Account for a set of blocks allocated in a snapshot inode.
+ */
+static int
+mapacct_ufs1(vp, oldblkp, lastblkp, fs, lblkno, expungetype)
+	struct vnode *vp;
+	ufs1_daddr_t *oldblkp, *lastblkp;
+	struct fs *fs;
+	ufs_lbn_t lblkno;
+	int expungetype;
+{
+	ufs1_daddr_t blkno;
+	struct inode *ip;
+	ino_t inum;
+	int acctit;
+
+	ip = VTOI(vp);
+	inum = ip->i_number;
+	if (lblkno == -1)
+		acctit = 0;
+	else
+		acctit = 1;
+	for ( ; oldblkp < lastblkp; oldblkp++, lblkno++) {
+		blkno = *oldblkp;
+		if (blkno == 0 || blkno == BLK_NOCOPY)
+			continue;
+		if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
+			*ip->i_snapblklist++ = lblkno;
+		if (blkno == BLK_SNAP)
+			blkno = blkstofrags(fs, lblkno);
+		ffs_blkfree(fs, vp, blkno, fs->fs_bsize, inum);
+	}
+	return (0);
+}
+
+/*
+ * Before expunging a snapshot inode, note all the
+ * blocks that it claims with BLK_SNAP so that fsck will
+ * be able to account for those blocks properly and so
+ * that this snapshot knows that it need not copy them
+ * if the other snapshot holding them is freed. This code
+ * is reproduced once each for UFS1 and UFS2.
+ */
+static int
+expunge_ufs2(snapvp, cancelip, fs, acctfunc, expungetype)
+	struct vnode *snapvp;
+	struct inode *cancelip;
+	struct fs *fs;
+	int (*acctfunc)(struct vnode *, ufs2_daddr_t *, ufs2_daddr_t *,
+	    struct fs *, ufs_lbn_t, int);
+	int expungetype;
+{
+	int i, error, indiroff;
+	ufs_lbn_t lbn, rlbn;
+	ufs2_daddr_t len, blkno, numblks, blksperindir;
+	struct ufs2_dinode *dip;
+	struct thread *td = curthread;
+	struct buf *bp;
+
+	/*
+	 * Prepare to expunge the inode. If its inode block has not
+	 * yet been copied, then allocate and fill the copy.
+	 */
+	lbn = fragstoblks(fs, ino_to_fsba(fs, cancelip->i_number));
+	blkno = 0;
+	if (lbn < NDADDR) {
+		blkno = VTOI(snapvp)->i_din2->di_db[lbn];
+	} else {
+		td->td_pflags |= TDP_COWINPROGRESS;
+		error = UFS_BALLOC(snapvp, lblktosize(fs, (off_t)lbn),
+		   fs->fs_bsize, KERNCRED, BA_METAONLY, &bp);
+		td->td_pflags &= ~TDP_COWINPROGRESS;
+		if (error)
+			return (error);
+		indiroff = (lbn - NDADDR) % NINDIR(fs);
+		blkno = ((ufs2_daddr_t *)(bp->b_data))[indiroff];
+		bqrelse(bp);
+	}
+	if (blkno != 0) {
+		if ((error = bread(snapvp, lbn, fs->fs_bsize, KERNCRED, &bp)))
+			return (error);
+	} else {
+		error = UFS_BALLOC(snapvp, lblktosize(fs, (off_t)lbn),
+		    fs->fs_bsize, KERNCRED, 0, &bp);
+		if (error)
+			return (error);
+		if ((error = readblock(bp, lbn)) != 0)
+			return (error);
+	}
+	/*
+	 * Set a snapshot inode to be a zero length file, regular files
+	 * to be completely unallocated.
+	 */
+	dip = (struct ufs2_dinode *)bp->b_data +
+	    ino_to_fsbo(fs, cancelip->i_number);
+	if (expungetype == BLK_NOCOPY)
+		dip->di_mode = 0;
+	dip->di_size = 0;
+	dip->di_blocks = 0;
+	dip->di_flags &= ~SF_SNAPSHOT;
+	bzero(&dip->di_db[0], (NDADDR + NIADDR) * sizeof(ufs2_daddr_t));
+	bdwrite(bp);
+	/*
+	 * Now go through and expunge all the blocks in the file
+	 * using the function requested.
+	 */
+	numblks = howmany(cancelip->i_size, fs->fs_bsize);
+	if ((error = (*acctfunc)(snapvp, &cancelip->i_din2->di_db[0],
+	    &cancelip->i_din2->di_db[NDADDR], fs, 0, expungetype)))
+		return (error);
+	if ((error = (*acctfunc)(snapvp, &cancelip->i_din2->di_ib[0],
+	    &cancelip->i_din2->di_ib[NIADDR], fs, -1, expungetype)))
+		return (error);
+	blksperindir = 1;
+	lbn = -NDADDR;
+	len = numblks - NDADDR;
+	rlbn = NDADDR;
+	for (i = 0; len > 0 && i < NIADDR; i++) {
+		error = indiracct_ufs2(snapvp, ITOV(cancelip), i,
+		    cancelip->i_din2->di_ib[i], lbn, rlbn, len,
+		    blksperindir, fs, acctfunc, expungetype);
+		if (error)
+			return (error);
+		blksperindir *= NINDIR(fs);
+		lbn -= blksperindir + 1;
+		len -= blksperindir;
+		rlbn += blksperindir;
+	}
+	return (0);
+}
+
+/*
+ * Descend an indirect block chain for vnode cancelvp accounting for all
+ * its indirect blocks in snapvp.
+ */ 
+static int
+indiracct_ufs2(snapvp, cancelvp, level, blkno, lbn, rlbn, remblks,
+	    blksperindir, fs, acctfunc, expungetype)
+	struct vnode *snapvp;
+	struct vnode *cancelvp;
+	int level;
+	ufs2_daddr_t blkno;
+	ufs_lbn_t lbn;
+	ufs_lbn_t rlbn;
+	ufs_lbn_t remblks;
+	ufs_lbn_t blksperindir;
+	struct fs *fs;
+	int (*acctfunc)(struct vnode *, ufs2_daddr_t *, ufs2_daddr_t *,
+	    struct fs *, ufs_lbn_t, int);
+	int expungetype;
+{
+	int error, num, i;
+	ufs_lbn_t subblksperindir;
+	struct indir indirs[NIADDR + 2];
+	ufs2_daddr_t last, *bap;
+	struct buf *bp;
+
+	if (blkno == 0) {
+		if (expungetype == BLK_NOCOPY)
+			return (0);
+		panic("indiracct_ufs2: missing indir");
+	}
+	if ((error = ufs_getlbns(cancelvp, rlbn, indirs, &num)) != 0)
+		return (error);
+	if (lbn != indirs[num - 1 - level].in_lbn || num < 2)
+		panic("indiracct_ufs2: botched params");
+	/*
+	 * We have to expand bread here since it will deadlock looking
+	 * up the block number for any blocks that are not in the cache.
+	 */
+	bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
+	bp->b_blkno = fsbtodb(fs, blkno);
+	if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
+	    (error = readblock(bp, fragstoblks(fs, blkno)))) {
+		brelse(bp);
+		return (error);
+	}
+	/*
+	 * Account for the block pointers in this indirect block.
+	 */
+	last = howmany(remblks, blksperindir);
+	if (last > NINDIR(fs))
+		last = NINDIR(fs);
+	MALLOC(bap, ufs2_daddr_t *, fs->fs_bsize, M_DEVBUF, M_WAITOK);
+	bcopy(bp->b_data, (caddr_t)bap, fs->fs_bsize);
+	bqrelse(bp);
+	error = (*acctfunc)(snapvp, &bap[0], &bap[last], fs,
+	    level == 0 ? rlbn : -1, expungetype);
+	if (error || level == 0)
+		goto out;
+	/*
+	 * Account for the block pointers in each of the indirect blocks
+	 * in the levels below us.
+	 */
+	subblksperindir = blksperindir / NINDIR(fs);
+	for (lbn++, level--, i = 0; i < last; i++) {
+		error = indiracct_ufs2(snapvp, cancelvp, level, bap[i], lbn,
+		    rlbn, remblks, subblksperindir, fs, acctfunc, expungetype);
+		if (error)
+			goto out;
+		rlbn += blksperindir;
+		lbn -= blksperindir;
+		remblks -= blksperindir;
+	}
+out:
+	FREE(bap, M_DEVBUF);
+	return (error);
+}
+
+/*
+ * Do both snap accounting and map accounting.
+ */
+static int
+fullacct_ufs2(vp, oldblkp, lastblkp, fs, lblkno, exptype)
+	struct vnode *vp;
+	ufs2_daddr_t *oldblkp, *lastblkp;
+	struct fs *fs;
+	ufs_lbn_t lblkno;
+	int exptype;	/* BLK_SNAP or BLK_NOCOPY */
+{
+	int error;
+
+	if ((error = snapacct_ufs2(vp, oldblkp, lastblkp, fs, lblkno, exptype)))
+		return (error);
+	return (mapacct_ufs2(vp, oldblkp, lastblkp, fs, lblkno, exptype));
+}
+
+/*
+ * Identify a set of blocks allocated in a snapshot inode.
+ */
+static int
+snapacct_ufs2(vp, oldblkp, lastblkp, fs, lblkno, expungetype)
+	struct vnode *vp;
+	ufs2_daddr_t *oldblkp, *lastblkp;
+	struct fs *fs;
+	ufs_lbn_t lblkno;
+	int expungetype;	/* BLK_SNAP or BLK_NOCOPY */
+{
+	struct inode *ip = VTOI(vp);
+	ufs2_daddr_t blkno, *blkp;
+	ufs_lbn_t lbn;
+	struct buf *ibp;
+	int error;
+
+	for ( ; oldblkp < lastblkp; oldblkp++) {
+		blkno = *oldblkp;
+		if (blkno == 0 || blkno == BLK_NOCOPY || blkno == BLK_SNAP)
+			continue;
+		lbn = fragstoblks(fs, blkno);
+		if (lbn < NDADDR) {
+			blkp = &ip->i_din2->di_db[lbn];
+			ip->i_flag |= IN_CHANGE | IN_UPDATE;
+		} else {
+			error = UFS_BALLOC(vp, lblktosize(fs, (off_t)lbn),
+			    fs->fs_bsize, KERNCRED, BA_METAONLY, &ibp);
+			if (error)
+				return (error);
+			blkp = &((ufs2_daddr_t *)(ibp->b_data))
+			    [(lbn - NDADDR) % NINDIR(fs)];
+		}
+		/*
+		 * If we are expunging a snapshot vnode and we
+		 * find a block marked BLK_NOCOPY, then it is
+		 * one that has been allocated to this snapshot after
+		 * we took our current snapshot and can be ignored.
+		 */
+		if (expungetype == BLK_SNAP && *blkp == BLK_NOCOPY) {
+			if (lbn >= NDADDR)
+				brelse(ibp);
+		} else {
+			if (*blkp != 0)
+				panic("snapacct_ufs2: bad block");
+			*blkp = expungetype;
+			if (lbn >= NDADDR)
+				bdwrite(ibp);
+		}
+	}
+	return (0);
+}
+
+/*
+ * Account for a set of blocks allocated in a snapshot inode.
+ */
+static int
+mapacct_ufs2(vp, oldblkp, lastblkp, fs, lblkno, expungetype)
+	struct vnode *vp;
+	ufs2_daddr_t *oldblkp, *lastblkp;
+	struct fs *fs;
+	ufs_lbn_t lblkno;
+	int expungetype;
+{
+	ufs2_daddr_t blkno;
+	struct inode *ip;
+	ino_t inum;
+	int acctit;
+
+	ip = VTOI(vp);
+	inum = ip->i_number;
+	if (lblkno == -1)
+		acctit = 0;
+	else
+		acctit = 1;
+	for ( ; oldblkp < lastblkp; oldblkp++, lblkno++) {
+		blkno = *oldblkp;
+		if (blkno == 0 || blkno == BLK_NOCOPY)
+			continue;
+		if (acctit && expungetype == BLK_SNAP && blkno != BLK_SNAP)
+			*ip->i_snapblklist++ = lblkno;
+		if (blkno == BLK_SNAP)
+			blkno = blkstofrags(fs, lblkno);
+		ffs_blkfree(fs, vp, blkno, fs->fs_bsize, inum);
+	}
+	return (0);
+}
+
+/*
+ * Decrement extra reference on snapshot when last name is removed.
+ * It will not be freed until the last open reference goes away.
+ */
+void
+ffs_snapgone(ip)
+	struct inode *ip;
+{
+	struct inode *xp;
+	struct fs *fs;
+	int snaploc;
+
+	/*
+	 * Find snapshot in incore list.
+	 */
+	TAILQ_FOREACH(xp, &ip->i_devvp->v_rdev->si_snapshots, i_nextsnap)
+		if (xp == ip)
+			break;
+	if (xp != NULL)
+		vrele(ITOV(ip));
+	else if (snapdebug)
+		printf("ffs_snapgone: lost snapshot vnode %d\n",
+		    ip->i_number);
+	/*
+	 * Delete snapshot inode from superblock. Keep list dense.
+	 */
+	fs = ip->i_fs;
+	for (snaploc = 0; snaploc < FSMAXSNAP; snaploc++)
+		if (fs->fs_snapinum[snaploc] == ip->i_number)
+			break;
+	if (snaploc < FSMAXSNAP) {
+		for (snaploc++; snaploc < FSMAXSNAP; snaploc++) {
+			if (fs->fs_snapinum[snaploc] == 0)
+				break;
+			fs->fs_snapinum[snaploc - 1] = fs->fs_snapinum[snaploc];
+		}
+		fs->fs_snapinum[snaploc - 1] = 0;
+	}
+}
+
+/*
+ * Prepare a snapshot file for being removed.
+ */
+void
+ffs_snapremove(vp)
+	struct vnode *vp;
+{
+	struct inode *ip;
+	struct vnode *devvp;
+	struct lock *lkp;
+	struct buf *ibp;
+	struct fs *fs;
+	struct thread *td = curthread;
+	ufs2_daddr_t numblks, blkno, dblk, *snapblklist;
+	int error, loc, last;
+
+	ip = VTOI(vp);
+	fs = ip->i_fs;
+	devvp = ip->i_devvp;
+	/*
+	 * If active, delete from incore list (this snapshot may
+	 * already have been in the process of being deleted, so
+	 * would not have been active).
+	 *
+	 * Clear copy-on-write flag if last snapshot.
+	 */
+	if (ip->i_nextsnap.tqe_prev != 0) {
+		VI_LOCK(devvp);
+		lockmgr(&vp->v_lock, LK_INTERLOCK | LK_EXCLUSIVE,
+		    VI_MTX(devvp), td);
+		VI_LOCK(devvp);
+		TAILQ_REMOVE(&devvp->v_rdev->si_snapshots, ip, i_nextsnap);
+		ip->i_nextsnap.tqe_prev = 0;
+		lkp = vp->v_vnlock;
+		vp->v_vnlock = &vp->v_lock;
+		lockmgr(lkp, LK_RELEASE, NULL, td);
+		if (TAILQ_FIRST(&devvp->v_rdev->si_snapshots) != 0) {
+			VI_UNLOCK(devvp);
+		} else {
+			snapblklist = devvp->v_rdev->si_snapblklist;
+			devvp->v_rdev->si_snapblklist = 0;
+			devvp->v_rdev->si_snaplistsize = 0;
+			devvp->v_rdev->si_copyonwrite = 0;
+			devvp->v_vflag &= ~VV_COPYONWRITE;
+			lockmgr(lkp, LK_DRAIN|LK_INTERLOCK, VI_MTX(devvp), td);
+			lockmgr(lkp, LK_RELEASE, NULL, td);
+			lockdestroy(lkp);
+			FREE(lkp, M_UFSMNT);
+			FREE(snapblklist, M_UFSMNT);
+		}
+	}
+	/*
+	 * Clear all BLK_NOCOPY fields. Pass any block claims to other
+	 * snapshots that want them (see ffs_snapblkfree below).
+	 */
+	for (blkno = 1; blkno < NDADDR; blkno++) {
+		dblk = DIP(ip, i_db[blkno]);
+		if (dblk == BLK_NOCOPY || dblk == BLK_SNAP)
+			DIP(ip, i_db[blkno]) = 0;
+		else if ((dblk == blkstofrags(fs, blkno) &&
+		     ffs_snapblkfree(fs, ip->i_devvp, dblk, fs->fs_bsize,
+		     ip->i_number))) {
+			DIP(ip, i_blocks) -= btodb(fs->fs_bsize);
+			DIP(ip, i_db[blkno]) = 0;
+		}
+	}
+	numblks = howmany(ip->i_size, fs->fs_bsize);
+	for (blkno = NDADDR; blkno < numblks; blkno += NINDIR(fs)) {
+		error = UFS_BALLOC(vp, lblktosize(fs, (off_t)blkno),
+		    fs->fs_bsize, KERNCRED, BA_METAONLY, &ibp);
+		if (error)
+			continue;
+		if (fs->fs_size - blkno > NINDIR(fs))
+			last = NINDIR(fs);
+		else
+			last = fs->fs_size - blkno;
+		for (loc = 0; loc < last; loc++) {
+			if (ip->i_ump->um_fstype == UFS1) {
+				dblk = ((ufs1_daddr_t *)(ibp->b_data))[loc];
+				if (dblk == BLK_NOCOPY || dblk == BLK_SNAP)
+					((ufs1_daddr_t *)(ibp->b_data))[loc]= 0;
+				else if ((dblk == blkstofrags(fs, blkno) &&
+				     ffs_snapblkfree(fs, ip->i_devvp, dblk,
+				     fs->fs_bsize, ip->i_number))) {
+					ip->i_din1->di_blocks -=
+					    btodb(fs->fs_bsize);
+					((ufs1_daddr_t *)(ibp->b_data))[loc]= 0;
+				}
+				continue;
+			}
+			dblk = ((ufs2_daddr_t *)(ibp->b_data))[loc];
+			if (dblk == BLK_NOCOPY || dblk == BLK_SNAP)
+				((ufs2_daddr_t *)(ibp->b_data))[loc] = 0;
+			else if ((dblk == blkstofrags(fs, blkno) &&
+			     ffs_snapblkfree(fs, ip->i_devvp, dblk,
+			     fs->fs_bsize, ip->i_number))) {
+				ip->i_din2->di_blocks -= btodb(fs->fs_bsize);
+				((ufs2_daddr_t *)(ibp->b_data))[loc] = 0;
+			}
+		}
+		bawrite(ibp);
+	}
+	/*
+	 * Clear snapshot flag and drop reference.
+	 */
+	ip->i_flags &= ~SF_SNAPSHOT;
+	DIP(ip, i_flags) = ip->i_flags;
+	ip->i_flag |= IN_CHANGE | IN_UPDATE;
+}
+
+/*
+ * Notification that a block is being freed. Return zero if the free
+ * should be allowed to proceed. Return non-zero if the snapshot file
+ * wants to claim the block. The block will be claimed if it is an
+ * uncopied part of one of the snapshots. It will be freed if it is
+ * either a BLK_NOCOPY or has already been copied in all of the snapshots.
+ * If a fragment is being freed, then all snapshots that care about
+ * it must make a copy since a snapshot file can only claim full sized
+ * blocks. Note that if more than one snapshot file maps the block,
+ * we can pick one at random to claim it. Since none of the snapshots
+ * can change, we are assurred that they will all see the same unmodified
+ * image. When deleting a snapshot file (see ffs_snapremove above), we
+ * must push any of these claimed blocks to one of the other snapshots
+ * that maps it. These claimed blocks are easily identified as they will
+ * have a block number equal to their logical block number within the
+ * snapshot. A copied block can never have this property because they
+ * must always have been allocated from a BLK_NOCOPY location.
+ */
+int
+ffs_snapblkfree(fs, devvp, bno, size, inum)
+	struct fs *fs;
+	struct vnode *devvp;
+	ufs2_daddr_t bno;
+	long size;
+	ino_t inum;
+{
+	struct buf *ibp, *cbp, *savedcbp = 0;
+	struct thread *td = curthread;
+	struct inode *ip;
+	struct vnode *vp = NULL;
+	ufs_lbn_t lbn;
+	ufs2_daddr_t blkno;
+	int indiroff = 0, snapshot_locked = 0, error = 0, claimedblk = 0;
+	struct snaphead *snaphead;
+
+	lbn = fragstoblks(fs, bno);
+retry:
+	VI_LOCK(devvp);
+	snaphead = &devvp->v_rdev->si_snapshots;
+	TAILQ_FOREACH(ip, snaphead, i_nextsnap) {
+		vp = ITOV(ip);
+		/*
+		 * Lookup block being written.
+		 */
+		if (lbn < NDADDR) {
+			blkno = DIP(ip, i_db[lbn]);
+		} else {
+			if (snapshot_locked == 0 &&
+			    lockmgr(vp->v_vnlock,
+			      LK_INTERLOCK | LK_EXCLUSIVE | LK_SLEEPFAIL,
+			      VI_MTX(devvp), td) != 0)
+				goto retry;
+			snapshot_locked = 1;
+			td->td_pflags |= TDP_COWINPROGRESS;
+			error = UFS_BALLOC(vp, lblktosize(fs, (off_t)lbn),
+			    fs->fs_bsize, KERNCRED, BA_METAONLY, &ibp);
+			td->td_pflags &= ~TDP_COWINPROGRESS;
+			if (error)
+				break;
+			indiroff = (lbn - NDADDR) % NINDIR(fs);
+			if (ip->i_ump->um_fstype == UFS1)
+				blkno=((ufs1_daddr_t *)(ibp->b_data))[indiroff];
+			else
+				blkno=((ufs2_daddr_t *)(ibp->b_data))[indiroff];
+		}
+		/*
+		 * Check to see if block needs to be copied.
+		 */
+		if (blkno == 0) {
+			/*
+			 * A block that we map is being freed. If it has not
+			 * been claimed yet, we will claim or copy it (below).
+			 */
+			claimedblk = 1;
+		} else if (blkno == BLK_SNAP) {
+			/*
+			 * No previous snapshot claimed the block,
+			 * so it will be freed and become a BLK_NOCOPY
+			 * (don't care) for us.
+			 */
+			if (claimedblk)
+				panic("snapblkfree: inconsistent block type");
+			if (snapshot_locked == 0 &&
+			    lockmgr(vp->v_vnlock,
+			      LK_INTERLOCK | LK_EXCLUSIVE | LK_NOWAIT,
+			      VI_MTX(devvp), td) != 0) {
+				if (lbn >= NDADDR)
+					bqrelse(ibp);
+				vn_lock(vp, LK_EXCLUSIVE | LK_SLEEPFAIL, td);
+				goto retry;
+			}
+			snapshot_locked = 1;
+			if (lbn < NDADDR) {
+				DIP(ip, i_db[lbn]) = BLK_NOCOPY;
+				ip->i_flag |= IN_CHANGE | IN_UPDATE;
+			} else if (ip->i_ump->um_fstype == UFS1) {
+				((ufs1_daddr_t *)(ibp->b_data))[indiroff] =
+				    BLK_NOCOPY;
+				bdwrite(ibp);
+			} else {
+				((ufs2_daddr_t *)(ibp->b_data))[indiroff] =
+				    BLK_NOCOPY;
+				bdwrite(ibp);
+			}
+			continue;
+		} else /* BLK_NOCOPY or default */ {
+			/*
+			 * If the snapshot has already copied the block
+			 * (default), or does not care about the block,
+			 * it is not needed.
+			 */
+			if (lbn >= NDADDR)
+				bqrelse(ibp);
+			continue;
+		}
+		/*
+		 * If this is a full size block, we will just grab it
+		 * and assign it to the snapshot inode. Otherwise we
+		 * will proceed to copy it. See explanation for this
+		 * routine as to why only a single snapshot needs to
+		 * claim this block.
+		 */
+		if (snapshot_locked == 0 &&
+		    lockmgr(vp->v_vnlock,
+		      LK_INTERLOCK | LK_EXCLUSIVE | LK_NOWAIT,
+		      VI_MTX(devvp), td) != 0) {
+			if (lbn >= NDADDR)
+				bqrelse(ibp);
+			vn_lock(vp, LK_EXCLUSIVE | LK_SLEEPFAIL, td);
+			goto retry;
+		}
+		snapshot_locked = 1;
+		if (size == fs->fs_bsize) {
+#ifdef DEBUG
+			if (snapdebug)
+				printf("%s %d lbn %jd from inum %d\n",
+				    "Grabonremove: snapino", ip->i_number,
+				    (intmax_t)lbn, inum);
+#endif
+			if (lbn < NDADDR) {
+				DIP(ip, i_db[lbn]) = bno;
+			} else if (ip->i_ump->um_fstype == UFS1) {
+				((ufs1_daddr_t *)(ibp->b_data))[indiroff] = bno;
+				bdwrite(ibp);
+			} else {
+				((ufs2_daddr_t *)(ibp->b_data))[indiroff] = bno;
+				bdwrite(ibp);
+			}
+			DIP(ip, i_blocks) += btodb(size);
+			ip->i_flag |= IN_CHANGE | IN_UPDATE;
+			VOP_UNLOCK(vp, 0, td);
+			return (1);
+		}
+		if (lbn >= NDADDR)
+			bqrelse(ibp);
+		/*
+		 * Allocate the block into which to do the copy. Note that this
+		 * allocation will never require any additional allocations for
+		 * the snapshot inode.
+		 */
+		td->td_pflags |= TDP_COWINPROGRESS;
+		error = UFS_BALLOC(vp, lblktosize(fs, (off_t)lbn),
+		    fs->fs_bsize, KERNCRED, 0, &cbp);
+		td->td_pflags &= ~TDP_COWINPROGRESS;
+		if (error)
+			break;
+#ifdef DEBUG
+		if (snapdebug)
+			printf("%s%d lbn %jd %s %d size %ld to blkno %jd\n",
+			    "Copyonremove: snapino ", ip->i_number,
+			    (intmax_t)lbn, "for inum", inum, size,
+			    (intmax_t)cbp->b_blkno);
+#endif
+		/*
+		 * If we have already read the old block contents, then
+		 * simply copy them to the new block. Note that we need
+		 * to synchronously write snapshots that have not been
+		 * unlinked, and hence will be visible after a crash,
+		 * to ensure their integrity.
+		 */
+		if (savedcbp != 0) {
+			bcopy(savedcbp->b_data, cbp->b_data, fs->fs_bsize);
+			bawrite(cbp);
+			if (dopersistence && ip->i_effnlink > 0)
+				(void) VOP_FSYNC(vp, KERNCRED, MNT_WAIT, td);
+			continue;
+		}
+		/*
+		 * Otherwise, read the old block contents into the buffer.
+		 */
+		if ((error = readblock(cbp, lbn)) != 0) {
+			bzero(cbp->b_data, fs->fs_bsize);
+			bawrite(cbp);
+			if (dopersistence && ip->i_effnlink > 0)
+				(void) VOP_FSYNC(vp, KERNCRED, MNT_WAIT, td);
+			break;
+		}
+		savedcbp = cbp;
+	}
+	/*
+	 * Note that we need to synchronously write snapshots that
+	 * have not been unlinked, and hence will be visible after
+	 * a crash, to ensure their integrity.
+	 */
+	if (savedcbp) {
+		vp = savedcbp->b_vp;
+		bawrite(savedcbp);
+		if (dopersistence && VTOI(vp)->i_effnlink > 0)
+			(void) VOP_FSYNC(vp, KERNCRED, MNT_WAIT, td);
+	}
+	/*
+	 * If we have been unable to allocate a block in which to do
+	 * the copy, then return non-zero so that the fragment will
+	 * not be freed. Although space will be lost, the snapshot
+	 * will stay consistent.
+	 */
+	if (snapshot_locked)
+		VOP_UNLOCK(vp, 0, td);
+	else
+		VI_UNLOCK(devvp);
+	return (error);
+}
+
+/*
+ * Associate snapshot files when mounting.
+ */
+void
+ffs_snapshot_mount(mp)
+	struct mount *mp;
+{
+	struct ufsmount *ump = VFSTOUFS(mp);
+	struct vnode *devvp = ump->um_devvp;
+	struct fs *fs = ump->um_fs;
+	struct thread *td = curthread;
+	struct snaphead *snaphead;
+	struct vnode *vp;
+	struct inode *ip, *xp;
+	struct uio auio;
+	struct iovec aiov;
+	void *snapblklist;
+	char *reason;
+	daddr_t snaplistsize;
+	int error, snaploc, loc;
+
+	/*
+	 * XXX The following needs to be set before UFS_TRUNCATE or
+	 * VOP_READ can be called.
+	 */
+	mp->mnt_stat.f_iosize = fs->fs_bsize;
+	/*
+	 * Process each snapshot listed in the superblock.
+	 */
+	vp = NULL;
+	snaphead = &devvp->v_rdev->si_snapshots;
+	for (snaploc = 0; snaploc < FSMAXSNAP; snaploc++) {
+		if (fs->fs_snapinum[snaploc] == 0)
+			break;
+		if ((error = VFS_VGET(mp, fs->fs_snapinum[snaploc],
+		    LK_EXCLUSIVE, &vp)) != 0){
+			printf("ffs_snapshot_mount: vget failed %d\n", error);
+			continue;
+		}
+		ip = VTOI(vp);
+		if ((ip->i_flags & SF_SNAPSHOT) == 0 || ip->i_size ==
+		    lblktosize(fs, howmany(fs->fs_size, fs->fs_frag))) {
+			if ((ip->i_flags & SF_SNAPSHOT) == 0) {
+				reason = "non-snapshot";
+			} else {
+				reason = "old format snapshot";
+				(void)UFS_TRUNCATE(vp, (off_t)0, 0, NOCRED, td);
+				(void)VOP_FSYNC(vp, KERNCRED, MNT_WAIT, td);
+			}
+			printf("ffs_snapshot_mount: %s inode %d\n",
+			    reason, fs->fs_snapinum[snaploc]);
+			vput(vp);
+			vp = NULL;
+			for (loc = snaploc + 1; loc < FSMAXSNAP; loc++) {
+				if (fs->fs_snapinum[loc] == 0)
+					break;
+				fs->fs_snapinum[loc - 1] = fs->fs_snapinum[loc];
+			}
+			fs->fs_snapinum[loc - 1] = 0;
+			snaploc--;
+			continue;
+		}
+		/*
+		 * If there already exist snapshots on this filesystem, grab a
+		 * reference to their shared lock. If this is the first snapshot
+		 * on this filesystem, we need to allocate a lock for the
+		 * snapshots to share. In either case, acquire the snapshot
+		 * lock and give up our original private lock.
+		 */
+		VI_LOCK(devvp);
+		if ((xp = TAILQ_FIRST(snaphead)) != NULL) {
+			VI_LOCK(vp);
+			vp->v_vnlock = ITOV(xp)->v_vnlock;
+			VI_UNLOCK(devvp);
+		} else {
+			struct lock *lkp;
+
+			VI_UNLOCK(devvp);
+			MALLOC(lkp, struct lock *, sizeof(struct lock),
+			    M_UFSMNT, M_WAITOK);
+			lockinit(lkp, PVFS, "snaplk", VLKTIMEOUT,
+			    LK_CANRECURSE | LK_NOPAUSE);
+			VI_LOCK(vp);
+			vp->v_vnlock = lkp;
+		}
+		vn_lock(vp, LK_INTERLOCK | LK_EXCLUSIVE | LK_RETRY, td);
+		transferlockers(&vp->v_lock, vp->v_vnlock);
+		lockmgr(&vp->v_lock, LK_RELEASE, NULL, td);
+		/*
+		 * Link it onto the active snapshot list.
+		 */
+		VI_LOCK(devvp);
+		if (ip->i_nextsnap.tqe_prev != 0)
+			panic("ffs_snapshot_mount: %d already on list",
+			    ip->i_number);
+		else
+			TAILQ_INSERT_TAIL(snaphead, ip, i_nextsnap);
+		vp->v_vflag |= VV_SYSTEM;
+		VI_UNLOCK(devvp);
+		VOP_UNLOCK(vp, 0, td);
+	}
+	/*
+	 * No usable snapshots found.
+	 */
+	if (vp == NULL)
+		return;
+	/*
+	 * Allocate the space for the block hints list. We always want to
+	 * use the list from the newest snapshot.
+	 */
+	auio.uio_iov = &aiov;
+	auio.uio_iovcnt = 1;
+	aiov.iov_base = (void *)&snaplistsize;
+	aiov.iov_len = sizeof(snaplistsize);
+	auio.uio_resid = aiov.iov_len;
+	auio.uio_offset =
+	    lblktosize(fs, howmany(fs->fs_size, fs->fs_frag));
+	auio.uio_segflg = UIO_SYSSPACE;
+	auio.uio_rw = UIO_READ;
+	auio.uio_td = td;
+	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
+	if ((error = VOP_READ(vp, &auio, IO_UNIT, td->td_ucred)) != 0) {
+		printf("ffs_snapshot_mount: read_1 failed %d\n", error);
+		VOP_UNLOCK(vp, 0, td);
+		return;
+	}
+	MALLOC(snapblklist, void *, snaplistsize * sizeof(daddr_t),
+	    M_UFSMNT, M_WAITOK);
+	auio.uio_iovcnt = 1;
+	aiov.iov_base = snapblklist;
+	aiov.iov_len = snaplistsize * sizeof (daddr_t);
+	auio.uio_resid = aiov.iov_len;
+	auio.uio_offset -= sizeof(snaplistsize);
+	if ((error = VOP_READ(vp, &auio, IO_UNIT, td->td_ucred)) != 0) {
+		printf("ffs_snapshot_mount: read_2 failed %d\n", error);
+		VOP_UNLOCK(vp, 0, td);
+		FREE(snapblklist, M_UFSMNT);
+		return;
+	}
+	VOP_UNLOCK(vp, 0, td);
+	VI_LOCK(devvp);
+	ASSERT_VOP_LOCKED(devvp, "ffs_snapshot_mount");
+	devvp->v_rdev->si_snaplistsize = snaplistsize;
+	devvp->v_rdev->si_snapblklist = (daddr_t *)snapblklist;
+	devvp->v_rdev->si_copyonwrite = ffs_copyonwrite;
+	devvp->v_vflag |= VV_COPYONWRITE;
+	VI_UNLOCK(devvp);
+}
+
+/*
+ * Disassociate snapshot files when unmounting.
+ */
+void
+ffs_snapshot_unmount(mp)
+	struct mount *mp;
+{
+	struct vnode *devvp = VFSTOUFS(mp)->um_devvp;
+	struct snaphead *snaphead = &devvp->v_rdev->si_snapshots;
+	struct lock *lkp = NULL;
+	struct inode *xp;
+	struct vnode *vp;
+
+	VI_LOCK(devvp);
+	while ((xp = TAILQ_FIRST(snaphead)) != 0) {
+		vp = ITOV(xp);
+		lkp = vp->v_vnlock;
+		vp->v_vnlock = &vp->v_lock;
+		TAILQ_REMOVE(snaphead, xp, i_nextsnap);
+		xp->i_nextsnap.tqe_prev = 0;
+		if (xp->i_effnlink > 0) {
+			VI_UNLOCK(devvp);
+			vrele(vp);
+			VI_LOCK(devvp);
+		}
+	}
+	if (devvp->v_rdev->si_snapblklist != NULL) {
+		FREE(devvp->v_rdev->si_snapblklist, M_UFSMNT);
+		devvp->v_rdev->si_snapblklist = NULL;
+		devvp->v_rdev->si_snaplistsize = 0;
+	}
+	if (lkp != NULL) {
+		lockdestroy(lkp);
+		FREE(lkp, M_UFSMNT);
+	}
+	ASSERT_VOP_LOCKED(devvp, "ffs_snapshot_unmount");
+	devvp->v_rdev->si_copyonwrite = 0;
+	devvp->v_vflag &= ~VV_COPYONWRITE;
+	VI_UNLOCK(devvp);
+}
+
+/*
+ * Check for need to copy block that is about to be written,
+ * copying the block if necessary.
+ */
+static int
+ffs_copyonwrite(devvp, bp)
+	struct vnode *devvp;
+	struct buf *bp;
+{
+	struct snaphead *snaphead;
+	struct buf *ibp, *cbp, *savedcbp = 0;
+	struct thread *td = curthread;
+	struct fs *fs;
+	struct inode *ip;
+	struct vnode *vp = 0;
+	ufs2_daddr_t lbn, blkno, *snapblklist;
+	int lower, upper, mid, indiroff, snapshot_locked = 0, error = 0;
+
+	if (td->td_pflags & TDP_COWINPROGRESS)
+		panic("ffs_copyonwrite: recursive call");
+	/*
+	 * First check to see if it is in the preallocated list.
+	 * By doing this check we avoid several potential deadlocks.
+	 */
+	VI_LOCK(devvp);
+	snaphead = &devvp->v_rdev->si_snapshots;
+	ip = TAILQ_FIRST(snaphead);
+	fs = ip->i_fs;
+	lbn = fragstoblks(fs, dbtofsb(fs, bp->b_blkno));
+	snapblklist = devvp->v_rdev->si_snapblklist;
+	upper = devvp->v_rdev->si_snaplistsize - 1;
+	lower = 1;
+	while (lower <= upper) {
+		mid = (lower + upper) / 2;
+		if (snapblklist[mid] == lbn)
+			break;
+		if (snapblklist[mid] < lbn)
+			lower = mid + 1;
+		else
+			upper = mid - 1;
+	}
+	if (lower <= upper) {
+		VI_UNLOCK(devvp);
+		return (0);
+	}
+	/*
+	 * Not in the precomputed list, so check the snapshots.
+	 */
+retry:
+	TAILQ_FOREACH(ip, snaphead, i_nextsnap) {
+		vp = ITOV(ip);
+		/*
+		 * We ensure that everything of our own that needs to be
+		 * copied will be done at the time that ffs_snapshot is
+		 * called. Thus we can skip the check here which can
+		 * deadlock in doing the lookup in UFS_BALLOC.
+		 */
+		if (bp->b_vp == vp)
+			continue;
+		/*
+		 * Check to see if block needs to be copied. We do not have
+		 * to hold the snapshot lock while doing this lookup as it
+		 * will never require any additional allocations for the
+		 * snapshot inode.
+		 */
+		if (lbn < NDADDR) {
+			blkno = DIP(ip, i_db[lbn]);
+		} else {
+			if (snapshot_locked == 0 &&
+			    lockmgr(vp->v_vnlock,
+			      LK_INTERLOCK | LK_EXCLUSIVE | LK_SLEEPFAIL,
+			      VI_MTX(devvp), td) != 0) {
+				VI_LOCK(devvp);
+				goto retry;
+			}
+			snapshot_locked = 1;
+			td->td_pflags |= TDP_COWINPROGRESS;
+			error = UFS_BALLOC(vp, lblktosize(fs, (off_t)lbn),
+			   fs->fs_bsize, KERNCRED, BA_METAONLY, &ibp);
+			td->td_pflags &= ~TDP_COWINPROGRESS;
+			if (error)
+				break;
+			indiroff = (lbn - NDADDR) % NINDIR(fs);
+			if (ip->i_ump->um_fstype == UFS1)
+				blkno=((ufs1_daddr_t *)(ibp->b_data))[indiroff];
+			else
+				blkno=((ufs2_daddr_t *)(ibp->b_data))[indiroff];
+			bqrelse(ibp);
+		}
+#ifdef DIAGNOSTIC
+		if (blkno == BLK_SNAP && bp->b_lblkno >= 0)
+			panic("ffs_copyonwrite: bad copy block");
+#endif
+		if (blkno != 0)
+			continue;
+		/*
+		 * Allocate the block into which to do the copy. Since
+		 * multiple processes may all try to copy the same block,
+		 * we have to recheck our need to do a copy if we sleep
+		 * waiting for the lock.
+		 *
+		 * Because all snapshots on a filesystem share a single
+		 * lock, we ensure that we will never be in competition
+		 * with another process to allocate a block.
+		 */
+		if (snapshot_locked == 0 &&
+		    lockmgr(vp->v_vnlock,
+		      LK_INTERLOCK | LK_EXCLUSIVE | LK_SLEEPFAIL,
+		      VI_MTX(devvp), td) != 0) {
+			VI_LOCK(devvp);
+			goto retry;
+		}
+		snapshot_locked = 1;
+		td->td_pflags |= TDP_COWINPROGRESS;
+		error = UFS_BALLOC(vp, lblktosize(fs, (off_t)lbn),
+		    fs->fs_bsize, KERNCRED, 0, &cbp);
+		td->td_pflags &= ~TDP_COWINPROGRESS;
+		if (error)
+			break;
+#ifdef DEBUG
+		if (snapdebug) {
+			printf("Copyonwrite: snapino %d lbn %jd for ",
+			    ip->i_number, (intmax_t)lbn);
+			if (bp->b_vp == devvp)
+				printf("fs metadata");
+			else
+				printf("inum %d", VTOI(bp->b_vp)->i_number);
+			printf(" lblkno %jd to blkno %jd\n",
+			    (intmax_t)bp->b_lblkno, (intmax_t)cbp->b_blkno);
+		}
+#endif
+		/*
+		 * If we have already read the old block contents, then
+		 * simply copy them to the new block. Note that we need
+		 * to synchronously write snapshots that have not been
+		 * unlinked, and hence will be visible after a crash,
+		 * to ensure their integrity.
+		 */
+		if (savedcbp != 0) {
+			bcopy(savedcbp->b_data, cbp->b_data, fs->fs_bsize);
+			bawrite(cbp);
+			if (dopersistence && ip->i_effnlink > 0)
+				(void) VOP_FSYNC(vp, KERNCRED, MNT_WAIT, td);
+			continue;
+		}
+		/*
+		 * Otherwise, read the old block contents into the buffer.
+		 */
+		if ((error = readblock(cbp, lbn)) != 0) {
+			bzero(cbp->b_data, fs->fs_bsize);
+			bawrite(cbp);
+			if (dopersistence && ip->i_effnlink > 0)
+				(void) VOP_FSYNC(vp, KERNCRED, MNT_WAIT, td);
+			break;
+		}
+		savedcbp = cbp;
+	}
+	/*
+	 * Note that we need to synchronously write snapshots that
+	 * have not been unlinked, and hence will be visible after
+	 * a crash, to ensure their integrity.
+	 */
+	if (savedcbp) {
+		vp = savedcbp->b_vp;
+		bawrite(savedcbp);
+		if (dopersistence && VTOI(vp)->i_effnlink > 0)
+			(void) VOP_FSYNC(vp, KERNCRED, MNT_WAIT, td);
+	}
+	if (snapshot_locked)
+		VOP_UNLOCK(vp, 0, td);
+	else
+		VI_UNLOCK(devvp);
+	return (error);
+}
+
+/*
+ * Read the specified block into the given buffer.
+ * Much of this boiler-plate comes from bwrite().
+ */
+static int
+readblock(bp, lbn)
+	struct buf *bp;
+	ufs2_daddr_t lbn;
+{
+	struct uio auio;
+	struct iovec aiov;
+	struct thread *td = curthread;
+	struct inode *ip = VTOI(bp->b_vp);
+
+	aiov.iov_base = bp->b_data;
+	aiov.iov_len = bp->b_bcount;
+	auio.uio_iov = &aiov;
+	auio.uio_iovcnt = 1;
+	auio.uio_offset = dbtob(fsbtodb(ip->i_fs, blkstofrags(ip->i_fs, lbn)));
+	auio.uio_resid = bp->b_bcount;
+	auio.uio_rw = UIO_READ;
+	auio.uio_segflg = UIO_SYSSPACE;
+	auio.uio_td = td;
+	return (physio(ip->i_devvp->v_rdev, &auio, 0));
+}
diff --git a/src/sys/ufs/ffs/ffs_softdep.c b/src/sys/ufs/ffs/ffs_softdep.c
new file mode 100644
index 0000000..51c92d3
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_softdep.c
@@ -0,0 +1,5933 @@
+/*
+ * Copyright 1998, 2000 Marshall Kirk McKusick. All Rights Reserved.
+ *
+ * The soft updates code is derived from the appendix of a University
+ * of Michigan technical report (Gregory R. Ganger and Yale N. Patt,
+ * "Soft Updates: A Solution to the Metadata Update Problem in File
+ * Systems", CSE-TR-254-95, August 1995).
+ *
+ * Further information about soft updates can be obtained from:
+ *
+ *	Marshall Kirk McKusick		http://www.mckusick.com/softdep/
+ *	1614 Oxford Street		mckusick@mckusick.com
+ *	Berkeley, CA 94709-1608		+1-510-843-9542
+ *	USA
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY MARSHALL KIRK MCKUSICK ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED.  IN NO EVENT SHALL MARSHALL KIRK MCKUSICK BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	from: @(#)ffs_softdep.c	9.59 (McKusick) 6/21/00
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_softdep.c,v 1.149 2003/10/23 21:14:08 jhb Exp $");
+
+/*
+ * For now we want the safety net that the DIAGNOSTIC and DEBUG flags provide.
+ */
+#ifndef DIAGNOSTIC
+#define DIAGNOSTIC
+#endif
+#ifndef DEBUG
+#define DEBUG
+#endif
+
+#include <sys/param.h>
+#include <sys/kernel.h>
+#include <sys/systm.h>
+#include <sys/bio.h>
+#include <sys/buf.h>
+#include <sys/malloc.h>
+#include <sys/mount.h>
+#include <sys/proc.h>
+#include <sys/stat.h>
+#include <sys/syslog.h>
+#include <sys/vnode.h>
+#include <sys/conf.h>
+#include <ufs/ufs/dir.h>
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufsmount.h>
+#include <ufs/ffs/fs.h>
+#include <ufs/ffs/softdep.h>
+#include <ufs/ffs/ffs_extern.h>
+#include <ufs/ufs/ufs_extern.h>
+
+/*
+ * These definitions need to be adapted to the system to which
+ * this file is being ported.
+ */
+/*
+ * malloc types defined for the softdep system.
+ */
+static MALLOC_DEFINE(M_PAGEDEP, "pagedep","File page dependencies");
+static MALLOC_DEFINE(M_INODEDEP, "inodedep","Inode dependencies");
+static MALLOC_DEFINE(M_NEWBLK, "newblk","New block allocation");
+static MALLOC_DEFINE(M_BMSAFEMAP, "bmsafemap","Block or frag allocated from cyl group map");
+static MALLOC_DEFINE(M_ALLOCDIRECT, "allocdirect","Block or frag dependency for an inode");
+static MALLOC_DEFINE(M_INDIRDEP, "indirdep","Indirect block dependencies");
+static MALLOC_DEFINE(M_ALLOCINDIR, "allocindir","Block dependency for an indirect block");
+static MALLOC_DEFINE(M_FREEFRAG, "freefrag","Previously used frag for an inode");
+static MALLOC_DEFINE(M_FREEBLKS, "freeblks","Blocks freed from an inode");
+static MALLOC_DEFINE(M_FREEFILE, "freefile","Inode deallocated");
+static MALLOC_DEFINE(M_DIRADD, "diradd","New directory entry");
+static MALLOC_DEFINE(M_MKDIR, "mkdir","New directory");
+static MALLOC_DEFINE(M_DIRREM, "dirrem","Directory entry deleted");
+static MALLOC_DEFINE(M_NEWDIRBLK, "newdirblk","Unclaimed new directory block");
+
+#define M_SOFTDEP_FLAGS	(M_WAITOK | M_USE_RESERVE)
+
+#define	D_PAGEDEP	0
+#define	D_INODEDEP	1
+#define	D_NEWBLK	2
+#define	D_BMSAFEMAP	3
+#define	D_ALLOCDIRECT	4
+#define	D_INDIRDEP	5
+#define	D_ALLOCINDIR	6
+#define	D_FREEFRAG	7
+#define	D_FREEBLKS	8
+#define	D_FREEFILE	9
+#define	D_DIRADD	10
+#define	D_MKDIR		11
+#define	D_DIRREM	12
+#define	D_NEWDIRBLK	13
+#define	D_LAST		D_NEWDIRBLK
+
+/* 
+ * translate from workitem type to memory type
+ * MUST match the defines above, such that memtype[D_XXX] == M_XXX
+ */
+static struct malloc_type *memtype[] = {
+	M_PAGEDEP,
+	M_INODEDEP,
+	M_NEWBLK,
+	M_BMSAFEMAP,
+	M_ALLOCDIRECT,
+	M_INDIRDEP,
+	M_ALLOCINDIR,
+	M_FREEFRAG,
+	M_FREEBLKS,
+	M_FREEFILE,
+	M_DIRADD,
+	M_MKDIR,
+	M_DIRREM,
+	M_NEWDIRBLK
+};
+
+#define DtoM(type) (memtype[type])
+
+/*
+ * Names of malloc types.
+ */
+#define TYPENAME(type)  \
+	((unsigned)(type) < D_LAST ? memtype[type]->ks_shortdesc : "???")
+/*
+ * End system adaptaion definitions.
+ */
+
+/*
+ * Internal function prototypes.
+ */
+static	void softdep_error(char *, int);
+static	void drain_output(struct vnode *, int);
+static	struct buf *getdirtybuf(struct buf **, struct mtx *, int);
+static	void clear_remove(struct thread *);
+static	void clear_inodedeps(struct thread *);
+static	int flush_pagedep_deps(struct vnode *, struct mount *,
+	    struct diraddhd *);
+static	int flush_inodedep_deps(struct fs *, ino_t);
+static	int flush_deplist(struct allocdirectlst *, int, int *);
+static	int handle_written_filepage(struct pagedep *, struct buf *);
+static  void diradd_inode_written(struct diradd *, struct inodedep *);
+static	int handle_written_inodeblock(struct inodedep *, struct buf *);
+static	void handle_allocdirect_partdone(struct allocdirect *);
+static	void handle_allocindir_partdone(struct allocindir *);
+static	void initiate_write_filepage(struct pagedep *, struct buf *);
+static	void handle_written_mkdir(struct mkdir *, int);
+static	void initiate_write_inodeblock_ufs1(struct inodedep *, struct buf *);
+static	void initiate_write_inodeblock_ufs2(struct inodedep *, struct buf *);
+static	void handle_workitem_freefile(struct freefile *);
+static	void handle_workitem_remove(struct dirrem *, struct vnode *);
+static	struct dirrem *newdirrem(struct buf *, struct inode *,
+	    struct inode *, int, struct dirrem **);
+static	void free_diradd(struct diradd *);
+static	void free_allocindir(struct allocindir *, struct inodedep *);
+static	void free_newdirblk(struct newdirblk *);
+static	int indir_trunc(struct freeblks *, ufs2_daddr_t, int, ufs_lbn_t,
+	    ufs2_daddr_t *);
+static	void deallocate_dependencies(struct buf *, struct inodedep *);
+static	void free_allocdirect(struct allocdirectlst *,
+	    struct allocdirect *, int);
+static	int check_inode_unwritten(struct inodedep *);
+static	int free_inodedep(struct inodedep *);
+static	void handle_workitem_freeblocks(struct freeblks *, int);
+static	void merge_inode_lists(struct allocdirectlst *,struct allocdirectlst *);
+static	void setup_allocindir_phase2(struct buf *, struct inode *,
+	    struct allocindir *);
+static	struct allocindir *newallocindir(struct inode *, int, ufs2_daddr_t,
+	    ufs2_daddr_t);
+static	void handle_workitem_freefrag(struct freefrag *);
+static	struct freefrag *newfreefrag(struct inode *, ufs2_daddr_t, long);
+static	void allocdirect_merge(struct allocdirectlst *,
+	    struct allocdirect *, struct allocdirect *);
+static	struct bmsafemap *bmsafemap_lookup(struct buf *);
+static	int newblk_lookup(struct fs *, ufs2_daddr_t, int, struct newblk **);
+static	int inodedep_lookup(struct fs *, ino_t, int, struct inodedep **);
+static	int pagedep_lookup(struct inode *, ufs_lbn_t, int, struct pagedep **);
+static	void pause_timer(void *);
+static	int request_cleanup(int, int);
+static	int process_worklist_item(struct mount *, int);
+static	void add_to_worklist(struct worklist *);
+
+/*
+ * Exported softdep operations.
+ */
+static	void softdep_disk_io_initiation(struct buf *);
+static	void softdep_disk_write_complete(struct buf *);
+static	void softdep_deallocate_dependencies(struct buf *);
+static	void softdep_move_dependencies(struct buf *, struct buf *);
+static	int softdep_count_dependencies(struct buf *bp, int);
+
+/*
+ * Locking primitives.
+ *
+ * For a uniprocessor, all we need to do is protect against disk
+ * interrupts. For a multiprocessor, this lock would have to be
+ * a mutex. A single mutex is used throughout this file, though
+ * finer grain locking could be used if contention warranted it.
+ *
+ * For a multiprocessor, the sleep call would accept a lock and
+ * release it after the sleep processing was complete. In a uniprocessor
+ * implementation there is no such interlock, so we simple mark
+ * the places where it needs to be done with the `interlocked' form
+ * of the lock calls. Since the uniprocessor sleep already interlocks
+ * the spl, there is nothing that really needs to be done.
+ */
+#ifndef /* NOT */ DEBUG
+static struct lockit {
+	int	lkt_spl;
+} lk = { 0 };
+#define ACQUIRE_LOCK(lk)		(lk)->lkt_spl = splbio()
+#define FREE_LOCK(lk)			splx((lk)->lkt_spl)
+
+#else /* DEBUG */
+#define NOHOLDER	((struct thread *)-1)
+#define SPECIAL_FLAG	((struct thread *)-2)
+static struct lockit {
+	int	lkt_spl;
+	struct	thread *lkt_held;
+} lk = { 0, NOHOLDER };
+
+static	void acquire_lock(struct lockit *);
+static	void free_lock(struct lockit *);
+void	softdep_panic(char *);
+
+#define ACQUIRE_LOCK(lk)		acquire_lock(lk)
+#define FREE_LOCK(lk)			free_lock(lk)
+
+static void
+acquire_lock(lk)
+	struct lockit *lk;
+{
+	struct thread *holder;
+
+	if (lk->lkt_held != NOHOLDER) {
+		holder = lk->lkt_held;
+		FREE_LOCK(lk);
+		if (holder == curthread)
+			panic("softdep_lock: locking against myself");
+		else
+			panic("softdep_lock: lock held by %p", holder);
+	}
+	lk->lkt_spl = splbio();
+	lk->lkt_held = curthread;
+}
+
+static void
+free_lock(lk)
+	struct lockit *lk;
+{
+
+	if (lk->lkt_held == NOHOLDER)
+		panic("softdep_unlock: lock not held");
+	lk->lkt_held = NOHOLDER;
+	splx(lk->lkt_spl);
+}
+
+/*
+ * Function to release soft updates lock and panic.
+ */
+void
+softdep_panic(msg)
+	char *msg;
+{
+
+	if (lk.lkt_held != NOHOLDER)
+		FREE_LOCK(&lk);
+	panic(msg);
+}
+#endif /* DEBUG */
+
+static	int interlocked_sleep(struct lockit *, int, void *, struct mtx *, int,
+	    const char *, int);
+
+/*
+ * When going to sleep, we must save our SPL so that it does
+ * not get lost if some other process uses the lock while we
+ * are sleeping. We restore it after we have slept. This routine
+ * wraps the interlocking with functions that sleep. The list
+ * below enumerates the available set of operations.
+ */
+#define	UNKNOWN		0
+#define	SLEEP		1
+#define	LOCKBUF		2
+
+static int
+interlocked_sleep(lk, op, ident, mtx, flags, wmesg, timo)
+	struct lockit *lk;
+	int op;
+	void *ident;
+	struct mtx *mtx;
+	int flags;
+	const char *wmesg;
+	int timo;
+{
+	struct thread *holder;
+	int s, retval;
+
+	s = lk->lkt_spl;
+#	ifdef DEBUG
+	if (lk->lkt_held == NOHOLDER)
+		panic("interlocked_sleep: lock not held");
+	lk->lkt_held = NOHOLDER;
+#	endif /* DEBUG */
+	switch (op) {
+	case SLEEP:
+		retval = msleep(ident, mtx, flags, wmesg, timo);
+		break;
+	case LOCKBUF:
+		retval = BUF_LOCK((struct buf *)ident, flags, mtx);
+		break;
+	default:
+		panic("interlocked_sleep: unknown operation");
+	}
+#	ifdef DEBUG
+	if (lk->lkt_held != NOHOLDER) {
+		holder = lk->lkt_held;
+		FREE_LOCK(lk);
+		if (holder == curthread)
+			panic("interlocked_sleep: locking against self");
+		else
+			panic("interlocked_sleep: lock held by %p", holder);
+	}
+	lk->lkt_held = curthread;
+#	endif /* DEBUG */
+	lk->lkt_spl = s;
+	return (retval);
+}
+
+/*
+ * Place holder for real semaphores.
+ */
+struct sema {
+	int	value;
+	struct	thread *holder;
+	char	*name;
+	int	prio;
+	int	timo;
+};
+static	void sema_init(struct sema *, char *, int, int);
+static	int sema_get(struct sema *, struct lockit *);
+static	void sema_release(struct sema *);
+
+static void
+sema_init(semap, name, prio, timo)
+	struct sema *semap;
+	char *name;
+	int prio, timo;
+{
+
+	semap->holder = NOHOLDER;
+	semap->value = 0;
+	semap->name = name;
+	semap->prio = prio;
+	semap->timo = timo;
+}
+
+static int
+sema_get(semap, interlock)
+	struct sema *semap;
+	struct lockit *interlock;
+{
+
+	if (semap->value++ > 0) {
+		if (interlock != NULL) {
+			interlocked_sleep(interlock, SLEEP, (caddr_t)semap,
+			    NULL, semap->prio, semap->name,
+			    semap->timo);
+			FREE_LOCK(interlock);
+		} else {
+			tsleep(semap, semap->prio, semap->name,
+			    semap->timo);
+		}
+		return (0);
+	}
+	semap->holder = curthread;
+	if (interlock != NULL)
+		FREE_LOCK(interlock);
+	return (1);
+}
+
+static void
+sema_release(semap)
+	struct sema *semap;
+{
+
+	if (semap->value <= 0 || semap->holder != curthread) {
+		if (lk.lkt_held != NOHOLDER)
+			FREE_LOCK(&lk);
+		panic("sema_release: not held");
+	}
+	if (--semap->value > 0) {
+		semap->value = 0;
+		wakeup(semap);
+	}
+	semap->holder = NOHOLDER;
+}
+
+/*
+ * Worklist queue management.
+ * These routines require that the lock be held.
+ */
+#ifndef /* NOT */ DEBUG
+#define WORKLIST_INSERT(head, item) do {	\
+	(item)->wk_state |= ONWORKLIST;		\
+	LIST_INSERT_HEAD(head, item, wk_list);	\
+} while (0)
+#define WORKLIST_REMOVE(item) do {		\
+	(item)->wk_state &= ~ONWORKLIST;	\
+	LIST_REMOVE(item, wk_list);		\
+} while (0)
+#define WORKITEM_FREE(item, type) FREE(item, DtoM(type))
+
+#else /* DEBUG */
+static	void worklist_insert(struct workhead *, struct worklist *);
+static	void worklist_remove(struct worklist *);
+static	void workitem_free(struct worklist *, int);
+
+#define WORKLIST_INSERT(head, item) worklist_insert(head, item)
+#define WORKLIST_REMOVE(item) worklist_remove(item)
+#define WORKITEM_FREE(item, type) workitem_free((struct worklist *)item, type)
+
+static void
+worklist_insert(head, item)
+	struct workhead *head;
+	struct worklist *item;
+{
+
+	if (lk.lkt_held == NOHOLDER)
+		panic("worklist_insert: lock not held");
+	if (item->wk_state & ONWORKLIST) {
+		FREE_LOCK(&lk);
+		panic("worklist_insert: already on list");
+	}
+	item->wk_state |= ONWORKLIST;
+	LIST_INSERT_HEAD(head, item, wk_list);
+}
+
+static void
+worklist_remove(item)
+	struct worklist *item;
+{
+
+	if (lk.lkt_held == NOHOLDER)
+		panic("worklist_remove: lock not held");
+	if ((item->wk_state & ONWORKLIST) == 0) {
+		FREE_LOCK(&lk);
+		panic("worklist_remove: not on list");
+	}
+	item->wk_state &= ~ONWORKLIST;
+	LIST_REMOVE(item, wk_list);
+}
+
+static void
+workitem_free(item, type)
+	struct worklist *item;
+	int type;
+{
+
+	if (item->wk_state & ONWORKLIST) {
+		if (lk.lkt_held != NOHOLDER)
+			FREE_LOCK(&lk);
+		panic("workitem_free: still on list");
+	}
+	if (item->wk_type != type) {
+		if (lk.lkt_held != NOHOLDER)
+			FREE_LOCK(&lk);
+		panic("workitem_free: type mismatch");
+	}
+	FREE(item, DtoM(type));
+}
+#endif /* DEBUG */
+
+/*
+ * Workitem queue management
+ */
+static struct workhead softdep_workitem_pending;
+static struct worklist *worklist_tail;
+static int num_on_worklist;	/* number of worklist items to be processed */
+static int softdep_worklist_busy; /* 1 => trying to do unmount */
+static int softdep_worklist_req; /* serialized waiters */
+static int max_softdeps;	/* maximum number of structs before slowdown */
+static int maxindirdeps = 50;	/* max number of indirdeps before slowdown */
+static int tickdelay = 2;	/* number of ticks to pause during slowdown */
+static int proc_waiting;	/* tracks whether we have a timeout posted */
+static int *stat_countp;	/* statistic to count in proc_waiting timeout */
+static struct callout_handle handle; /* handle on posted proc_waiting timeout */
+static struct thread *filesys_syncer; /* proc of filesystem syncer process */
+static int req_clear_inodedeps;	/* syncer process flush some inodedeps */
+#define FLUSH_INODES		1
+static int req_clear_remove;	/* syncer process flush some freeblks */
+#define FLUSH_REMOVE		2
+#define FLUSH_REMOVE_WAIT	3
+/*
+ * runtime statistics
+ */
+static int stat_worklist_push;	/* number of worklist cleanups */
+static int stat_blk_limit_push;	/* number of times block limit neared */
+static int stat_ino_limit_push;	/* number of times inode limit neared */
+static int stat_blk_limit_hit;	/* number of times block slowdown imposed */
+static int stat_ino_limit_hit;	/* number of times inode slowdown imposed */
+static int stat_sync_limit_hit;	/* number of synchronous slowdowns imposed */
+static int stat_indir_blk_ptrs;	/* bufs redirtied as indir ptrs not written */
+static int stat_inode_bitmap;	/* bufs redirtied as inode bitmap not written */
+static int stat_direct_blk_ptrs;/* bufs redirtied as direct ptrs not written */
+static int stat_dir_entry;	/* bufs redirtied as dir entry cannot write */
+#ifdef DEBUG
+#include <vm/vm.h>
+#include <sys/sysctl.h>
+SYSCTL_INT(_debug, OID_AUTO, max_softdeps, CTLFLAG_RW, &max_softdeps, 0, "");
+SYSCTL_INT(_debug, OID_AUTO, tickdelay, CTLFLAG_RW, &tickdelay, 0, "");
+SYSCTL_INT(_debug, OID_AUTO, maxindirdeps, CTLFLAG_RW, &maxindirdeps, 0, "");
+SYSCTL_INT(_debug, OID_AUTO, worklist_push, CTLFLAG_RW, &stat_worklist_push, 0,"");
+SYSCTL_INT(_debug, OID_AUTO, blk_limit_push, CTLFLAG_RW, &stat_blk_limit_push, 0,"");
+SYSCTL_INT(_debug, OID_AUTO, ino_limit_push, CTLFLAG_RW, &stat_ino_limit_push, 0,"");
+SYSCTL_INT(_debug, OID_AUTO, blk_limit_hit, CTLFLAG_RW, &stat_blk_limit_hit, 0, "");
+SYSCTL_INT(_debug, OID_AUTO, ino_limit_hit, CTLFLAG_RW, &stat_ino_limit_hit, 0, "");
+SYSCTL_INT(_debug, OID_AUTO, sync_limit_hit, CTLFLAG_RW, &stat_sync_limit_hit, 0, "");
+SYSCTL_INT(_debug, OID_AUTO, indir_blk_ptrs, CTLFLAG_RW, &stat_indir_blk_ptrs, 0, "");
+SYSCTL_INT(_debug, OID_AUTO, inode_bitmap, CTLFLAG_RW, &stat_inode_bitmap, 0, "");
+SYSCTL_INT(_debug, OID_AUTO, direct_blk_ptrs, CTLFLAG_RW, &stat_direct_blk_ptrs, 0, "");
+SYSCTL_INT(_debug, OID_AUTO, dir_entry, CTLFLAG_RW, &stat_dir_entry, 0, "");
+#endif /* DEBUG */
+
+/*
+ * Add an item to the end of the work queue.
+ * This routine requires that the lock be held.
+ * This is the only routine that adds items to the list.
+ * The following routine is the only one that removes items
+ * and does so in order from first to last.
+ */
+static void
+add_to_worklist(wk)
+	struct worklist *wk;
+{
+
+	if (wk->wk_state & ONWORKLIST) {
+		if (lk.lkt_held != NOHOLDER)
+			FREE_LOCK(&lk);
+		panic("add_to_worklist: already on list");
+	}
+	wk->wk_state |= ONWORKLIST;
+	if (LIST_FIRST(&softdep_workitem_pending) == NULL)
+		LIST_INSERT_HEAD(&softdep_workitem_pending, wk, wk_list);
+	else
+		LIST_INSERT_AFTER(worklist_tail, wk, wk_list);
+	worklist_tail = wk;
+	num_on_worklist += 1;
+}
+
+/*
+ * Process that runs once per second to handle items in the background queue.
+ *
+ * Note that we ensure that everything is done in the order in which they
+ * appear in the queue. The code below depends on this property to ensure
+ * that blocks of a file are freed before the inode itself is freed. This
+ * ordering ensures that no new <vfsid, inum, lbn> triples will be generated
+ * until all the old ones have been purged from the dependency lists.
+ */
+int 
+softdep_process_worklist(matchmnt)
+	struct mount *matchmnt;
+{
+	struct thread *td = curthread;
+	int cnt, matchcnt, loopcount;
+	long starttime;
+
+	/*
+	 * Record the process identifier of our caller so that we can give
+	 * this process preferential treatment in request_cleanup below.
+	 */
+	filesys_syncer = td;
+	matchcnt = 0;
+
+	/*
+	 * There is no danger of having multiple processes run this
+	 * code, but we have to single-thread it when softdep_flushfiles()
+	 * is in operation to get an accurate count of the number of items
+	 * related to its mount point that are in the list.
+	 */
+	if (matchmnt == NULL) {
+		if (softdep_worklist_busy < 0)
+			return(-1);
+		softdep_worklist_busy += 1;
+	}
+
+	/*
+	 * If requested, try removing inode or removal dependencies.
+	 */
+	if (req_clear_inodedeps) {
+		clear_inodedeps(td);
+		req_clear_inodedeps -= 1;
+		wakeup_one(&proc_waiting);
+	}
+	if (req_clear_remove) {
+		clear_remove(td);
+		req_clear_remove -= 1;
+		wakeup_one(&proc_waiting);
+	}
+	loopcount = 1;
+	starttime = time_second;
+	while (num_on_worklist > 0) {
+		if ((cnt = process_worklist_item(matchmnt, 0)) == -1)
+			break;
+		else
+			matchcnt += cnt;
+
+		/*
+		 * If a umount operation wants to run the worklist
+		 * accurately, abort.
+		 */
+		if (softdep_worklist_req && matchmnt == NULL) {
+			matchcnt = -1;
+			break;
+		}
+
+		/*
+		 * If requested, try removing inode or removal dependencies.
+		 */
+		if (req_clear_inodedeps) {
+			clear_inodedeps(td);
+			req_clear_inodedeps -= 1;
+			wakeup_one(&proc_waiting);
+		}
+		if (req_clear_remove) {
+			clear_remove(td);
+			req_clear_remove -= 1;
+			wakeup_one(&proc_waiting);
+		}
+		/*
+		 * We do not generally want to stop for buffer space, but if
+		 * we are really being a buffer hog, we will stop and wait.
+		 */
+		if (loopcount++ % 128 == 0)
+			bwillwrite();
+		/*
+		 * Never allow processing to run for more than one
+		 * second. Otherwise the other syncer tasks may get
+		 * excessively backlogged.
+		 */
+		if (starttime != time_second && matchmnt == NULL) {
+			matchcnt = -1;
+			break;
+		}
+	}
+	if (matchmnt == NULL) {
+		softdep_worklist_busy -= 1;
+		if (softdep_worklist_req && softdep_worklist_busy == 0)
+			wakeup(&softdep_worklist_req);
+	}
+	return (matchcnt);
+}
+
+/*
+ * Process one item on the worklist.
+ */
+static int
+process_worklist_item(matchmnt, flags)
+	struct mount *matchmnt;
+	int flags;
+{
+	struct worklist *wk, *wkend;
+	struct mount *mp;
+	struct vnode *vp;
+	int matchcnt = 0;
+
+	/*
+	 * If we are being called because of a process doing a
+	 * copy-on-write, then it is not safe to write as we may
+	 * recurse into the copy-on-write routine.
+	 */
+	if (curthread->td_pflags & TDP_COWINPROGRESS)
+		return (-1);
+	ACQUIRE_LOCK(&lk);
+	/*
+	 * Normally we just process each item on the worklist in order.
+	 * However, if we are in a situation where we cannot lock any
+	 * inodes, we have to skip over any dirrem requests whose
+	 * vnodes are resident and locked.
+	 */
+	vp = NULL;
+	LIST_FOREACH(wk, &softdep_workitem_pending, wk_list) {
+		if (wk->wk_state & INPROGRESS)
+			continue;
+		if ((flags & LK_NOWAIT) == 0 || wk->wk_type != D_DIRREM)
+			break;
+		wk->wk_state |= INPROGRESS;
+		FREE_LOCK(&lk);
+		VFS_VGET(WK_DIRREM(wk)->dm_mnt, WK_DIRREM(wk)->dm_oldinum,
+		    LK_NOWAIT | LK_EXCLUSIVE, &vp);
+		ACQUIRE_LOCK(&lk);
+		wk->wk_state &= ~INPROGRESS;
+		if (vp != NULL)
+			break;
+	}
+	if (wk == 0) {
+		FREE_LOCK(&lk);
+		return (-1);
+	}
+	/*
+	 * Remove the item to be processed. If we are removing the last
+	 * item on the list, we need to recalculate the tail pointer.
+	 * As this happens rarely and usually when the list is short,
+	 * we just run down the list to find it rather than tracking it
+	 * in the above loop.
+	 */
+	WORKLIST_REMOVE(wk);
+	if (wk == worklist_tail) {
+		LIST_FOREACH(wkend, &softdep_workitem_pending, wk_list)
+			if (LIST_NEXT(wkend, wk_list) == NULL)
+				break;
+		worklist_tail = wkend;
+	}
+	num_on_worklist -= 1;
+	FREE_LOCK(&lk);
+	switch (wk->wk_type) {
+
+	case D_DIRREM:
+		/* removal of a directory entry */
+		mp = WK_DIRREM(wk)->dm_mnt;
+		if (vn_write_suspend_wait(NULL, mp, V_NOWAIT))
+			panic("%s: dirrem on suspended filesystem",
+				"process_worklist_item");
+		if (mp == matchmnt)
+			matchcnt += 1;
+		handle_workitem_remove(WK_DIRREM(wk), vp);
+		break;
+
+	case D_FREEBLKS:
+		/* releasing blocks and/or fragments from a file */
+		mp = WK_FREEBLKS(wk)->fb_mnt;
+		if (vn_write_suspend_wait(NULL, mp, V_NOWAIT))
+			panic("%s: freeblks on suspended filesystem",
+				"process_worklist_item");
+		if (mp == matchmnt)
+			matchcnt += 1;
+		handle_workitem_freeblocks(WK_FREEBLKS(wk), flags & LK_NOWAIT);
+		break;
+
+	case D_FREEFRAG:
+		/* releasing a fragment when replaced as a file grows */
+		mp = WK_FREEFRAG(wk)->ff_mnt;
+		if (vn_write_suspend_wait(NULL, mp, V_NOWAIT))
+			panic("%s: freefrag on suspended filesystem",
+				"process_worklist_item");
+		if (mp == matchmnt)
+			matchcnt += 1;
+		handle_workitem_freefrag(WK_FREEFRAG(wk));
+		break;
+
+	case D_FREEFILE:
+		/* releasing an inode when its link count drops to 0 */
+		mp = WK_FREEFILE(wk)->fx_mnt;
+		if (vn_write_suspend_wait(NULL, mp, V_NOWAIT))
+			panic("%s: freefile on suspended filesystem",
+				"process_worklist_item");
+		if (mp == matchmnt)
+			matchcnt += 1;
+		handle_workitem_freefile(WK_FREEFILE(wk));
+		break;
+
+	default:
+		panic("%s_process_worklist: Unknown type %s",
+		    "softdep", TYPENAME(wk->wk_type));
+		/* NOTREACHED */
+	}
+	return (matchcnt);
+}
+
+/*
+ * Move dependencies from one buffer to another.
+ */
+static void
+softdep_move_dependencies(oldbp, newbp)
+	struct buf *oldbp;
+	struct buf *newbp;
+{
+	struct worklist *wk, *wktail;
+
+	if (LIST_FIRST(&newbp->b_dep) != NULL)
+		panic("softdep_move_dependencies: need merge code");
+	wktail = 0;
+	ACQUIRE_LOCK(&lk);
+	while ((wk = LIST_FIRST(&oldbp->b_dep)) != NULL) {
+		LIST_REMOVE(wk, wk_list);
+		if (wktail == 0)
+			LIST_INSERT_HEAD(&newbp->b_dep, wk, wk_list);
+		else
+			LIST_INSERT_AFTER(wktail, wk, wk_list);
+		wktail = wk;
+	}
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Purge the work list of all items associated with a particular mount point.
+ */
+int
+softdep_flushworklist(oldmnt, countp, td)
+	struct mount *oldmnt;
+	int *countp;
+	struct thread *td;
+{
+	struct vnode *devvp;
+	int count, error = 0;
+
+	/*
+	 * Await our turn to clear out the queue, then serialize access.
+	 */
+	while (softdep_worklist_busy) {
+		softdep_worklist_req += 1;
+		tsleep(&softdep_worklist_req, PRIBIO, "softflush", 0);
+		softdep_worklist_req -= 1;
+	}
+	softdep_worklist_busy = -1;
+	/*
+	 * Alternately flush the block device associated with the mount
+	 * point and process any dependencies that the flushing
+	 * creates. We continue until no more worklist dependencies
+	 * are found.
+	 */
+	*countp = 0;
+	devvp = VFSTOUFS(oldmnt)->um_devvp;
+	while ((count = softdep_process_worklist(oldmnt)) > 0) {
+		*countp += count;
+		vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY, td);
+		error = VOP_FSYNC(devvp, td->td_ucred, MNT_WAIT, td);
+		VOP_UNLOCK(devvp, 0, td);
+		if (error)
+			break;
+	}
+	softdep_worklist_busy = 0;
+	if (softdep_worklist_req)
+		wakeup(&softdep_worklist_req);
+	return (error);
+}
+
+/*
+ * Flush all vnodes and worklist items associated with a specified mount point.
+ */
+int
+softdep_flushfiles(oldmnt, flags, td)
+	struct mount *oldmnt;
+	int flags;
+	struct thread *td;
+{
+	int error, count, loopcnt;
+
+	error = 0;
+
+	/*
+	 * Alternately flush the vnodes associated with the mount
+	 * point and process any dependencies that the flushing
+	 * creates. In theory, this loop can happen at most twice,
+	 * but we give it a few extra just to be sure.
+	 */
+	for (loopcnt = 10; loopcnt > 0; loopcnt--) {
+		/*
+		 * Do another flush in case any vnodes were brought in
+		 * as part of the cleanup operations.
+		 */
+		if ((error = ffs_flushfiles(oldmnt, flags, td)) != 0)
+			break;
+		if ((error = softdep_flushworklist(oldmnt, &count, td)) != 0 ||
+		    count == 0)
+			break;
+	}
+	/*
+	 * If we are unmounting then it is an error to fail. If we
+	 * are simply trying to downgrade to read-only, then filesystem
+	 * activity can keep us busy forever, so we just fail with EBUSY.
+	 */
+	if (loopcnt == 0) {
+		if (oldmnt->mnt_kern_flag & MNTK_UNMOUNT)
+			panic("softdep_flushfiles: looping");
+		error = EBUSY;
+	}
+	return (error);
+}
+
+/*
+ * Structure hashing.
+ * 
+ * There are three types of structures that can be looked up:
+ *	1) pagedep structures identified by mount point, inode number,
+ *	   and logical block.
+ *	2) inodedep structures identified by mount point and inode number.
+ *	3) newblk structures identified by mount point and
+ *	   physical block number.
+ *
+ * The "pagedep" and "inodedep" dependency structures are hashed
+ * separately from the file blocks and inodes to which they correspond.
+ * This separation helps when the in-memory copy of an inode or
+ * file block must be replaced. It also obviates the need to access
+ * an inode or file page when simply updating (or de-allocating)
+ * dependency structures. Lookup of newblk structures is needed to
+ * find newly allocated blocks when trying to associate them with
+ * their allocdirect or allocindir structure.
+ *
+ * The lookup routines optionally create and hash a new instance when
+ * an existing entry is not found.
+ */
+#define DEPALLOC	0x0001	/* allocate structure if lookup fails */
+#define NODELAY		0x0002	/* cannot do background work */
+
+/*
+ * Structures and routines associated with pagedep caching.
+ */
+LIST_HEAD(pagedep_hashhead, pagedep) *pagedep_hashtbl;
+u_long	pagedep_hash;		/* size of hash table - 1 */
+#define	PAGEDEP_HASH(mp, inum, lbn) \
+	(&pagedep_hashtbl[((((register_t)(mp)) >> 13) + (inum) + (lbn)) & \
+	    pagedep_hash])
+static struct sema pagedep_in_progress;
+
+/*
+ * Look up a pagedep. Return 1 if found, 0 if not found or found
+ * when asked to allocate but not associated with any buffer.
+ * If not found, allocate if DEPALLOC flag is passed.
+ * Found or allocated entry is returned in pagedeppp.
+ * This routine must be called with splbio interrupts blocked.
+ */
+static int
+pagedep_lookup(ip, lbn, flags, pagedeppp)
+	struct inode *ip;
+	ufs_lbn_t lbn;
+	int flags;
+	struct pagedep **pagedeppp;
+{
+	struct pagedep *pagedep;
+	struct pagedep_hashhead *pagedephd;
+	struct mount *mp;
+	int i;
+
+#ifdef DEBUG
+	if (lk.lkt_held == NOHOLDER)
+		panic("pagedep_lookup: lock not held");
+#endif
+	mp = ITOV(ip)->v_mount;
+	pagedephd = PAGEDEP_HASH(mp, ip->i_number, lbn);
+top:
+	LIST_FOREACH(pagedep, pagedephd, pd_hash)
+		if (ip->i_number == pagedep->pd_ino &&
+		    lbn == pagedep->pd_lbn &&
+		    mp == pagedep->pd_mnt)
+			break;
+	if (pagedep) {
+		*pagedeppp = pagedep;
+		if ((flags & DEPALLOC) != 0 &&
+		    (pagedep->pd_state & ONWORKLIST) == 0)
+			return (0);
+		return (1);
+	}
+	if ((flags & DEPALLOC) == 0) {
+		*pagedeppp = NULL;
+		return (0);
+	}
+	if (sema_get(&pagedep_in_progress, &lk) == 0) {
+		ACQUIRE_LOCK(&lk);
+		goto top;
+	}
+	MALLOC(pagedep, struct pagedep *, sizeof(struct pagedep), M_PAGEDEP,
+		M_SOFTDEP_FLAGS|M_ZERO);
+	pagedep->pd_list.wk_type = D_PAGEDEP;
+	pagedep->pd_mnt = mp;
+	pagedep->pd_ino = ip->i_number;
+	pagedep->pd_lbn = lbn;
+	LIST_INIT(&pagedep->pd_dirremhd);
+	LIST_INIT(&pagedep->pd_pendinghd);
+	for (i = 0; i < DAHASHSZ; i++)
+		LIST_INIT(&pagedep->pd_diraddhd[i]);
+	ACQUIRE_LOCK(&lk);
+	LIST_INSERT_HEAD(pagedephd, pagedep, pd_hash);
+	sema_release(&pagedep_in_progress);
+	*pagedeppp = pagedep;
+	return (0);
+}
+
+/*
+ * Structures and routines associated with inodedep caching.
+ */
+LIST_HEAD(inodedep_hashhead, inodedep) *inodedep_hashtbl;
+static u_long	inodedep_hash;	/* size of hash table - 1 */
+static long	num_inodedep;	/* number of inodedep allocated */
+#define	INODEDEP_HASH(fs, inum) \
+      (&inodedep_hashtbl[((((register_t)(fs)) >> 13) + (inum)) & inodedep_hash])
+static struct sema inodedep_in_progress;
+
+/*
+ * Look up an inodedep. Return 1 if found, 0 if not found.
+ * If not found, allocate if DEPALLOC flag is passed.
+ * Found or allocated entry is returned in inodedeppp.
+ * This routine must be called with splbio interrupts blocked.
+ */
+static int
+inodedep_lookup(fs, inum, flags, inodedeppp)
+	struct fs *fs;
+	ino_t inum;
+	int flags;
+	struct inodedep **inodedeppp;
+{
+	struct inodedep *inodedep;
+	struct inodedep_hashhead *inodedephd;
+	int firsttry;
+
+#ifdef DEBUG
+	if (lk.lkt_held == NOHOLDER)
+		panic("inodedep_lookup: lock not held");
+#endif
+	firsttry = 1;
+	inodedephd = INODEDEP_HASH(fs, inum);
+top:
+	LIST_FOREACH(inodedep, inodedephd, id_hash)
+		if (inum == inodedep->id_ino && fs == inodedep->id_fs)
+			break;
+	if (inodedep) {
+		*inodedeppp = inodedep;
+		return (1);
+	}
+	if ((flags & DEPALLOC) == 0) {
+		*inodedeppp = NULL;
+		return (0);
+	}
+	/*
+	 * If we are over our limit, try to improve the situation.
+	 */
+	if (num_inodedep > max_softdeps && firsttry && (flags & NODELAY) == 0 &&
+	    request_cleanup(FLUSH_INODES, 1)) {
+		firsttry = 0;
+		goto top;
+	}
+	if (sema_get(&inodedep_in_progress, &lk) == 0) {
+		ACQUIRE_LOCK(&lk);
+		goto top;
+	}
+	num_inodedep += 1;
+	MALLOC(inodedep, struct inodedep *, sizeof(struct inodedep),
+		M_INODEDEP, M_SOFTDEP_FLAGS);
+	inodedep->id_list.wk_type = D_INODEDEP;
+	inodedep->id_fs = fs;
+	inodedep->id_ino = inum;
+	inodedep->id_state = ALLCOMPLETE;
+	inodedep->id_nlinkdelta = 0;
+	inodedep->id_savedino1 = NULL;
+	inodedep->id_savedsize = -1;
+	inodedep->id_savedextsize = -1;
+	inodedep->id_buf = NULL;
+	LIST_INIT(&inodedep->id_pendinghd);
+	LIST_INIT(&inodedep->id_inowait);
+	LIST_INIT(&inodedep->id_bufwait);
+	TAILQ_INIT(&inodedep->id_inoupdt);
+	TAILQ_INIT(&inodedep->id_newinoupdt);
+	TAILQ_INIT(&inodedep->id_extupdt);
+	TAILQ_INIT(&inodedep->id_newextupdt);
+	ACQUIRE_LOCK(&lk);
+	LIST_INSERT_HEAD(inodedephd, inodedep, id_hash);
+	sema_release(&inodedep_in_progress);
+	*inodedeppp = inodedep;
+	return (0);
+}
+
+/*
+ * Structures and routines associated with newblk caching.
+ */
+LIST_HEAD(newblk_hashhead, newblk) *newblk_hashtbl;
+u_long	newblk_hash;		/* size of hash table - 1 */
+#define	NEWBLK_HASH(fs, inum) \
+	(&newblk_hashtbl[((((register_t)(fs)) >> 13) + (inum)) & newblk_hash])
+static struct sema newblk_in_progress;
+
+/*
+ * Look up a newblk. Return 1 if found, 0 if not found.
+ * If not found, allocate if DEPALLOC flag is passed.
+ * Found or allocated entry is returned in newblkpp.
+ */
+static int
+newblk_lookup(fs, newblkno, flags, newblkpp)
+	struct fs *fs;
+	ufs2_daddr_t newblkno;
+	int flags;
+	struct newblk **newblkpp;
+{
+	struct newblk *newblk;
+	struct newblk_hashhead *newblkhd;
+
+	newblkhd = NEWBLK_HASH(fs, newblkno);
+top:
+	LIST_FOREACH(newblk, newblkhd, nb_hash)
+		if (newblkno == newblk->nb_newblkno && fs == newblk->nb_fs)
+			break;
+	if (newblk) {
+		*newblkpp = newblk;
+		return (1);
+	}
+	if ((flags & DEPALLOC) == 0) {
+		*newblkpp = NULL;
+		return (0);
+	}
+	if (sema_get(&newblk_in_progress, 0) == 0)
+		goto top;
+	MALLOC(newblk, struct newblk *, sizeof(struct newblk),
+		M_NEWBLK, M_SOFTDEP_FLAGS);
+	newblk->nb_state = 0;
+	newblk->nb_fs = fs;
+	newblk->nb_newblkno = newblkno;
+	LIST_INSERT_HEAD(newblkhd, newblk, nb_hash);
+	sema_release(&newblk_in_progress);
+	*newblkpp = newblk;
+	return (0);
+}
+
+/*
+ * Executed during filesystem system initialization before
+ * mounting any filesystems.
+ */
+void 
+softdep_initialize()
+{
+
+	LIST_INIT(&mkdirlisthd);
+	LIST_INIT(&softdep_workitem_pending);
+	max_softdeps = desiredvnodes * 4;
+	pagedep_hashtbl = hashinit(desiredvnodes / 5, M_PAGEDEP,
+	    &pagedep_hash);
+	sema_init(&pagedep_in_progress, "pagedep", PRIBIO, 0);
+	inodedep_hashtbl = hashinit(desiredvnodes, M_INODEDEP, &inodedep_hash);
+	sema_init(&inodedep_in_progress, "inodedep", PRIBIO, 0);
+	newblk_hashtbl = hashinit(64, M_NEWBLK, &newblk_hash);
+	sema_init(&newblk_in_progress, "newblk", PRIBIO, 0);
+
+	/* hooks through which the main kernel code calls us */
+	softdep_process_worklist_hook = softdep_process_worklist;
+	softdep_fsync_hook = softdep_fsync;
+
+	/* initialise bioops hack */
+	bioops.io_start = softdep_disk_io_initiation;
+	bioops.io_complete = softdep_disk_write_complete;
+	bioops.io_deallocate = softdep_deallocate_dependencies;
+	bioops.io_movedeps = softdep_move_dependencies;
+	bioops.io_countdeps = softdep_count_dependencies;
+}
+
+/*
+ * Executed after all filesystems have been unmounted during
+ * filesystem module unload.
+ */
+void
+softdep_uninitialize()
+{
+
+	softdep_process_worklist_hook = NULL;
+	softdep_fsync_hook = NULL;
+	hashdestroy(pagedep_hashtbl, M_PAGEDEP, pagedep_hash);
+	hashdestroy(inodedep_hashtbl, M_INODEDEP, inodedep_hash);
+	hashdestroy(newblk_hashtbl, M_NEWBLK, newblk_hash);
+}
+
+/*
+ * Called at mount time to notify the dependency code that a
+ * filesystem wishes to use it.
+ */
+int
+softdep_mount(devvp, mp, fs, cred)
+	struct vnode *devvp;
+	struct mount *mp;
+	struct fs *fs;
+	struct ucred *cred;
+{
+	struct csum_total cstotal;
+	struct cg *cgp;
+	struct buf *bp;
+	int error, cyl;
+
+	mp->mnt_flag &= ~MNT_ASYNC;
+	mp->mnt_flag |= MNT_SOFTDEP;
+	/*
+	 * When doing soft updates, the counters in the
+	 * superblock may have gotten out of sync, so we have
+	 * to scan the cylinder groups and recalculate them.
+	 */
+	if (fs->fs_clean != 0)
+		return (0);
+	bzero(&cstotal, sizeof cstotal);
+	for (cyl = 0; cyl < fs->fs_ncg; cyl++) {
+		if ((error = bread(devvp, fsbtodb(fs, cgtod(fs, cyl)),
+		    fs->fs_cgsize, cred, &bp)) != 0) {
+			brelse(bp);
+			return (error);
+		}
+		cgp = (struct cg *)bp->b_data;
+		cstotal.cs_nffree += cgp->cg_cs.cs_nffree;
+		cstotal.cs_nbfree += cgp->cg_cs.cs_nbfree;
+		cstotal.cs_nifree += cgp->cg_cs.cs_nifree;
+		cstotal.cs_ndir += cgp->cg_cs.cs_ndir;
+		fs->fs_cs(fs, cyl) = cgp->cg_cs;
+		brelse(bp);
+	}
+#ifdef DEBUG
+	if (bcmp(&cstotal, &fs->fs_cstotal, sizeof cstotal))
+		printf("%s: superblock summary recomputed\n", fs->fs_fsmnt);
+#endif
+	bcopy(&cstotal, &fs->fs_cstotal, sizeof cstotal);
+	return (0);
+}
+
+/*
+ * Protecting the freemaps (or bitmaps).
+ * 
+ * To eliminate the need to execute fsck before mounting a filesystem
+ * after a power failure, one must (conservatively) guarantee that the
+ * on-disk copy of the bitmaps never indicate that a live inode or block is
+ * free.  So, when a block or inode is allocated, the bitmap should be
+ * updated (on disk) before any new pointers.  When a block or inode is
+ * freed, the bitmap should not be updated until all pointers have been
+ * reset.  The latter dependency is handled by the delayed de-allocation
+ * approach described below for block and inode de-allocation.  The former
+ * dependency is handled by calling the following procedure when a block or
+ * inode is allocated. When an inode is allocated an "inodedep" is created
+ * with its DEPCOMPLETE flag cleared until its bitmap is written to disk.
+ * Each "inodedep" is also inserted into the hash indexing structure so
+ * that any additional link additions can be made dependent on the inode
+ * allocation.
+ * 
+ * The ufs filesystem maintains a number of free block counts (e.g., per
+ * cylinder group, per cylinder and per <cylinder, rotational position> pair)
+ * in addition to the bitmaps.  These counts are used to improve efficiency
+ * during allocation and therefore must be consistent with the bitmaps.
+ * There is no convenient way to guarantee post-crash consistency of these
+ * counts with simple update ordering, for two main reasons: (1) The counts
+ * and bitmaps for a single cylinder group block are not in the same disk
+ * sector.  If a disk write is interrupted (e.g., by power failure), one may
+ * be written and the other not.  (2) Some of the counts are located in the
+ * superblock rather than the cylinder group block. So, we focus our soft
+ * updates implementation on protecting the bitmaps. When mounting a
+ * filesystem, we recompute the auxiliary counts from the bitmaps.
+ */
+
+/*
+ * Called just after updating the cylinder group block to allocate an inode.
+ */
+void
+softdep_setup_inomapdep(bp, ip, newinum)
+	struct buf *bp;		/* buffer for cylgroup block with inode map */
+	struct inode *ip;	/* inode related to allocation */
+	ino_t newinum;		/* new inode number being allocated */
+{
+	struct inodedep *inodedep;
+	struct bmsafemap *bmsafemap;
+
+	/*
+	 * Create a dependency for the newly allocated inode.
+	 * Panic if it already exists as something is seriously wrong.
+	 * Otherwise add it to the dependency list for the buffer holding
+	 * the cylinder group map from which it was allocated.
+	 */
+	ACQUIRE_LOCK(&lk);
+	if ((inodedep_lookup(ip->i_fs, newinum, DEPALLOC|NODELAY, &inodedep))) {
+		FREE_LOCK(&lk);
+		panic("softdep_setup_inomapdep: found inode");
+	}
+	inodedep->id_buf = bp;
+	inodedep->id_state &= ~DEPCOMPLETE;
+	bmsafemap = bmsafemap_lookup(bp);
+	LIST_INSERT_HEAD(&bmsafemap->sm_inodedephd, inodedep, id_deps);
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Called just after updating the cylinder group block to
+ * allocate block or fragment.
+ */
+void
+softdep_setup_blkmapdep(bp, fs, newblkno)
+	struct buf *bp;		/* buffer for cylgroup block with block map */
+	struct fs *fs;		/* filesystem doing allocation */
+	ufs2_daddr_t newblkno;	/* number of newly allocated block */
+{
+	struct newblk *newblk;
+	struct bmsafemap *bmsafemap;
+
+	/*
+	 * Create a dependency for the newly allocated block.
+	 * Add it to the dependency list for the buffer holding
+	 * the cylinder group map from which it was allocated.
+	 */
+	if (newblk_lookup(fs, newblkno, DEPALLOC, &newblk) != 0)
+		panic("softdep_setup_blkmapdep: found block");
+	ACQUIRE_LOCK(&lk);
+	newblk->nb_bmsafemap = bmsafemap = bmsafemap_lookup(bp);
+	LIST_INSERT_HEAD(&bmsafemap->sm_newblkhd, newblk, nb_deps);
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Find the bmsafemap associated with a cylinder group buffer.
+ * If none exists, create one. The buffer must be locked when
+ * this routine is called and this routine must be called with
+ * splbio interrupts blocked.
+ */
+static struct bmsafemap *
+bmsafemap_lookup(bp)
+	struct buf *bp;
+{
+	struct bmsafemap *bmsafemap;
+	struct worklist *wk;
+
+#ifdef DEBUG
+	if (lk.lkt_held == NOHOLDER)
+		panic("bmsafemap_lookup: lock not held");
+#endif
+	LIST_FOREACH(wk, &bp->b_dep, wk_list)
+		if (wk->wk_type == D_BMSAFEMAP)
+			return (WK_BMSAFEMAP(wk));
+	FREE_LOCK(&lk);
+	MALLOC(bmsafemap, struct bmsafemap *, sizeof(struct bmsafemap),
+		M_BMSAFEMAP, M_SOFTDEP_FLAGS);
+	bmsafemap->sm_list.wk_type = D_BMSAFEMAP;
+	bmsafemap->sm_list.wk_state = 0;
+	bmsafemap->sm_buf = bp;
+	LIST_INIT(&bmsafemap->sm_allocdirecthd);
+	LIST_INIT(&bmsafemap->sm_allocindirhd);
+	LIST_INIT(&bmsafemap->sm_inodedephd);
+	LIST_INIT(&bmsafemap->sm_newblkhd);
+	ACQUIRE_LOCK(&lk);
+	WORKLIST_INSERT(&bp->b_dep, &bmsafemap->sm_list);
+	return (bmsafemap);
+}
+
+/*
+ * Direct block allocation dependencies.
+ * 
+ * When a new block is allocated, the corresponding disk locations must be
+ * initialized (with zeros or new data) before the on-disk inode points to
+ * them.  Also, the freemap from which the block was allocated must be
+ * updated (on disk) before the inode's pointer. These two dependencies are
+ * independent of each other and are needed for all file blocks and indirect
+ * blocks that are pointed to directly by the inode.  Just before the
+ * "in-core" version of the inode is updated with a newly allocated block
+ * number, a procedure (below) is called to setup allocation dependency
+ * structures.  These structures are removed when the corresponding
+ * dependencies are satisfied or when the block allocation becomes obsolete
+ * (i.e., the file is deleted, the block is de-allocated, or the block is a
+ * fragment that gets upgraded).  All of these cases are handled in
+ * procedures described later.
+ * 
+ * When a file extension causes a fragment to be upgraded, either to a larger
+ * fragment or to a full block, the on-disk location may change (if the
+ * previous fragment could not simply be extended). In this case, the old
+ * fragment must be de-allocated, but not until after the inode's pointer has
+ * been updated. In most cases, this is handled by later procedures, which
+ * will construct a "freefrag" structure to be added to the workitem queue
+ * when the inode update is complete (or obsolete).  The main exception to
+ * this is when an allocation occurs while a pending allocation dependency
+ * (for the same block pointer) remains.  This case is handled in the main
+ * allocation dependency setup procedure by immediately freeing the
+ * unreferenced fragments.
+ */ 
+void 
+softdep_setup_allocdirect(ip, lbn, newblkno, oldblkno, newsize, oldsize, bp)
+	struct inode *ip;	/* inode to which block is being added */
+	ufs_lbn_t lbn;		/* block pointer within inode */
+	ufs2_daddr_t newblkno;	/* disk block number being added */
+	ufs2_daddr_t oldblkno;	/* previous block number, 0 unless frag */
+	long newsize;		/* size of new block */
+	long oldsize;		/* size of new block */
+	struct buf *bp;		/* bp for allocated block */
+{
+	struct allocdirect *adp, *oldadp;
+	struct allocdirectlst *adphead;
+	struct bmsafemap *bmsafemap;
+	struct inodedep *inodedep;
+	struct pagedep *pagedep;
+	struct newblk *newblk;
+
+	MALLOC(adp, struct allocdirect *, sizeof(struct allocdirect),
+		M_ALLOCDIRECT, M_SOFTDEP_FLAGS|M_ZERO);
+	adp->ad_list.wk_type = D_ALLOCDIRECT;
+	adp->ad_lbn = lbn;
+	adp->ad_newblkno = newblkno;
+	adp->ad_oldblkno = oldblkno;
+	adp->ad_newsize = newsize;
+	adp->ad_oldsize = oldsize;
+	adp->ad_state = ATTACHED;
+	LIST_INIT(&adp->ad_newdirblk);
+	if (newblkno == oldblkno)
+		adp->ad_freefrag = NULL;
+	else
+		adp->ad_freefrag = newfreefrag(ip, oldblkno, oldsize);
+
+	if (newblk_lookup(ip->i_fs, newblkno, 0, &newblk) == 0)
+		panic("softdep_setup_allocdirect: lost block");
+
+	ACQUIRE_LOCK(&lk);
+	inodedep_lookup(ip->i_fs, ip->i_number, DEPALLOC | NODELAY, &inodedep);
+	adp->ad_inodedep = inodedep;
+
+	if (newblk->nb_state == DEPCOMPLETE) {
+		adp->ad_state |= DEPCOMPLETE;
+		adp->ad_buf = NULL;
+	} else {
+		bmsafemap = newblk->nb_bmsafemap;
+		adp->ad_buf = bmsafemap->sm_buf;
+		LIST_REMOVE(newblk, nb_deps);
+		LIST_INSERT_HEAD(&bmsafemap->sm_allocdirecthd, adp, ad_deps);
+	}
+	LIST_REMOVE(newblk, nb_hash);
+	FREE(newblk, M_NEWBLK);
+
+	WORKLIST_INSERT(&bp->b_dep, &adp->ad_list);
+	if (lbn >= NDADDR) {
+		/* allocating an indirect block */
+		if (oldblkno != 0) {
+			FREE_LOCK(&lk);
+			panic("softdep_setup_allocdirect: non-zero indir");
+		}
+	} else {
+		/*
+		 * Allocating a direct block.
+		 *
+		 * If we are allocating a directory block, then we must
+		 * allocate an associated pagedep to track additions and
+		 * deletions.
+		 */
+		if ((ip->i_mode & IFMT) == IFDIR &&
+		    pagedep_lookup(ip, lbn, DEPALLOC, &pagedep) == 0)
+			WORKLIST_INSERT(&bp->b_dep, &pagedep->pd_list);
+	}
+	/*
+	 * The list of allocdirects must be kept in sorted and ascending
+	 * order so that the rollback routines can quickly determine the
+	 * first uncommitted block (the size of the file stored on disk
+	 * ends at the end of the lowest committed fragment, or if there
+	 * are no fragments, at the end of the highest committed block).
+	 * Since files generally grow, the typical case is that the new
+	 * block is to be added at the end of the list. We speed this
+	 * special case by checking against the last allocdirect in the
+	 * list before laboriously traversing the list looking for the
+	 * insertion point.
+	 */
+	adphead = &inodedep->id_newinoupdt;
+	oldadp = TAILQ_LAST(adphead, allocdirectlst);
+	if (oldadp == NULL || oldadp->ad_lbn <= lbn) {
+		/* insert at end of list */
+		TAILQ_INSERT_TAIL(adphead, adp, ad_next);
+		if (oldadp != NULL && oldadp->ad_lbn == lbn)
+			allocdirect_merge(adphead, adp, oldadp);
+		FREE_LOCK(&lk);
+		return;
+	}
+	TAILQ_FOREACH(oldadp, adphead, ad_next) {
+		if (oldadp->ad_lbn >= lbn)
+			break;
+	}
+	if (oldadp == NULL) {
+		FREE_LOCK(&lk);
+		panic("softdep_setup_allocdirect: lost entry");
+	}
+	/* insert in middle of list */
+	TAILQ_INSERT_BEFORE(oldadp, adp, ad_next);
+	if (oldadp->ad_lbn == lbn)
+		allocdirect_merge(adphead, adp, oldadp);
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Replace an old allocdirect dependency with a newer one.
+ * This routine must be called with splbio interrupts blocked.
+ */
+static void
+allocdirect_merge(adphead, newadp, oldadp)
+	struct allocdirectlst *adphead;	/* head of list holding allocdirects */
+	struct allocdirect *newadp;	/* allocdirect being added */
+	struct allocdirect *oldadp;	/* existing allocdirect being checked */
+{
+	struct worklist *wk;
+	struct freefrag *freefrag;
+	struct newdirblk *newdirblk;
+
+#ifdef DEBUG
+	if (lk.lkt_held == NOHOLDER)
+		panic("allocdirect_merge: lock not held");
+#endif
+	if (newadp->ad_oldblkno != oldadp->ad_newblkno ||
+	    newadp->ad_oldsize != oldadp->ad_newsize ||
+	    newadp->ad_lbn >= NDADDR) {
+		FREE_LOCK(&lk);
+		panic("%s %jd != new %jd || old size %ld != new %ld",
+		    "allocdirect_merge: old blkno",
+		    (intmax_t)newadp->ad_oldblkno,
+		    (intmax_t)oldadp->ad_newblkno,
+		    newadp->ad_oldsize, oldadp->ad_newsize);
+	}
+	newadp->ad_oldblkno = oldadp->ad_oldblkno;
+	newadp->ad_oldsize = oldadp->ad_oldsize;
+	/*
+	 * If the old dependency had a fragment to free or had never
+	 * previously had a block allocated, then the new dependency
+	 * can immediately post its freefrag and adopt the old freefrag.
+	 * This action is done by swapping the freefrag dependencies.
+	 * The new dependency gains the old one's freefrag, and the
+	 * old one gets the new one and then immediately puts it on
+	 * the worklist when it is freed by free_allocdirect. It is
+	 * not possible to do this swap when the old dependency had a
+	 * non-zero size but no previous fragment to free. This condition
+	 * arises when the new block is an extension of the old block.
+	 * Here, the first part of the fragment allocated to the new
+	 * dependency is part of the block currently claimed on disk by
+	 * the old dependency, so cannot legitimately be freed until the
+	 * conditions for the new dependency are fulfilled.
+	 */
+	if (oldadp->ad_freefrag != NULL || oldadp->ad_oldblkno == 0) {
+		freefrag = newadp->ad_freefrag;
+		newadp->ad_freefrag = oldadp->ad_freefrag;
+		oldadp->ad_freefrag = freefrag;
+	}
+	/*
+	 * If we are tracking a new directory-block allocation,
+	 * move it from the old allocdirect to the new allocdirect.
+	 */
+	if ((wk = LIST_FIRST(&oldadp->ad_newdirblk)) != NULL) {
+		newdirblk = WK_NEWDIRBLK(wk);
+		WORKLIST_REMOVE(&newdirblk->db_list);
+		if (LIST_FIRST(&oldadp->ad_newdirblk) != NULL)
+			panic("allocdirect_merge: extra newdirblk");
+		WORKLIST_INSERT(&newadp->ad_newdirblk, &newdirblk->db_list);
+	}
+	free_allocdirect(adphead, oldadp, 0);
+}
+		
+/*
+ * Allocate a new freefrag structure if needed.
+ */
+static struct freefrag *
+newfreefrag(ip, blkno, size)
+	struct inode *ip;
+	ufs2_daddr_t blkno;
+	long size;
+{
+	struct freefrag *freefrag;
+	struct fs *fs;
+
+	if (blkno == 0)
+		return (NULL);
+	fs = ip->i_fs;
+	if (fragnum(fs, blkno) + numfrags(fs, size) > fs->fs_frag)
+		panic("newfreefrag: frag size");
+	MALLOC(freefrag, struct freefrag *, sizeof(struct freefrag),
+		M_FREEFRAG, M_SOFTDEP_FLAGS);
+	freefrag->ff_list.wk_type = D_FREEFRAG;
+	freefrag->ff_state = 0;
+	freefrag->ff_inum = ip->i_number;
+	freefrag->ff_mnt = ITOV(ip)->v_mount;
+	freefrag->ff_blkno = blkno;
+	freefrag->ff_fragsize = size;
+	return (freefrag);
+}
+
+/*
+ * This workitem de-allocates fragments that were replaced during
+ * file block allocation.
+ */
+static void 
+handle_workitem_freefrag(freefrag)
+	struct freefrag *freefrag;
+{
+	struct ufsmount *ump = VFSTOUFS(freefrag->ff_mnt);
+
+	ffs_blkfree(ump->um_fs, ump->um_devvp, freefrag->ff_blkno,
+	    freefrag->ff_fragsize, freefrag->ff_inum);
+	FREE(freefrag, M_FREEFRAG);
+}
+
+/*
+ * Set up a dependency structure for an external attributes data block.
+ * This routine follows much of the structure of softdep_setup_allocdirect.
+ * See the description of softdep_setup_allocdirect above for details.
+ */
+void 
+softdep_setup_allocext(ip, lbn, newblkno, oldblkno, newsize, oldsize, bp)
+	struct inode *ip;
+	ufs_lbn_t lbn;
+	ufs2_daddr_t newblkno;
+	ufs2_daddr_t oldblkno;
+	long newsize;
+	long oldsize;
+	struct buf *bp;
+{
+	struct allocdirect *adp, *oldadp;
+	struct allocdirectlst *adphead;
+	struct bmsafemap *bmsafemap;
+	struct inodedep *inodedep;
+	struct newblk *newblk;
+
+	MALLOC(adp, struct allocdirect *, sizeof(struct allocdirect),
+		M_ALLOCDIRECT, M_SOFTDEP_FLAGS|M_ZERO);
+	adp->ad_list.wk_type = D_ALLOCDIRECT;
+	adp->ad_lbn = lbn;
+	adp->ad_newblkno = newblkno;
+	adp->ad_oldblkno = oldblkno;
+	adp->ad_newsize = newsize;
+	adp->ad_oldsize = oldsize;
+	adp->ad_state = ATTACHED | EXTDATA;
+	LIST_INIT(&adp->ad_newdirblk);
+	if (newblkno == oldblkno)
+		adp->ad_freefrag = NULL;
+	else
+		adp->ad_freefrag = newfreefrag(ip, oldblkno, oldsize);
+
+	if (newblk_lookup(ip->i_fs, newblkno, 0, &newblk) == 0)
+		panic("softdep_setup_allocext: lost block");
+
+	ACQUIRE_LOCK(&lk);
+	inodedep_lookup(ip->i_fs, ip->i_number, DEPALLOC | NODELAY, &inodedep);
+	adp->ad_inodedep = inodedep;
+
+	if (newblk->nb_state == DEPCOMPLETE) {
+		adp->ad_state |= DEPCOMPLETE;
+		adp->ad_buf = NULL;
+	} else {
+		bmsafemap = newblk->nb_bmsafemap;
+		adp->ad_buf = bmsafemap->sm_buf;
+		LIST_REMOVE(newblk, nb_deps);
+		LIST_INSERT_HEAD(&bmsafemap->sm_allocdirecthd, adp, ad_deps);
+	}
+	LIST_REMOVE(newblk, nb_hash);
+	FREE(newblk, M_NEWBLK);
+
+	WORKLIST_INSERT(&bp->b_dep, &adp->ad_list);
+	if (lbn >= NXADDR) {
+		FREE_LOCK(&lk);
+		panic("softdep_setup_allocext: lbn %lld > NXADDR",
+		    (long long)lbn);
+	}
+	/*
+	 * The list of allocdirects must be kept in sorted and ascending
+	 * order so that the rollback routines can quickly determine the
+	 * first uncommitted block (the size of the file stored on disk
+	 * ends at the end of the lowest committed fragment, or if there
+	 * are no fragments, at the end of the highest committed block).
+	 * Since files generally grow, the typical case is that the new
+	 * block is to be added at the end of the list. We speed this
+	 * special case by checking against the last allocdirect in the
+	 * list before laboriously traversing the list looking for the
+	 * insertion point.
+	 */
+	adphead = &inodedep->id_newextupdt;
+	oldadp = TAILQ_LAST(adphead, allocdirectlst);
+	if (oldadp == NULL || oldadp->ad_lbn <= lbn) {
+		/* insert at end of list */
+		TAILQ_INSERT_TAIL(adphead, adp, ad_next);
+		if (oldadp != NULL && oldadp->ad_lbn == lbn)
+			allocdirect_merge(adphead, adp, oldadp);
+		FREE_LOCK(&lk);
+		return;
+	}
+	TAILQ_FOREACH(oldadp, adphead, ad_next) {
+		if (oldadp->ad_lbn >= lbn)
+			break;
+	}
+	if (oldadp == NULL) {
+		FREE_LOCK(&lk);
+		panic("softdep_setup_allocext: lost entry");
+	}
+	/* insert in middle of list */
+	TAILQ_INSERT_BEFORE(oldadp, adp, ad_next);
+	if (oldadp->ad_lbn == lbn)
+		allocdirect_merge(adphead, adp, oldadp);
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Indirect block allocation dependencies.
+ * 
+ * The same dependencies that exist for a direct block also exist when
+ * a new block is allocated and pointed to by an entry in a block of
+ * indirect pointers. The undo/redo states described above are also
+ * used here. Because an indirect block contains many pointers that
+ * may have dependencies, a second copy of the entire in-memory indirect
+ * block is kept. The buffer cache copy is always completely up-to-date.
+ * The second copy, which is used only as a source for disk writes,
+ * contains only the safe pointers (i.e., those that have no remaining
+ * update dependencies). The second copy is freed when all pointers
+ * are safe. The cache is not allowed to replace indirect blocks with
+ * pending update dependencies. If a buffer containing an indirect
+ * block with dependencies is written, these routines will mark it
+ * dirty again. It can only be successfully written once all the
+ * dependencies are removed. The ffs_fsync routine in conjunction with
+ * softdep_sync_metadata work together to get all the dependencies
+ * removed so that a file can be successfully written to disk. Three
+ * procedures are used when setting up indirect block pointer
+ * dependencies. The division is necessary because of the organization
+ * of the "balloc" routine and because of the distinction between file
+ * pages and file metadata blocks.
+ */
+
+/*
+ * Allocate a new allocindir structure.
+ */
+static struct allocindir *
+newallocindir(ip, ptrno, newblkno, oldblkno)
+	struct inode *ip;	/* inode for file being extended */
+	int ptrno;		/* offset of pointer in indirect block */
+	ufs2_daddr_t newblkno;	/* disk block number being added */
+	ufs2_daddr_t oldblkno;	/* previous block number, 0 if none */
+{
+	struct allocindir *aip;
+
+	MALLOC(aip, struct allocindir *, sizeof(struct allocindir),
+		M_ALLOCINDIR, M_SOFTDEP_FLAGS|M_ZERO);
+	aip->ai_list.wk_type = D_ALLOCINDIR;
+	aip->ai_state = ATTACHED;
+	aip->ai_offset = ptrno;
+	aip->ai_newblkno = newblkno;
+	aip->ai_oldblkno = oldblkno;
+	aip->ai_freefrag = newfreefrag(ip, oldblkno, ip->i_fs->fs_bsize);
+	return (aip);
+}
+
+/*
+ * Called just before setting an indirect block pointer
+ * to a newly allocated file page.
+ */
+void
+softdep_setup_allocindir_page(ip, lbn, bp, ptrno, newblkno, oldblkno, nbp)
+	struct inode *ip;	/* inode for file being extended */
+	ufs_lbn_t lbn;		/* allocated block number within file */
+	struct buf *bp;		/* buffer with indirect blk referencing page */
+	int ptrno;		/* offset of pointer in indirect block */
+	ufs2_daddr_t newblkno;	/* disk block number being added */
+	ufs2_daddr_t oldblkno;	/* previous block number, 0 if none */
+	struct buf *nbp;	/* buffer holding allocated page */
+{
+	struct allocindir *aip;
+	struct pagedep *pagedep;
+
+	aip = newallocindir(ip, ptrno, newblkno, oldblkno);
+	ACQUIRE_LOCK(&lk);
+	/*
+	 * If we are allocating a directory page, then we must
+	 * allocate an associated pagedep to track additions and
+	 * deletions.
+	 */
+	if ((ip->i_mode & IFMT) == IFDIR &&
+	    pagedep_lookup(ip, lbn, DEPALLOC, &pagedep) == 0)
+		WORKLIST_INSERT(&nbp->b_dep, &pagedep->pd_list);
+	WORKLIST_INSERT(&nbp->b_dep, &aip->ai_list);
+	FREE_LOCK(&lk);
+	setup_allocindir_phase2(bp, ip, aip);
+}
+
+/*
+ * Called just before setting an indirect block pointer to a
+ * newly allocated indirect block.
+ */
+void
+softdep_setup_allocindir_meta(nbp, ip, bp, ptrno, newblkno)
+	struct buf *nbp;	/* newly allocated indirect block */
+	struct inode *ip;	/* inode for file being extended */
+	struct buf *bp;		/* indirect block referencing allocated block */
+	int ptrno;		/* offset of pointer in indirect block */
+	ufs2_daddr_t newblkno;	/* disk block number being added */
+{
+	struct allocindir *aip;
+
+	aip = newallocindir(ip, ptrno, newblkno, 0);
+	ACQUIRE_LOCK(&lk);
+	WORKLIST_INSERT(&nbp->b_dep, &aip->ai_list);
+	FREE_LOCK(&lk);
+	setup_allocindir_phase2(bp, ip, aip);
+}
+
+/*
+ * Called to finish the allocation of the "aip" allocated
+ * by one of the two routines above.
+ */
+static void 
+setup_allocindir_phase2(bp, ip, aip)
+	struct buf *bp;		/* in-memory copy of the indirect block */
+	struct inode *ip;	/* inode for file being extended */
+	struct allocindir *aip;	/* allocindir allocated by the above routines */
+{
+	struct worklist *wk;
+	struct indirdep *indirdep, *newindirdep;
+	struct bmsafemap *bmsafemap;
+	struct allocindir *oldaip;
+	struct freefrag *freefrag;
+	struct newblk *newblk;
+	ufs2_daddr_t blkno;
+
+	if (bp->b_lblkno >= 0)
+		panic("setup_allocindir_phase2: not indir blk");
+	for (indirdep = NULL, newindirdep = NULL; ; ) {
+		ACQUIRE_LOCK(&lk);
+		LIST_FOREACH(wk, &bp->b_dep, wk_list) {
+			if (wk->wk_type != D_INDIRDEP)
+				continue;
+			indirdep = WK_INDIRDEP(wk);
+			break;
+		}
+		if (indirdep == NULL && newindirdep) {
+			indirdep = newindirdep;
+			WORKLIST_INSERT(&bp->b_dep, &indirdep->ir_list);
+			newindirdep = NULL;
+		}
+		FREE_LOCK(&lk);
+		if (indirdep) {
+			if (newblk_lookup(ip->i_fs, aip->ai_newblkno, 0,
+			    &newblk) == 0)
+				panic("setup_allocindir: lost block");
+			ACQUIRE_LOCK(&lk);
+			if (newblk->nb_state == DEPCOMPLETE) {
+				aip->ai_state |= DEPCOMPLETE;
+				aip->ai_buf = NULL;
+			} else {
+				bmsafemap = newblk->nb_bmsafemap;
+				aip->ai_buf = bmsafemap->sm_buf;
+				LIST_REMOVE(newblk, nb_deps);
+				LIST_INSERT_HEAD(&bmsafemap->sm_allocindirhd,
+				    aip, ai_deps);
+			}
+			LIST_REMOVE(newblk, nb_hash);
+			FREE(newblk, M_NEWBLK);
+			aip->ai_indirdep = indirdep;
+			/*
+			 * Check to see if there is an existing dependency
+			 * for this block. If there is, merge the old
+			 * dependency into the new one.
+			 */
+			if (aip->ai_oldblkno == 0)
+				oldaip = NULL;
+			else
+
+				LIST_FOREACH(oldaip, &indirdep->ir_deplisthd, ai_next)
+					if (oldaip->ai_offset == aip->ai_offset)
+						break;
+			freefrag = NULL;
+			if (oldaip != NULL) {
+				if (oldaip->ai_newblkno != aip->ai_oldblkno) {
+					FREE_LOCK(&lk);
+					panic("setup_allocindir_phase2: blkno");
+				}
+				aip->ai_oldblkno = oldaip->ai_oldblkno;
+				freefrag = aip->ai_freefrag;
+				aip->ai_freefrag = oldaip->ai_freefrag;
+				oldaip->ai_freefrag = NULL;
+				free_allocindir(oldaip, NULL);
+			}
+			LIST_INSERT_HEAD(&indirdep->ir_deplisthd, aip, ai_next);
+			if (ip->i_ump->um_fstype == UFS1)
+				((ufs1_daddr_t *)indirdep->ir_savebp->b_data)
+				    [aip->ai_offset] = aip->ai_oldblkno;
+			else
+				((ufs2_daddr_t *)indirdep->ir_savebp->b_data)
+				    [aip->ai_offset] = aip->ai_oldblkno;
+			FREE_LOCK(&lk);
+			if (freefrag != NULL)
+				handle_workitem_freefrag(freefrag);
+		}
+		if (newindirdep) {
+			brelse(newindirdep->ir_savebp);
+			WORKITEM_FREE((caddr_t)newindirdep, D_INDIRDEP);
+		}
+		if (indirdep)
+			break;
+		MALLOC(newindirdep, struct indirdep *, sizeof(struct indirdep),
+			M_INDIRDEP, M_SOFTDEP_FLAGS);
+		newindirdep->ir_list.wk_type = D_INDIRDEP;
+		newindirdep->ir_state = ATTACHED;
+		if (ip->i_ump->um_fstype == UFS1)
+			newindirdep->ir_state |= UFS1FMT;
+		LIST_INIT(&newindirdep->ir_deplisthd);
+		LIST_INIT(&newindirdep->ir_donehd);
+		if (bp->b_blkno == bp->b_lblkno) {
+			ufs_bmaparray(bp->b_vp, bp->b_lblkno, &blkno, bp,
+			    NULL, NULL);
+			bp->b_blkno = blkno;
+		}
+		newindirdep->ir_savebp =
+		    getblk(ip->i_devvp, bp->b_blkno, bp->b_bcount, 0, 0, 0);
+		BUF_KERNPROC(newindirdep->ir_savebp);
+		bcopy(bp->b_data, newindirdep->ir_savebp->b_data, bp->b_bcount);
+	}
+}
+
+/*
+ * Block de-allocation dependencies.
+ * 
+ * When blocks are de-allocated, the on-disk pointers must be nullified before
+ * the blocks are made available for use by other files.  (The true
+ * requirement is that old pointers must be nullified before new on-disk
+ * pointers are set.  We chose this slightly more stringent requirement to
+ * reduce complexity.) Our implementation handles this dependency by updating
+ * the inode (or indirect block) appropriately but delaying the actual block
+ * de-allocation (i.e., freemap and free space count manipulation) until
+ * after the updated versions reach stable storage.  After the disk is
+ * updated, the blocks can be safely de-allocated whenever it is convenient.
+ * This implementation handles only the common case of reducing a file's
+ * length to zero. Other cases are handled by the conventional synchronous
+ * write approach.
+ *
+ * The ffs implementation with which we worked double-checks
+ * the state of the block pointers and file size as it reduces
+ * a file's length.  Some of this code is replicated here in our
+ * soft updates implementation.  The freeblks->fb_chkcnt field is
+ * used to transfer a part of this information to the procedure
+ * that eventually de-allocates the blocks.
+ *
+ * This routine should be called from the routine that shortens
+ * a file's length, before the inode's size or block pointers
+ * are modified. It will save the block pointer information for
+ * later release and zero the inode so that the calling routine
+ * can release it.
+ */
+void
+softdep_setup_freeblocks(ip, length, flags)
+	struct inode *ip;	/* The inode whose length is to be reduced */
+	off_t length;		/* The new length for the file */
+	int flags;		/* IO_EXT and/or IO_NORMAL */
+{
+	struct freeblks *freeblks;
+	struct inodedep *inodedep;
+	struct allocdirect *adp;
+	struct vnode *vp;
+	struct buf *bp;
+	struct fs *fs;
+	ufs2_daddr_t extblocks, datablocks;
+	int i, delay, error;
+
+	fs = ip->i_fs;
+	if (length != 0)
+		panic("softdep_setup_freeblocks: non-zero length");
+	MALLOC(freeblks, struct freeblks *, sizeof(struct freeblks),
+		M_FREEBLKS, M_SOFTDEP_FLAGS|M_ZERO);
+	freeblks->fb_list.wk_type = D_FREEBLKS;
+	freeblks->fb_uid = ip->i_uid;
+	freeblks->fb_previousinum = ip->i_number;
+	freeblks->fb_devvp = ip->i_devvp;
+	freeblks->fb_mnt = ITOV(ip)->v_mount;
+	extblocks = 0;
+	if (fs->fs_magic == FS_UFS2_MAGIC)
+		extblocks = btodb(fragroundup(fs, ip->i_din2->di_extsize));
+	datablocks = DIP(ip, i_blocks) - extblocks;
+	if ((flags & IO_NORMAL) == 0) {
+		freeblks->fb_oldsize = 0;
+		freeblks->fb_chkcnt = 0;
+	} else {
+		freeblks->fb_oldsize = ip->i_size;
+		ip->i_size = 0;
+		DIP(ip, i_size) = 0;
+		freeblks->fb_chkcnt = datablocks;
+		for (i = 0; i < NDADDR; i++) {
+			freeblks->fb_dblks[i] = DIP(ip, i_db[i]);
+			DIP(ip, i_db[i]) = 0;
+		}
+		for (i = 0; i < NIADDR; i++) {
+			freeblks->fb_iblks[i] = DIP(ip, i_ib[i]);
+			DIP(ip, i_ib[i]) = 0;
+		}
+		/*
+		 * If the file was removed, then the space being freed was
+		 * accounted for then (see softdep_filereleased()). If the
+		 * file is merely being truncated, then we account for it now.
+		 */
+		if ((ip->i_flag & IN_SPACECOUNTED) == 0)
+			fs->fs_pendingblocks += datablocks;
+	}
+	if ((flags & IO_EXT) == 0) {
+		freeblks->fb_oldextsize = 0;
+	} else {
+		freeblks->fb_oldextsize = ip->i_din2->di_extsize;
+		ip->i_din2->di_extsize = 0;
+		freeblks->fb_chkcnt += extblocks;
+		for (i = 0; i < NXADDR; i++) {
+			freeblks->fb_eblks[i] = ip->i_din2->di_extb[i];
+			ip->i_din2->di_extb[i] = 0;
+		}
+	}
+	DIP(ip, i_blocks) -= freeblks->fb_chkcnt;
+	/*
+	 * Push the zero'ed inode to to its disk buffer so that we are free
+	 * to delete its dependencies below. Once the dependencies are gone
+	 * the buffer can be safely released.
+	 */
+	if ((error = bread(ip->i_devvp,
+	    fsbtodb(fs, ino_to_fsba(fs, ip->i_number)),
+	    (int)fs->fs_bsize, NOCRED, &bp)) != 0) {
+		brelse(bp);
+		softdep_error("softdep_setup_freeblocks", error);
+	}
+	if (ip->i_ump->um_fstype == UFS1)
+		*((struct ufs1_dinode *)bp->b_data +
+		    ino_to_fsbo(fs, ip->i_number)) = *ip->i_din1;
+	else
+		*((struct ufs2_dinode *)bp->b_data +
+		    ino_to_fsbo(fs, ip->i_number)) = *ip->i_din2;
+	/*
+	 * Find and eliminate any inode dependencies.
+	 */
+	ACQUIRE_LOCK(&lk);
+	(void) inodedep_lookup(fs, ip->i_number, DEPALLOC, &inodedep);
+	if ((inodedep->id_state & IOSTARTED) != 0) {
+		FREE_LOCK(&lk);
+		panic("softdep_setup_freeblocks: inode busy");
+	}
+	/*
+	 * Add the freeblks structure to the list of operations that
+	 * must await the zero'ed inode being written to disk. If we
+	 * still have a bitmap dependency (delay == 0), then the inode
+	 * has never been written to disk, so we can process the
+	 * freeblks below once we have deleted the dependencies.
+	 */
+	delay = (inodedep->id_state & DEPCOMPLETE);
+	if (delay)
+		WORKLIST_INSERT(&inodedep->id_bufwait, &freeblks->fb_list);
+	/*
+	 * Because the file length has been truncated to zero, any
+	 * pending block allocation dependency structures associated
+	 * with this inode are obsolete and can simply be de-allocated.
+	 * We must first merge the two dependency lists to get rid of
+	 * any duplicate freefrag structures, then purge the merged list.
+	 * If we still have a bitmap dependency, then the inode has never
+	 * been written to disk, so we can free any fragments without delay.
+	 */
+	if (flags & IO_NORMAL) {
+		merge_inode_lists(&inodedep->id_newinoupdt,
+		    &inodedep->id_inoupdt);
+		while ((adp = TAILQ_FIRST(&inodedep->id_inoupdt)) != 0)
+			free_allocdirect(&inodedep->id_inoupdt, adp, delay);
+	}
+	if (flags & IO_EXT) {
+		merge_inode_lists(&inodedep->id_newextupdt,
+		    &inodedep->id_extupdt);
+		while ((adp = TAILQ_FIRST(&inodedep->id_extupdt)) != 0)
+			free_allocdirect(&inodedep->id_extupdt, adp, delay);
+	}
+	FREE_LOCK(&lk);
+	bdwrite(bp);
+	/*
+	 * We must wait for any I/O in progress to finish so that
+	 * all potential buffers on the dirty list will be visible.
+	 * Once they are all there, walk the list and get rid of
+	 * any dependencies.
+	 */
+	vp = ITOV(ip);
+	ACQUIRE_LOCK(&lk);
+	VI_LOCK(vp);
+	drain_output(vp, 1);
+restart:
+	TAILQ_FOREACH(bp, &vp->v_dirtyblkhd, b_vnbufs) {
+		if (((flags & IO_EXT) == 0 && (bp->b_xflags & BX_ALTDATA)) ||
+		    ((flags & IO_NORMAL) == 0 &&
+		      (bp->b_xflags & BX_ALTDATA) == 0))
+			continue;
+		if ((bp = getdirtybuf(&bp, VI_MTX(vp), MNT_WAIT)) == NULL)
+			goto restart;
+		(void) inodedep_lookup(fs, ip->i_number, 0, &inodedep);
+		deallocate_dependencies(bp, inodedep);
+		bp->b_flags |= B_INVAL | B_NOCACHE;
+		FREE_LOCK(&lk);
+		brelse(bp);
+		ACQUIRE_LOCK(&lk);
+		VI_LOCK(vp);
+		goto restart;
+	}
+	VI_UNLOCK(vp);
+	if (inodedep_lookup(fs, ip->i_number, 0, &inodedep) != 0)
+		(void) free_inodedep(inodedep);
+	FREE_LOCK(&lk);
+	/*
+	 * If the inode has never been written to disk (delay == 0),
+	 * then we can process the freeblks now that we have deleted
+	 * the dependencies.
+	 */
+	if (!delay)
+		handle_workitem_freeblocks(freeblks, 0);
+}
+
+/*
+ * Reclaim any dependency structures from a buffer that is about to
+ * be reallocated to a new vnode. The buffer must be locked, thus,
+ * no I/O completion operations can occur while we are manipulating
+ * its associated dependencies. The mutex is held so that other I/O's
+ * associated with related dependencies do not occur.
+ */
+static void
+deallocate_dependencies(bp, inodedep)
+	struct buf *bp;
+	struct inodedep *inodedep;
+{
+	struct worklist *wk;
+	struct indirdep *indirdep;
+	struct allocindir *aip;
+	struct pagedep *pagedep;
+	struct dirrem *dirrem;
+	struct diradd *dap;
+	int i;
+
+	while ((wk = LIST_FIRST(&bp->b_dep)) != NULL) {
+		switch (wk->wk_type) {
+
+		case D_INDIRDEP:
+			indirdep = WK_INDIRDEP(wk);
+			/*
+			 * None of the indirect pointers will ever be visible,
+			 * so they can simply be tossed. GOINGAWAY ensures
+			 * that allocated pointers will be saved in the buffer
+			 * cache until they are freed. Note that they will
+			 * only be able to be found by their physical address
+			 * since the inode mapping the logical address will
+			 * be gone. The save buffer used for the safe copy
+			 * was allocated in setup_allocindir_phase2 using
+			 * the physical address so it could be used for this
+			 * purpose. Hence we swap the safe copy with the real
+			 * copy, allowing the safe copy to be freed and holding
+			 * on to the real copy for later use in indir_trunc.
+			 */
+			if (indirdep->ir_state & GOINGAWAY) {
+				FREE_LOCK(&lk);
+				panic("deallocate_dependencies: already gone");
+			}
+			indirdep->ir_state |= GOINGAWAY;
+			VFSTOUFS(bp->b_vp->v_mount)->um_numindirdeps += 1;
+			while ((aip = LIST_FIRST(&indirdep->ir_deplisthd)) != 0)
+				free_allocindir(aip, inodedep);
+			if (bp->b_lblkno >= 0 ||
+			    bp->b_blkno != indirdep->ir_savebp->b_lblkno) {
+				FREE_LOCK(&lk);
+				panic("deallocate_dependencies: not indir");
+			}
+			bcopy(bp->b_data, indirdep->ir_savebp->b_data,
+			    bp->b_bcount);
+			WORKLIST_REMOVE(wk);
+			WORKLIST_INSERT(&indirdep->ir_savebp->b_dep, wk);
+			continue;
+
+		case D_PAGEDEP:
+			pagedep = WK_PAGEDEP(wk);
+			/*
+			 * None of the directory additions will ever be
+			 * visible, so they can simply be tossed.
+			 */
+			for (i = 0; i < DAHASHSZ; i++)
+				while ((dap =
+				    LIST_FIRST(&pagedep->pd_diraddhd[i])))
+					free_diradd(dap);
+			while ((dap = LIST_FIRST(&pagedep->pd_pendinghd)) != 0)
+				free_diradd(dap);
+			/*
+			 * Copy any directory remove dependencies to the list
+			 * to be processed after the zero'ed inode is written.
+			 * If the inode has already been written, then they 
+			 * can be dumped directly onto the work list.
+			 */
+			LIST_FOREACH(dirrem, &pagedep->pd_dirremhd, dm_next) {
+				LIST_REMOVE(dirrem, dm_next);
+				dirrem->dm_dirinum = pagedep->pd_ino;
+				if (inodedep == NULL ||
+				    (inodedep->id_state & ALLCOMPLETE) ==
+				     ALLCOMPLETE)
+					add_to_worklist(&dirrem->dm_list);
+				else
+					WORKLIST_INSERT(&inodedep->id_bufwait,
+					    &dirrem->dm_list);
+			}
+			if ((pagedep->pd_state & NEWBLOCK) != 0) {
+				LIST_FOREACH(wk, &inodedep->id_bufwait, wk_list)
+					if (wk->wk_type == D_NEWDIRBLK &&
+					    WK_NEWDIRBLK(wk)->db_pagedep ==
+					      pagedep)
+						break;
+				if (wk != NULL) {
+					WORKLIST_REMOVE(wk);
+					free_newdirblk(WK_NEWDIRBLK(wk));
+				} else {
+					FREE_LOCK(&lk);
+					panic("deallocate_dependencies: "
+					      "lost pagedep");
+				}
+			}
+			WORKLIST_REMOVE(&pagedep->pd_list);
+			LIST_REMOVE(pagedep, pd_hash);
+			WORKITEM_FREE(pagedep, D_PAGEDEP);
+			continue;
+
+		case D_ALLOCINDIR:
+			free_allocindir(WK_ALLOCINDIR(wk), inodedep);
+			continue;
+
+		case D_ALLOCDIRECT:
+		case D_INODEDEP:
+			FREE_LOCK(&lk);
+			panic("deallocate_dependencies: Unexpected type %s",
+			    TYPENAME(wk->wk_type));
+			/* NOTREACHED */
+
+		default:
+			FREE_LOCK(&lk);
+			panic("deallocate_dependencies: Unknown type %s",
+			    TYPENAME(wk->wk_type));
+			/* NOTREACHED */
+		}
+	}
+}
+
+/*
+ * Free an allocdirect. Generate a new freefrag work request if appropriate.
+ * This routine must be called with splbio interrupts blocked.
+ */
+static void
+free_allocdirect(adphead, adp, delay)
+	struct allocdirectlst *adphead;
+	struct allocdirect *adp;
+	int delay;
+{
+	struct newdirblk *newdirblk;
+	struct worklist *wk;
+
+#ifdef DEBUG
+	if (lk.lkt_held == NOHOLDER)
+		panic("free_allocdirect: lock not held");
+#endif
+	if ((adp->ad_state & DEPCOMPLETE) == 0)
+		LIST_REMOVE(adp, ad_deps);
+	TAILQ_REMOVE(adphead, adp, ad_next);
+	if ((adp->ad_state & COMPLETE) == 0)
+		WORKLIST_REMOVE(&adp->ad_list);
+	if (adp->ad_freefrag != NULL) {
+		if (delay)
+			WORKLIST_INSERT(&adp->ad_inodedep->id_bufwait,
+			    &adp->ad_freefrag->ff_list);
+		else
+			add_to_worklist(&adp->ad_freefrag->ff_list);
+	}
+	if ((wk = LIST_FIRST(&adp->ad_newdirblk)) != NULL) {
+		newdirblk = WK_NEWDIRBLK(wk);
+		WORKLIST_REMOVE(&newdirblk->db_list);
+		if (LIST_FIRST(&adp->ad_newdirblk) != NULL)
+			panic("free_allocdirect: extra newdirblk");
+		if (delay)
+			WORKLIST_INSERT(&adp->ad_inodedep->id_bufwait,
+			    &newdirblk->db_list);
+		else
+			free_newdirblk(newdirblk);
+	}
+	WORKITEM_FREE(adp, D_ALLOCDIRECT);
+}
+
+/*
+ * Free a newdirblk. Clear the NEWBLOCK flag on its associated pagedep.
+ * This routine must be called with splbio interrupts blocked.
+ */
+static void
+free_newdirblk(newdirblk)
+	struct newdirblk *newdirblk;
+{
+	struct pagedep *pagedep;
+	struct diradd *dap;
+	int i;
+
+#ifdef DEBUG
+	if (lk.lkt_held == NOHOLDER)
+		panic("free_newdirblk: lock not held");
+#endif
+	/*
+	 * If the pagedep is still linked onto the directory buffer
+	 * dependency chain, then some of the entries on the
+	 * pd_pendinghd list may not be committed to disk yet. In
+	 * this case, we will simply clear the NEWBLOCK flag and
+	 * let the pd_pendinghd list be processed when the pagedep
+	 * is next written. If the pagedep is no longer on the buffer
+	 * dependency chain, then all the entries on the pd_pending
+	 * list are committed to disk and we can free them here.
+	 */
+	pagedep = newdirblk->db_pagedep;
+	pagedep->pd_state &= ~NEWBLOCK;
+	if ((pagedep->pd_state & ONWORKLIST) == 0)
+		while ((dap = LIST_FIRST(&pagedep->pd_pendinghd)) != NULL)
+			free_diradd(dap);
+	/*
+	 * If no dependencies remain, the pagedep will be freed.
+	 */
+	for (i = 0; i < DAHASHSZ; i++)
+		if (LIST_FIRST(&pagedep->pd_diraddhd[i]) != NULL)
+			break;
+	if (i == DAHASHSZ && (pagedep->pd_state & ONWORKLIST) == 0) {
+		LIST_REMOVE(pagedep, pd_hash);
+		WORKITEM_FREE(pagedep, D_PAGEDEP);
+	}
+	WORKITEM_FREE(newdirblk, D_NEWDIRBLK);
+}
+
+/*
+ * Prepare an inode to be freed. The actual free operation is not
+ * done until the zero'ed inode has been written to disk.
+ */
+void
+softdep_freefile(pvp, ino, mode)
+	struct vnode *pvp;
+	ino_t ino;
+	int mode;
+{
+	struct inode *ip = VTOI(pvp);
+	struct inodedep *inodedep;
+	struct freefile *freefile;
+
+	/*
+	 * This sets up the inode de-allocation dependency.
+	 */
+	MALLOC(freefile, struct freefile *, sizeof(struct freefile),
+		M_FREEFILE, M_SOFTDEP_FLAGS);
+	freefile->fx_list.wk_type = D_FREEFILE;
+	freefile->fx_list.wk_state = 0;
+	freefile->fx_mode = mode;
+	freefile->fx_oldinum = ino;
+	freefile->fx_devvp = ip->i_devvp;
+	freefile->fx_mnt = ITOV(ip)->v_mount;
+	if ((ip->i_flag & IN_SPACECOUNTED) == 0)
+		ip->i_fs->fs_pendinginodes += 1;
+
+	/*
+	 * If the inodedep does not exist, then the zero'ed inode has
+	 * been written to disk. If the allocated inode has never been
+	 * written to disk, then the on-disk inode is zero'ed. In either
+	 * case we can free the file immediately.
+	 */
+	ACQUIRE_LOCK(&lk);
+	if (inodedep_lookup(ip->i_fs, ino, 0, &inodedep) == 0 ||
+	    check_inode_unwritten(inodedep)) {
+		FREE_LOCK(&lk);
+		handle_workitem_freefile(freefile);
+		return;
+	}
+	WORKLIST_INSERT(&inodedep->id_inowait, &freefile->fx_list);
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Check to see if an inode has never been written to disk. If
+ * so free the inodedep and return success, otherwise return failure.
+ * This routine must be called with splbio interrupts blocked.
+ *
+ * If we still have a bitmap dependency, then the inode has never
+ * been written to disk. Drop the dependency as it is no longer
+ * necessary since the inode is being deallocated. We set the
+ * ALLCOMPLETE flags since the bitmap now properly shows that the
+ * inode is not allocated. Even if the inode is actively being
+ * written, it has been rolled back to its zero'ed state, so we
+ * are ensured that a zero inode is what is on the disk. For short
+ * lived files, this change will usually result in removing all the
+ * dependencies from the inode so that it can be freed immediately.
+ */
+static int
+check_inode_unwritten(inodedep)
+	struct inodedep *inodedep;
+{
+
+	if ((inodedep->id_state & DEPCOMPLETE) != 0 ||
+	    LIST_FIRST(&inodedep->id_pendinghd) != NULL ||
+	    LIST_FIRST(&inodedep->id_bufwait) != NULL ||
+	    LIST_FIRST(&inodedep->id_inowait) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_inoupdt) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_newinoupdt) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_extupdt) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_newextupdt) != NULL ||
+	    inodedep->id_nlinkdelta != 0)
+		return (0);
+	inodedep->id_state |= ALLCOMPLETE;
+	LIST_REMOVE(inodedep, id_deps);
+	inodedep->id_buf = NULL;
+	if (inodedep->id_state & ONWORKLIST)
+		WORKLIST_REMOVE(&inodedep->id_list);
+	if (inodedep->id_savedino1 != NULL) {
+		FREE(inodedep->id_savedino1, M_INODEDEP);
+		inodedep->id_savedino1 = NULL;
+	}
+	if (free_inodedep(inodedep) == 0) {
+		FREE_LOCK(&lk);
+		panic("check_inode_unwritten: busy inode");
+	}
+	return (1);
+}
+
+/*
+ * Try to free an inodedep structure. Return 1 if it could be freed.
+ */
+static int
+free_inodedep(inodedep)
+	struct inodedep *inodedep;
+{
+
+	if ((inodedep->id_state & ONWORKLIST) != 0 ||
+	    (inodedep->id_state & ALLCOMPLETE) != ALLCOMPLETE ||
+	    LIST_FIRST(&inodedep->id_pendinghd) != NULL ||
+	    LIST_FIRST(&inodedep->id_bufwait) != NULL ||
+	    LIST_FIRST(&inodedep->id_inowait) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_inoupdt) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_newinoupdt) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_extupdt) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_newextupdt) != NULL ||
+	    inodedep->id_nlinkdelta != 0 || inodedep->id_savedino1 != NULL)
+		return (0);
+	LIST_REMOVE(inodedep, id_hash);
+	WORKITEM_FREE(inodedep, D_INODEDEP);
+	num_inodedep -= 1;
+	return (1);
+}
+
+/*
+ * This workitem routine performs the block de-allocation.
+ * The workitem is added to the pending list after the updated
+ * inode block has been written to disk.  As mentioned above,
+ * checks regarding the number of blocks de-allocated (compared
+ * to the number of blocks allocated for the file) are also
+ * performed in this function.
+ */
+static void
+handle_workitem_freeblocks(freeblks, flags)
+	struct freeblks *freeblks;
+	int flags;
+{
+	struct inode *ip;
+	struct vnode *vp;
+	struct fs *fs;
+	int i, nblocks, level, bsize;
+	ufs2_daddr_t bn, blocksreleased = 0;
+	int error, allerror = 0;
+	ufs_lbn_t baselbns[NIADDR], tmpval;
+
+	fs = VFSTOUFS(freeblks->fb_mnt)->um_fs;
+	tmpval = 1;
+	baselbns[0] = NDADDR;
+	for (i = 1; i < NIADDR; i++) {
+		tmpval *= NINDIR(fs);
+		baselbns[i] = baselbns[i - 1] + tmpval;
+	}
+	nblocks = btodb(fs->fs_bsize);
+	blocksreleased = 0;
+	/*
+	 * Release all extended attribute blocks or frags.
+	 */
+	if (freeblks->fb_oldextsize > 0) {
+		for (i = (NXADDR - 1); i >= 0; i--) {
+			if ((bn = freeblks->fb_eblks[i]) == 0)
+				continue;
+			bsize = sblksize(fs, freeblks->fb_oldextsize, i);
+			ffs_blkfree(fs, freeblks->fb_devvp, bn, bsize,
+			    freeblks->fb_previousinum);
+			blocksreleased += btodb(bsize);
+		}
+	}
+	/*
+	 * Release all data blocks or frags.
+	 */
+	if (freeblks->fb_oldsize > 0) {
+		/*
+		 * Indirect blocks first.
+		 */
+		for (level = (NIADDR - 1); level >= 0; level--) {
+			if ((bn = freeblks->fb_iblks[level]) == 0)
+				continue;
+			if ((error = indir_trunc(freeblks, fsbtodb(fs, bn),
+			    level, baselbns[level], &blocksreleased)) == 0)
+				allerror = error;
+			ffs_blkfree(fs, freeblks->fb_devvp, bn, fs->fs_bsize,
+			    freeblks->fb_previousinum);
+			fs->fs_pendingblocks -= nblocks;
+			blocksreleased += nblocks;
+		}
+		/*
+		 * All direct blocks or frags.
+		 */
+		for (i = (NDADDR - 1); i >= 0; i--) {
+			if ((bn = freeblks->fb_dblks[i]) == 0)
+				continue;
+			bsize = sblksize(fs, freeblks->fb_oldsize, i);
+			ffs_blkfree(fs, freeblks->fb_devvp, bn, bsize,
+			    freeblks->fb_previousinum);
+			fs->fs_pendingblocks -= btodb(bsize);
+			blocksreleased += btodb(bsize);
+		}
+	}
+	/*
+	 * If we still have not finished background cleanup, then check
+	 * to see if the block count needs to be adjusted.
+	 */
+	if (freeblks->fb_chkcnt != blocksreleased &&
+	    (fs->fs_flags & FS_UNCLEAN) != 0 &&
+	    VFS_VGET(freeblks->fb_mnt, freeblks->fb_previousinum,
+	    (flags & LK_NOWAIT) | LK_EXCLUSIVE, &vp) == 0) {
+		ip = VTOI(vp);
+		DIP(ip, i_blocks) += freeblks->fb_chkcnt - blocksreleased;
+		ip->i_flag |= IN_CHANGE;
+		vput(vp);
+	}
+
+#ifdef DIAGNOSTIC
+	if (freeblks->fb_chkcnt != blocksreleased &&
+	    ((fs->fs_flags & FS_UNCLEAN) == 0 || (flags & LK_NOWAIT) != 0))
+		printf("handle_workitem_freeblocks: block count\n");
+	if (allerror)
+		softdep_error("handle_workitem_freeblks", allerror);
+#endif /* DIAGNOSTIC */
+
+	WORKITEM_FREE(freeblks, D_FREEBLKS);
+}
+
+/*
+ * Release blocks associated with the inode ip and stored in the indirect
+ * block dbn. If level is greater than SINGLE, the block is an indirect block
+ * and recursive calls to indirtrunc must be used to cleanse other indirect
+ * blocks.
+ */
+static int
+indir_trunc(freeblks, dbn, level, lbn, countp)
+	struct freeblks *freeblks;
+	ufs2_daddr_t dbn;
+	int level;
+	ufs_lbn_t lbn;
+	ufs2_daddr_t *countp;
+{
+	struct buf *bp;
+	struct fs *fs;
+	struct worklist *wk;
+	struct indirdep *indirdep;
+	ufs1_daddr_t *bap1 = 0;
+	ufs2_daddr_t nb, *bap2 = 0;
+	ufs_lbn_t lbnadd;
+	int i, nblocks, ufs1fmt;
+	int error, allerror = 0;
+
+	fs = VFSTOUFS(freeblks->fb_mnt)->um_fs;
+	lbnadd = 1;
+	for (i = level; i > 0; i--)
+		lbnadd *= NINDIR(fs);
+	/*
+	 * Get buffer of block pointers to be freed. This routine is not
+	 * called until the zero'ed inode has been written, so it is safe
+	 * to free blocks as they are encountered. Because the inode has
+	 * been zero'ed, calls to bmap on these blocks will fail. So, we
+	 * have to use the on-disk address and the block device for the
+	 * filesystem to look them up. If the file was deleted before its
+	 * indirect blocks were all written to disk, the routine that set
+	 * us up (deallocate_dependencies) will have arranged to leave
+	 * a complete copy of the indirect block in memory for our use.
+	 * Otherwise we have to read the blocks in from the disk.
+	 */
+#ifdef notyet
+	bp = getblk(freeblks->fb_devvp, dbn, (int)fs->fs_bsize, 0, 0,
+	    GB_NOCREAT);
+#else
+	bp = incore(freeblks->fb_devvp, dbn);
+#endif
+	ACQUIRE_LOCK(&lk);
+	if (bp != NULL && (wk = LIST_FIRST(&bp->b_dep)) != NULL) {
+		if (wk->wk_type != D_INDIRDEP ||
+		    (indirdep = WK_INDIRDEP(wk))->ir_savebp != bp ||
+		    (indirdep->ir_state & GOINGAWAY) == 0) {
+			FREE_LOCK(&lk);
+			panic("indir_trunc: lost indirdep");
+		}
+		WORKLIST_REMOVE(wk);
+		WORKITEM_FREE(indirdep, D_INDIRDEP);
+		if (LIST_FIRST(&bp->b_dep) != NULL) {
+			FREE_LOCK(&lk);
+			panic("indir_trunc: dangling dep");
+		}
+		VFSTOUFS(freeblks->fb_mnt)->um_numindirdeps -= 1;
+		FREE_LOCK(&lk);
+	} else {
+#ifdef notyet
+		if (bp)
+			brelse(bp);
+#endif
+		FREE_LOCK(&lk);
+		error = bread(freeblks->fb_devvp, dbn, (int)fs->fs_bsize,
+		    NOCRED, &bp);
+		if (error) {
+			brelse(bp);
+			return (error);
+		}
+	}
+	/*
+	 * Recursively free indirect blocks.
+	 */
+	if (VFSTOUFS(freeblks->fb_mnt)->um_fstype == UFS1) {
+		ufs1fmt = 1;
+		bap1 = (ufs1_daddr_t *)bp->b_data;
+	} else {
+		ufs1fmt = 0;
+		bap2 = (ufs2_daddr_t *)bp->b_data;
+	}
+	nblocks = btodb(fs->fs_bsize);
+	for (i = NINDIR(fs) - 1; i >= 0; i--) {
+		if (ufs1fmt)
+			nb = bap1[i];
+		else
+			nb = bap2[i];
+		if (nb == 0)
+			continue;
+		if (level != 0) {
+			if ((error = indir_trunc(freeblks, fsbtodb(fs, nb),
+			     level - 1, lbn + (i * lbnadd), countp)) != 0)
+				allerror = error;
+		}
+		ffs_blkfree(fs, freeblks->fb_devvp, nb, fs->fs_bsize,
+		    freeblks->fb_previousinum);
+		fs->fs_pendingblocks -= nblocks;
+		*countp += nblocks;
+	}
+	bp->b_flags |= B_INVAL | B_NOCACHE;
+	brelse(bp);
+	return (allerror);
+}
+
+/*
+ * Free an allocindir.
+ * This routine must be called with splbio interrupts blocked.
+ */
+static void
+free_allocindir(aip, inodedep)
+	struct allocindir *aip;
+	struct inodedep *inodedep;
+{
+	struct freefrag *freefrag;
+
+#ifdef DEBUG
+	if (lk.lkt_held == NOHOLDER)
+		panic("free_allocindir: lock not held");
+#endif
+	if ((aip->ai_state & DEPCOMPLETE) == 0)
+		LIST_REMOVE(aip, ai_deps);
+	if (aip->ai_state & ONWORKLIST)
+		WORKLIST_REMOVE(&aip->ai_list);
+	LIST_REMOVE(aip, ai_next);
+	if ((freefrag = aip->ai_freefrag) != NULL) {
+		if (inodedep == NULL)
+			add_to_worklist(&freefrag->ff_list);
+		else
+			WORKLIST_INSERT(&inodedep->id_bufwait,
+			    &freefrag->ff_list);
+	}
+	WORKITEM_FREE(aip, D_ALLOCINDIR);
+}
+
+/*
+ * Directory entry addition dependencies.
+ * 
+ * When adding a new directory entry, the inode (with its incremented link
+ * count) must be written to disk before the directory entry's pointer to it.
+ * Also, if the inode is newly allocated, the corresponding freemap must be
+ * updated (on disk) before the directory entry's pointer. These requirements
+ * are met via undo/redo on the directory entry's pointer, which consists
+ * simply of the inode number.
+ * 
+ * As directory entries are added and deleted, the free space within a
+ * directory block can become fragmented.  The ufs filesystem will compact
+ * a fragmented directory block to make space for a new entry. When this
+ * occurs, the offsets of previously added entries change. Any "diradd"
+ * dependency structures corresponding to these entries must be updated with
+ * the new offsets.
+ */
+
+/*
+ * This routine is called after the in-memory inode's link
+ * count has been incremented, but before the directory entry's
+ * pointer to the inode has been set.
+ */
+int
+softdep_setup_directory_add(bp, dp, diroffset, newinum, newdirbp, isnewblk)
+	struct buf *bp;		/* buffer containing directory block */
+	struct inode *dp;	/* inode for directory */
+	off_t diroffset;	/* offset of new entry in directory */
+	ino_t newinum;		/* inode referenced by new directory entry */
+	struct buf *newdirbp;	/* non-NULL => contents of new mkdir */
+	int isnewblk;		/* entry is in a newly allocated block */
+{
+	int offset;		/* offset of new entry within directory block */
+	ufs_lbn_t lbn;		/* block in directory containing new entry */
+	struct fs *fs;
+	struct diradd *dap;
+	struct allocdirect *adp;
+	struct pagedep *pagedep;
+	struct inodedep *inodedep;
+	struct newdirblk *newdirblk = 0;
+	struct mkdir *mkdir1, *mkdir2;
+
+	/*
+	 * Whiteouts have no dependencies.
+	 */
+	if (newinum == WINO) {
+		if (newdirbp != NULL)
+			bdwrite(newdirbp);
+		return (0);
+	}
+
+	fs = dp->i_fs;
+	lbn = lblkno(fs, diroffset);
+	offset = blkoff(fs, diroffset);
+	MALLOC(dap, struct diradd *, sizeof(struct diradd), M_DIRADD,
+		M_SOFTDEP_FLAGS|M_ZERO);
+	dap->da_list.wk_type = D_DIRADD;
+	dap->da_offset = offset;
+	dap->da_newinum = newinum;
+	dap->da_state = ATTACHED;
+	if (isnewblk && lbn < NDADDR && fragoff(fs, diroffset) == 0) {
+		MALLOC(newdirblk, struct newdirblk *, sizeof(struct newdirblk),
+		    M_NEWDIRBLK, M_SOFTDEP_FLAGS);
+		newdirblk->db_list.wk_type = D_NEWDIRBLK;
+		newdirblk->db_state = 0;
+	}
+	if (newdirbp == NULL) {
+		dap->da_state |= DEPCOMPLETE;
+		ACQUIRE_LOCK(&lk);
+	} else {
+		dap->da_state |= MKDIR_BODY | MKDIR_PARENT;
+		MALLOC(mkdir1, struct mkdir *, sizeof(struct mkdir), M_MKDIR,
+		    M_SOFTDEP_FLAGS);
+		mkdir1->md_list.wk_type = D_MKDIR;
+		mkdir1->md_state = MKDIR_BODY;
+		mkdir1->md_diradd = dap;
+		MALLOC(mkdir2, struct mkdir *, sizeof(struct mkdir), M_MKDIR,
+		    M_SOFTDEP_FLAGS);
+		mkdir2->md_list.wk_type = D_MKDIR;
+		mkdir2->md_state = MKDIR_PARENT;
+		mkdir2->md_diradd = dap;
+		/*
+		 * Dependency on "." and ".." being written to disk.
+		 */
+		mkdir1->md_buf = newdirbp;
+		ACQUIRE_LOCK(&lk);
+		LIST_INSERT_HEAD(&mkdirlisthd, mkdir1, md_mkdirs);
+		WORKLIST_INSERT(&newdirbp->b_dep, &mkdir1->md_list);
+		FREE_LOCK(&lk);
+		bdwrite(newdirbp);
+		/*
+		 * Dependency on link count increase for parent directory
+		 */
+		ACQUIRE_LOCK(&lk);
+		if (inodedep_lookup(fs, dp->i_number, 0, &inodedep) == 0
+		    || (inodedep->id_state & ALLCOMPLETE) == ALLCOMPLETE) {
+			dap->da_state &= ~MKDIR_PARENT;
+			WORKITEM_FREE(mkdir2, D_MKDIR);
+		} else {
+			LIST_INSERT_HEAD(&mkdirlisthd, mkdir2, md_mkdirs);
+			WORKLIST_INSERT(&inodedep->id_bufwait,&mkdir2->md_list);
+		}
+	}
+	/*
+	 * Link into parent directory pagedep to await its being written.
+	 */
+	if (pagedep_lookup(dp, lbn, DEPALLOC, &pagedep) == 0)
+		WORKLIST_INSERT(&bp->b_dep, &pagedep->pd_list);
+	dap->da_pagedep = pagedep;
+	LIST_INSERT_HEAD(&pagedep->pd_diraddhd[DIRADDHASH(offset)], dap,
+	    da_pdlist);
+	/*
+	 * Link into its inodedep. Put it on the id_bufwait list if the inode
+	 * is not yet written. If it is written, do the post-inode write
+	 * processing to put it on the id_pendinghd list.
+	 */
+	(void) inodedep_lookup(fs, newinum, DEPALLOC, &inodedep);
+	if ((inodedep->id_state & ALLCOMPLETE) == ALLCOMPLETE)
+		diradd_inode_written(dap, inodedep);
+	else
+		WORKLIST_INSERT(&inodedep->id_bufwait, &dap->da_list);
+	if (isnewblk) {
+		/*
+		 * Directories growing into indirect blocks are rare
+		 * enough and the frequency of new block allocation
+		 * in those cases even more rare, that we choose not
+		 * to bother tracking them. Rather we simply force the
+		 * new directory entry to disk.
+		 */
+		if (lbn >= NDADDR) {
+			FREE_LOCK(&lk);
+			/*
+			 * We only have a new allocation when at the
+			 * beginning of a new block, not when we are
+			 * expanding into an existing block.
+			 */
+			if (blkoff(fs, diroffset) == 0)
+				return (1);
+			return (0);
+		}
+		/*
+		 * We only have a new allocation when at the beginning
+		 * of a new fragment, not when we are expanding into an
+		 * existing fragment. Also, there is nothing to do if we
+		 * are already tracking this block.
+		 */
+		if (fragoff(fs, diroffset) != 0) {
+			FREE_LOCK(&lk);
+			return (0);
+		}
+		if ((pagedep->pd_state & NEWBLOCK) != 0) {
+			WORKITEM_FREE(newdirblk, D_NEWDIRBLK);
+			FREE_LOCK(&lk);
+			return (0);
+		}
+		/*
+		 * Find our associated allocdirect and have it track us.
+		 */
+		if (inodedep_lookup(fs, dp->i_number, 0, &inodedep) == 0)
+			panic("softdep_setup_directory_add: lost inodedep");
+		adp = TAILQ_LAST(&inodedep->id_newinoupdt, allocdirectlst);
+		if (adp == NULL || adp->ad_lbn != lbn) {
+			FREE_LOCK(&lk);
+			panic("softdep_setup_directory_add: lost entry");
+		}
+		pagedep->pd_state |= NEWBLOCK;
+		newdirblk->db_pagedep = pagedep;
+		WORKLIST_INSERT(&adp->ad_newdirblk, &newdirblk->db_list);
+	}
+	FREE_LOCK(&lk);
+	return (0);
+}
+
+/*
+ * This procedure is called to change the offset of a directory
+ * entry when compacting a directory block which must be owned
+ * exclusively by the caller. Note that the actual entry movement
+ * must be done in this procedure to ensure that no I/O completions
+ * occur while the move is in progress.
+ */
+void 
+softdep_change_directoryentry_offset(dp, base, oldloc, newloc, entrysize)
+	struct inode *dp;	/* inode for directory */
+	caddr_t base;		/* address of dp->i_offset */
+	caddr_t oldloc;		/* address of old directory location */
+	caddr_t newloc;		/* address of new directory location */
+	int entrysize;		/* size of directory entry */
+{
+	int offset, oldoffset, newoffset;
+	struct pagedep *pagedep;
+	struct diradd *dap;
+	ufs_lbn_t lbn;
+
+	ACQUIRE_LOCK(&lk);
+	lbn = lblkno(dp->i_fs, dp->i_offset);
+	offset = blkoff(dp->i_fs, dp->i_offset);
+	if (pagedep_lookup(dp, lbn, 0, &pagedep) == 0)
+		goto done;
+	oldoffset = offset + (oldloc - base);
+	newoffset = offset + (newloc - base);
+
+	LIST_FOREACH(dap, &pagedep->pd_diraddhd[DIRADDHASH(oldoffset)], da_pdlist) {
+		if (dap->da_offset != oldoffset)
+			continue;
+		dap->da_offset = newoffset;
+		if (DIRADDHASH(newoffset) == DIRADDHASH(oldoffset))
+			break;
+		LIST_REMOVE(dap, da_pdlist);
+		LIST_INSERT_HEAD(&pagedep->pd_diraddhd[DIRADDHASH(newoffset)],
+		    dap, da_pdlist);
+		break;
+	}
+	if (dap == NULL) {
+
+		LIST_FOREACH(dap, &pagedep->pd_pendinghd, da_pdlist) {
+			if (dap->da_offset == oldoffset) {
+				dap->da_offset = newoffset;
+				break;
+			}
+		}
+	}
+done:
+	bcopy(oldloc, newloc, entrysize);
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Free a diradd dependency structure. This routine must be called
+ * with splbio interrupts blocked.
+ */
+static void
+free_diradd(dap)
+	struct diradd *dap;
+{
+	struct dirrem *dirrem;
+	struct pagedep *pagedep;
+	struct inodedep *inodedep;
+	struct mkdir *mkdir, *nextmd;
+
+#ifdef DEBUG
+	if (lk.lkt_held == NOHOLDER)
+		panic("free_diradd: lock not held");
+#endif
+	WORKLIST_REMOVE(&dap->da_list);
+	LIST_REMOVE(dap, da_pdlist);
+	if ((dap->da_state & DIRCHG) == 0) {
+		pagedep = dap->da_pagedep;
+	} else {
+		dirrem = dap->da_previous;
+		pagedep = dirrem->dm_pagedep;
+		dirrem->dm_dirinum = pagedep->pd_ino;
+		add_to_worklist(&dirrem->dm_list);
+	}
+	if (inodedep_lookup(VFSTOUFS(pagedep->pd_mnt)->um_fs, dap->da_newinum,
+	    0, &inodedep) != 0)
+		(void) free_inodedep(inodedep);
+	if ((dap->da_state & (MKDIR_PARENT | MKDIR_BODY)) != 0) {
+		for (mkdir = LIST_FIRST(&mkdirlisthd); mkdir; mkdir = nextmd) {
+			nextmd = LIST_NEXT(mkdir, md_mkdirs);
+			if (mkdir->md_diradd != dap)
+				continue;
+			dap->da_state &= ~mkdir->md_state;
+			WORKLIST_REMOVE(&mkdir->md_list);
+			LIST_REMOVE(mkdir, md_mkdirs);
+			WORKITEM_FREE(mkdir, D_MKDIR);
+		}
+		if ((dap->da_state & (MKDIR_PARENT | MKDIR_BODY)) != 0) {
+			FREE_LOCK(&lk);
+			panic("free_diradd: unfound ref");
+		}
+	}
+	WORKITEM_FREE(dap, D_DIRADD);
+}
+
+/*
+ * Directory entry removal dependencies.
+ * 
+ * When removing a directory entry, the entry's inode pointer must be
+ * zero'ed on disk before the corresponding inode's link count is decremented
+ * (possibly freeing the inode for re-use). This dependency is handled by
+ * updating the directory entry but delaying the inode count reduction until
+ * after the directory block has been written to disk. After this point, the
+ * inode count can be decremented whenever it is convenient.
+ */
+
+/*
+ * This routine should be called immediately after removing
+ * a directory entry.  The inode's link count should not be
+ * decremented by the calling procedure -- the soft updates
+ * code will do this task when it is safe.
+ */
+void 
+softdep_setup_remove(bp, dp, ip, isrmdir)
+	struct buf *bp;		/* buffer containing directory block */
+	struct inode *dp;	/* inode for the directory being modified */
+	struct inode *ip;	/* inode for directory entry being removed */
+	int isrmdir;		/* indicates if doing RMDIR */
+{
+	struct dirrem *dirrem, *prevdirrem;
+
+	/*
+	 * Allocate a new dirrem if appropriate and ACQUIRE_LOCK.
+	 */
+	dirrem = newdirrem(bp, dp, ip, isrmdir, &prevdirrem);
+
+	/*
+	 * If the COMPLETE flag is clear, then there were no active
+	 * entries and we want to roll back to a zeroed entry until
+	 * the new inode is committed to disk. If the COMPLETE flag is
+	 * set then we have deleted an entry that never made it to
+	 * disk. If the entry we deleted resulted from a name change,
+	 * then the old name still resides on disk. We cannot delete
+	 * its inode (returned to us in prevdirrem) until the zeroed
+	 * directory entry gets to disk. The new inode has never been
+	 * referenced on the disk, so can be deleted immediately.
+	 */
+	if ((dirrem->dm_state & COMPLETE) == 0) {
+		LIST_INSERT_HEAD(&dirrem->dm_pagedep->pd_dirremhd, dirrem,
+		    dm_next);
+		FREE_LOCK(&lk);
+	} else {
+		if (prevdirrem != NULL)
+			LIST_INSERT_HEAD(&dirrem->dm_pagedep->pd_dirremhd,
+			    prevdirrem, dm_next);
+		dirrem->dm_dirinum = dirrem->dm_pagedep->pd_ino;
+		FREE_LOCK(&lk);
+		handle_workitem_remove(dirrem, NULL);
+	}
+}
+
+/*
+ * Allocate a new dirrem if appropriate and return it along with
+ * its associated pagedep. Called without a lock, returns with lock.
+ */
+static long num_dirrem;		/* number of dirrem allocated */
+static struct dirrem *
+newdirrem(bp, dp, ip, isrmdir, prevdirremp)
+	struct buf *bp;		/* buffer containing directory block */
+	struct inode *dp;	/* inode for the directory being modified */
+	struct inode *ip;	/* inode for directory entry being removed */
+	int isrmdir;		/* indicates if doing RMDIR */
+	struct dirrem **prevdirremp; /* previously referenced inode, if any */
+{
+	int offset;
+	ufs_lbn_t lbn;
+	struct diradd *dap;
+	struct dirrem *dirrem;
+	struct pagedep *pagedep;
+
+	/*
+	 * Whiteouts have no deletion dependencies.
+	 */
+	if (ip == NULL)
+		panic("newdirrem: whiteout");
+	/*
+	 * If we are over our limit, try to improve the situation.
+	 * Limiting the number of dirrem structures will also limit
+	 * the number of freefile and freeblks structures.
+	 */
+	if (num_dirrem > max_softdeps / 2)
+		(void) request_cleanup(FLUSH_REMOVE, 0);
+	num_dirrem += 1;
+	MALLOC(dirrem, struct dirrem *, sizeof(struct dirrem),
+		M_DIRREM, M_SOFTDEP_FLAGS|M_ZERO);
+	dirrem->dm_list.wk_type = D_DIRREM;
+	dirrem->dm_state = isrmdir ? RMDIR : 0;
+	dirrem->dm_mnt = ITOV(ip)->v_mount;
+	dirrem->dm_oldinum = ip->i_number;
+	*prevdirremp = NULL;
+
+	ACQUIRE_LOCK(&lk);
+	lbn = lblkno(dp->i_fs, dp->i_offset);
+	offset = blkoff(dp->i_fs, dp->i_offset);
+	if (pagedep_lookup(dp, lbn, DEPALLOC, &pagedep) == 0)
+		WORKLIST_INSERT(&bp->b_dep, &pagedep->pd_list);
+	dirrem->dm_pagedep = pagedep;
+	/*
+	 * Check for a diradd dependency for the same directory entry.
+	 * If present, then both dependencies become obsolete and can
+	 * be de-allocated. Check for an entry on both the pd_dirraddhd
+	 * list and the pd_pendinghd list.
+	 */
+
+	LIST_FOREACH(dap, &pagedep->pd_diraddhd[DIRADDHASH(offset)], da_pdlist)
+		if (dap->da_offset == offset)
+			break;
+	if (dap == NULL) {
+
+		LIST_FOREACH(dap, &pagedep->pd_pendinghd, da_pdlist)
+			if (dap->da_offset == offset)
+				break;
+		if (dap == NULL)
+			return (dirrem);
+	}
+	/*
+	 * Must be ATTACHED at this point.
+	 */
+	if ((dap->da_state & ATTACHED) == 0) {
+		FREE_LOCK(&lk);
+		panic("newdirrem: not ATTACHED");
+	}
+	if (dap->da_newinum != ip->i_number) {
+		FREE_LOCK(&lk);
+		panic("newdirrem: inum %d should be %d",
+		    ip->i_number, dap->da_newinum);
+	}
+	/*
+	 * If we are deleting a changed name that never made it to disk,
+	 * then return the dirrem describing the previous inode (which
+	 * represents the inode currently referenced from this entry on disk).
+	 */
+	if ((dap->da_state & DIRCHG) != 0) {
+		*prevdirremp = dap->da_previous;
+		dap->da_state &= ~DIRCHG;
+		dap->da_pagedep = pagedep;
+	}
+	/*
+	 * We are deleting an entry that never made it to disk.
+	 * Mark it COMPLETE so we can delete its inode immediately.
+	 */
+	dirrem->dm_state |= COMPLETE;
+	free_diradd(dap);
+	return (dirrem);
+}
+
+/*
+ * Directory entry change dependencies.
+ * 
+ * Changing an existing directory entry requires that an add operation
+ * be completed first followed by a deletion. The semantics for the addition
+ * are identical to the description of adding a new entry above except
+ * that the rollback is to the old inode number rather than zero. Once
+ * the addition dependency is completed, the removal is done as described
+ * in the removal routine above.
+ */
+
+/*
+ * This routine should be called immediately after changing
+ * a directory entry.  The inode's link count should not be
+ * decremented by the calling procedure -- the soft updates
+ * code will perform this task when it is safe.
+ */
+void 
+softdep_setup_directory_change(bp, dp, ip, newinum, isrmdir)
+	struct buf *bp;		/* buffer containing directory block */
+	struct inode *dp;	/* inode for the directory being modified */
+	struct inode *ip;	/* inode for directory entry being removed */
+	ino_t newinum;		/* new inode number for changed entry */
+	int isrmdir;		/* indicates if doing RMDIR */
+{
+	int offset;
+	struct diradd *dap = NULL;
+	struct dirrem *dirrem, *prevdirrem;
+	struct pagedep *pagedep;
+	struct inodedep *inodedep;
+
+	offset = blkoff(dp->i_fs, dp->i_offset);
+
+	/*
+	 * Whiteouts do not need diradd dependencies.
+	 */
+	if (newinum != WINO) {
+		MALLOC(dap, struct diradd *, sizeof(struct diradd),
+		    M_DIRADD, M_SOFTDEP_FLAGS|M_ZERO);
+		dap->da_list.wk_type = D_DIRADD;
+		dap->da_state = DIRCHG | ATTACHED | DEPCOMPLETE;
+		dap->da_offset = offset;
+		dap->da_newinum = newinum;
+	}
+
+	/*
+	 * Allocate a new dirrem and ACQUIRE_LOCK.
+	 */
+	dirrem = newdirrem(bp, dp, ip, isrmdir, &prevdirrem);
+	pagedep = dirrem->dm_pagedep;
+	/*
+	 * The possible values for isrmdir:
+	 *	0 - non-directory file rename
+	 *	1 - directory rename within same directory
+	 *   inum - directory rename to new directory of given inode number
+	 * When renaming to a new directory, we are both deleting and
+	 * creating a new directory entry, so the link count on the new
+	 * directory should not change. Thus we do not need the followup
+	 * dirrem which is usually done in handle_workitem_remove. We set
+	 * the DIRCHG flag to tell handle_workitem_remove to skip the 
+	 * followup dirrem.
+	 */
+	if (isrmdir > 1)
+		dirrem->dm_state |= DIRCHG;
+
+	/*
+	 * Whiteouts have no additional dependencies,
+	 * so just put the dirrem on the correct list.
+	 */
+	if (newinum == WINO) {
+		if ((dirrem->dm_state & COMPLETE) == 0) {
+			LIST_INSERT_HEAD(&pagedep->pd_dirremhd, dirrem,
+			    dm_next);
+		} else {
+			dirrem->dm_dirinum = pagedep->pd_ino;
+			add_to_worklist(&dirrem->dm_list);
+		}
+		FREE_LOCK(&lk);
+		return;
+	}
+
+	/*
+	 * If the COMPLETE flag is clear, then there were no active
+	 * entries and we want to roll back to the previous inode until
+	 * the new inode is committed to disk. If the COMPLETE flag is
+	 * set, then we have deleted an entry that never made it to disk.
+	 * If the entry we deleted resulted from a name change, then the old
+	 * inode reference still resides on disk. Any rollback that we do
+	 * needs to be to that old inode (returned to us in prevdirrem). If
+	 * the entry we deleted resulted from a create, then there is
+	 * no entry on the disk, so we want to roll back to zero rather
+	 * than the uncommitted inode. In either of the COMPLETE cases we
+	 * want to immediately free the unwritten and unreferenced inode.
+	 */
+	if ((dirrem->dm_state & COMPLETE) == 0) {
+		dap->da_previous = dirrem;
+	} else {
+		if (prevdirrem != NULL) {
+			dap->da_previous = prevdirrem;
+		} else {
+			dap->da_state &= ~DIRCHG;
+			dap->da_pagedep = pagedep;
+		}
+		dirrem->dm_dirinum = pagedep->pd_ino;
+		add_to_worklist(&dirrem->dm_list);
+	}
+	/*
+	 * Link into its inodedep. Put it on the id_bufwait list if the inode
+	 * is not yet written. If it is written, do the post-inode write
+	 * processing to put it on the id_pendinghd list.
+	 */
+	if (inodedep_lookup(dp->i_fs, newinum, DEPALLOC, &inodedep) == 0 ||
+	    (inodedep->id_state & ALLCOMPLETE) == ALLCOMPLETE) {
+		dap->da_state |= COMPLETE;
+		LIST_INSERT_HEAD(&pagedep->pd_pendinghd, dap, da_pdlist);
+		WORKLIST_INSERT(&inodedep->id_pendinghd, &dap->da_list);
+	} else {
+		LIST_INSERT_HEAD(&pagedep->pd_diraddhd[DIRADDHASH(offset)],
+		    dap, da_pdlist);
+		WORKLIST_INSERT(&inodedep->id_bufwait, &dap->da_list);
+	}
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Called whenever the link count on an inode is changed.
+ * It creates an inode dependency so that the new reference(s)
+ * to the inode cannot be committed to disk until the updated
+ * inode has been written.
+ */
+void
+softdep_change_linkcnt(ip)
+	struct inode *ip;	/* the inode with the increased link count */
+{
+	struct inodedep *inodedep;
+
+	ACQUIRE_LOCK(&lk);
+	(void) inodedep_lookup(ip->i_fs, ip->i_number, DEPALLOC, &inodedep);
+	if (ip->i_nlink < ip->i_effnlink) {
+		FREE_LOCK(&lk);
+		panic("softdep_change_linkcnt: bad delta");
+	}
+	inodedep->id_nlinkdelta = ip->i_nlink - ip->i_effnlink;
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Called when the effective link count and the reference count
+ * on an inode drops to zero. At this point there are no names
+ * referencing the file in the filesystem and no active file
+ * references. The space associated with the file will be freed
+ * as soon as the necessary soft dependencies are cleared.
+ */
+void
+softdep_releasefile(ip)
+	struct inode *ip;	/* inode with the zero effective link count */
+{
+	struct inodedep *inodedep;
+	struct fs *fs;
+	int extblocks;
+
+	if (ip->i_effnlink > 0)
+		panic("softdep_filerelease: file still referenced");
+	/*
+	 * We may be called several times as the real reference count
+	 * drops to zero. We only want to account for the space once.
+	 */
+	if (ip->i_flag & IN_SPACECOUNTED)
+		return;
+	/*
+	 * We have to deactivate a snapshot otherwise copyonwrites may
+	 * add blocks and the cleanup may remove blocks after we have
+	 * tried to account for them.
+	 */
+	if ((ip->i_flags & SF_SNAPSHOT) != 0)
+		ffs_snapremove(ITOV(ip));
+	/*
+	 * If we are tracking an nlinkdelta, we have to also remember
+	 * whether we accounted for the freed space yet.
+	 */
+	ACQUIRE_LOCK(&lk);
+	if ((inodedep_lookup(ip->i_fs, ip->i_number, 0, &inodedep)))
+		inodedep->id_state |= SPACECOUNTED;
+	FREE_LOCK(&lk);
+	fs = ip->i_fs;
+	extblocks = 0;
+	if (fs->fs_magic == FS_UFS2_MAGIC)
+		extblocks = btodb(fragroundup(fs, ip->i_din2->di_extsize));
+	ip->i_fs->fs_pendingblocks += DIP(ip, i_blocks) - extblocks;
+	ip->i_fs->fs_pendinginodes += 1;
+	ip->i_flag |= IN_SPACECOUNTED;
+}
+
+/*
+ * This workitem decrements the inode's link count.
+ * If the link count reaches zero, the file is removed.
+ */
+static void 
+handle_workitem_remove(dirrem, xp)
+	struct dirrem *dirrem;
+	struct vnode *xp;
+{
+	struct thread *td = curthread;
+	struct inodedep *inodedep;
+	struct vnode *vp;
+	struct inode *ip;
+	ino_t oldinum;
+	int error;
+
+	if ((vp = xp) == NULL &&
+	    (error = VFS_VGET(dirrem->dm_mnt, dirrem->dm_oldinum, LK_EXCLUSIVE,
+	     &vp)) != 0) {
+		softdep_error("handle_workitem_remove: vget", error);
+		return;
+	}
+	ip = VTOI(vp);
+	ACQUIRE_LOCK(&lk);
+	if ((inodedep_lookup(ip->i_fs, dirrem->dm_oldinum, 0, &inodedep)) == 0){
+		FREE_LOCK(&lk);
+		panic("handle_workitem_remove: lost inodedep");
+	}
+	/*
+	 * Normal file deletion.
+	 */
+	if ((dirrem->dm_state & RMDIR) == 0) {
+		ip->i_nlink--;
+		DIP(ip, i_nlink) = ip->i_nlink;
+		ip->i_flag |= IN_CHANGE;
+		if (ip->i_nlink < ip->i_effnlink) {
+			FREE_LOCK(&lk);
+			panic("handle_workitem_remove: bad file delta");
+		}
+		inodedep->id_nlinkdelta = ip->i_nlink - ip->i_effnlink;
+		FREE_LOCK(&lk);
+		vput(vp);
+		num_dirrem -= 1;
+		WORKITEM_FREE(dirrem, D_DIRREM);
+		return;
+	}
+	/*
+	 * Directory deletion. Decrement reference count for both the
+	 * just deleted parent directory entry and the reference for ".".
+	 * Next truncate the directory to length zero. When the
+	 * truncation completes, arrange to have the reference count on
+	 * the parent decremented to account for the loss of "..".
+	 */
+	ip->i_nlink -= 2;
+	DIP(ip, i_nlink) = ip->i_nlink;
+	ip->i_flag |= IN_CHANGE;
+	if (ip->i_nlink < ip->i_effnlink) {
+		FREE_LOCK(&lk);
+		panic("handle_workitem_remove: bad dir delta");
+	}
+	inodedep->id_nlinkdelta = ip->i_nlink - ip->i_effnlink;
+	FREE_LOCK(&lk);
+	if ((error = UFS_TRUNCATE(vp, (off_t)0, 0, td->td_ucred, td)) != 0)
+		softdep_error("handle_workitem_remove: truncate", error);
+	/*
+	 * Rename a directory to a new parent. Since, we are both deleting
+	 * and creating a new directory entry, the link count on the new
+	 * directory should not change. Thus we skip the followup dirrem.
+	 */
+	if (dirrem->dm_state & DIRCHG) {
+		vput(vp);
+		num_dirrem -= 1;
+		WORKITEM_FREE(dirrem, D_DIRREM);
+		return;
+	}
+	/*
+	 * If the inodedep does not exist, then the zero'ed inode has
+	 * been written to disk. If the allocated inode has never been
+	 * written to disk, then the on-disk inode is zero'ed. In either
+	 * case we can remove the file immediately.
+	 */
+	ACQUIRE_LOCK(&lk);
+	dirrem->dm_state = 0;
+	oldinum = dirrem->dm_oldinum;
+	dirrem->dm_oldinum = dirrem->dm_dirinum;
+	if (inodedep_lookup(ip->i_fs, oldinum, 0, &inodedep) == 0 ||
+	    check_inode_unwritten(inodedep)) {
+		FREE_LOCK(&lk);
+		vput(vp);
+		handle_workitem_remove(dirrem, NULL);
+		return;
+	}
+	WORKLIST_INSERT(&inodedep->id_inowait, &dirrem->dm_list);
+	FREE_LOCK(&lk);
+	vput(vp);
+}
+
+/*
+ * Inode de-allocation dependencies.
+ * 
+ * When an inode's link count is reduced to zero, it can be de-allocated. We
+ * found it convenient to postpone de-allocation until after the inode is
+ * written to disk with its new link count (zero).  At this point, all of the
+ * on-disk inode's block pointers are nullified and, with careful dependency
+ * list ordering, all dependencies related to the inode will be satisfied and
+ * the corresponding dependency structures de-allocated.  So, if/when the
+ * inode is reused, there will be no mixing of old dependencies with new
+ * ones.  This artificial dependency is set up by the block de-allocation
+ * procedure above (softdep_setup_freeblocks) and completed by the
+ * following procedure.
+ */
+static void 
+handle_workitem_freefile(freefile)
+	struct freefile *freefile;
+{
+	struct fs *fs;
+	struct inodedep *idp;
+	int error;
+
+	fs = VFSTOUFS(freefile->fx_mnt)->um_fs;
+#ifdef DEBUG
+	ACQUIRE_LOCK(&lk);
+	error = inodedep_lookup(fs, freefile->fx_oldinum, 0, &idp);
+	FREE_LOCK(&lk);
+	if (error)
+		panic("handle_workitem_freefile: inodedep survived");
+#endif
+	fs->fs_pendinginodes -= 1;
+	if ((error = ffs_freefile(fs, freefile->fx_devvp, freefile->fx_oldinum,
+	     freefile->fx_mode)) != 0)
+		softdep_error("handle_workitem_freefile", error);
+	WORKITEM_FREE(freefile, D_FREEFILE);
+}
+
+/*
+ * Disk writes.
+ * 
+ * The dependency structures constructed above are most actively used when file
+ * system blocks are written to disk.  No constraints are placed on when a
+ * block can be written, but unsatisfied update dependencies are made safe by
+ * modifying (or replacing) the source memory for the duration of the disk
+ * write.  When the disk write completes, the memory block is again brought
+ * up-to-date.
+ *
+ * In-core inode structure reclamation.
+ * 
+ * Because there are a finite number of "in-core" inode structures, they are
+ * reused regularly.  By transferring all inode-related dependencies to the
+ * in-memory inode block and indexing them separately (via "inodedep"s), we
+ * can allow "in-core" inode structures to be reused at any time and avoid
+ * any increase in contention.
+ *
+ * Called just before entering the device driver to initiate a new disk I/O.
+ * The buffer must be locked, thus, no I/O completion operations can occur
+ * while we are manipulating its associated dependencies.
+ */
+static void 
+softdep_disk_io_initiation(bp)
+	struct buf *bp;		/* structure describing disk write to occur */
+{
+	struct worklist *wk, *nextwk;
+	struct indirdep *indirdep;
+	struct inodedep *inodedep;
+
+	/*
+	 * We only care about write operations. There should never
+	 * be dependencies for reads.
+	 */
+	if (bp->b_iocmd == BIO_READ)
+		panic("softdep_disk_io_initiation: read");
+	/*
+	 * Do any necessary pre-I/O processing.
+	 */
+	for (wk = LIST_FIRST(&bp->b_dep); wk; wk = nextwk) {
+		nextwk = LIST_NEXT(wk, wk_list);
+		switch (wk->wk_type) {
+
+		case D_PAGEDEP:
+			initiate_write_filepage(WK_PAGEDEP(wk), bp);
+			continue;
+
+		case D_INODEDEP:
+			inodedep = WK_INODEDEP(wk);
+			if (inodedep->id_fs->fs_magic == FS_UFS1_MAGIC)
+				initiate_write_inodeblock_ufs1(inodedep, bp);
+			else
+				initiate_write_inodeblock_ufs2(inodedep, bp);
+			continue;
+
+		case D_INDIRDEP:
+			indirdep = WK_INDIRDEP(wk);
+			if (indirdep->ir_state & GOINGAWAY)
+				panic("disk_io_initiation: indirdep gone");
+			/*
+			 * If there are no remaining dependencies, this
+			 * will be writing the real pointers, so the
+			 * dependency can be freed.
+			 */
+			if (LIST_FIRST(&indirdep->ir_deplisthd) == NULL) {
+				indirdep->ir_savebp->b_flags |=
+				    B_INVAL | B_NOCACHE;
+				brelse(indirdep->ir_savebp);
+				/* inline expand WORKLIST_REMOVE(wk); */
+				wk->wk_state &= ~ONWORKLIST;
+				LIST_REMOVE(wk, wk_list);
+				WORKITEM_FREE(indirdep, D_INDIRDEP);
+				continue;
+			}
+			/*
+			 * Replace up-to-date version with safe version.
+			 */
+			MALLOC(indirdep->ir_saveddata, caddr_t, bp->b_bcount,
+			    M_INDIRDEP, M_SOFTDEP_FLAGS);
+			ACQUIRE_LOCK(&lk);
+			indirdep->ir_state &= ~ATTACHED;
+			indirdep->ir_state |= UNDONE;
+			bcopy(bp->b_data, indirdep->ir_saveddata, bp->b_bcount);
+			bcopy(indirdep->ir_savebp->b_data, bp->b_data,
+			    bp->b_bcount);
+			FREE_LOCK(&lk);
+			continue;
+
+		case D_MKDIR:
+		case D_BMSAFEMAP:
+		case D_ALLOCDIRECT:
+		case D_ALLOCINDIR:
+			continue;
+
+		default:
+			panic("handle_disk_io_initiation: Unexpected type %s",
+			    TYPENAME(wk->wk_type));
+			/* NOTREACHED */
+		}
+	}
+}
+
+/*
+ * Called from within the procedure above to deal with unsatisfied
+ * allocation dependencies in a directory. The buffer must be locked,
+ * thus, no I/O completion operations can occur while we are
+ * manipulating its associated dependencies.
+ */
+static void
+initiate_write_filepage(pagedep, bp)
+	struct pagedep *pagedep;
+	struct buf *bp;
+{
+	struct diradd *dap;
+	struct direct *ep;
+	int i;
+
+	if (pagedep->pd_state & IOSTARTED) {
+		/*
+		 * This can only happen if there is a driver that does not
+		 * understand chaining. Here biodone will reissue the call
+		 * to strategy for the incomplete buffers.
+		 */
+		printf("initiate_write_filepage: already started\n");
+		return;
+	}
+	pagedep->pd_state |= IOSTARTED;
+	ACQUIRE_LOCK(&lk);
+	for (i = 0; i < DAHASHSZ; i++) {
+		LIST_FOREACH(dap, &pagedep->pd_diraddhd[i], da_pdlist) {
+			ep = (struct direct *)
+			    ((char *)bp->b_data + dap->da_offset);
+			if (ep->d_ino != dap->da_newinum) {
+				FREE_LOCK(&lk);
+				panic("%s: dir inum %d != new %d",
+				    "initiate_write_filepage",
+				    ep->d_ino, dap->da_newinum);
+			}
+			if (dap->da_state & DIRCHG)
+				ep->d_ino = dap->da_previous->dm_oldinum;
+			else
+				ep->d_ino = 0;
+			dap->da_state &= ~ATTACHED;
+			dap->da_state |= UNDONE;
+		}
+	}
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Version of initiate_write_inodeblock that handles UFS1 dinodes.
+ * Note that any bug fixes made to this routine must be done in the
+ * version found below.
+ *
+ * Called from within the procedure above to deal with unsatisfied
+ * allocation dependencies in an inodeblock. The buffer must be
+ * locked, thus, no I/O completion operations can occur while we
+ * are manipulating its associated dependencies.
+ */
+static void 
+initiate_write_inodeblock_ufs1(inodedep, bp)
+	struct inodedep *inodedep;
+	struct buf *bp;			/* The inode block */
+{
+	struct allocdirect *adp, *lastadp;
+	struct ufs1_dinode *dp;
+	struct fs *fs;
+	ufs_lbn_t i, prevlbn = 0;
+	int deplist;
+
+	if (inodedep->id_state & IOSTARTED)
+		panic("initiate_write_inodeblock_ufs1: already started");
+	inodedep->id_state |= IOSTARTED;
+	fs = inodedep->id_fs;
+	dp = (struct ufs1_dinode *)bp->b_data +
+	    ino_to_fsbo(fs, inodedep->id_ino);
+	/*
+	 * If the bitmap is not yet written, then the allocated
+	 * inode cannot be written to disk.
+	 */
+	if ((inodedep->id_state & DEPCOMPLETE) == 0) {
+		if (inodedep->id_savedino1 != NULL)
+			panic("initiate_write_inodeblock_ufs1: I/O underway");
+		MALLOC(inodedep->id_savedino1, struct ufs1_dinode *,
+		    sizeof(struct ufs1_dinode), M_INODEDEP, M_SOFTDEP_FLAGS);
+		*inodedep->id_savedino1 = *dp;
+		bzero((caddr_t)dp, sizeof(struct ufs1_dinode));
+		return;
+	}
+	/*
+	 * If no dependencies, then there is nothing to roll back.
+	 */
+	inodedep->id_savedsize = dp->di_size;
+	inodedep->id_savedextsize = 0;
+	if (TAILQ_FIRST(&inodedep->id_inoupdt) == NULL)
+		return;
+	/*
+	 * Set the dependencies to busy.
+	 */
+	ACQUIRE_LOCK(&lk);
+	for (deplist = 0, adp = TAILQ_FIRST(&inodedep->id_inoupdt); adp;
+	     adp = TAILQ_NEXT(adp, ad_next)) {
+#ifdef DIAGNOSTIC
+		if (deplist != 0 && prevlbn >= adp->ad_lbn) {
+			FREE_LOCK(&lk);
+			panic("softdep_write_inodeblock: lbn order");
+		}
+		prevlbn = adp->ad_lbn;
+		if (adp->ad_lbn < NDADDR &&
+		    dp->di_db[adp->ad_lbn] != adp->ad_newblkno) {
+			FREE_LOCK(&lk);
+			panic("%s: direct pointer #%jd mismatch %d != %jd",
+			    "softdep_write_inodeblock",
+			    (intmax_t)adp->ad_lbn,
+			    dp->di_db[adp->ad_lbn],
+			    (intmax_t)adp->ad_newblkno);
+		}
+		if (adp->ad_lbn >= NDADDR &&
+		    dp->di_ib[adp->ad_lbn - NDADDR] != adp->ad_newblkno) {
+			FREE_LOCK(&lk);
+			panic("%s: indirect pointer #%jd mismatch %d != %jd",
+			    "softdep_write_inodeblock",
+			    (intmax_t)adp->ad_lbn - NDADDR,
+			    dp->di_ib[adp->ad_lbn - NDADDR],
+			    (intmax_t)adp->ad_newblkno);
+		}
+		deplist |= 1 << adp->ad_lbn;
+		if ((adp->ad_state & ATTACHED) == 0) {
+			FREE_LOCK(&lk);
+			panic("softdep_write_inodeblock: Unknown state 0x%x",
+			    adp->ad_state);
+		}
+#endif /* DIAGNOSTIC */
+		adp->ad_state &= ~ATTACHED;
+		adp->ad_state |= UNDONE;
+	}
+	/*
+	 * The on-disk inode cannot claim to be any larger than the last
+	 * fragment that has been written. Otherwise, the on-disk inode
+	 * might have fragments that were not the last block in the file
+	 * which would corrupt the filesystem.
+	 */
+	for (lastadp = NULL, adp = TAILQ_FIRST(&inodedep->id_inoupdt); adp;
+	     lastadp = adp, adp = TAILQ_NEXT(adp, ad_next)) {
+		if (adp->ad_lbn >= NDADDR)
+			break;
+		dp->di_db[adp->ad_lbn] = adp->ad_oldblkno;
+		/* keep going until hitting a rollback to a frag */
+		if (adp->ad_oldsize == 0 || adp->ad_oldsize == fs->fs_bsize)
+			continue;
+		dp->di_size = fs->fs_bsize * adp->ad_lbn + adp->ad_oldsize;
+		for (i = adp->ad_lbn + 1; i < NDADDR; i++) {
+#ifdef DIAGNOSTIC
+			if (dp->di_db[i] != 0 && (deplist & (1 << i)) == 0) {
+				FREE_LOCK(&lk);
+				panic("softdep_write_inodeblock: lost dep1");
+			}
+#endif /* DIAGNOSTIC */
+			dp->di_db[i] = 0;
+		}
+		for (i = 0; i < NIADDR; i++) {
+#ifdef DIAGNOSTIC
+			if (dp->di_ib[i] != 0 &&
+			    (deplist & ((1 << NDADDR) << i)) == 0) {
+				FREE_LOCK(&lk);
+				panic("softdep_write_inodeblock: lost dep2");
+			}
+#endif /* DIAGNOSTIC */
+			dp->di_ib[i] = 0;
+		}
+		FREE_LOCK(&lk);
+		return;
+	}
+	/*
+	 * If we have zero'ed out the last allocated block of the file,
+	 * roll back the size to the last currently allocated block.
+	 * We know that this last allocated block is a full-sized as
+	 * we already checked for fragments in the loop above.
+	 */
+	if (lastadp != NULL &&
+	    dp->di_size <= (lastadp->ad_lbn + 1) * fs->fs_bsize) {
+		for (i = lastadp->ad_lbn; i >= 0; i--)
+			if (dp->di_db[i] != 0)
+				break;
+		dp->di_size = (i + 1) * fs->fs_bsize;
+	}
+	/*
+	 * The only dependencies are for indirect blocks.
+	 *
+	 * The file size for indirect block additions is not guaranteed.
+	 * Such a guarantee would be non-trivial to achieve. The conventional
+	 * synchronous write implementation also does not make this guarantee.
+	 * Fsck should catch and fix discrepancies. Arguably, the file size
+	 * can be over-estimated without destroying integrity when the file
+	 * moves into the indirect blocks (i.e., is large). If we want to
+	 * postpone fsck, we are stuck with this argument.
+	 */
+	for (; adp; adp = TAILQ_NEXT(adp, ad_next))
+		dp->di_ib[adp->ad_lbn - NDADDR] = 0;
+	FREE_LOCK(&lk);
+}
+		
+/*
+ * Version of initiate_write_inodeblock that handles UFS2 dinodes.
+ * Note that any bug fixes made to this routine must be done in the
+ * version found above.
+ *
+ * Called from within the procedure above to deal with unsatisfied
+ * allocation dependencies in an inodeblock. The buffer must be
+ * locked, thus, no I/O completion operations can occur while we
+ * are manipulating its associated dependencies.
+ */
+static void 
+initiate_write_inodeblock_ufs2(inodedep, bp)
+	struct inodedep *inodedep;
+	struct buf *bp;			/* The inode block */
+{
+	struct allocdirect *adp, *lastadp;
+	struct ufs2_dinode *dp;
+	struct fs *fs;
+	ufs_lbn_t i, prevlbn = 0;
+	int deplist;
+
+	if (inodedep->id_state & IOSTARTED)
+		panic("initiate_write_inodeblock_ufs2: already started");
+	inodedep->id_state |= IOSTARTED;
+	fs = inodedep->id_fs;
+	dp = (struct ufs2_dinode *)bp->b_data +
+	    ino_to_fsbo(fs, inodedep->id_ino);
+	/*
+	 * If the bitmap is not yet written, then the allocated
+	 * inode cannot be written to disk.
+	 */
+	if ((inodedep->id_state & DEPCOMPLETE) == 0) {
+		if (inodedep->id_savedino2 != NULL)
+			panic("initiate_write_inodeblock_ufs2: I/O underway");
+		MALLOC(inodedep->id_savedino2, struct ufs2_dinode *,
+		    sizeof(struct ufs2_dinode), M_INODEDEP, M_SOFTDEP_FLAGS);
+		*inodedep->id_savedino2 = *dp;
+		bzero((caddr_t)dp, sizeof(struct ufs2_dinode));
+		return;
+	}
+	/*
+	 * If no dependencies, then there is nothing to roll back.
+	 */
+	inodedep->id_savedsize = dp->di_size;
+	inodedep->id_savedextsize = dp->di_extsize;
+	if (TAILQ_FIRST(&inodedep->id_inoupdt) == NULL &&
+	    TAILQ_FIRST(&inodedep->id_extupdt) == NULL)
+		return;
+	/*
+	 * Set the ext data dependencies to busy.
+	 */
+	ACQUIRE_LOCK(&lk);
+	for (deplist = 0, adp = TAILQ_FIRST(&inodedep->id_extupdt); adp;
+	     adp = TAILQ_NEXT(adp, ad_next)) {
+#ifdef DIAGNOSTIC
+		if (deplist != 0 && prevlbn >= adp->ad_lbn) {
+			FREE_LOCK(&lk);
+			panic("softdep_write_inodeblock: lbn order");
+		}
+		prevlbn = adp->ad_lbn;
+		if (dp->di_extb[adp->ad_lbn] != adp->ad_newblkno) {
+			FREE_LOCK(&lk);
+			panic("%s: direct pointer #%jd mismatch %jd != %jd",
+			    "softdep_write_inodeblock",
+			    (intmax_t)adp->ad_lbn,
+			    (intmax_t)dp->di_extb[adp->ad_lbn],
+			    (intmax_t)adp->ad_newblkno);
+		}
+		deplist |= 1 << adp->ad_lbn;
+		if ((adp->ad_state & ATTACHED) == 0) {
+			FREE_LOCK(&lk);
+			panic("softdep_write_inodeblock: Unknown state 0x%x",
+			    adp->ad_state);
+		}
+#endif /* DIAGNOSTIC */
+		adp->ad_state &= ~ATTACHED;
+		adp->ad_state |= UNDONE;
+	}
+	/*
+	 * The on-disk inode cannot claim to be any larger than the last
+	 * fragment that has been written. Otherwise, the on-disk inode
+	 * might have fragments that were not the last block in the ext
+	 * data which would corrupt the filesystem.
+	 */
+	for (lastadp = NULL, adp = TAILQ_FIRST(&inodedep->id_extupdt); adp;
+	     lastadp = adp, adp = TAILQ_NEXT(adp, ad_next)) {
+		dp->di_extb[adp->ad_lbn] = adp->ad_oldblkno;
+		/* keep going until hitting a rollback to a frag */
+		if (adp->ad_oldsize == 0 || adp->ad_oldsize == fs->fs_bsize)
+			continue;
+		dp->di_extsize = fs->fs_bsize * adp->ad_lbn + adp->ad_oldsize;
+		for (i = adp->ad_lbn + 1; i < NXADDR; i++) {
+#ifdef DIAGNOSTIC
+			if (dp->di_extb[i] != 0 && (deplist & (1 << i)) == 0) {
+				FREE_LOCK(&lk);
+				panic("softdep_write_inodeblock: lost dep1");
+			}
+#endif /* DIAGNOSTIC */
+			dp->di_extb[i] = 0;
+		}
+		lastadp = NULL;
+		break;
+	}
+	/*
+	 * If we have zero'ed out the last allocated block of the ext
+	 * data, roll back the size to the last currently allocated block.
+	 * We know that this last allocated block is a full-sized as
+	 * we already checked for fragments in the loop above.
+	 */
+	if (lastadp != NULL &&
+	    dp->di_extsize <= (lastadp->ad_lbn + 1) * fs->fs_bsize) {
+		for (i = lastadp->ad_lbn; i >= 0; i--)
+			if (dp->di_extb[i] != 0)
+				break;
+		dp->di_extsize = (i + 1) * fs->fs_bsize;
+	}
+	/*
+	 * Set the file data dependencies to busy.
+	 */
+	for (deplist = 0, adp = TAILQ_FIRST(&inodedep->id_inoupdt); adp;
+	     adp = TAILQ_NEXT(adp, ad_next)) {
+#ifdef DIAGNOSTIC
+		if (deplist != 0 && prevlbn >= adp->ad_lbn) {
+			FREE_LOCK(&lk);
+			panic("softdep_write_inodeblock: lbn order");
+		}
+		prevlbn = adp->ad_lbn;
+		if (adp->ad_lbn < NDADDR &&
+		    dp->di_db[adp->ad_lbn] != adp->ad_newblkno) {
+			FREE_LOCK(&lk);
+			panic("%s: direct pointer #%jd mismatch %jd != %jd",
+			    "softdep_write_inodeblock",
+			    (intmax_t)adp->ad_lbn,
+			    (intmax_t)dp->di_db[adp->ad_lbn],
+			    (intmax_t)adp->ad_newblkno);
+		}
+		if (adp->ad_lbn >= NDADDR &&
+		    dp->di_ib[adp->ad_lbn - NDADDR] != adp->ad_newblkno) {
+			FREE_LOCK(&lk);
+			panic("%s indirect pointer #%jd mismatch %jd != %jd",
+			    "softdep_write_inodeblock:",
+			    (intmax_t)adp->ad_lbn - NDADDR,
+			    (intmax_t)dp->di_ib[adp->ad_lbn - NDADDR],
+			    (intmax_t)adp->ad_newblkno);
+		}
+		deplist |= 1 << adp->ad_lbn;
+		if ((adp->ad_state & ATTACHED) == 0) {
+			FREE_LOCK(&lk);
+			panic("softdep_write_inodeblock: Unknown state 0x%x",
+			    adp->ad_state);
+		}
+#endif /* DIAGNOSTIC */
+		adp->ad_state &= ~ATTACHED;
+		adp->ad_state |= UNDONE;
+	}
+	/*
+	 * The on-disk inode cannot claim to be any larger than the last
+	 * fragment that has been written. Otherwise, the on-disk inode
+	 * might have fragments that were not the last block in the file
+	 * which would corrupt the filesystem.
+	 */
+	for (lastadp = NULL, adp = TAILQ_FIRST(&inodedep->id_inoupdt); adp;
+	     lastadp = adp, adp = TAILQ_NEXT(adp, ad_next)) {
+		if (adp->ad_lbn >= NDADDR)
+			break;
+		dp->di_db[adp->ad_lbn] = adp->ad_oldblkno;
+		/* keep going until hitting a rollback to a frag */
+		if (adp->ad_oldsize == 0 || adp->ad_oldsize == fs->fs_bsize)
+			continue;
+		dp->di_size = fs->fs_bsize * adp->ad_lbn + adp->ad_oldsize;
+		for (i = adp->ad_lbn + 1; i < NDADDR; i++) {
+#ifdef DIAGNOSTIC
+			if (dp->di_db[i] != 0 && (deplist & (1 << i)) == 0) {
+				FREE_LOCK(&lk);
+				panic("softdep_write_inodeblock: lost dep2");
+			}
+#endif /* DIAGNOSTIC */
+			dp->di_db[i] = 0;
+		}
+		for (i = 0; i < NIADDR; i++) {
+#ifdef DIAGNOSTIC
+			if (dp->di_ib[i] != 0 &&
+			    (deplist & ((1 << NDADDR) << i)) == 0) {
+				FREE_LOCK(&lk);
+				panic("softdep_write_inodeblock: lost dep3");
+			}
+#endif /* DIAGNOSTIC */
+			dp->di_ib[i] = 0;
+		}
+		FREE_LOCK(&lk);
+		return;
+	}
+	/*
+	 * If we have zero'ed out the last allocated block of the file,
+	 * roll back the size to the last currently allocated block.
+	 * We know that this last allocated block is a full-sized as
+	 * we already checked for fragments in the loop above.
+	 */
+	if (lastadp != NULL &&
+	    dp->di_size <= (lastadp->ad_lbn + 1) * fs->fs_bsize) {
+		for (i = lastadp->ad_lbn; i >= 0; i--)
+			if (dp->di_db[i] != 0)
+				break;
+		dp->di_size = (i + 1) * fs->fs_bsize;
+	}
+	/*
+	 * The only dependencies are for indirect blocks.
+	 *
+	 * The file size for indirect block additions is not guaranteed.
+	 * Such a guarantee would be non-trivial to achieve. The conventional
+	 * synchronous write implementation also does not make this guarantee.
+	 * Fsck should catch and fix discrepancies. Arguably, the file size
+	 * can be over-estimated without destroying integrity when the file
+	 * moves into the indirect blocks (i.e., is large). If we want to
+	 * postpone fsck, we are stuck with this argument.
+	 */
+	for (; adp; adp = TAILQ_NEXT(adp, ad_next))
+		dp->di_ib[adp->ad_lbn - NDADDR] = 0;
+	FREE_LOCK(&lk);
+}
+
+/*
+ * This routine is called during the completion interrupt
+ * service routine for a disk write (from the procedure called
+ * by the device driver to inform the filesystem caches of
+ * a request completion).  It should be called early in this
+ * procedure, before the block is made available to other
+ * processes or other routines are called.
+ */
+static void 
+softdep_disk_write_complete(bp)
+	struct buf *bp;		/* describes the completed disk write */
+{
+	struct worklist *wk;
+	struct workhead reattach;
+	struct newblk *newblk;
+	struct allocindir *aip;
+	struct allocdirect *adp;
+	struct indirdep *indirdep;
+	struct inodedep *inodedep;
+	struct bmsafemap *bmsafemap;
+
+	/*
+	 * If an error occurred while doing the write, then the data
+	 * has not hit the disk and the dependencies cannot be unrolled.
+	 */
+	if ((bp->b_ioflags & BIO_ERROR) != 0 && (bp->b_flags & B_INVAL) == 0)
+		return;
+#ifdef DEBUG
+	if (lk.lkt_held != NOHOLDER)
+		panic("softdep_disk_write_complete: lock is held");
+	lk.lkt_held = SPECIAL_FLAG;
+#endif
+	LIST_INIT(&reattach);
+	while ((wk = LIST_FIRST(&bp->b_dep)) != NULL) {
+		WORKLIST_REMOVE(wk);
+		switch (wk->wk_type) {
+
+		case D_PAGEDEP:
+			if (handle_written_filepage(WK_PAGEDEP(wk), bp))
+				WORKLIST_INSERT(&reattach, wk);
+			continue;
+
+		case D_INODEDEP:
+			if (handle_written_inodeblock(WK_INODEDEP(wk), bp))
+				WORKLIST_INSERT(&reattach, wk);
+			continue;
+
+		case D_BMSAFEMAP:
+			bmsafemap = WK_BMSAFEMAP(wk);
+			while ((newblk = LIST_FIRST(&bmsafemap->sm_newblkhd))) {
+				newblk->nb_state |= DEPCOMPLETE;
+				newblk->nb_bmsafemap = NULL;
+				LIST_REMOVE(newblk, nb_deps);
+			}
+			while ((adp =
+			   LIST_FIRST(&bmsafemap->sm_allocdirecthd))) {
+				adp->ad_state |= DEPCOMPLETE;
+				adp->ad_buf = NULL;
+				LIST_REMOVE(adp, ad_deps);
+				handle_allocdirect_partdone(adp);
+			}
+			while ((aip =
+			    LIST_FIRST(&bmsafemap->sm_allocindirhd))) {
+				aip->ai_state |= DEPCOMPLETE;
+				aip->ai_buf = NULL;
+				LIST_REMOVE(aip, ai_deps);
+				handle_allocindir_partdone(aip);
+			}
+			while ((inodedep =
+			     LIST_FIRST(&bmsafemap->sm_inodedephd)) != NULL) {
+				inodedep->id_state |= DEPCOMPLETE;
+				LIST_REMOVE(inodedep, id_deps);
+				inodedep->id_buf = NULL;
+			}
+			WORKITEM_FREE(bmsafemap, D_BMSAFEMAP);
+			continue;
+
+		case D_MKDIR:
+			handle_written_mkdir(WK_MKDIR(wk), MKDIR_BODY);
+			continue;
+
+		case D_ALLOCDIRECT:
+			adp = WK_ALLOCDIRECT(wk);
+			adp->ad_state |= COMPLETE;
+			handle_allocdirect_partdone(adp);
+			continue;
+
+		case D_ALLOCINDIR:
+			aip = WK_ALLOCINDIR(wk);
+			aip->ai_state |= COMPLETE;
+			handle_allocindir_partdone(aip);
+			continue;
+
+		case D_INDIRDEP:
+			indirdep = WK_INDIRDEP(wk);
+			if (indirdep->ir_state & GOINGAWAY) {
+				lk.lkt_held = NOHOLDER;
+				panic("disk_write_complete: indirdep gone");
+			}
+			bcopy(indirdep->ir_saveddata, bp->b_data, bp->b_bcount);
+			FREE(indirdep->ir_saveddata, M_INDIRDEP);
+			indirdep->ir_saveddata = 0;
+			indirdep->ir_state &= ~UNDONE;
+			indirdep->ir_state |= ATTACHED;
+			while ((aip = LIST_FIRST(&indirdep->ir_donehd)) != 0) {
+				handle_allocindir_partdone(aip);
+				if (aip == LIST_FIRST(&indirdep->ir_donehd)) {
+					lk.lkt_held = NOHOLDER;
+					panic("disk_write_complete: not gone");
+				}
+			}
+			WORKLIST_INSERT(&reattach, wk);
+			if ((bp->b_flags & B_DELWRI) == 0)
+				stat_indir_blk_ptrs++;
+			bdirty(bp);
+			continue;
+
+		default:
+			lk.lkt_held = NOHOLDER;
+			panic("handle_disk_write_complete: Unknown type %s",
+			    TYPENAME(wk->wk_type));
+			/* NOTREACHED */
+		}
+	}
+	/*
+	 * Reattach any requests that must be redone.
+	 */
+	while ((wk = LIST_FIRST(&reattach)) != NULL) {
+		WORKLIST_REMOVE(wk);
+		WORKLIST_INSERT(&bp->b_dep, wk);
+	}
+#ifdef DEBUG
+	if (lk.lkt_held != SPECIAL_FLAG)
+		panic("softdep_disk_write_complete: lock lost");
+	lk.lkt_held = NOHOLDER;
+#endif
+}
+
+/*
+ * Called from within softdep_disk_write_complete above. Note that
+ * this routine is always called from interrupt level with further
+ * splbio interrupts blocked.
+ */
+static void 
+handle_allocdirect_partdone(adp)
+	struct allocdirect *adp;	/* the completed allocdirect */
+{
+	struct allocdirectlst *listhead;
+	struct allocdirect *listadp;
+	struct inodedep *inodedep;
+	long bsize, delay;
+
+	if ((adp->ad_state & ALLCOMPLETE) != ALLCOMPLETE)
+		return;
+	if (adp->ad_buf != NULL) {
+		lk.lkt_held = NOHOLDER;
+		panic("handle_allocdirect_partdone: dangling dep");
+	}
+	/*
+	 * The on-disk inode cannot claim to be any larger than the last
+	 * fragment that has been written. Otherwise, the on-disk inode
+	 * might have fragments that were not the last block in the file
+	 * which would corrupt the filesystem. Thus, we cannot free any
+	 * allocdirects after one whose ad_oldblkno claims a fragment as
+	 * these blocks must be rolled back to zero before writing the inode.
+	 * We check the currently active set of allocdirects in id_inoupdt
+	 * or id_extupdt as appropriate.
+	 */
+	inodedep = adp->ad_inodedep;
+	bsize = inodedep->id_fs->fs_bsize;
+	if (adp->ad_state & EXTDATA)
+		listhead = &inodedep->id_extupdt;
+	else
+		listhead = &inodedep->id_inoupdt;
+	TAILQ_FOREACH(listadp, listhead, ad_next) {
+		/* found our block */
+		if (listadp == adp)
+			break;
+		/* continue if ad_oldlbn is not a fragment */
+		if (listadp->ad_oldsize == 0 ||
+		    listadp->ad_oldsize == bsize)
+			continue;
+		/* hit a fragment */
+		return;
+	}
+	/*
+	 * If we have reached the end of the current list without
+	 * finding the just finished dependency, then it must be
+	 * on the future dependency list. Future dependencies cannot
+	 * be freed until they are moved to the current list.
+	 */
+	if (listadp == NULL) {
+#ifdef DEBUG
+		if (adp->ad_state & EXTDATA)
+			listhead = &inodedep->id_newextupdt;
+		else
+			listhead = &inodedep->id_newinoupdt;
+		TAILQ_FOREACH(listadp, listhead, ad_next)
+			/* found our block */
+			if (listadp == adp)
+				break;
+		if (listadp == NULL) {
+			lk.lkt_held = NOHOLDER;
+			panic("handle_allocdirect_partdone: lost dep");
+		}
+#endif /* DEBUG */
+		return;
+	}
+	/*
+	 * If we have found the just finished dependency, then free
+	 * it along with anything that follows it that is complete.
+	 * If the inode still has a bitmap dependency, then it has
+	 * never been written to disk, hence the on-disk inode cannot
+	 * reference the old fragment so we can free it without delay.
+	 */
+	delay = (inodedep->id_state & DEPCOMPLETE);
+	for (; adp; adp = listadp) {
+		listadp = TAILQ_NEXT(adp, ad_next);
+		if ((adp->ad_state & ALLCOMPLETE) != ALLCOMPLETE)
+			return;
+		free_allocdirect(listhead, adp, delay);
+	}
+}
+
+/*
+ * Called from within softdep_disk_write_complete above. Note that
+ * this routine is always called from interrupt level with further
+ * splbio interrupts blocked.
+ */
+static void
+handle_allocindir_partdone(aip)
+	struct allocindir *aip;		/* the completed allocindir */
+{
+	struct indirdep *indirdep;
+
+	if ((aip->ai_state & ALLCOMPLETE) != ALLCOMPLETE)
+		return;
+	if (aip->ai_buf != NULL) {
+		lk.lkt_held = NOHOLDER;
+		panic("handle_allocindir_partdone: dangling dependency");
+	}
+	indirdep = aip->ai_indirdep;
+	if (indirdep->ir_state & UNDONE) {
+		LIST_REMOVE(aip, ai_next);
+		LIST_INSERT_HEAD(&indirdep->ir_donehd, aip, ai_next);
+		return;
+	}
+	if (indirdep->ir_state & UFS1FMT)
+		((ufs1_daddr_t *)indirdep->ir_savebp->b_data)[aip->ai_offset] =
+		    aip->ai_newblkno;
+	else
+		((ufs2_daddr_t *)indirdep->ir_savebp->b_data)[aip->ai_offset] =
+		    aip->ai_newblkno;
+	LIST_REMOVE(aip, ai_next);
+	if (aip->ai_freefrag != NULL)
+		add_to_worklist(&aip->ai_freefrag->ff_list);
+	WORKITEM_FREE(aip, D_ALLOCINDIR);
+}
+
+/*
+ * Called from within softdep_disk_write_complete above to restore
+ * in-memory inode block contents to their most up-to-date state. Note
+ * that this routine is always called from interrupt level with further
+ * splbio interrupts blocked.
+ */
+static int 
+handle_written_inodeblock(inodedep, bp)
+	struct inodedep *inodedep;
+	struct buf *bp;		/* buffer containing the inode block */
+{
+	struct worklist *wk, *filefree;
+	struct allocdirect *adp, *nextadp;
+	struct ufs1_dinode *dp1 = NULL;
+	struct ufs2_dinode *dp2 = NULL;
+	int hadchanges, fstype;
+
+	if ((inodedep->id_state & IOSTARTED) == 0) {
+		lk.lkt_held = NOHOLDER;
+		panic("handle_written_inodeblock: not started");
+	}
+	inodedep->id_state &= ~IOSTARTED;
+	inodedep->id_state |= COMPLETE;
+	if (inodedep->id_fs->fs_magic == FS_UFS1_MAGIC) {
+		fstype = UFS1;
+		dp1 = (struct ufs1_dinode *)bp->b_data +
+		    ino_to_fsbo(inodedep->id_fs, inodedep->id_ino);
+	} else {
+		fstype = UFS2;
+		dp2 = (struct ufs2_dinode *)bp->b_data +
+		    ino_to_fsbo(inodedep->id_fs, inodedep->id_ino);
+	}
+	/*
+	 * If we had to rollback the inode allocation because of
+	 * bitmaps being incomplete, then simply restore it.
+	 * Keep the block dirty so that it will not be reclaimed until
+	 * all associated dependencies have been cleared and the
+	 * corresponding updates written to disk.
+	 */
+	if (inodedep->id_savedino1 != NULL) {
+		if (fstype == UFS1)
+			*dp1 = *inodedep->id_savedino1;
+		else
+			*dp2 = *inodedep->id_savedino2;
+		FREE(inodedep->id_savedino1, M_INODEDEP);
+		inodedep->id_savedino1 = NULL;
+		if ((bp->b_flags & B_DELWRI) == 0)
+			stat_inode_bitmap++;
+		bdirty(bp);
+		return (1);
+	}
+	/*
+	 * Roll forward anything that had to be rolled back before 
+	 * the inode could be updated.
+	 */
+	hadchanges = 0;
+	for (adp = TAILQ_FIRST(&inodedep->id_inoupdt); adp; adp = nextadp) {
+		nextadp = TAILQ_NEXT(adp, ad_next);
+		if (adp->ad_state & ATTACHED) {
+			lk.lkt_held = NOHOLDER;
+			panic("handle_written_inodeblock: new entry");
+		}
+		if (fstype == UFS1) {
+			if (adp->ad_lbn < NDADDR) {
+				if (dp1->di_db[adp->ad_lbn]!=adp->ad_oldblkno) {
+					lk.lkt_held = NOHOLDER;
+					panic("%s %s #%jd mismatch %d != %jd",
+					    "handle_written_inodeblock:",
+					    "direct pointer",
+					    (intmax_t)adp->ad_lbn,
+					    dp1->di_db[adp->ad_lbn],
+					    (intmax_t)adp->ad_oldblkno);
+				}
+				dp1->di_db[adp->ad_lbn] = adp->ad_newblkno;
+			} else {
+				if (dp1->di_ib[adp->ad_lbn - NDADDR] != 0) {
+					lk.lkt_held = NOHOLDER;
+					panic("%s: %s #%jd allocated as %d",
+					    "handle_written_inodeblock",
+					    "indirect pointer",
+					    (intmax_t)adp->ad_lbn - NDADDR,
+					    dp1->di_ib[adp->ad_lbn - NDADDR]);
+				}
+				dp1->di_ib[adp->ad_lbn - NDADDR] =
+				    adp->ad_newblkno;
+			}
+		} else {
+			if (adp->ad_lbn < NDADDR) {
+				if (dp2->di_db[adp->ad_lbn]!=adp->ad_oldblkno) {
+					lk.lkt_held = NOHOLDER;
+					panic("%s: %s #%jd %s %jd != %jd",
+					    "handle_written_inodeblock",
+					    "direct pointer",
+					    (intmax_t)adp->ad_lbn, "mismatch",
+					    (intmax_t)dp2->di_db[adp->ad_lbn],
+					    (intmax_t)adp->ad_oldblkno);
+				}
+				dp2->di_db[adp->ad_lbn] = adp->ad_newblkno;
+			} else {
+				if (dp2->di_ib[adp->ad_lbn - NDADDR] != 0) {
+					lk.lkt_held = NOHOLDER;
+					panic("%s: %s #%jd allocated as %jd",
+					    "handle_written_inodeblock",
+					    "indirect pointer",
+					    (intmax_t)adp->ad_lbn - NDADDR,
+					    (intmax_t)
+					    dp2->di_ib[adp->ad_lbn - NDADDR]);
+				}
+				dp2->di_ib[adp->ad_lbn - NDADDR] =
+				    adp->ad_newblkno;
+			}
+		}
+		adp->ad_state &= ~UNDONE;
+		adp->ad_state |= ATTACHED;
+		hadchanges = 1;
+	}
+	for (adp = TAILQ_FIRST(&inodedep->id_extupdt); adp; adp = nextadp) {
+		nextadp = TAILQ_NEXT(adp, ad_next);
+		if (adp->ad_state & ATTACHED) {
+			lk.lkt_held = NOHOLDER;
+			panic("handle_written_inodeblock: new entry");
+		}
+		if (dp2->di_extb[adp->ad_lbn] != adp->ad_oldblkno) {
+			lk.lkt_held = NOHOLDER;
+			panic("%s: direct pointers #%jd %s %jd != %jd",
+			    "handle_written_inodeblock",
+			    (intmax_t)adp->ad_lbn, "mismatch",
+			    (intmax_t)dp2->di_extb[adp->ad_lbn],
+			    (intmax_t)adp->ad_oldblkno);
+		}
+		dp2->di_extb[adp->ad_lbn] = adp->ad_newblkno;
+		adp->ad_state &= ~UNDONE;
+		adp->ad_state |= ATTACHED;
+		hadchanges = 1;
+	}
+	if (hadchanges && (bp->b_flags & B_DELWRI) == 0)
+		stat_direct_blk_ptrs++;
+	/*
+	 * Reset the file size to its most up-to-date value.
+	 */
+	if (inodedep->id_savedsize == -1 || inodedep->id_savedextsize == -1) {
+		lk.lkt_held = NOHOLDER;
+		panic("handle_written_inodeblock: bad size");
+	}
+	if (fstype == UFS1) {
+		if (dp1->di_size != inodedep->id_savedsize) {
+			dp1->di_size = inodedep->id_savedsize;
+			hadchanges = 1;
+		}
+	} else {
+		if (dp2->di_size != inodedep->id_savedsize) {
+			dp2->di_size = inodedep->id_savedsize;
+			hadchanges = 1;
+		}
+		if (dp2->di_extsize != inodedep->id_savedextsize) {
+			dp2->di_extsize = inodedep->id_savedextsize;
+			hadchanges = 1;
+		}
+	}
+	inodedep->id_savedsize = -1;
+	inodedep->id_savedextsize = -1;
+	/*
+	 * If there were any rollbacks in the inode block, then it must be
+	 * marked dirty so that its will eventually get written back in
+	 * its correct form.
+	 */
+	if (hadchanges)
+		bdirty(bp);
+	/*
+	 * Process any allocdirects that completed during the update.
+	 */
+	if ((adp = TAILQ_FIRST(&inodedep->id_inoupdt)) != NULL)
+		handle_allocdirect_partdone(adp);
+	if ((adp = TAILQ_FIRST(&inodedep->id_extupdt)) != NULL)
+		handle_allocdirect_partdone(adp);
+	/*
+	 * Process deallocations that were held pending until the
+	 * inode had been written to disk. Freeing of the inode
+	 * is delayed until after all blocks have been freed to
+	 * avoid creation of new <vfsid, inum, lbn> triples
+	 * before the old ones have been deleted.
+	 */
+	filefree = NULL;
+	while ((wk = LIST_FIRST(&inodedep->id_bufwait)) != NULL) {
+		WORKLIST_REMOVE(wk);
+		switch (wk->wk_type) {
+
+		case D_FREEFILE:
+			/*
+			 * We defer adding filefree to the worklist until
+			 * all other additions have been made to ensure
+			 * that it will be done after all the old blocks
+			 * have been freed.
+			 */
+			if (filefree != NULL) {
+				lk.lkt_held = NOHOLDER;
+				panic("handle_written_inodeblock: filefree");
+			}
+			filefree = wk;
+			continue;
+
+		case D_MKDIR:
+			handle_written_mkdir(WK_MKDIR(wk), MKDIR_PARENT);
+			continue;
+
+		case D_DIRADD:
+			diradd_inode_written(WK_DIRADD(wk), inodedep);
+			continue;
+
+		case D_FREEBLKS:
+		case D_FREEFRAG:
+		case D_DIRREM:
+			add_to_worklist(wk);
+			continue;
+
+		case D_NEWDIRBLK:
+			free_newdirblk(WK_NEWDIRBLK(wk));
+			continue;
+
+		default:
+			lk.lkt_held = NOHOLDER;
+			panic("handle_written_inodeblock: Unknown type %s",
+			    TYPENAME(wk->wk_type));
+			/* NOTREACHED */
+		}
+	}
+	if (filefree != NULL) {
+		if (free_inodedep(inodedep) == 0) {
+			lk.lkt_held = NOHOLDER;
+			panic("handle_written_inodeblock: live inodedep");
+		}
+		add_to_worklist(filefree);
+		return (0);
+	}
+
+	/*
+	 * If no outstanding dependencies, free it.
+	 */
+	if (free_inodedep(inodedep) ||
+	    (TAILQ_FIRST(&inodedep->id_inoupdt) == 0 &&
+	     TAILQ_FIRST(&inodedep->id_extupdt) == 0))
+		return (0);
+	return (hadchanges);
+}
+
+/*
+ * Process a diradd entry after its dependent inode has been written.
+ * This routine must be called with splbio interrupts blocked.
+ */
+static void
+diradd_inode_written(dap, inodedep)
+	struct diradd *dap;
+	struct inodedep *inodedep;
+{
+	struct pagedep *pagedep;
+
+	dap->da_state |= COMPLETE;
+	if ((dap->da_state & ALLCOMPLETE) == ALLCOMPLETE) {
+		if (dap->da_state & DIRCHG)
+			pagedep = dap->da_previous->dm_pagedep;
+		else
+			pagedep = dap->da_pagedep;
+		LIST_REMOVE(dap, da_pdlist);
+		LIST_INSERT_HEAD(&pagedep->pd_pendinghd, dap, da_pdlist);
+	}
+	WORKLIST_INSERT(&inodedep->id_pendinghd, &dap->da_list);
+}
+
+/*
+ * Handle the completion of a mkdir dependency.
+ */
+static void
+handle_written_mkdir(mkdir, type)
+	struct mkdir *mkdir;
+	int type;
+{
+	struct diradd *dap;
+	struct pagedep *pagedep;
+
+	if (mkdir->md_state != type) {
+		lk.lkt_held = NOHOLDER;
+		panic("handle_written_mkdir: bad type");
+	}
+	dap = mkdir->md_diradd;
+	dap->da_state &= ~type;
+	if ((dap->da_state & (MKDIR_PARENT | MKDIR_BODY)) == 0)
+		dap->da_state |= DEPCOMPLETE;
+	if ((dap->da_state & ALLCOMPLETE) == ALLCOMPLETE) {
+		if (dap->da_state & DIRCHG)
+			pagedep = dap->da_previous->dm_pagedep;
+		else
+			pagedep = dap->da_pagedep;
+		LIST_REMOVE(dap, da_pdlist);
+		LIST_INSERT_HEAD(&pagedep->pd_pendinghd, dap, da_pdlist);
+	}
+	LIST_REMOVE(mkdir, md_mkdirs);
+	WORKITEM_FREE(mkdir, D_MKDIR);
+}
+
+/*
+ * Called from within softdep_disk_write_complete above.
+ * A write operation was just completed. Removed inodes can
+ * now be freed and associated block pointers may be committed.
+ * Note that this routine is always called from interrupt level
+ * with further splbio interrupts blocked.
+ */
+static int 
+handle_written_filepage(pagedep, bp)
+	struct pagedep *pagedep;
+	struct buf *bp;		/* buffer containing the written page */
+{
+	struct dirrem *dirrem;
+	struct diradd *dap, *nextdap;
+	struct direct *ep;
+	int i, chgs;
+
+	if ((pagedep->pd_state & IOSTARTED) == 0) {
+		lk.lkt_held = NOHOLDER;
+		panic("handle_written_filepage: not started");
+	}
+	pagedep->pd_state &= ~IOSTARTED;
+	/*
+	 * Process any directory removals that have been committed.
+	 */
+	while ((dirrem = LIST_FIRST(&pagedep->pd_dirremhd)) != NULL) {
+		LIST_REMOVE(dirrem, dm_next);
+		dirrem->dm_dirinum = pagedep->pd_ino;
+		add_to_worklist(&dirrem->dm_list);
+	}
+	/*
+	 * Free any directory additions that have been committed.
+	 * If it is a newly allocated block, we have to wait until
+	 * the on-disk directory inode claims the new block.
+	 */
+	if ((pagedep->pd_state & NEWBLOCK) == 0)
+		while ((dap = LIST_FIRST(&pagedep->pd_pendinghd)) != NULL)
+			free_diradd(dap);
+	/*
+	 * Uncommitted directory entries must be restored.
+	 */
+	for (chgs = 0, i = 0; i < DAHASHSZ; i++) {
+		for (dap = LIST_FIRST(&pagedep->pd_diraddhd[i]); dap;
+		     dap = nextdap) {
+			nextdap = LIST_NEXT(dap, da_pdlist);
+			if (dap->da_state & ATTACHED) {
+				lk.lkt_held = NOHOLDER;
+				panic("handle_written_filepage: attached");
+			}
+			ep = (struct direct *)
+			    ((char *)bp->b_data + dap->da_offset);
+			ep->d_ino = dap->da_newinum;
+			dap->da_state &= ~UNDONE;
+			dap->da_state |= ATTACHED;
+			chgs = 1;
+			/*
+			 * If the inode referenced by the directory has
+			 * been written out, then the dependency can be
+			 * moved to the pending list.
+			 */
+			if ((dap->da_state & ALLCOMPLETE) == ALLCOMPLETE) {
+				LIST_REMOVE(dap, da_pdlist);
+				LIST_INSERT_HEAD(&pagedep->pd_pendinghd, dap,
+				    da_pdlist);
+			}
+		}
+	}
+	/*
+	 * If there were any rollbacks in the directory, then it must be
+	 * marked dirty so that its will eventually get written back in
+	 * its correct form.
+	 */
+	if (chgs) {
+		if ((bp->b_flags & B_DELWRI) == 0)
+			stat_dir_entry++;
+		bdirty(bp);
+		return (1);
+	}
+	/*
+	 * If we are not waiting for a new directory block to be
+	 * claimed by its inode, then the pagedep will be freed.
+	 * Otherwise it will remain to track any new entries on
+	 * the page in case they are fsync'ed.
+	 */
+	if ((pagedep->pd_state & NEWBLOCK) == 0) {
+		LIST_REMOVE(pagedep, pd_hash);
+		WORKITEM_FREE(pagedep, D_PAGEDEP);
+	}
+	return (0);
+}
+
+/*
+ * Writing back in-core inode structures.
+ * 
+ * The filesystem only accesses an inode's contents when it occupies an
+ * "in-core" inode structure.  These "in-core" structures are separate from
+ * the page frames used to cache inode blocks.  Only the latter are
+ * transferred to/from the disk.  So, when the updated contents of the
+ * "in-core" inode structure are copied to the corresponding in-memory inode
+ * block, the dependencies are also transferred.  The following procedure is
+ * called when copying a dirty "in-core" inode to a cached inode block.
+ */
+
+/*
+ * Called when an inode is loaded from disk. If the effective link count
+ * differed from the actual link count when it was last flushed, then we
+ * need to ensure that the correct effective link count is put back.
+ */
+void 
+softdep_load_inodeblock(ip)
+	struct inode *ip;	/* the "in_core" copy of the inode */
+{
+	struct inodedep *inodedep;
+
+	/*
+	 * Check for alternate nlink count.
+	 */
+	ip->i_effnlink = ip->i_nlink;
+	ACQUIRE_LOCK(&lk);
+	if (inodedep_lookup(ip->i_fs, ip->i_number, 0, &inodedep) == 0) {
+		FREE_LOCK(&lk);
+		return;
+	}
+	ip->i_effnlink -= inodedep->id_nlinkdelta;
+	if (inodedep->id_state & SPACECOUNTED)
+		ip->i_flag |= IN_SPACECOUNTED;
+	FREE_LOCK(&lk);
+}
+
+/*
+ * This routine is called just before the "in-core" inode
+ * information is to be copied to the in-memory inode block.
+ * Recall that an inode block contains several inodes. If
+ * the force flag is set, then the dependencies will be
+ * cleared so that the update can always be made. Note that
+ * the buffer is locked when this routine is called, so we
+ * will never be in the middle of writing the inode block 
+ * to disk.
+ */
+void 
+softdep_update_inodeblock(ip, bp, waitfor)
+	struct inode *ip;	/* the "in_core" copy of the inode */
+	struct buf *bp;		/* the buffer containing the inode block */
+	int waitfor;		/* nonzero => update must be allowed */
+{
+	struct inodedep *inodedep;
+	struct worklist *wk;
+	struct buf *ibp;
+	int error;
+
+	/*
+	 * If the effective link count is not equal to the actual link
+	 * count, then we must track the difference in an inodedep while
+	 * the inode is (potentially) tossed out of the cache. Otherwise,
+	 * if there is no existing inodedep, then there are no dependencies
+	 * to track.
+	 */
+	ACQUIRE_LOCK(&lk);
+	if (inodedep_lookup(ip->i_fs, ip->i_number, 0, &inodedep) == 0) {
+		FREE_LOCK(&lk);
+		if (ip->i_effnlink != ip->i_nlink)
+			panic("softdep_update_inodeblock: bad link count");
+		return;
+	}
+	if (inodedep->id_nlinkdelta != ip->i_nlink - ip->i_effnlink) {
+		FREE_LOCK(&lk);
+		panic("softdep_update_inodeblock: bad delta");
+	}
+	/*
+	 * Changes have been initiated. Anything depending on these
+	 * changes cannot occur until this inode has been written.
+	 */
+	inodedep->id_state &= ~COMPLETE;
+	if ((inodedep->id_state & ONWORKLIST) == 0)
+		WORKLIST_INSERT(&bp->b_dep, &inodedep->id_list);
+	/*
+	 * Any new dependencies associated with the incore inode must 
+	 * now be moved to the list associated with the buffer holding
+	 * the in-memory copy of the inode. Once merged process any
+	 * allocdirects that are completed by the merger.
+	 */
+	merge_inode_lists(&inodedep->id_newinoupdt, &inodedep->id_inoupdt);
+	if (TAILQ_FIRST(&inodedep->id_inoupdt) != NULL)
+		handle_allocdirect_partdone(TAILQ_FIRST(&inodedep->id_inoupdt));
+	merge_inode_lists(&inodedep->id_newextupdt, &inodedep->id_extupdt);
+	if (TAILQ_FIRST(&inodedep->id_extupdt) != NULL)
+		handle_allocdirect_partdone(TAILQ_FIRST(&inodedep->id_extupdt));
+	/*
+	 * Now that the inode has been pushed into the buffer, the
+	 * operations dependent on the inode being written to disk
+	 * can be moved to the id_bufwait so that they will be
+	 * processed when the buffer I/O completes.
+	 */
+	while ((wk = LIST_FIRST(&inodedep->id_inowait)) != NULL) {
+		WORKLIST_REMOVE(wk);
+		WORKLIST_INSERT(&inodedep->id_bufwait, wk);
+	}
+	/*
+	 * Newly allocated inodes cannot be written until the bitmap
+	 * that allocates them have been written (indicated by
+	 * DEPCOMPLETE being set in id_state). If we are doing a
+	 * forced sync (e.g., an fsync on a file), we force the bitmap
+	 * to be written so that the update can be done.
+	 */
+	if ((inodedep->id_state & DEPCOMPLETE) != 0 || waitfor == 0) {
+		FREE_LOCK(&lk);
+		return;
+	}
+	ibp = inodedep->id_buf;
+	ibp = getdirtybuf(&ibp, NULL, MNT_WAIT);
+	FREE_LOCK(&lk);
+	if (ibp && (error = BUF_WRITE(ibp)) != 0)
+		softdep_error("softdep_update_inodeblock: bwrite", error);
+	if ((inodedep->id_state & DEPCOMPLETE) == 0)
+		panic("softdep_update_inodeblock: update failed");
+}
+
+/*
+ * Merge the a new inode dependency list (such as id_newinoupdt) into an
+ * old inode dependency list (such as id_inoupdt). This routine must be
+ * called with splbio interrupts blocked.
+ */
+static void
+merge_inode_lists(newlisthead, oldlisthead)
+	struct allocdirectlst *newlisthead;
+	struct allocdirectlst *oldlisthead;
+{
+	struct allocdirect *listadp, *newadp;
+
+	newadp = TAILQ_FIRST(newlisthead);
+	for (listadp = TAILQ_FIRST(oldlisthead); listadp && newadp;) {
+		if (listadp->ad_lbn < newadp->ad_lbn) {
+			listadp = TAILQ_NEXT(listadp, ad_next);
+			continue;
+		}
+		TAILQ_REMOVE(newlisthead, newadp, ad_next);
+		TAILQ_INSERT_BEFORE(listadp, newadp, ad_next);
+		if (listadp->ad_lbn == newadp->ad_lbn) {
+			allocdirect_merge(oldlisthead, newadp,
+			    listadp);
+			listadp = newadp;
+		}
+		newadp = TAILQ_FIRST(newlisthead);
+	}
+	while ((newadp = TAILQ_FIRST(newlisthead)) != NULL) {
+		TAILQ_REMOVE(newlisthead, newadp, ad_next);
+		TAILQ_INSERT_TAIL(oldlisthead, newadp, ad_next);
+	}
+}
+
+/*
+ * If we are doing an fsync, then we must ensure that any directory
+ * entries for the inode have been written after the inode gets to disk.
+ */
+int
+softdep_fsync(vp)
+	struct vnode *vp;	/* the "in_core" copy of the inode */
+{
+	struct inodedep *inodedep;
+	struct pagedep *pagedep;
+	struct worklist *wk;
+	struct diradd *dap;
+	struct mount *mnt;
+	struct vnode *pvp;
+	struct inode *ip;
+	struct buf *bp;
+	struct fs *fs;
+	struct thread *td = curthread;
+	int error, flushparent;
+	ino_t parentino;
+	ufs_lbn_t lbn;
+
+	ip = VTOI(vp);
+	fs = ip->i_fs;
+	ACQUIRE_LOCK(&lk);
+	if (inodedep_lookup(fs, ip->i_number, 0, &inodedep) == 0) {
+		FREE_LOCK(&lk);
+		return (0);
+	}
+	if (LIST_FIRST(&inodedep->id_inowait) != NULL ||
+	    LIST_FIRST(&inodedep->id_bufwait) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_extupdt) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_newextupdt) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_inoupdt) != NULL ||
+	    TAILQ_FIRST(&inodedep->id_newinoupdt) != NULL) {
+		FREE_LOCK(&lk);
+		panic("softdep_fsync: pending ops");
+	}
+	for (error = 0, flushparent = 0; ; ) {
+		if ((wk = LIST_FIRST(&inodedep->id_pendinghd)) == NULL)
+			break;
+		if (wk->wk_type != D_DIRADD) {
+			FREE_LOCK(&lk);
+			panic("softdep_fsync: Unexpected type %s",
+			    TYPENAME(wk->wk_type));
+		}
+		dap = WK_DIRADD(wk);
+		/*
+		 * Flush our parent if this directory entry has a MKDIR_PARENT
+		 * dependency or is contained in a newly allocated block.
+		 */
+		if (dap->da_state & DIRCHG)
+			pagedep = dap->da_previous->dm_pagedep;
+		else
+			pagedep = dap->da_pagedep;
+		mnt = pagedep->pd_mnt;
+		parentino = pagedep->pd_ino;
+		lbn = pagedep->pd_lbn;
+		if ((dap->da_state & (MKDIR_BODY | COMPLETE)) != COMPLETE) {
+			FREE_LOCK(&lk);
+			panic("softdep_fsync: dirty");
+		}
+		if ((dap->da_state & MKDIR_PARENT) ||
+		    (pagedep->pd_state & NEWBLOCK))
+			flushparent = 1;
+		else
+			flushparent = 0;
+		/*
+		 * If we are being fsync'ed as part of vgone'ing this vnode,
+		 * then we will not be able to release and recover the
+		 * vnode below, so we just have to give up on writing its
+		 * directory entry out. It will eventually be written, just
+		 * not now, but then the user was not asking to have it
+		 * written, so we are not breaking any promises.
+		 */
+		if (vp->v_iflag & VI_XLOCK)
+			break;
+		/*
+		 * We prevent deadlock by always fetching inodes from the
+		 * root, moving down the directory tree. Thus, when fetching
+		 * our parent directory, we first try to get the lock. If
+		 * that fails, we must unlock ourselves before requesting
+		 * the lock on our parent. See the comment in ufs_lookup
+		 * for details on possible races.
+		 */
+		FREE_LOCK(&lk);
+		if (VFS_VGET(mnt, parentino, LK_NOWAIT | LK_EXCLUSIVE, &pvp)) {
+			VOP_UNLOCK(vp, 0, td);
+			error = VFS_VGET(mnt, parentino, LK_EXCLUSIVE, &pvp);
+			vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
+			if (error != 0)
+				return (error);
+		}
+		/*
+		 * All MKDIR_PARENT dependencies and all the NEWBLOCK pagedeps
+		 * that are contained in direct blocks will be resolved by 
+		 * doing a UFS_UPDATE. Pagedeps contained in indirect blocks
+		 * may require a complete sync'ing of the directory. So, we
+		 * try the cheap and fast UFS_UPDATE first, and if that fails,
+		 * then we do the slower VOP_FSYNC of the directory.
+		 */
+		if (flushparent) {
+			if ((error = UFS_UPDATE(pvp, 1)) != 0) {
+				vput(pvp);
+				return (error);
+			}
+			if ((pagedep->pd_state & NEWBLOCK) &&
+			    (error = VOP_FSYNC(pvp, td->td_ucred, MNT_WAIT, td))) {
+				vput(pvp);
+				return (error);
+			}
+		}
+		/*
+		 * Flush directory page containing the inode's name.
+		 */
+		error = bread(pvp, lbn, blksize(fs, VTOI(pvp), lbn), td->td_ucred,
+		    &bp);
+		if (error == 0)
+			error = BUF_WRITE(bp);
+		else
+			brelse(bp);
+		vput(pvp);
+		if (error != 0)
+			return (error);
+		ACQUIRE_LOCK(&lk);
+		if (inodedep_lookup(fs, ip->i_number, 0, &inodedep) == 0)
+			break;
+	}
+	FREE_LOCK(&lk);
+	return (0);
+}
+
+/*
+ * Flush all the dirty bitmaps associated with the block device
+ * before flushing the rest of the dirty blocks so as to reduce
+ * the number of dependencies that will have to be rolled back.
+ */
+void
+softdep_fsync_mountdev(vp)
+	struct vnode *vp;
+{
+	struct buf *bp, *nbp;
+	struct worklist *wk;
+
+	if (!vn_isdisk(vp, NULL))
+		panic("softdep_fsync_mountdev: vnode not a disk");
+	ACQUIRE_LOCK(&lk);
+	VI_LOCK(vp);
+	for (bp = TAILQ_FIRST(&vp->v_dirtyblkhd); bp; bp = nbp) {
+		nbp = TAILQ_NEXT(bp, b_vnbufs);
+		/* 
+		 * If it is already scheduled, skip to the next buffer.
+		 */
+		if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL))
+			continue;
+
+		if ((bp->b_flags & B_DELWRI) == 0) {
+			FREE_LOCK(&lk);
+			panic("softdep_fsync_mountdev: not dirty");
+		}
+		/*
+		 * We are only interested in bitmaps with outstanding
+		 * dependencies.
+		 */
+		if ((wk = LIST_FIRST(&bp->b_dep)) == NULL ||
+		    wk->wk_type != D_BMSAFEMAP ||
+		    (bp->b_vflags & BV_BKGRDINPROG)) {
+			BUF_UNLOCK(bp);
+			continue;
+		}
+		VI_UNLOCK(vp);
+		bremfree(bp);
+		FREE_LOCK(&lk);
+		(void) bawrite(bp);
+		ACQUIRE_LOCK(&lk);
+		/*
+		 * Since we may have slept during the I/O, we need 
+		 * to start from a known point.
+		 */
+		VI_LOCK(vp);
+		nbp = TAILQ_FIRST(&vp->v_dirtyblkhd);
+	}
+	drain_output(vp, 1);
+	VI_UNLOCK(vp);
+	FREE_LOCK(&lk);
+}
+
+/*
+ * This routine is called when we are trying to synchronously flush a
+ * file. This routine must eliminate any filesystem metadata dependencies
+ * so that the syncing routine can succeed by pushing the dirty blocks
+ * associated with the file. If any I/O errors occur, they are returned.
+ */
+int
+softdep_sync_metadata(ap)
+	struct vop_fsync_args /* {
+		struct vnode *a_vp;
+		struct ucred *a_cred;
+		int a_waitfor;
+		struct thread *a_td;
+	} */ *ap;
+{
+	struct vnode *vp = ap->a_vp;
+	struct pagedep *pagedep;
+	struct allocdirect *adp;
+	struct allocindir *aip;
+	struct buf *bp, *nbp;
+	struct worklist *wk;
+	int i, error, waitfor;
+
+	/*
+	 * Check whether this vnode is involved in a filesystem
+	 * that is doing soft dependency processing.
+	 */
+	if (!vn_isdisk(vp, NULL)) {
+		if (!DOINGSOFTDEP(vp))
+			return (0);
+	} else
+		if (vp->v_rdev->si_mountpoint == NULL ||
+		    (vp->v_rdev->si_mountpoint->mnt_flag & MNT_SOFTDEP) == 0)
+			return (0);
+	/*
+	 * Ensure that any direct block dependencies have been cleared.
+	 */
+	ACQUIRE_LOCK(&lk);
+	if ((error = flush_inodedep_deps(VTOI(vp)->i_fs, VTOI(vp)->i_number))) {
+		FREE_LOCK(&lk);
+		return (error);
+	}
+	/*
+	 * For most files, the only metadata dependencies are the
+	 * cylinder group maps that allocate their inode or blocks.
+	 * The block allocation dependencies can be found by traversing
+	 * the dependency lists for any buffers that remain on their
+	 * dirty buffer list. The inode allocation dependency will
+	 * be resolved when the inode is updated with MNT_WAIT.
+	 * This work is done in two passes. The first pass grabs most
+	 * of the buffers and begins asynchronously writing them. The
+	 * only way to wait for these asynchronous writes is to sleep
+	 * on the filesystem vnode which may stay busy for a long time
+	 * if the filesystem is active. So, instead, we make a second
+	 * pass over the dependencies blocking on each write. In the
+	 * usual case we will be blocking against a write that we
+	 * initiated, so when it is done the dependency will have been
+	 * resolved. Thus the second pass is expected to end quickly.
+	 */
+	waitfor = MNT_NOWAIT;
+top:
+	/*
+	 * We must wait for any I/O in progress to finish so that
+	 * all potential buffers on the dirty list will be visible.
+	 */
+	VI_LOCK(vp);
+	drain_output(vp, 1);
+	bp = getdirtybuf(&TAILQ_FIRST(&vp->v_dirtyblkhd),
+	    VI_MTX(vp), MNT_WAIT);
+	if (bp == NULL) {
+		VI_UNLOCK(vp);
+		FREE_LOCK(&lk);
+		return (0);
+	}
+	/* While syncing snapshots, we must allow recursive lookups */
+	bp->b_lock.lk_flags |= LK_CANRECURSE;
+loop:
+	/*
+	 * As we hold the buffer locked, none of its dependencies
+	 * will disappear.
+	 */
+	LIST_FOREACH(wk, &bp->b_dep, wk_list) {
+		switch (wk->wk_type) {
+
+		case D_ALLOCDIRECT:
+			adp = WK_ALLOCDIRECT(wk);
+			if (adp->ad_state & DEPCOMPLETE)
+				continue;
+			nbp = adp->ad_buf;
+			nbp = getdirtybuf(&nbp, NULL, waitfor);
+			if (nbp == NULL)
+				continue;
+			FREE_LOCK(&lk);
+			if (waitfor == MNT_NOWAIT) {
+				bawrite(nbp);
+			} else if ((error = BUF_WRITE(nbp)) != 0) {
+				break;
+			}
+			ACQUIRE_LOCK(&lk);
+			continue;
+
+		case D_ALLOCINDIR:
+			aip = WK_ALLOCINDIR(wk);
+			if (aip->ai_state & DEPCOMPLETE)
+				continue;
+			nbp = aip->ai_buf;
+			nbp = getdirtybuf(&nbp, NULL, waitfor);
+			if (nbp == NULL)
+				continue;
+			FREE_LOCK(&lk);
+			if (waitfor == MNT_NOWAIT) {
+				bawrite(nbp);
+			} else if ((error = BUF_WRITE(nbp)) != 0) {
+				break;
+			}
+			ACQUIRE_LOCK(&lk);
+			continue;
+
+		case D_INDIRDEP:
+		restart:
+
+			LIST_FOREACH(aip, &WK_INDIRDEP(wk)->ir_deplisthd, ai_next) {
+				if (aip->ai_state & DEPCOMPLETE)
+					continue;
+				nbp = aip->ai_buf;
+				nbp = getdirtybuf(&nbp, NULL, MNT_WAIT);
+				if (nbp == NULL)
+					goto restart;
+				FREE_LOCK(&lk);
+				if ((error = BUF_WRITE(nbp)) != 0) {
+					break;
+				}
+				ACQUIRE_LOCK(&lk);
+				goto restart;
+			}
+			continue;
+
+		case D_INODEDEP:
+			if ((error = flush_inodedep_deps(WK_INODEDEP(wk)->id_fs,
+			    WK_INODEDEP(wk)->id_ino)) != 0) {
+				FREE_LOCK(&lk);
+				break;
+			}
+			continue;
+
+		case D_PAGEDEP:
+			/*
+			 * We are trying to sync a directory that may
+			 * have dependencies on both its own metadata
+			 * and/or dependencies on the inodes of any
+			 * recently allocated files. We walk its diradd
+			 * lists pushing out the associated inode.
+			 */
+			pagedep = WK_PAGEDEP(wk);
+			for (i = 0; i < DAHASHSZ; i++) {
+				if (LIST_FIRST(&pagedep->pd_diraddhd[i]) == 0)
+					continue;
+				if ((error =
+				    flush_pagedep_deps(vp, pagedep->pd_mnt,
+						&pagedep->pd_diraddhd[i]))) {
+					FREE_LOCK(&lk);
+					break;
+				}
+			}
+			continue;
+
+		case D_MKDIR:
+			/*
+			 * This case should never happen if the vnode has
+			 * been properly sync'ed. However, if this function
+			 * is used at a place where the vnode has not yet
+			 * been sync'ed, this dependency can show up. So,
+			 * rather than panic, just flush it.
+			 */
+			nbp = WK_MKDIR(wk)->md_buf;
+			nbp = getdirtybuf(&nbp, NULL, waitfor);
+			if (nbp == NULL)
+				continue;
+			FREE_LOCK(&lk);
+			if (waitfor == MNT_NOWAIT) {
+				bawrite(nbp);
+			} else if ((error = BUF_WRITE(nbp)) != 0) {
+				break;
+			}
+			ACQUIRE_LOCK(&lk);
+			continue;
+
+		case D_BMSAFEMAP:
+			/*
+			 * This case should never happen if the vnode has
+			 * been properly sync'ed. However, if this function
+			 * is used at a place where the vnode has not yet
+			 * been sync'ed, this dependency can show up. So,
+			 * rather than panic, just flush it.
+			 */
+			nbp = WK_BMSAFEMAP(wk)->sm_buf;
+			nbp = getdirtybuf(&nbp, NULL, waitfor);
+			if (nbp == NULL)
+				continue;
+			FREE_LOCK(&lk);
+			if (waitfor == MNT_NOWAIT) {
+				bawrite(nbp);
+			} else if ((error = BUF_WRITE(nbp)) != 0) {
+				break;
+			}
+			ACQUIRE_LOCK(&lk);
+			continue;
+
+		default:
+			FREE_LOCK(&lk);
+			panic("softdep_sync_metadata: Unknown type %s",
+			    TYPENAME(wk->wk_type));
+			/* NOTREACHED */
+		}
+		/* We reach here only in error and unlocked */
+		if (error == 0)
+			panic("softdep_sync_metadata: zero error");
+		bp->b_lock.lk_flags &= ~LK_CANRECURSE;
+		bawrite(bp);
+		return (error);
+	}
+	VI_LOCK(vp);
+	nbp = getdirtybuf(&TAILQ_NEXT(bp, b_vnbufs), VI_MTX(vp), MNT_WAIT);
+	if (nbp == NULL)
+		VI_UNLOCK(vp);
+	FREE_LOCK(&lk);
+	bp->b_lock.lk_flags &= ~LK_CANRECURSE;
+	bawrite(bp);
+	ACQUIRE_LOCK(&lk);
+	if (nbp != NULL) {
+		bp = nbp;
+		goto loop;
+	}
+	/*
+	 * The brief unlock is to allow any pent up dependency
+	 * processing to be done. Then proceed with the second pass.
+	 */
+	if (waitfor == MNT_NOWAIT) {
+		waitfor = MNT_WAIT;
+		FREE_LOCK(&lk);
+		ACQUIRE_LOCK(&lk);
+		goto top;
+	}
+
+	/*
+	 * If we have managed to get rid of all the dirty buffers,
+	 * then we are done. For certain directories and block
+	 * devices, we may need to do further work.
+	 *
+	 * We must wait for any I/O in progress to finish so that
+	 * all potential buffers on the dirty list will be visible.
+	 */
+	VI_LOCK(vp);
+	drain_output(vp, 1);
+	if (TAILQ_FIRST(&vp->v_dirtyblkhd) == NULL) {
+		VI_UNLOCK(vp);
+		FREE_LOCK(&lk);
+		return (0);
+	}
+	VI_UNLOCK(vp);
+
+	FREE_LOCK(&lk);
+	/*
+	 * If we are trying to sync a block device, some of its buffers may
+	 * contain metadata that cannot be written until the contents of some
+	 * partially written files have been written to disk. The only easy
+	 * way to accomplish this is to sync the entire filesystem (luckily
+	 * this happens rarely).
+	 */
+	if (vn_isdisk(vp, NULL) && 
+	    vp->v_rdev->si_mountpoint && !VOP_ISLOCKED(vp, NULL) &&
+	    (error = VFS_SYNC(vp->v_rdev->si_mountpoint, MNT_WAIT, ap->a_cred,
+	     ap->a_td)) != 0)
+		return (error);
+	return (0);
+}
+
+/*
+ * Flush the dependencies associated with an inodedep.
+ * Called with splbio blocked.
+ */
+static int
+flush_inodedep_deps(fs, ino)
+	struct fs *fs;
+	ino_t ino;
+{
+	struct inodedep *inodedep;
+	int error, waitfor;
+
+	/*
+	 * This work is done in two passes. The first pass grabs most
+	 * of the buffers and begins asynchronously writing them. The
+	 * only way to wait for these asynchronous writes is to sleep
+	 * on the filesystem vnode which may stay busy for a long time
+	 * if the filesystem is active. So, instead, we make a second
+	 * pass over the dependencies blocking on each write. In the
+	 * usual case we will be blocking against a write that we
+	 * initiated, so when it is done the dependency will have been
+	 * resolved. Thus the second pass is expected to end quickly.
+	 * We give a brief window at the top of the loop to allow
+	 * any pending I/O to complete.
+	 */
+	for (error = 0, waitfor = MNT_NOWAIT; ; ) {
+		if (error)
+			return (error);
+		FREE_LOCK(&lk);
+		ACQUIRE_LOCK(&lk);
+		if (inodedep_lookup(fs, ino, 0, &inodedep) == 0)
+			return (0);
+		if (flush_deplist(&inodedep->id_inoupdt, waitfor, &error) ||
+		    flush_deplist(&inodedep->id_newinoupdt, waitfor, &error) ||
+		    flush_deplist(&inodedep->id_extupdt, waitfor, &error) ||
+		    flush_deplist(&inodedep->id_newextupdt, waitfor, &error))
+			continue;
+		/*
+		 * If pass2, we are done, otherwise do pass 2.
+		 */
+		if (waitfor == MNT_WAIT)
+			break;
+		waitfor = MNT_WAIT;
+	}
+	/*
+	 * Try freeing inodedep in case all dependencies have been removed.
+	 */
+	if (inodedep_lookup(fs, ino, 0, &inodedep) != 0)
+		(void) free_inodedep(inodedep);
+	return (0);
+}
+
+/*
+ * Flush an inode dependency list.
+ * Called with splbio blocked.
+ */
+static int
+flush_deplist(listhead, waitfor, errorp)
+	struct allocdirectlst *listhead;
+	int waitfor;
+	int *errorp;
+{
+	struct allocdirect *adp;
+	struct buf *bp;
+
+	TAILQ_FOREACH(adp, listhead, ad_next) {
+		if (adp->ad_state & DEPCOMPLETE)
+			continue;
+		bp = adp->ad_buf;
+		bp = getdirtybuf(&bp, NULL, waitfor);
+		if (bp == NULL) {
+			if (waitfor == MNT_NOWAIT)
+				continue;
+			return (1);
+		}
+		FREE_LOCK(&lk);
+		if (waitfor == MNT_NOWAIT) {
+			bawrite(bp);
+		} else if ((*errorp = BUF_WRITE(bp)) != 0) {
+			ACQUIRE_LOCK(&lk);
+			return (1);
+		}
+		ACQUIRE_LOCK(&lk);
+		return (1);
+	}
+	return (0);
+}
+
+/*
+ * Eliminate a pagedep dependency by flushing out all its diradd dependencies.
+ * Called with splbio blocked.
+ */
+static int
+flush_pagedep_deps(pvp, mp, diraddhdp)
+	struct vnode *pvp;
+	struct mount *mp;
+	struct diraddhd *diraddhdp;
+{
+	struct thread *td = curthread;
+	struct inodedep *inodedep;
+	struct ufsmount *ump;
+	struct diradd *dap;
+	struct vnode *vp;
+	int error = 0;
+	struct buf *bp;
+	ino_t inum;
+
+	ump = VFSTOUFS(mp);
+	while ((dap = LIST_FIRST(diraddhdp)) != NULL) {
+		/*
+		 * Flush ourselves if this directory entry
+		 * has a MKDIR_PARENT dependency.
+		 */
+		if (dap->da_state & MKDIR_PARENT) {
+			FREE_LOCK(&lk);
+			if ((error = UFS_UPDATE(pvp, 1)) != 0)
+				break;
+			ACQUIRE_LOCK(&lk);
+			/*
+			 * If that cleared dependencies, go on to next.
+			 */
+			if (dap != LIST_FIRST(diraddhdp))
+				continue;
+			if (dap->da_state & MKDIR_PARENT) {
+				FREE_LOCK(&lk);
+				panic("flush_pagedep_deps: MKDIR_PARENT");
+			}
+		}
+		/*
+		 * A newly allocated directory must have its "." and
+		 * ".." entries written out before its name can be
+		 * committed in its parent. We do not want or need
+		 * the full semantics of a synchronous VOP_FSYNC as
+		 * that may end up here again, once for each directory
+		 * level in the filesystem. Instead, we push the blocks
+		 * and wait for them to clear. We have to fsync twice
+		 * because the first call may choose to defer blocks
+		 * that still have dependencies, but deferral will
+		 * happen at most once.
+		 */
+		inum = dap->da_newinum;
+		if (dap->da_state & MKDIR_BODY) {
+			FREE_LOCK(&lk);
+			if ((error = VFS_VGET(mp, inum, LK_EXCLUSIVE, &vp)))
+				break;
+			if ((error=VOP_FSYNC(vp, td->td_ucred, MNT_NOWAIT, td)) ||
+			    (error=VOP_FSYNC(vp, td->td_ucred, MNT_NOWAIT, td))) {
+				vput(vp);
+				break;
+			}
+			VI_LOCK(vp);
+			drain_output(vp, 0);
+			VI_UNLOCK(vp);
+			vput(vp);
+			ACQUIRE_LOCK(&lk);
+			/*
+			 * If that cleared dependencies, go on to next.
+			 */
+			if (dap != LIST_FIRST(diraddhdp))
+				continue;
+			if (dap->da_state & MKDIR_BODY) {
+				FREE_LOCK(&lk);
+				panic("flush_pagedep_deps: MKDIR_BODY");
+			}
+		}
+		/*
+		 * Flush the inode on which the directory entry depends.
+		 * Having accounted for MKDIR_PARENT and MKDIR_BODY above,
+		 * the only remaining dependency is that the updated inode
+		 * count must get pushed to disk. The inode has already
+		 * been pushed into its inode buffer (via VOP_UPDATE) at
+		 * the time of the reference count change. So we need only
+		 * locate that buffer, ensure that there will be no rollback
+		 * caused by a bitmap dependency, then write the inode buffer.
+		 */
+		if (inodedep_lookup(ump->um_fs, inum, 0, &inodedep) == 0) {
+			FREE_LOCK(&lk);
+			panic("flush_pagedep_deps: lost inode");
+		}
+		/*
+		 * If the inode still has bitmap dependencies,
+		 * push them to disk.
+		 */
+		if ((inodedep->id_state & DEPCOMPLETE) == 0) {
+			bp = inodedep->id_buf;
+			bp = getdirtybuf(&bp, NULL, MNT_WAIT);
+			FREE_LOCK(&lk);
+			if (bp && (error = BUF_WRITE(bp)) != 0)
+				break;
+			ACQUIRE_LOCK(&lk);
+			if (dap != LIST_FIRST(diraddhdp))
+				continue;
+		}
+		/*
+		 * If the inode is still sitting in a buffer waiting
+		 * to be written, push it to disk.
+		 */
+		FREE_LOCK(&lk);
+		if ((error = bread(ump->um_devvp,
+		    fsbtodb(ump->um_fs, ino_to_fsba(ump->um_fs, inum)),
+		    (int)ump->um_fs->fs_bsize, NOCRED, &bp)) != 0) {
+			brelse(bp);
+			break;
+		}
+		if ((error = BUF_WRITE(bp)) != 0)
+			break;
+		ACQUIRE_LOCK(&lk);
+		/*
+		 * If we have failed to get rid of all the dependencies
+		 * then something is seriously wrong.
+		 */
+		if (dap == LIST_FIRST(diraddhdp)) {
+			FREE_LOCK(&lk);
+			panic("flush_pagedep_deps: flush failed");
+		}
+	}
+	if (error)
+		ACQUIRE_LOCK(&lk);
+	return (error);
+}
+
+/*
+ * A large burst of file addition or deletion activity can drive the
+ * memory load excessively high. First attempt to slow things down
+ * using the techniques below. If that fails, this routine requests
+ * the offending operations to fall back to running synchronously
+ * until the memory load returns to a reasonable level.
+ */
+int
+softdep_slowdown(vp)
+	struct vnode *vp;
+{
+	int max_softdeps_hard;
+
+	max_softdeps_hard = max_softdeps * 11 / 10;
+	if (num_dirrem < max_softdeps_hard / 2 &&
+	    num_inodedep < max_softdeps_hard &&
+	    VFSTOUFS(vp->v_mount)->um_numindirdeps < maxindirdeps)
+  		return (0);
+	if (VFSTOUFS(vp->v_mount)->um_numindirdeps >= maxindirdeps)
+		speedup_syncer();
+	stat_sync_limit_hit += 1;
+	return (1);
+}
+
+/*
+ * Called by the allocation routines when they are about to fail
+ * in the hope that we can free up some disk space.
+ * 
+ * First check to see if the work list has anything on it. If it has,
+ * clean up entries until we successfully free some space. Because this
+ * process holds inodes locked, we cannot handle any remove requests
+ * that might block on a locked inode as that could lead to deadlock.
+ * If the worklist yields no free space, encourage the syncer daemon
+ * to help us. In no event will we try for longer than tickdelay seconds.
+ */
+int
+softdep_request_cleanup(fs, vp)
+	struct fs *fs;
+	struct vnode *vp;
+{
+	long starttime;
+	ufs2_daddr_t needed;
+
+	needed = fs->fs_cstotal.cs_nbfree + fs->fs_contigsumsize;
+	starttime = time_second + tickdelay;
+	/*
+	 * If we are being called because of a process doing a
+	 * copy-on-write, then it is not safe to update the vnode
+	 * as we may recurse into the copy-on-write routine.
+	 */
+	if (!(curthread->td_pflags & TDP_COWINPROGRESS) &&
+	    UFS_UPDATE(vp, 1) != 0)
+		return (0);
+	while (fs->fs_pendingblocks > 0 && fs->fs_cstotal.cs_nbfree <= needed) {
+		if (time_second > starttime)
+			return (0);
+		if (num_on_worklist > 0 &&
+		    process_worklist_item(NULL, LK_NOWAIT) != -1) {
+			stat_worklist_push += 1;
+			continue;
+		}
+		request_cleanup(FLUSH_REMOVE_WAIT, 0);
+	}
+	return (1);
+}
+
+/*
+ * If memory utilization has gotten too high, deliberately slow things
+ * down and speed up the I/O processing.
+ */
+static int
+request_cleanup(resource, islocked)
+	int resource;
+	int islocked;
+{
+	struct thread *td = curthread;
+
+	/*
+	 * We never hold up the filesystem syncer process.
+	 */
+	if (td == filesys_syncer)
+		return (0);
+	/*
+	 * First check to see if the work list has gotten backlogged.
+	 * If it has, co-opt this process to help clean up two entries.
+	 * Because this process may hold inodes locked, we cannot
+	 * handle any remove requests that might block on a locked
+	 * inode as that could lead to deadlock.
+	 */
+	if (num_on_worklist > max_softdeps / 10) {
+		if (islocked)
+			FREE_LOCK(&lk);
+		process_worklist_item(NULL, LK_NOWAIT);
+		process_worklist_item(NULL, LK_NOWAIT);
+		stat_worklist_push += 2;
+		if (islocked)
+			ACQUIRE_LOCK(&lk);
+		return(1);
+	}
+	/*
+	 * Next, we attempt to speed up the syncer process. If that
+	 * is successful, then we allow the process to continue.
+	 */
+	if (speedup_syncer() && resource != FLUSH_REMOVE_WAIT)
+		return(0);
+	/*
+	 * If we are resource constrained on inode dependencies, try
+	 * flushing some dirty inodes. Otherwise, we are constrained
+	 * by file deletions, so try accelerating flushes of directories
+	 * with removal dependencies. We would like to do the cleanup
+	 * here, but we probably hold an inode locked at this point and 
+	 * that might deadlock against one that we try to clean. So,
+	 * the best that we can do is request the syncer daemon to do
+	 * the cleanup for us.
+	 */
+	switch (resource) {
+
+	case FLUSH_INODES:
+		stat_ino_limit_push += 1;
+		req_clear_inodedeps += 1;
+		stat_countp = &stat_ino_limit_hit;
+		break;
+
+	case FLUSH_REMOVE:
+	case FLUSH_REMOVE_WAIT:
+		stat_blk_limit_push += 1;
+		req_clear_remove += 1;
+		stat_countp = &stat_blk_limit_hit;
+		break;
+
+	default:
+		if (islocked)
+			FREE_LOCK(&lk);
+		panic("request_cleanup: unknown type");
+	}
+	/*
+	 * Hopefully the syncer daemon will catch up and awaken us.
+	 * We wait at most tickdelay before proceeding in any case.
+	 */
+	if (islocked == 0)
+		ACQUIRE_LOCK(&lk);
+	proc_waiting += 1;
+	if (handle.callout == NULL)
+		handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2);
+	interlocked_sleep(&lk, SLEEP, (caddr_t)&proc_waiting, NULL, PPAUSE,
+	    "softupdate", 0);
+	proc_waiting -= 1;
+	if (islocked == 0)
+		FREE_LOCK(&lk);
+	return (1);
+}
+
+/*
+ * Awaken processes pausing in request_cleanup and clear proc_waiting
+ * to indicate that there is no longer a timer running.
+ */
+static void
+pause_timer(arg)
+	void *arg;
+{
+
+	*stat_countp += 1;
+	wakeup_one(&proc_waiting);
+	if (proc_waiting > 0)
+		handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2);
+	else
+		handle.callout = NULL;
+}
+
+/*
+ * Flush out a directory with at least one removal dependency in an effort to
+ * reduce the number of dirrem, freefile, and freeblks dependency structures.
+ */
+static void
+clear_remove(td)
+	struct thread *td;
+{
+	struct pagedep_hashhead *pagedephd;
+	struct pagedep *pagedep;
+	static int next = 0;
+	struct mount *mp;
+	struct vnode *vp;
+	int error, cnt;
+	ino_t ino;
+
+	ACQUIRE_LOCK(&lk);
+	for (cnt = 0; cnt < pagedep_hash; cnt++) {
+		pagedephd = &pagedep_hashtbl[next++];
+		if (next >= pagedep_hash)
+			next = 0;
+		LIST_FOREACH(pagedep, pagedephd, pd_hash) {
+			if (LIST_FIRST(&pagedep->pd_dirremhd) == NULL)
+				continue;
+			mp = pagedep->pd_mnt;
+			ino = pagedep->pd_ino;
+			if (vn_start_write(NULL, &mp, V_NOWAIT) != 0)
+				continue;
+			FREE_LOCK(&lk);
+			if ((error = VFS_VGET(mp, ino, LK_EXCLUSIVE, &vp))) {
+				softdep_error("clear_remove: vget", error);
+				vn_finished_write(mp);
+				return;
+			}
+			if ((error = VOP_FSYNC(vp, td->td_ucred, MNT_NOWAIT, td)))
+				softdep_error("clear_remove: fsync", error);
+			VI_LOCK(vp);
+			drain_output(vp, 0);
+			VI_UNLOCK(vp);
+			vput(vp);
+			vn_finished_write(mp);
+			return;
+		}
+	}
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Clear out a block of dirty inodes in an effort to reduce
+ * the number of inodedep dependency structures.
+ */
+static void
+clear_inodedeps(td)
+	struct thread *td;
+{
+	struct inodedep_hashhead *inodedephd;
+	struct inodedep *inodedep;
+	static int next = 0;
+	struct mount *mp;
+	struct vnode *vp;
+	struct fs *fs;
+	int error, cnt;
+	ino_t firstino, lastino, ino;
+
+	ACQUIRE_LOCK(&lk);
+	/*
+	 * Pick a random inode dependency to be cleared.
+	 * We will then gather up all the inodes in its block 
+	 * that have dependencies and flush them out.
+	 */
+	for (cnt = 0; cnt < inodedep_hash; cnt++) {
+		inodedephd = &inodedep_hashtbl[next++];
+		if (next >= inodedep_hash)
+			next = 0;
+		if ((inodedep = LIST_FIRST(inodedephd)) != NULL)
+			break;
+	}
+	if (inodedep == NULL)
+		return;
+	/*
+	 * Ugly code to find mount point given pointer to superblock.
+	 */
+	fs = inodedep->id_fs;
+	TAILQ_FOREACH(mp, &mountlist, mnt_list)
+		if ((mp->mnt_flag & MNT_SOFTDEP) && fs == VFSTOUFS(mp)->um_fs)
+			break;
+	/*
+	 * Find the last inode in the block with dependencies.
+	 */
+	firstino = inodedep->id_ino & ~(INOPB(fs) - 1);
+	for (lastino = firstino + INOPB(fs) - 1; lastino > firstino; lastino--)
+		if (inodedep_lookup(fs, lastino, 0, &inodedep) != 0)
+			break;
+	/*
+	 * Asynchronously push all but the last inode with dependencies.
+	 * Synchronously push the last inode with dependencies to ensure
+	 * that the inode block gets written to free up the inodedeps.
+	 */
+	for (ino = firstino; ino <= lastino; ino++) {
+		if (inodedep_lookup(fs, ino, 0, &inodedep) == 0)
+			continue;
+		FREE_LOCK(&lk);
+		if (vn_start_write(NULL, &mp, V_NOWAIT) != 0)
+			continue;
+		if ((error = VFS_VGET(mp, ino, LK_EXCLUSIVE, &vp)) != 0) {
+			softdep_error("clear_inodedeps: vget", error);
+			vn_finished_write(mp);
+			return;
+		}
+		if (ino == lastino) {
+			if ((error = VOP_FSYNC(vp, td->td_ucred, MNT_WAIT, td)))
+				softdep_error("clear_inodedeps: fsync1", error);
+		} else {
+			if ((error = VOP_FSYNC(vp, td->td_ucred, MNT_NOWAIT, td)))
+				softdep_error("clear_inodedeps: fsync2", error);
+			VI_LOCK(vp);
+			drain_output(vp, 0);
+			VI_UNLOCK(vp);
+		}
+		vput(vp);
+		vn_finished_write(mp);
+		ACQUIRE_LOCK(&lk);
+	}
+	FREE_LOCK(&lk);
+}
+
+/*
+ * Function to determine if the buffer has outstanding dependencies
+ * that will cause a roll-back if the buffer is written. If wantcount
+ * is set, return number of dependencies, otherwise just yes or no.
+ */
+static int
+softdep_count_dependencies(bp, wantcount)
+	struct buf *bp;
+	int wantcount;
+{
+	struct worklist *wk;
+	struct inodedep *inodedep;
+	struct indirdep *indirdep;
+	struct allocindir *aip;
+	struct pagedep *pagedep;
+	struct diradd *dap;
+	int i, retval;
+
+	retval = 0;
+	ACQUIRE_LOCK(&lk);
+	LIST_FOREACH(wk, &bp->b_dep, wk_list) {
+		switch (wk->wk_type) {
+
+		case D_INODEDEP:
+			inodedep = WK_INODEDEP(wk);
+			if ((inodedep->id_state & DEPCOMPLETE) == 0) {
+				/* bitmap allocation dependency */
+				retval += 1;
+				if (!wantcount)
+					goto out;
+			}
+			if (TAILQ_FIRST(&inodedep->id_inoupdt)) {
+				/* direct block pointer dependency */
+				retval += 1;
+				if (!wantcount)
+					goto out;
+			}
+			if (TAILQ_FIRST(&inodedep->id_extupdt)) {
+				/* direct block pointer dependency */
+				retval += 1;
+				if (!wantcount)
+					goto out;
+			}
+			continue;
+
+		case D_INDIRDEP:
+			indirdep = WK_INDIRDEP(wk);
+
+			LIST_FOREACH(aip, &indirdep->ir_deplisthd, ai_next) {
+				/* indirect block pointer dependency */
+				retval += 1;
+				if (!wantcount)
+					goto out;
+			}
+			continue;
+
+		case D_PAGEDEP:
+			pagedep = WK_PAGEDEP(wk);
+			for (i = 0; i < DAHASHSZ; i++) {
+
+				LIST_FOREACH(dap, &pagedep->pd_diraddhd[i], da_pdlist) {
+					/* directory entry dependency */
+					retval += 1;
+					if (!wantcount)
+						goto out;
+				}
+			}
+			continue;
+
+		case D_BMSAFEMAP:
+		case D_ALLOCDIRECT:
+		case D_ALLOCINDIR:
+		case D_MKDIR:
+			/* never a dependency on these blocks */
+			continue;
+
+		default:
+			FREE_LOCK(&lk);
+			panic("softdep_check_for_rollback: Unexpected type %s",
+			    TYPENAME(wk->wk_type));
+			/* NOTREACHED */
+		}
+	}
+out:
+	FREE_LOCK(&lk);
+	return retval;
+}
+
+/*
+ * Acquire exclusive access to a buffer.
+ * Must be called with splbio blocked.
+ * Return acquired buffer or NULL on failure.  mtx, if provided, will be
+ * released on success but held on failure.
+ */
+static struct buf *
+getdirtybuf(bpp, mtx, waitfor)
+	struct buf **bpp;
+	struct mtx *mtx;
+	int waitfor;
+{
+	struct buf *bp;
+	int error;
+
+	/*
+	 * XXX This code and the code that calls it need to be reviewed to
+	 * verify its use of the vnode interlock.
+	 */
+
+	for (;;) {
+		if ((bp = *bpp) == NULL)
+			return (0);
+		if (bp->b_vp == NULL)
+			backtrace();
+		if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL) == 0) {
+			if ((bp->b_vflags & BV_BKGRDINPROG) == 0)
+				break;
+			BUF_UNLOCK(bp);
+			if (waitfor != MNT_WAIT)
+				return (NULL);
+			/*
+			 * The mtx argument must be bp->b_vp's mutex in
+			 * this case.
+			 */
+#ifdef	DEBUG_VFS_LOCKS
+			if (bp->b_vp->v_type != VCHR)
+				ASSERT_VI_LOCKED(bp->b_vp, "getdirtybuf");
+#endif
+			bp->b_vflags |= BV_BKGRDWAIT;
+			interlocked_sleep(&lk, SLEEP, &bp->b_xflags, mtx,
+			    PRIBIO, "getbuf", 0);
+			continue;
+		}
+		if (waitfor != MNT_WAIT)
+			return (NULL);
+		if (mtx) {
+			error = interlocked_sleep(&lk, LOCKBUF, bp, mtx,
+			    LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK, 0, 0);
+			mtx_lock(mtx);
+		} else
+			error = interlocked_sleep(&lk, LOCKBUF, bp, NULL,
+			    LK_EXCLUSIVE | LK_SLEEPFAIL, 0, 0);
+		if (error != ENOLCK) {
+			FREE_LOCK(&lk);
+			panic("getdirtybuf: inconsistent lock");
+		}
+	}
+	if ((bp->b_flags & B_DELWRI) == 0) {
+		BUF_UNLOCK(bp);
+		return (NULL);
+	}
+	if (mtx)
+		mtx_unlock(mtx);
+	bremfree(bp);
+	return (bp);
+}
+
+/*
+ * Wait for pending output on a vnode to complete.
+ * Must be called with vnode lock and interlock locked.
+ */
+static void
+drain_output(vp, islocked)
+	struct vnode *vp;
+	int islocked;
+{
+	ASSERT_VOP_LOCKED(vp, "drain_output");
+	ASSERT_VI_LOCKED(vp, "drain_output");
+
+	if (!islocked)
+		ACQUIRE_LOCK(&lk);
+	while (vp->v_numoutput) {
+		vp->v_iflag |= VI_BWAIT;
+		interlocked_sleep(&lk, SLEEP, (caddr_t)&vp->v_numoutput, 
+		    VI_MTX(vp), PRIBIO + 1, "drainvp", 0);
+	}
+	if (!islocked)
+		FREE_LOCK(&lk);
+}
+
+/*
+ * Called whenever a buffer that is being invalidated or reallocated
+ * contains dependencies. This should only happen if an I/O error has
+ * occurred. The routine is called with the buffer locked.
+ */ 
+static void
+softdep_deallocate_dependencies(bp)
+	struct buf *bp;
+{
+
+	if ((bp->b_ioflags & BIO_ERROR) == 0)
+		panic("softdep_deallocate_dependencies: dangling deps");
+	softdep_error(bp->b_vp->v_mount->mnt_stat.f_mntonname, bp->b_error);
+	panic("softdep_deallocate_dependencies: unrecovered I/O error");
+}
+
+/*
+ * Function to handle asynchronous write errors in the filesystem.
+ */
+static void
+softdep_error(func, error)
+	char *func;
+	int error;
+{
+
+	/* XXX should do something better! */
+	printf("%s: got error %d while accessing filesystem\n", func, error);
+}
diff --git a/src/sys/ufs/ffs/ffs_softdep_stub.c b/src/sys/ufs/ffs/ffs_softdep_stub.c
new file mode 100644
index 0000000..486bc58
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_softdep_stub.c
@@ -0,0 +1,317 @@
+/*
+ * Copyright 1998 Marshall Kirk McKusick. All Rights Reserved.
+ *
+ * The soft updates code is derived from the appendix of a University
+ * of Michigan technical report (Gregory R. Ganger and Yale N. Patt,
+ * "Soft Updates: A Solution to the Metadata Update Problem in File
+ * Systems", CSE-TR-254-95, August 1995).
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. None of the names of McKusick, Ganger, or the University of Michigan
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY MARSHALL KIRK MCKUSICK ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL MARSHALL KIRK MCKUSICK BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	from: @(#)ffs_softdep_stub.c	9.1 (McKusick) 7/10/97
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_softdep_stub.c,v 1.27 2003/06/11 06:31:28 obrien Exp $");
+
+/* 
+ * Use this file as ffs_softdep.c if you do not wish the real ffs_softdep.c
+ * to be included in your system. (e.g for legal reasons )
+ * The real files are in /usr/src/contrib/sys/softupdates.
+ * You must copy them here before you can use soft updates.
+ * Read the README for legal and technical information.
+ */
+
+#include "opt_ffs.h"
+#if (SOFTUPDATES == 0 ) /* SOFTUPDATES not configured in, use these stubs. */
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/vnode.h>
+#include <sys/mount.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ffs/ffs_extern.h>
+#include <ufs/ufs/ufs_extern.h>
+
+int
+softdep_flushfiles(oldmnt, flags, td)
+	struct mount *oldmnt;
+	int flags;
+	struct thread *td;
+{
+
+	panic("softdep_flushfiles called");
+}
+
+int
+softdep_mount(devvp, mp, fs, cred)
+	struct vnode *devvp;
+	struct mount *mp;
+	struct fs *fs;
+	struct ucred *cred;
+{
+
+	return (0);
+}
+
+void 
+softdep_initialize()
+{
+
+	return;
+}
+
+void
+softdep_uninitialize()
+{
+
+	return;
+}
+
+void
+softdep_setup_inomapdep(bp, ip, newinum)
+	struct buf *bp;
+	struct inode *ip;
+	ino_t newinum;
+{
+
+	panic("softdep_setup_inomapdep called");
+}
+
+void
+softdep_setup_blkmapdep(bp, fs, newblkno)
+	struct buf *bp;
+	struct fs *fs;
+	ufs2_daddr_t newblkno;
+{
+
+	panic("softdep_setup_blkmapdep called");
+}
+
+void 
+softdep_setup_allocdirect(ip, lbn, newblkno, oldblkno, newsize, oldsize, bp)
+	struct inode *ip;
+	ufs_lbn_t lbn;
+	ufs2_daddr_t newblkno;
+	ufs2_daddr_t oldblkno;
+	long newsize;
+	long oldsize;
+	struct buf *bp;
+{
+	
+	panic("softdep_setup_allocdirect called");
+}
+
+void 
+softdep_setup_allocext(ip, lbn, newblkno, oldblkno, newsize, oldsize, bp)
+	struct inode *ip;
+	ufs_lbn_t lbn;
+	ufs2_daddr_t newblkno;
+	ufs2_daddr_t oldblkno;
+	long newsize;
+	long oldsize;
+	struct buf *bp;
+{
+	
+	panic("softdep_setup_allocdirect called");
+}
+
+void
+softdep_setup_allocindir_page(ip, lbn, bp, ptrno, newblkno, oldblkno, nbp)
+	struct inode *ip;
+	ufs_lbn_t lbn;
+	struct buf *bp;
+	int ptrno;
+	ufs2_daddr_t newblkno;
+	ufs2_daddr_t oldblkno;
+	struct buf *nbp;
+{
+
+	panic("softdep_setup_allocindir_page called");
+}
+
+void
+softdep_setup_allocindir_meta(nbp, ip, bp, ptrno, newblkno)
+	struct buf *nbp;
+	struct inode *ip;
+	struct buf *bp;
+	int ptrno;
+	ufs2_daddr_t newblkno;
+{
+
+	panic("softdep_setup_allocindir_meta called");
+}
+
+void
+softdep_setup_freeblocks(ip, length, flags)
+	struct inode *ip;
+	off_t length;
+	int flags;
+{
+	
+	panic("softdep_setup_freeblocks called");
+}
+
+void
+softdep_freefile(pvp, ino, mode)
+		struct vnode *pvp;
+		ino_t ino;
+		int mode;
+{
+
+	panic("softdep_freefile called");
+}
+
+int 
+softdep_setup_directory_add(bp, dp, diroffset, newinum, newdirbp, isnewblk)
+	struct buf *bp;
+	struct inode *dp;
+	off_t diroffset;
+	ino_t newinum;
+	struct buf *newdirbp;
+	int isnewblk;
+{
+
+	panic("softdep_setup_directory_add called");
+}
+
+void 
+softdep_change_directoryentry_offset(dp, base, oldloc, newloc, entrysize)
+	struct inode *dp;
+	caddr_t base;
+	caddr_t oldloc;
+	caddr_t newloc;
+	int entrysize;
+{
+
+	panic("softdep_change_directoryentry_offset called");
+}
+
+void 
+softdep_setup_remove(bp, dp, ip, isrmdir)
+	struct buf *bp;
+	struct inode *dp;
+	struct inode *ip;
+	int isrmdir;
+{
+	
+	panic("softdep_setup_remove called");
+}
+
+void 
+softdep_setup_directory_change(bp, dp, ip, newinum, isrmdir)
+	struct buf *bp;
+	struct inode *dp;
+	struct inode *ip;
+	ino_t newinum;
+	int isrmdir;
+{
+
+	panic("softdep_setup_directory_change called");
+}
+
+void
+softdep_change_linkcnt(ip)
+	struct inode *ip;
+{
+
+	panic("softdep_change_linkcnt called");
+}
+
+void 
+softdep_load_inodeblock(ip)
+	struct inode *ip;
+{
+
+	panic("softdep_load_inodeblock called");
+}
+
+void 
+softdep_update_inodeblock(ip, bp, waitfor)
+	struct inode *ip;
+	struct buf *bp;
+	int waitfor;
+{
+
+	panic("softdep_update_inodeblock called");
+}
+
+void
+softdep_fsync_mountdev(vp)
+	struct vnode *vp;
+{
+
+	return;
+}
+
+int
+softdep_flushworklist(oldmnt, countp, td)
+	struct mount *oldmnt;
+	int *countp;
+	struct thread *td;
+{
+
+	*countp = 0;
+	return (0);
+}
+
+int
+softdep_sync_metadata(ap)
+	struct vop_fsync_args /* {
+		struct vnode *a_vp;
+		struct ucred *a_cred;
+		int a_waitfor;
+		struct thread *a_td;
+	} */ *ap;
+{
+
+	return (0);
+}
+
+int
+softdep_slowdown(vp)
+	struct vnode *vp;
+{
+
+	panic("softdep_slowdown called");
+}
+
+void
+softdep_releasefile(ip)
+	struct inode *ip;	/* inode with the zero effective link count */
+{
+
+	panic("softdep_releasefile called");
+}
+
+int
+softdep_request_cleanup(fs, vp)
+	struct fs *fs;
+	struct vnode *vp;
+{
+
+	return (0);
+}
+#endif	/* SOFTUPDATES not configured in */
diff --git a/src/sys/ufs/ffs/ffs_subr.c b/src/sys/ufs/ffs/ffs_subr.c
new file mode 100644
index 0000000..a772bde
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_subr.c
@@ -0,0 +1,292 @@
+/*
+ * Copyright (c) 1982, 1986, 1989, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. All advertising materials mentioning features or use of this software
+ *    must display the following acknowledgement:
+ *	This product includes software developed by the University of
+ *	California, Berkeley and its contributors.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)ffs_subr.c	8.5 (Berkeley) 3/21/95
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_subr.c,v 1.37 2003/06/11 06:31:28 obrien Exp $");
+
+#include <sys/param.h>
+
+#ifndef _KERNEL
+#include <ufs/ufs/dinode.h>
+#include <ufs/ffs/fs.h>
+#include "fsck.h"
+#else
+#include "opt_ddb.h"
+
+#include <sys/systm.h>
+#include <sys/lock.h>
+#include <sys/malloc.h>
+#include <sys/mount.h>
+#include <sys/vnode.h>
+#include <sys/bio.h>
+#include <sys/buf.h>
+#include <sys/ucred.h>
+
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/ufsmount.h>
+#include <ufs/ufs/ufs_extern.h>
+#include <ufs/ffs/ffs_extern.h>
+#include <ufs/ffs/fs.h>
+
+#ifdef DDB
+void	ffs_checkoverlap(struct buf *, struct inode *);
+#endif
+
+/*
+ * Return buffer with the contents of block "offset" from the beginning of
+ * directory "ip".  If "res" is non-zero, fill it in with a pointer to the
+ * remaining space in the directory.
+ */
+int
+ffs_blkatoff(vp, offset, res, bpp)
+	struct vnode *vp;
+	off_t offset;
+	char **res;
+	struct buf **bpp;
+{
+	struct inode *ip;
+	struct fs *fs;
+	struct buf *bp;
+	ufs_lbn_t lbn;
+	int bsize, error;
+
+	ip = VTOI(vp);
+	fs = ip->i_fs;
+	lbn = lblkno(fs, offset);
+	bsize = blksize(fs, ip, lbn);
+
+	*bpp = NULL;
+	error = bread(vp, lbn, bsize, NOCRED, &bp);
+	if (error) {
+		brelse(bp);
+		return (error);
+	}
+	if (res)
+		*res = (char *)bp->b_data + blkoff(fs, offset);
+	*bpp = bp;
+	return (0);
+}
+
+/*
+ * Load up the contents of an inode and copy the appropriate pieces
+ * to the incore copy.
+ */
+void
+ffs_load_inode(bp, ip, fs, ino)
+	struct buf *bp;
+	struct inode *ip;
+	struct fs *fs;
+	ino_t ino;
+{
+
+	if (ip->i_ump->um_fstype == UFS1) {
+		*ip->i_din1 =
+		    *((struct ufs1_dinode *)bp->b_data + ino_to_fsbo(fs, ino));
+		ip->i_mode = ip->i_din1->di_mode;
+		ip->i_nlink = ip->i_din1->di_nlink;
+		ip->i_size = ip->i_din1->di_size;
+		ip->i_flags = ip->i_din1->di_flags;
+		ip->i_gen = ip->i_din1->di_gen;
+		ip->i_uid = ip->i_din1->di_uid;
+		ip->i_gid = ip->i_din1->di_gid;
+	} else {
+		*ip->i_din2 =
+		    *((struct ufs2_dinode *)bp->b_data + ino_to_fsbo(fs, ino));
+		ip->i_mode = ip->i_din2->di_mode;
+		ip->i_nlink = ip->i_din2->di_nlink;
+		ip->i_size = ip->i_din2->di_size;
+		ip->i_flags = ip->i_din2->di_flags;
+		ip->i_gen = ip->i_din2->di_gen;
+		ip->i_uid = ip->i_din2->di_uid;
+		ip->i_gid = ip->i_din2->di_gid;
+	}
+}
+#endif /* KERNEL */
+
+/*
+ * Update the frsum fields to reflect addition or deletion
+ * of some frags.
+ */
+void
+ffs_fragacct(fs, fragmap, fraglist, cnt)
+	struct fs *fs;
+	int fragmap;
+	int32_t fraglist[];
+	int cnt;
+{
+	int inblk;
+	int field, subfield;
+	int siz, pos;
+
+	inblk = (int)(fragtbl[fs->fs_frag][fragmap]) << 1;
+	fragmap <<= 1;
+	for (siz = 1; siz < fs->fs_frag; siz++) {
+		if ((inblk & (1 << (siz + (fs->fs_frag % NBBY)))) == 0)
+			continue;
+		field = around[siz];
+		subfield = inside[siz];
+		for (pos = siz; pos <= fs->fs_frag; pos++) {
+			if ((fragmap & field) == subfield) {
+				fraglist[siz] += cnt;
+				pos += siz;
+				field <<= siz;
+				subfield <<= siz;
+			}
+			field <<= 1;
+			subfield <<= 1;
+		}
+	}
+}
+
+#ifdef DDB
+void
+ffs_checkoverlap(bp, ip)
+	struct buf *bp;
+	struct inode *ip;
+{
+	struct buf *ebp, *ep;
+	ufs2_daddr_t start, last;
+	struct vnode *vp;
+
+	ebp = &buf[nbuf];
+	start = bp->b_blkno;
+	last = start + btodb(bp->b_bcount) - 1;
+	for (ep = buf; ep < ebp; ep++) {
+		if (ep == bp || (ep->b_flags & B_INVAL) ||
+		    ep->b_vp == NULLVP)
+			continue;
+		vp = ip->i_devvp;
+		/* look for overlap */
+		if (ep->b_bcount == 0 || ep->b_blkno > last ||
+		    ep->b_blkno + btodb(ep->b_bcount) <= start)
+			continue;
+		vprint("Disk overlap", vp);
+		printf("\tstart %jd, end %jd overlap start %jd, end %jd\n",
+		    (intmax_t)start, (intmax_t)last, (intmax_t)ep->b_blkno,
+		    (intmax_t)(ep->b_blkno + btodb(ep->b_bcount) - 1));
+		panic("ffs_checkoverlap: Disk buffer overlap");
+	}
+}
+#endif /* DDB */
+
+/*
+ * block operations
+ *
+ * check if a block is available
+ */
+int
+ffs_isblock(fs, cp, h)
+	struct fs *fs;
+	unsigned char *cp;
+	ufs1_daddr_t h;
+{
+	unsigned char mask;
+
+	switch ((int)fs->fs_frag) {
+	case 8:
+		return (cp[h] == 0xff);
+	case 4:
+		mask = 0x0f << ((h & 0x1) << 2);
+		return ((cp[h >> 1] & mask) == mask);
+	case 2:
+		mask = 0x03 << ((h & 0x3) << 1);
+		return ((cp[h >> 2] & mask) == mask);
+	case 1:
+		mask = 0x01 << (h & 0x7);
+		return ((cp[h >> 3] & mask) == mask);
+	default:
+		panic("ffs_isblock");
+	}
+	return (0);
+}
+
+/*
+ * take a block out of the map
+ */
+void
+ffs_clrblock(fs, cp, h)
+	struct fs *fs;
+	u_char *cp;
+	ufs1_daddr_t h;
+{
+
+	switch ((int)fs->fs_frag) {
+	case 8:
+		cp[h] = 0;
+		return;
+	case 4:
+		cp[h >> 1] &= ~(0x0f << ((h & 0x1) << 2));
+		return;
+	case 2:
+		cp[h >> 2] &= ~(0x03 << ((h & 0x3) << 1));
+		return;
+	case 1:
+		cp[h >> 3] &= ~(0x01 << (h & 0x7));
+		return;
+	default:
+		panic("ffs_clrblock");
+	}
+}
+
+/*
+ * put a block into the map
+ */
+void
+ffs_setblock(fs, cp, h)
+	struct fs *fs;
+	unsigned char *cp;
+	ufs1_daddr_t h;
+{
+
+	switch ((int)fs->fs_frag) {
+
+	case 8:
+		cp[h] = 0xff;
+		return;
+	case 4:
+		cp[h >> 1] |= (0x0f << ((h & 0x1) << 2));
+		return;
+	case 2:
+		cp[h >> 2] |= (0x03 << ((h & 0x3) << 1));
+		return;
+	case 1:
+		cp[h >> 3] |= (0x01 << (h & 0x7));
+		return;
+	default:
+		panic("ffs_setblock");
+	}
+}
diff --git a/src/sys/ufs/ffs/ffs_tables.c b/src/sys/ufs/ffs/ffs_tables.c
new file mode 100644
index 0000000..3df835c
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_tables.c
@@ -0,0 +1,141 @@
+/*
+ * Copyright (c) 1982, 1986, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. All advertising materials mentioning features or use of this software
+ *    must display the following acknowledgement:
+ *	This product includes software developed by the University of
+ *	California, Berkeley and its contributors.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)ffs_tables.c	8.1 (Berkeley) 6/11/93
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_tables.c,v 1.10 2003/06/11 06:31:28 obrien Exp $");
+
+#include <sys/param.h>
+#include <ufs/ufs/dinode.h>
+#include <ufs/ffs/fs.h>
+
+/*
+ * Bit patterns for identifying fragments in the block map
+ * used as ((map & around) == inside)
+ */
+int around[9] = {
+	0x3, 0x7, 0xf, 0x1f, 0x3f, 0x7f, 0xff, 0x1ff, 0x3ff
+};
+int inside[9] = {
+	0x0, 0x2, 0x6, 0xe, 0x1e, 0x3e, 0x7e, 0xfe, 0x1fe
+};
+
+/*
+ * Given a block map bit pattern, the frag tables tell whether a
+ * particular size fragment is available.
+ *
+ * used as:
+ * if ((1 << (size - 1)) & fragtbl[fs->fs_frag][map] {
+ *	at least one fragment of the indicated size is available
+ * }
+ *
+ * These tables are used by the scanc instruction on the VAX to
+ * quickly find an appropriate fragment.
+ */
+static u_char fragtbl124[256] = {
+	0x00, 0x16, 0x16, 0x2a, 0x16, 0x16, 0x26, 0x4e,
+	0x16, 0x16, 0x16, 0x3e, 0x2a, 0x3e, 0x4e, 0x8a,
+	0x16, 0x16, 0x16, 0x3e, 0x16, 0x16, 0x36, 0x5e,
+	0x16, 0x16, 0x16, 0x3e, 0x3e, 0x3e, 0x5e, 0x9e,
+	0x16, 0x16, 0x16, 0x3e, 0x16, 0x16, 0x36, 0x5e,
+	0x16, 0x16, 0x16, 0x3e, 0x3e, 0x3e, 0x5e, 0x9e,
+	0x2a, 0x3e, 0x3e, 0x2a, 0x3e, 0x3e, 0x2e, 0x6e,
+	0x3e, 0x3e, 0x3e, 0x3e, 0x2a, 0x3e, 0x6e, 0xaa,
+	0x16, 0x16, 0x16, 0x3e, 0x16, 0x16, 0x36, 0x5e,
+	0x16, 0x16, 0x16, 0x3e, 0x3e, 0x3e, 0x5e, 0x9e,
+	0x16, 0x16, 0x16, 0x3e, 0x16, 0x16, 0x36, 0x5e,
+	0x16, 0x16, 0x16, 0x3e, 0x3e, 0x3e, 0x5e, 0x9e,
+	0x26, 0x36, 0x36, 0x2e, 0x36, 0x36, 0x26, 0x6e,
+	0x36, 0x36, 0x36, 0x3e, 0x2e, 0x3e, 0x6e, 0xae,
+	0x4e, 0x5e, 0x5e, 0x6e, 0x5e, 0x5e, 0x6e, 0x4e,
+	0x5e, 0x5e, 0x5e, 0x7e, 0x6e, 0x7e, 0x4e, 0xce,
+	0x16, 0x16, 0x16, 0x3e, 0x16, 0x16, 0x36, 0x5e,
+	0x16, 0x16, 0x16, 0x3e, 0x3e, 0x3e, 0x5e, 0x9e,
+	0x16, 0x16, 0x16, 0x3e, 0x16, 0x16, 0x36, 0x5e,
+	0x16, 0x16, 0x16, 0x3e, 0x3e, 0x3e, 0x5e, 0x9e,
+	0x16, 0x16, 0x16, 0x3e, 0x16, 0x16, 0x36, 0x5e,
+	0x16, 0x16, 0x16, 0x3e, 0x3e, 0x3e, 0x5e, 0x9e,
+	0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x7e,
+	0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x7e, 0xbe,
+	0x2a, 0x3e, 0x3e, 0x2a, 0x3e, 0x3e, 0x2e, 0x6e,
+	0x3e, 0x3e, 0x3e, 0x3e, 0x2a, 0x3e, 0x6e, 0xaa,
+	0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x7e,
+	0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x3e, 0x7e, 0xbe,
+	0x4e, 0x5e, 0x5e, 0x6e, 0x5e, 0x5e, 0x6e, 0x4e,
+	0x5e, 0x5e, 0x5e, 0x7e, 0x6e, 0x7e, 0x4e, 0xce,
+	0x8a, 0x9e, 0x9e, 0xaa, 0x9e, 0x9e, 0xae, 0xce,
+	0x9e, 0x9e, 0x9e, 0xbe, 0xaa, 0xbe, 0xce, 0x8a,
+};
+
+static u_char fragtbl8[256] = {
+	0x00, 0x01, 0x01, 0x02, 0x01, 0x01, 0x02, 0x04,
+	0x01, 0x01, 0x01, 0x03, 0x02, 0x03, 0x04, 0x08,
+	0x01, 0x01, 0x01, 0x03, 0x01, 0x01, 0x03, 0x05,
+	0x02, 0x03, 0x03, 0x02, 0x04, 0x05, 0x08, 0x10,
+	0x01, 0x01, 0x01, 0x03, 0x01, 0x01, 0x03, 0x05,
+	0x01, 0x01, 0x01, 0x03, 0x03, 0x03, 0x05, 0x09,
+	0x02, 0x03, 0x03, 0x02, 0x03, 0x03, 0x02, 0x06,
+	0x04, 0x05, 0x05, 0x06, 0x08, 0x09, 0x10, 0x20,
+	0x01, 0x01, 0x01, 0x03, 0x01, 0x01, 0x03, 0x05,
+	0x01, 0x01, 0x01, 0x03, 0x03, 0x03, 0x05, 0x09,
+	0x01, 0x01, 0x01, 0x03, 0x01, 0x01, 0x03, 0x05,
+	0x03, 0x03, 0x03, 0x03, 0x05, 0x05, 0x09, 0x11,
+	0x02, 0x03, 0x03, 0x02, 0x03, 0x03, 0x02, 0x06,
+	0x03, 0x03, 0x03, 0x03, 0x02, 0x03, 0x06, 0x0a,
+	0x04, 0x05, 0x05, 0x06, 0x05, 0x05, 0x06, 0x04,
+	0x08, 0x09, 0x09, 0x0a, 0x10, 0x11, 0x20, 0x40,
+	0x01, 0x01, 0x01, 0x03, 0x01, 0x01, 0x03, 0x05,
+	0x01, 0x01, 0x01, 0x03, 0x03, 0x03, 0x05, 0x09,
+	0x01, 0x01, 0x01, 0x03, 0x01, 0x01, 0x03, 0x05,
+	0x03, 0x03, 0x03, 0x03, 0x05, 0x05, 0x09, 0x11,
+	0x01, 0x01, 0x01, 0x03, 0x01, 0x01, 0x03, 0x05,
+	0x01, 0x01, 0x01, 0x03, 0x03, 0x03, 0x05, 0x09,
+	0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x07,
+	0x05, 0x05, 0x05, 0x07, 0x09, 0x09, 0x11, 0x21,
+	0x02, 0x03, 0x03, 0x02, 0x03, 0x03, 0x02, 0x06,
+	0x03, 0x03, 0x03, 0x03, 0x02, 0x03, 0x06, 0x0a,
+	0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x07,
+	0x02, 0x03, 0x03, 0x02, 0x06, 0x07, 0x0a, 0x12,
+	0x04, 0x05, 0x05, 0x06, 0x05, 0x05, 0x06, 0x04,
+	0x05, 0x05, 0x05, 0x07, 0x06, 0x07, 0x04, 0x0c,
+	0x08, 0x09, 0x09, 0x0a, 0x09, 0x09, 0x0a, 0x0c,
+	0x10, 0x11, 0x11, 0x12, 0x20, 0x21, 0x40, 0x80,
+};
+
+/*
+ * The actual fragtbl array.
+ */
+u_char *fragtbl[MAXFRAG + 1] = {
+	0, fragtbl124, fragtbl124, 0, fragtbl124, 0, 0, 0, fragtbl8,
+};
diff --git a/src/sys/ufs/ffs/ffs_vfsops.c b/src/sys/ufs/ffs/ffs_vfsops.c
new file mode 100644
index 0000000..e3db44c
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_vfsops.c
@@ -0,0 +1,1562 @@
+/*
+ * Copyright (c) 1989, 1991, 1993, 1994
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. All advertising materials mentioning features or use of this software
+ *    must display the following acknowledgement:
+ *	This product includes software developed by the University of
+ *	California, Berkeley and its contributors.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)ffs_vfsops.c	8.31 (Berkeley) 5/20/95
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_vfsops.c,v 1.225.2.1 2003/12/12 02:23:22 truckman Exp $");
+
+#include "opt_mac.h"
+#include "opt_quota.h"
+#include "opt_ufs.h"
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/namei.h>
+#include <sys/proc.h>
+#include <sys/kernel.h>
+#include <sys/mac.h>
+#include <sys/vnode.h>
+#include <sys/mount.h>
+#include <sys/bio.h>
+#include <sys/buf.h>
+#include <sys/conf.h>
+#include <sys/fcntl.h>
+#include <sys/disk.h>
+#include <sys/malloc.h>
+#include <sys/mutex.h>
+
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/ufsmount.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufs_extern.h>
+
+#include <ufs/ffs/fs.h>
+#include <ufs/ffs/ffs_extern.h>
+
+#include <vm/vm.h>
+#include <vm/uma.h>
+#include <vm/vm_page.h>
+
+uma_zone_t uma_inode, uma_ufs1, uma_ufs2;
+
+static int	ffs_sbupdate(struct ufsmount *, int);
+       int	ffs_reload(struct mount *,struct ucred *,struct thread *);
+static int	ffs_mountfs(struct vnode *, struct mount *, struct thread *);
+static void	ffs_oldfscompat_read(struct fs *, struct ufsmount *,
+		    ufs2_daddr_t);
+static void	ffs_oldfscompat_write(struct fs *, struct ufsmount *);
+static void	ffs_ifree(struct ufsmount *ump, struct inode *ip);
+static vfs_init_t ffs_init;
+static vfs_uninit_t ffs_uninit;
+static vfs_extattrctl_t ffs_extattrctl;
+
+static struct vfsops ufs_vfsops = {
+	.vfs_extattrctl =	ffs_extattrctl,
+	.vfs_fhtovp =		ffs_fhtovp,
+	.vfs_init =		ffs_init,
+	.vfs_mount =		ffs_mount,
+	.vfs_quotactl =		ufs_quotactl,
+	.vfs_root =		ufs_root,
+	.vfs_start =		ufs_start,
+	.vfs_statfs =		ffs_statfs,
+	.vfs_sync =		ffs_sync,
+	.vfs_uninit =		ffs_uninit,
+	.vfs_unmount =		ffs_unmount,
+	.vfs_vget =		ffs_vget,
+	.vfs_vptofh =		ffs_vptofh,
+};
+
+VFS_SET(ufs_vfsops, ufs, 0);
+
+/*
+ * ffs_mount
+ *
+ * Called when mounting local physical media
+ *
+ * PARAMETERS:
+ *		mountroot
+ *			mp	mount point structure
+ *			path	NULL (flag for root mount!!!)
+ *			data	<unused>
+ *			ndp	<unused>
+ *			p	process (user credentials check [statfs])
+ *
+ *		mount
+ *			mp	mount point structure
+ *			path	path to mount point
+ *			data	pointer to argument struct in user space
+ *			ndp	mount point namei() return (used for
+ *				credentials on reload), reused to look
+ *				up block device.
+ *			p	process (user credentials check)
+ *
+ * RETURNS:	0	Success
+ *		!0	error number (errno.h)
+ *
+ * LOCK STATE:
+ *
+ *		ENTRY
+ *			mount point is locked
+ *		EXIT
+ *			mount point is locked
+ *
+ * NOTES:
+ *		A NULL path can be used for a flag since the mount
+ *		system call will fail with EFAULT in copyinstr in
+ *		namei() if it is a genuine NULL from the user.
+ */
+int
+ffs_mount(mp, path, data, ndp, td)
+        struct mount		*mp;	/* mount struct pointer*/
+        char			*path;	/* path to mount point*/
+        caddr_t			data;	/* arguments to FS specific mount*/
+        struct nameidata	*ndp;	/* mount point credentials*/
+        struct thread		*td;	/* process requesting mount*/
+{
+	size_t size;
+	struct vnode *devvp;
+	struct ufs_args args;
+	struct ufsmount *ump = 0;
+	struct fs *fs;
+	int error, flags;
+	mode_t accessmode;
+
+	if (uma_inode == NULL) {
+		uma_inode = uma_zcreate("FFS inode",
+		    sizeof(struct inode), NULL, NULL, NULL, NULL,
+		    UMA_ALIGN_PTR, 0);
+		uma_ufs1 = uma_zcreate("FFS1 dinode",
+		    sizeof(struct ufs1_dinode), NULL, NULL, NULL, NULL,
+		    UMA_ALIGN_PTR, 0);
+		uma_ufs2 = uma_zcreate("FFS2 dinode",
+		    sizeof(struct ufs2_dinode), NULL, NULL, NULL, NULL,
+		    UMA_ALIGN_PTR, 0);
+	}
+	/*
+	 * Use NULL path to indicate we are mounting the root filesystem.
+	 */
+	if (path == NULL) {
+		if ((error = bdevvp(rootdev, &rootvp))) {
+			printf("ffs_mountroot: can't find rootvp\n");
+			return (error);
+		}
+
+		if ((error = ffs_mountfs(rootvp, mp, td)) != 0)
+			return (error);
+		(void)VFS_STATFS(mp, &mp->mnt_stat, td);
+		return (0);
+	}
+
+	/*
+	 * Mounting non-root filesystem or updating a filesystem
+	 */
+	if ((error = copyin(data, (caddr_t)&args, sizeof(struct ufs_args)))!= 0)
+		return (error);
+
+	/*
+	 * If updating, check whether changing from read-only to
+	 * read/write; if there is no device name, that's all we do.
+	 */
+	if (mp->mnt_flag & MNT_UPDATE) {
+		ump = VFSTOUFS(mp);
+		fs = ump->um_fs;
+		devvp = ump->um_devvp;
+		if (fs->fs_ronly == 0 && (mp->mnt_flag & MNT_RDONLY)) {
+			if ((error = vn_start_write(NULL, &mp, V_WAIT)) != 0)
+				return (error);
+			/*
+			 * Flush any dirty data.
+			 */
+			if ((error = VFS_SYNC(mp, MNT_WAIT,
+			    td->td_ucred, td)) != 0) {
+				vn_finished_write(mp);
+				return (error);
+			}
+			/*
+			 * Check for and optionally get rid of files open
+			 * for writing.
+			 */
+			flags = WRITECLOSE;
+			if (mp->mnt_flag & MNT_FORCE)
+				flags |= FORCECLOSE;
+			if (mp->mnt_flag & MNT_SOFTDEP) {
+				error = softdep_flushfiles(mp, flags, td);
+			} else {
+				error = ffs_flushfiles(mp, flags, td);
+			}
+			if (error) {
+				vn_finished_write(mp);
+				return (error);
+			}
+			if (fs->fs_pendingblocks != 0 ||
+			    fs->fs_pendinginodes != 0) {
+				printf("%s: %s: blocks %jd files %d\n",
+				    fs->fs_fsmnt, "update error",
+				    (intmax_t)fs->fs_pendingblocks,
+				    fs->fs_pendinginodes);
+				fs->fs_pendingblocks = 0;
+				fs->fs_pendinginodes = 0;
+			}
+			fs->fs_ronly = 1;
+			if ((fs->fs_flags & (FS_UNCLEAN | FS_NEEDSFSCK)) == 0)
+				fs->fs_clean = 1;
+			if ((error = ffs_sbupdate(ump, MNT_WAIT)) != 0) {
+				fs->fs_ronly = 0;
+				fs->fs_clean = 0;
+				vn_finished_write(mp);
+				return (error);
+			}
+			vn_finished_write(mp);
+		}
+		if ((mp->mnt_flag & MNT_RELOAD) &&
+		    (error = ffs_reload(mp, ndp->ni_cnd.cn_cred, td)) != 0)
+			return (error);
+		if (fs->fs_ronly && (mp->mnt_kern_flag & MNTK_WANTRDWR)) {
+			/*
+			 * If upgrade to read-write by non-root, then verify
+			 * that user has necessary permissions on the device.
+			 */
+			if (suser(td)) {
+				vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY, td);
+				if ((error = VOP_ACCESS(devvp, VREAD | VWRITE,
+				    td->td_ucred, td)) != 0) {
+					VOP_UNLOCK(devvp, 0, td);
+					return (error);
+				}
+				VOP_UNLOCK(devvp, 0, td);
+			}
+			fs->fs_flags &= ~FS_UNCLEAN;
+			if (fs->fs_clean == 0) {
+				fs->fs_flags |= FS_UNCLEAN;
+				if ((mp->mnt_flag & MNT_FORCE) ||
+				    ((fs->fs_flags & FS_NEEDSFSCK) == 0 &&
+				     (fs->fs_flags & FS_DOSOFTDEP))) {
+					printf("WARNING: %s was not %s\n",
+					   fs->fs_fsmnt, "properly dismounted");
+				} else {
+					printf(
+"WARNING: R/W mount of %s denied.  Filesystem is not clean - run fsck\n",
+					    fs->fs_fsmnt);
+					return (EPERM);
+				}
+			}
+			if ((error = vn_start_write(NULL, &mp, V_WAIT)) != 0)
+				return (error);
+			fs->fs_ronly = 0;
+			fs->fs_clean = 0;
+			if ((error = ffs_sbupdate(ump, MNT_WAIT)) != 0) {
+				vn_finished_write(mp);
+				return (error);
+			}
+			/* check to see if we need to start softdep */
+			if ((fs->fs_flags & FS_DOSOFTDEP) &&
+			    (error = softdep_mount(devvp, mp, fs, td->td_ucred))){
+				vn_finished_write(mp);
+				return (error);
+			}
+			if (fs->fs_snapinum[0] != 0)
+				ffs_snapshot_mount(mp);
+			vn_finished_write(mp);
+		}
+		/*
+		 * Soft updates is incompatible with "async",
+		 * so if we are doing softupdates stop the user
+		 * from setting the async flag in an update.
+		 * Softdep_mount() clears it in an initial mount 
+		 * or ro->rw remount.
+		 */
+		if (mp->mnt_flag & MNT_SOFTDEP)
+			mp->mnt_flag &= ~MNT_ASYNC;
+		/*
+		 * If not updating name, process export requests.
+		 */
+		if (args.fspec == 0)
+			return (vfs_export(mp, &args.export));
+		/*
+		 * If this is a snapshot request, take the snapshot.
+		 */
+		if (mp->mnt_flag & MNT_SNAPSHOT)
+			return (ffs_snapshot(mp, args.fspec));
+	}
+
+	/*
+	 * Not an update, or updating the name: look up the name
+	 * and verify that it refers to a sensible block device.
+	 */
+	NDINIT(ndp, LOOKUP, FOLLOW, UIO_USERSPACE, args.fspec, td);
+	if ((error = namei(ndp)) != 0)
+		return (error);
+	NDFREE(ndp, NDF_ONLY_PNBUF);
+	devvp = ndp->ni_vp;
+	if (!vn_isdisk(devvp, &error)) {
+		vrele(devvp);
+		return (error);
+	}
+
+	/*
+	 * If mount by non-root, then verify that user has necessary
+	 * permissions on the device.
+	 */
+	if (suser(td)) {
+		accessmode = VREAD;
+		if ((mp->mnt_flag & MNT_RDONLY) == 0)
+			accessmode |= VWRITE;
+		vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY, td);
+		if ((error = VOP_ACCESS(devvp, accessmode, td->td_ucred, td))!= 0){
+			vput(devvp);
+			return (error);
+		}
+		VOP_UNLOCK(devvp, 0, td);
+	}
+
+	if (mp->mnt_flag & MNT_UPDATE) {
+		/*
+		 * Update only
+		 *
+		 * If it's not the same vnode, or at least the same device
+		 * then it's not correct.
+		 */
+
+		if (devvp != ump->um_devvp &&
+		    devvp->v_rdev != ump->um_devvp->v_rdev)
+			error = EINVAL;	/* needs translation */
+		vrele(devvp);
+		if (error)
+			return (error);
+	} else {
+		/*
+		 * New mount
+		 *
+		 * We need the name for the mount point (also used for
+		 * "last mounted on") copied in. If an error occurs,
+		 * the mount point is discarded by the upper level code.
+		 * Note that vfs_mount() populates f_mntonname for us.
+		 */
+		if ((error = ffs_mountfs(devvp, mp, td)) != 0) {
+			vrele(devvp);
+			return (error);
+		}
+	}
+	/*
+	 * Save "mounted from" device name info for mount point (NULL pad).
+	 */
+	copyinstr(args.fspec, mp->mnt_stat.f_mntfromname, MNAMELEN - 1, &size);
+	bzero( mp->mnt_stat.f_mntfromname + size, MNAMELEN - size);
+	/*
+	 * Initialize filesystem stat information in mount struct.
+	 */
+	(void)VFS_STATFS(mp, &mp->mnt_stat, td);
+	return (0);
+}
+
+/*
+ * Reload all incore data for a filesystem (used after running fsck on
+ * the root filesystem and finding things to fix). The filesystem must
+ * be mounted read-only.
+ *
+ * Things to do to update the mount:
+ *	1) invalidate all cached meta-data.
+ *	2) re-read superblock from disk.
+ *	3) re-read summary information from disk.
+ *	4) invalidate all inactive vnodes.
+ *	5) invalidate all cached file data.
+ *	6) re-read inode data for all active vnodes.
+ */
+int
+ffs_reload(mp, cred, td)
+	struct mount *mp;
+	struct ucred *cred;
+	struct thread *td;
+{
+	struct vnode *vp, *nvp, *devvp;
+	struct inode *ip;
+	void *space;
+	struct buf *bp;
+	struct fs *fs, *newfs;
+	ufs2_daddr_t sblockloc;
+	int i, blks, size, error;
+	int32_t *lp;
+
+	if ((mp->mnt_flag & MNT_RDONLY) == 0)
+		return (EINVAL);
+	/*
+	 * Step 1: invalidate all cached meta-data.
+	 */
+	devvp = VFSTOUFS(mp)->um_devvp;
+	vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY, td);
+	error = vinvalbuf(devvp, 0, cred, td, 0, 0);
+	VOP_UNLOCK(devvp, 0, td);
+	if (error)
+		panic("ffs_reload: dirty1");
+
+	/*
+	 * Only VMIO the backing device if the backing device is a real
+	 * block device.
+	 */
+	if (vn_isdisk(devvp, NULL)) {
+		vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY, td);
+		vfs_object_create(devvp, td, td->td_ucred);
+		VOP_UNLOCK(devvp, 0, td);
+	}
+
+	/*
+	 * Step 2: re-read superblock from disk.
+	 */
+	fs = VFSTOUFS(mp)->um_fs;
+	if ((error = bread(devvp, btodb(fs->fs_sblockloc), fs->fs_sbsize,
+	    NOCRED, &bp)) != 0)
+		return (error);
+	newfs = (struct fs *)bp->b_data;
+	if ((newfs->fs_magic != FS_UFS1_MAGIC &&
+	     newfs->fs_magic != FS_UFS2_MAGIC) ||
+	    newfs->fs_bsize > MAXBSIZE ||
+	    newfs->fs_bsize < sizeof(struct fs)) {
+			brelse(bp);
+			return (EIO);		/* XXX needs translation */
+	}
+	/*
+	 * Copy pointer fields back into superblock before copying in	XXX
+	 * new superblock. These should really be in the ufsmount.	XXX
+	 * Note that important parameters (eg fs_ncg) are unchanged.
+	 */
+	newfs->fs_csp = fs->fs_csp;
+	newfs->fs_maxcluster = fs->fs_maxcluster;
+	newfs->fs_contigdirs = fs->fs_contigdirs;
+	newfs->fs_active = fs->fs_active;
+	/* The file system is still read-only. */
+	newfs->fs_ronly = 1;
+	sblockloc = fs->fs_sblockloc;
+	bcopy(newfs, fs, (u_int)fs->fs_sbsize);
+	brelse(bp);
+	mp->mnt_maxsymlinklen = fs->fs_maxsymlinklen;
+	ffs_oldfscompat_read(fs, VFSTOUFS(mp), sblockloc);
+	if (fs->fs_pendingblocks != 0 || fs->fs_pendinginodes != 0) {
+		printf("%s: reload pending error: blocks %jd files %d\n",
+		    fs->fs_fsmnt, (intmax_t)fs->fs_pendingblocks,
+		    fs->fs_pendinginodes);
+		fs->fs_pendingblocks = 0;
+		fs->fs_pendinginodes = 0;
+	}
+
+	/*
+	 * Step 3: re-read summary information from disk.
+	 */
+	blks = howmany(fs->fs_cssize, fs->fs_fsize);
+	space = fs->fs_csp;
+	for (i = 0; i < blks; i += fs->fs_frag) {
+		size = fs->fs_bsize;
+		if (i + fs->fs_frag > blks)
+			size = (blks - i) * fs->fs_fsize;
+		error = bread(devvp, fsbtodb(fs, fs->fs_csaddr + i), size,
+		    NOCRED, &bp);
+		if (error)
+			return (error);
+		bcopy(bp->b_data, space, (u_int)size);
+		space = (char *)space + size;
+		brelse(bp);
+	}
+	/*
+	 * We no longer know anything about clusters per cylinder group.
+	 */
+	if (fs->fs_contigsumsize > 0) {
+		lp = fs->fs_maxcluster;
+		for (i = 0; i < fs->fs_ncg; i++)
+			*lp++ = fs->fs_contigsumsize;
+	}
+
+loop:
+	MNT_ILOCK(mp);
+	for (vp = TAILQ_FIRST(&mp->mnt_nvnodelist); vp != NULL; vp = nvp) {
+		if (vp->v_mount != mp) {
+			MNT_IUNLOCK(mp);
+			goto loop;
+		}
+		nvp = TAILQ_NEXT(vp, v_nmntvnodes);
+		VI_LOCK(vp);
+		if (vp->v_iflag & VI_XLOCK) {
+			VI_UNLOCK(vp);
+			continue;
+		}
+		MNT_IUNLOCK(mp);
+		/*
+		 * Step 4: invalidate all inactive vnodes.
+		 */
+		if (vp->v_usecount == 0) {
+			vgonel(vp, td);
+			goto loop;
+		}
+		/*
+		 * Step 5: invalidate all cached file data.
+		 */
+		if (vget(vp, LK_EXCLUSIVE | LK_INTERLOCK, td)) {
+			goto loop;
+		}
+		if (vinvalbuf(vp, 0, cred, td, 0, 0))
+			panic("ffs_reload: dirty2");
+		/*
+		 * Step 6: re-read inode data for all active vnodes.
+		 */
+		ip = VTOI(vp);
+		error =
+		    bread(devvp, fsbtodb(fs, ino_to_fsba(fs, ip->i_number)),
+		    (int)fs->fs_bsize, NOCRED, &bp);
+		if (error) {
+			VOP_UNLOCK(vp, 0, td);
+			vrele(vp);
+			return (error);
+		}
+		ffs_load_inode(bp, ip, fs, ip->i_number);
+		ip->i_effnlink = ip->i_nlink;
+		brelse(bp);
+		VOP_UNLOCK(vp, 0, td);
+		vrele(vp);
+		MNT_ILOCK(mp);
+	}
+	MNT_IUNLOCK(mp);
+	return (0);
+}
+
+/*
+ * Possible superblock locations ordered from most to least likely.
+ */
+static int sblock_try[] = SBLOCKSEARCH;
+
+/*
+ * Common code for mount and mountroot
+ */
+static int
+ffs_mountfs(devvp, mp, td)
+	struct vnode *devvp;
+	struct mount *mp;
+	struct thread *td;
+{
+	struct ufsmount *ump;
+	struct buf *bp;
+	struct fs *fs;
+	dev_t dev;
+	void *space;
+	ufs2_daddr_t sblockloc;
+	int error, i, blks, size, ronly;
+	int32_t *lp;
+	struct ucred *cred;
+	size_t strsize;
+	int ncount;
+
+	dev = devvp->v_rdev;
+	cred = td ? td->td_ucred : NOCRED;
+	/*
+	 * Disallow multiple mounts of the same device.
+	 * Disallow mounting of a device that is currently in use
+	 * (except for root, which might share swap device for miniroot).
+	 * Flush out any old buffers remaining from a previous use.
+	 */
+	error = vfs_mountedon(devvp);
+	if (error)
+		return (error);
+	ncount = vcount(devvp);
+
+	if (ncount > 1 && devvp != rootvp)
+		return (EBUSY);
+	vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY, td);
+	error = vinvalbuf(devvp, V_SAVE, cred, td, 0, 0);
+	VOP_UNLOCK(devvp, 0, td);
+	if (error)
+		return (error);
+
+	/*
+	 * Only VMIO the backing device if the backing device is a real
+	 * block device.
+	 * Note that it is optional that the backing device be VMIOed.  This
+	 * increases the opportunity for metadata caching.
+	 */
+	if (vn_isdisk(devvp, NULL)) {
+		vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY, td);
+		vfs_object_create(devvp, td, cred);
+		VOP_UNLOCK(devvp, 0, td);
+	}
+
+	ronly = (mp->mnt_flag & MNT_RDONLY) != 0;
+	vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY, td);
+	/*
+	 * XXX: We don't re-VOP_OPEN in FREAD|FWRITE mode if the filesystem
+	 * XXX: is subsequently remounted, so open it FREAD|FWRITE from the
+	 * XXX: start to avoid getting trashed later on.
+	 */
+#ifdef notyet
+	error = VOP_OPEN(devvp, ronly ? FREAD : FREAD|FWRITE, FSCRED, td, -1);
+#else
+	error = VOP_OPEN(devvp, FREAD|FWRITE, FSCRED, td, -1);
+#endif
+	VOP_UNLOCK(devvp, 0, td);
+	if (error)
+		return (error);
+	if (devvp->v_rdev->si_iosize_max != 0)
+		mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max;
+	if (mp->mnt_iosize_max > MAXPHYS)
+		mp->mnt_iosize_max = MAXPHYS;
+
+	bp = NULL;
+	ump = NULL;
+	fs = NULL;
+	sblockloc = 0;
+	/*
+	 * Try reading the superblock in each of its possible locations.
+	 */
+	for (i = 0; sblock_try[i] != -1; i++) {
+		if ((error = bread(devvp, sblock_try[i] / DEV_BSIZE, SBLOCKSIZE,
+		    cred, &bp)) != 0)
+			goto out;
+		fs = (struct fs *)bp->b_data;
+		sblockloc = sblock_try[i];
+		if ((fs->fs_magic == FS_UFS1_MAGIC ||
+		     (fs->fs_magic == FS_UFS2_MAGIC &&
+		      (fs->fs_sblockloc == sblockloc ||
+		       (fs->fs_old_flags & FS_FLAGS_UPDATED) == 0))) &&
+		    fs->fs_bsize <= MAXBSIZE &&
+		    fs->fs_bsize >= sizeof(struct fs))
+			break;
+		brelse(bp);
+		bp = NULL;
+	}
+	if (sblock_try[i] == -1) {
+		error = EINVAL;		/* XXX needs translation */
+		goto out;
+	}
+	fs->fs_fmod = 0;
+	fs->fs_flags &= ~FS_INDEXDIRS;	/* no support for directory indicies */
+	fs->fs_flags &= ~FS_UNCLEAN;
+	if (fs->fs_clean == 0) {
+		fs->fs_flags |= FS_UNCLEAN;
+		if (ronly || (mp->mnt_flag & MNT_FORCE) ||
+		    ((fs->fs_flags & FS_NEEDSFSCK) == 0 &&
+		     (fs->fs_flags & FS_DOSOFTDEP))) {
+			printf(
+"WARNING: %s was not properly dismounted\n",
+			    fs->fs_fsmnt);
+		} else {
+			printf(
+"WARNING: R/W mount of %s denied.  Filesystem is not clean - run fsck\n",
+			    fs->fs_fsmnt);
+			error = EPERM;
+			goto out;
+		}
+		if ((fs->fs_pendingblocks != 0 || fs->fs_pendinginodes != 0) &&
+		    (mp->mnt_flag & MNT_FORCE)) {
+			printf("%s: lost blocks %jd files %d\n", fs->fs_fsmnt,
+			    (intmax_t)fs->fs_pendingblocks,
+			    fs->fs_pendinginodes);
+			fs->fs_pendingblocks = 0;
+			fs->fs_pendinginodes = 0;
+		}
+	}
+	if (fs->fs_pendingblocks != 0 || fs->fs_pendinginodes != 0) {
+		printf("%s: mount pending error: blocks %jd files %d\n",
+		    fs->fs_fsmnt, (intmax_t)fs->fs_pendingblocks,
+		    fs->fs_pendinginodes);
+		fs->fs_pendingblocks = 0;
+		fs->fs_pendinginodes = 0;
+	}
+	ump = malloc(sizeof *ump, M_UFSMNT, M_WAITOK | M_ZERO);
+	ump->um_fs = malloc((u_long)fs->fs_sbsize, M_UFSMNT,
+	    M_WAITOK);
+	if (fs->fs_magic == FS_UFS1_MAGIC) {
+		ump->um_fstype = UFS1;
+		ump->um_balloc = ffs_balloc_ufs1;
+	} else {
+		ump->um_fstype = UFS2;
+		ump->um_balloc = ffs_balloc_ufs2;
+	}
+	ump->um_blkatoff = ffs_blkatoff;
+	ump->um_truncate = ffs_truncate;
+	ump->um_update = ffs_update;
+	ump->um_valloc = ffs_valloc;
+	ump->um_vfree = ffs_vfree;
+	ump->um_ifree = ffs_ifree;
+	bcopy(bp->b_data, ump->um_fs, (u_int)fs->fs_sbsize);
+	if (fs->fs_sbsize < SBLOCKSIZE)
+		bp->b_flags |= B_INVAL | B_NOCACHE;
+	brelse(bp);
+	bp = NULL;
+	fs = ump->um_fs;
+	ffs_oldfscompat_read(fs, ump, sblockloc);
+	fs->fs_ronly = ronly;
+	size = fs->fs_cssize;
+	blks = howmany(size, fs->fs_fsize);
+	if (fs->fs_contigsumsize > 0)
+		size += fs->fs_ncg * sizeof(int32_t);
+	size += fs->fs_ncg * sizeof(u_int8_t);
+	space = malloc((u_long)size, M_UFSMNT, M_WAITOK);
+	fs->fs_csp = space;
+	for (i = 0; i < blks; i += fs->fs_frag) {
+		size = fs->fs_bsize;
+		if (i + fs->fs_frag > blks)
+			size = (blks - i) * fs->fs_fsize;
+		if ((error = bread(devvp, fsbtodb(fs, fs->fs_csaddr + i), size,
+		    cred, &bp)) != 0) {
+			free(fs->fs_csp, M_UFSMNT);
+			goto out;
+		}
+		bcopy(bp->b_data, space, (u_int)size);
+		space = (char *)space + size;
+		brelse(bp);
+		bp = NULL;
+	}
+	if (fs->fs_contigsumsize > 0) {
+		fs->fs_maxcluster = lp = space;
+		for (i = 0; i < fs->fs_ncg; i++)
+			*lp++ = fs->fs_contigsumsize;
+		space = lp;
+	}
+	size = fs->fs_ncg * sizeof(u_int8_t);
+	fs->fs_contigdirs = (u_int8_t *)space;
+	bzero(fs->fs_contigdirs, size);
+	fs->fs_active = NULL;
+	mp->mnt_data = (qaddr_t)ump;
+	mp->mnt_stat.f_fsid.val[0] = fs->fs_id[0];
+	mp->mnt_stat.f_fsid.val[1] = fs->fs_id[1];
+	if (fs->fs_id[0] == 0 || fs->fs_id[1] == 0 || 
+	    vfs_getvfs(&mp->mnt_stat.f_fsid)) 
+		vfs_getnewfsid(mp);
+	mp->mnt_maxsymlinklen = fs->fs_maxsymlinklen;
+	mp->mnt_flag |= MNT_LOCAL;
+	if ((fs->fs_flags & FS_MULTILABEL) != 0)
+#ifdef MAC
+		mp->mnt_flag |= MNT_MULTILABEL;
+#else
+		printf(
+"WARNING: %s: multilabel flag on fs but no MAC support\n",
+		    fs->fs_fsmnt);
+#endif
+	if ((fs->fs_flags & FS_ACLS) != 0)
+#ifdef UFS_ACL
+		mp->mnt_flag |= MNT_ACLS;
+#else
+		printf(
+"WARNING: %s: ACLs flag on fs but no ACLs support\n",
+		    fs->fs_fsmnt);
+#endif
+	ump->um_mountp = mp;
+	ump->um_dev = dev;
+	ump->um_devvp = devvp;
+	ump->um_nindir = fs->fs_nindir;
+	ump->um_bptrtodb = fs->fs_fsbtodb;
+	ump->um_seqinc = fs->fs_frag;
+	for (i = 0; i < MAXQUOTAS; i++)
+		ump->um_quotas[i] = NULLVP;
+#ifdef UFS_EXTATTR
+	ufs_extattr_uepm_init(&ump->um_extattr);
+#endif
+	devvp->v_rdev->si_mountpoint = mp;
+
+	/*
+	 * Set FS local "last mounted on" information (NULL pad)
+	 */
+	copystr(	mp->mnt_stat.f_mntonname,	/* mount point*/
+			fs->fs_fsmnt,			/* copy area*/
+			sizeof(fs->fs_fsmnt) - 1,	/* max size*/
+			&strsize);			/* real size*/
+	bzero( fs->fs_fsmnt + strsize, sizeof(fs->fs_fsmnt) - strsize);
+
+	if( mp->mnt_flag & MNT_ROOTFS) {
+		/*
+		 * Root mount; update timestamp in mount structure.
+		 * this will be used by the common root mount code
+		 * to update the system clock.
+		 */
+		mp->mnt_time = fs->fs_time;
+	}
+
+	if (ronly == 0) {
+		if ((fs->fs_flags & FS_DOSOFTDEP) &&
+		    (error = softdep_mount(devvp, mp, fs, cred)) != 0) {
+			free(fs->fs_csp, M_UFSMNT);
+			goto out;
+		}
+		if (fs->fs_snapinum[0] != 0)
+			ffs_snapshot_mount(mp);
+		fs->fs_fmod = 1;
+		fs->fs_clean = 0;
+		(void) ffs_sbupdate(ump, MNT_WAIT);
+	}
+#ifdef UFS_EXTATTR
+#ifdef UFS_EXTATTR_AUTOSTART
+	/*
+	 *
+	 * Auto-starting does the following:
+	 *	- check for /.attribute in the fs, and extattr_start if so
+	 *	- for each file in .attribute, enable that file with
+	 * 	  an attribute of the same name.
+	 * Not clear how to report errors -- probably eat them.
+	 * This would all happen while the filesystem was busy/not
+	 * available, so would effectively be "atomic".
+	 */
+	(void) ufs_extattr_autostart(mp, td);
+#endif /* !UFS_EXTATTR_AUTOSTART */
+#endif /* !UFS_EXTATTR */
+	return (0);
+out:
+	devvp->v_rdev->si_mountpoint = NULL;
+	if (bp)
+		brelse(bp);
+	/* XXX: see comment above VOP_OPEN */
+#ifdef notyet
+	(void)VOP_CLOSE(devvp, ronly ? FREAD : FREAD|FWRITE, cred, td);
+#else
+	(void)VOP_CLOSE(devvp, FREAD|FWRITE, cred, td);
+#endif
+	if (ump) {
+		free(ump->um_fs, M_UFSMNT);
+		free(ump, M_UFSMNT);
+		mp->mnt_data = (qaddr_t)0;
+	}
+	return (error);
+}
+
+#include <sys/sysctl.h>
+int bigcgs = 0;
+SYSCTL_INT(_debug, OID_AUTO, bigcgs, CTLFLAG_RW, &bigcgs, 0, "");
+
+/*
+ * Sanity checks for loading old filesystem superblocks.
+ * See ffs_oldfscompat_write below for unwound actions.
+ *
+ * XXX - Parts get retired eventually.
+ * Unfortunately new bits get added.
+ */
+static void
+ffs_oldfscompat_read(fs, ump, sblockloc)
+	struct fs *fs;
+	struct ufsmount *ump;
+	ufs2_daddr_t sblockloc;
+{
+	off_t maxfilesize;
+
+	/*
+	 * If not yet done, update fs_flags location and value of fs_sblockloc.
+	 */
+	if ((fs->fs_old_flags & FS_FLAGS_UPDATED) == 0) {
+		fs->fs_flags = fs->fs_old_flags;
+		fs->fs_old_flags |= FS_FLAGS_UPDATED;
+		fs->fs_sblockloc = sblockloc;
+	}
+	/*
+	 * If not yet done, update UFS1 superblock with new wider fields.
+	 */
+	if (fs->fs_magic == FS_UFS1_MAGIC && fs->fs_maxbsize != fs->fs_bsize) {
+		fs->fs_maxbsize = fs->fs_bsize;
+		fs->fs_time = fs->fs_old_time;
+		fs->fs_size = fs->fs_old_size;
+		fs->fs_dsize = fs->fs_old_dsize;
+		fs->fs_csaddr = fs->fs_old_csaddr;
+		fs->fs_cstotal.cs_ndir = fs->fs_old_cstotal.cs_ndir;
+		fs->fs_cstotal.cs_nbfree = fs->fs_old_cstotal.cs_nbfree;
+		fs->fs_cstotal.cs_nifree = fs->fs_old_cstotal.cs_nifree;
+		fs->fs_cstotal.cs_nffree = fs->fs_old_cstotal.cs_nffree;
+	}
+	if (fs->fs_magic == FS_UFS1_MAGIC &&
+	    fs->fs_old_inodefmt < FS_44INODEFMT) {
+		fs->fs_maxfilesize = (u_quad_t) 1LL << 39;
+		fs->fs_qbmask = ~fs->fs_bmask;
+		fs->fs_qfmask = ~fs->fs_fmask;
+	}
+	if (fs->fs_magic == FS_UFS1_MAGIC) {
+		ump->um_savedmaxfilesize = fs->fs_maxfilesize;
+		maxfilesize = (u_int64_t)0x40000000 * fs->fs_bsize - 1;
+		if (fs->fs_maxfilesize > maxfilesize)
+			fs->fs_maxfilesize = maxfilesize;
+	}
+	/* Compatibility for old filesystems */
+	if (fs->fs_avgfilesize <= 0)
+		fs->fs_avgfilesize = AVFILESIZ;
+	if (fs->fs_avgfpdir <= 0)
+		fs->fs_avgfpdir = AFPDIR;
+	if (bigcgs) {
+		fs->fs_save_cgsize = fs->fs_cgsize;
+		fs->fs_cgsize = fs->fs_bsize;
+	}
+}
+
+/*
+ * Unwinding superblock updates for old filesystems.
+ * See ffs_oldfscompat_read above for details.
+ *
+ * XXX - Parts get retired eventually.
+ * Unfortunately new bits get added.
+ */
+static void
+ffs_oldfscompat_write(fs, ump)
+	struct fs *fs;
+	struct ufsmount *ump;
+{
+
+	/*
+	 * Copy back UFS2 updated fields that UFS1 inspects.
+	 */
+	if (fs->fs_magic == FS_UFS1_MAGIC) {
+		fs->fs_old_time = fs->fs_time;
+		fs->fs_old_cstotal.cs_ndir = fs->fs_cstotal.cs_ndir;
+		fs->fs_old_cstotal.cs_nbfree = fs->fs_cstotal.cs_nbfree;
+		fs->fs_old_cstotal.cs_nifree = fs->fs_cstotal.cs_nifree;
+		fs->fs_old_cstotal.cs_nffree = fs->fs_cstotal.cs_nffree;
+		fs->fs_maxfilesize = ump->um_savedmaxfilesize;
+	}
+	if (bigcgs) {
+		fs->fs_cgsize = fs->fs_save_cgsize;
+		fs->fs_save_cgsize = 0;
+	}
+}
+
+/*
+ * unmount system call
+ */
+int
+ffs_unmount(mp, mntflags, td)
+	struct mount *mp;
+	int mntflags;
+	struct thread *td;
+{
+	struct ufsmount *ump = VFSTOUFS(mp);
+	struct fs *fs;
+	int error, flags;
+
+	flags = 0;
+	if (mntflags & MNT_FORCE) {
+		flags |= FORCECLOSE;
+	}
+#ifdef UFS_EXTATTR
+	if ((error = ufs_extattr_stop(mp, td))) {
+		if (error != EOPNOTSUPP)
+			printf("ffs_unmount: ufs_extattr_stop returned %d\n",
+			    error);
+	} else {
+		ufs_extattr_uepm_destroy(&ump->um_extattr);
+	}
+#endif
+	if (mp->mnt_flag & MNT_SOFTDEP) {
+		if ((error = softdep_flushfiles(mp, flags, td)) != 0)
+			return (error);
+	} else {
+		if ((error = ffs_flushfiles(mp, flags, td)) != 0)
+			return (error);
+	}
+	fs = ump->um_fs;
+	if (fs->fs_pendingblocks != 0 || fs->fs_pendinginodes != 0) {
+		printf("%s: unmount pending error: blocks %jd files %d\n",
+		    fs->fs_fsmnt, (intmax_t)fs->fs_pendingblocks,
+		    fs->fs_pendinginodes);
+		fs->fs_pendingblocks = 0;
+		fs->fs_pendinginodes = 0;
+	}
+	if (fs->fs_ronly == 0) {
+		fs->fs_clean = fs->fs_flags & (FS_UNCLEAN|FS_NEEDSFSCK) ? 0 : 1;
+		error = ffs_sbupdate(ump, MNT_WAIT);
+		if (error) {
+			fs->fs_clean = 0;
+			return (error);
+		}
+	}
+	ump->um_devvp->v_rdev->si_mountpoint = NULL;
+
+	vinvalbuf(ump->um_devvp, V_SAVE, NOCRED, td, 0, 0);
+	/* XXX: see comment above VOP_OPEN */
+#ifdef notyet
+	error = VOP_CLOSE(ump->um_devvp, fs->fs_ronly ? FREAD : FREAD|FWRITE,
+		NOCRED, td);
+#else
+	error = VOP_CLOSE(ump->um_devvp, FREAD|FWRITE, NOCRED, td);
+#endif
+
+	vrele(ump->um_devvp);
+
+	free(fs->fs_csp, M_UFSMNT);
+	free(fs, M_UFSMNT);
+	free(ump, M_UFSMNT);
+	mp->mnt_data = (qaddr_t)0;
+	mp->mnt_flag &= ~MNT_LOCAL;
+	return (error);
+}
+
+/*
+ * Flush out all the files in a filesystem.
+ */
+int
+ffs_flushfiles(mp, flags, td)
+	struct mount *mp;
+	int flags;
+	struct thread *td;
+{
+	struct ufsmount *ump;
+	int error;
+
+	ump = VFSTOUFS(mp);
+#ifdef QUOTA
+	if (mp->mnt_flag & MNT_QUOTA) {
+		int i;
+		error = vflush(mp, 0, SKIPSYSTEM|flags);
+		if (error)
+			return (error);
+		for (i = 0; i < MAXQUOTAS; i++) {
+			if (ump->um_quotas[i] == NULLVP)
+				continue;
+			quotaoff(td, mp, i);
+		}
+		/*
+		 * Here we fall through to vflush again to ensure
+		 * that we have gotten rid of all the system vnodes.
+		 */
+	}
+#endif
+	ASSERT_VOP_LOCKED(ump->um_devvp, "ffs_flushfiles");
+	if (ump->um_devvp->v_vflag & VV_COPYONWRITE) {
+		if ((error = vflush(mp, 0, SKIPSYSTEM | flags)) != 0)
+			return (error);
+		ffs_snapshot_unmount(mp);
+		/*
+		 * Here we fall through to vflush again to ensure
+		 * that we have gotten rid of all the system vnodes.
+		 */
+	}
+        /*
+	 * Flush all the files.
+	 */
+	if ((error = vflush(mp, 0, flags)) != 0)
+		return (error);
+	/*
+	 * Flush filesystem metadata.
+	 */
+	vn_lock(ump->um_devvp, LK_EXCLUSIVE | LK_RETRY, td);
+	error = VOP_FSYNC(ump->um_devvp, td->td_ucred, MNT_WAIT, td);
+	VOP_UNLOCK(ump->um_devvp, 0, td);
+	return (error);
+}
+
+/*
+ * Get filesystem statistics.
+ */
+int
+ffs_statfs(mp, sbp, td)
+	struct mount *mp;
+	struct statfs *sbp;
+	struct thread *td;
+{
+	struct ufsmount *ump;
+	struct fs *fs;
+
+	ump = VFSTOUFS(mp);
+	fs = ump->um_fs;
+	if (fs->fs_magic != FS_UFS1_MAGIC && fs->fs_magic != FS_UFS2_MAGIC)
+		panic("ffs_statfs");
+	sbp->f_version = STATFS_VERSION;
+	sbp->f_bsize = fs->fs_fsize;
+	sbp->f_iosize = fs->fs_bsize;
+	sbp->f_blocks = fs->fs_dsize;
+	sbp->f_bfree = fs->fs_cstotal.cs_nbfree * fs->fs_frag +
+	    fs->fs_cstotal.cs_nffree + dbtofsb(fs, fs->fs_pendingblocks);
+	sbp->f_bavail = freespace(fs, fs->fs_minfree) +
+	    dbtofsb(fs, fs->fs_pendingblocks);
+	sbp->f_files =  fs->fs_ncg * fs->fs_ipg - ROOTINO;
+	sbp->f_ffree = fs->fs_cstotal.cs_nifree + fs->fs_pendinginodes;
+	sbp->f_namemax = NAME_MAX;
+	if (sbp != &mp->mnt_stat) {
+		sbp->f_flags = mp->mnt_flag & MNT_VISFLAGMASK;
+		sbp->f_type = mp->mnt_vfc->vfc_typenum;
+		sbp->f_syncwrites = mp->mnt_stat.f_syncwrites;
+		sbp->f_asyncwrites = mp->mnt_stat.f_asyncwrites;
+		sbp->f_syncreads = mp->mnt_stat.f_syncreads;
+		sbp->f_asyncreads = mp->mnt_stat.f_asyncreads;
+		sbp->f_owner = mp->mnt_stat.f_owner;
+		sbp->f_fsid = mp->mnt_stat.f_fsid;
+		bcopy((caddr_t)mp->mnt_stat.f_fstypename,
+			(caddr_t)&sbp->f_fstypename[0], MFSNAMELEN);
+		bcopy((caddr_t)mp->mnt_stat.f_mntonname,
+			(caddr_t)&sbp->f_mntonname[0], MNAMELEN);
+		bcopy((caddr_t)mp->mnt_stat.f_mntfromname,
+			(caddr_t)&sbp->f_mntfromname[0], MNAMELEN);
+	}
+	return (0);
+}
+
+/*
+ * Go through the disk queues to initiate sandbagged IO;
+ * go through the inodes to write those that have been modified;
+ * initiate the writing of the super block if it has been modified.
+ *
+ * Note: we are always called with the filesystem marked `MPBUSY'.
+ */
+int
+ffs_sync(mp, waitfor, cred, td)
+	struct mount *mp;
+	int waitfor;
+	struct ucred *cred;
+	struct thread *td;
+{
+	struct vnode *nvp, *vp, *devvp;
+	struct inode *ip;
+	struct ufsmount *ump = VFSTOUFS(mp);
+	struct fs *fs;
+	int error, count, wait, lockreq, allerror = 0;
+
+	fs = ump->um_fs;
+	if (fs->fs_fmod != 0 && fs->fs_ronly != 0) {		/* XXX */
+		printf("fs = %s\n", fs->fs_fsmnt);
+		panic("ffs_sync: rofs mod");
+	}
+	/*
+	 * Write back each (modified) inode.
+	 */
+	wait = 0;
+	lockreq = LK_EXCLUSIVE | LK_NOWAIT;
+	if (waitfor == MNT_WAIT) {
+		wait = 1;
+		lockreq = LK_EXCLUSIVE;
+	}
+	lockreq |= LK_INTERLOCK;
+	MNT_ILOCK(mp);
+loop:
+	for (vp = TAILQ_FIRST(&mp->mnt_nvnodelist); vp != NULL; vp = nvp) {
+		/*
+		 * If the vnode that we are about to sync is no longer
+		 * associated with this mount point, start over.
+		 */
+		if (vp->v_mount != mp)
+			goto loop;
+
+		/*
+		 * Depend on the mntvnode_slock to keep things stable enough
+		 * for a quick test.  Since there might be hundreds of
+		 * thousands of vnodes, we cannot afford even a subroutine
+		 * call unless there's a good chance that we have work to do.
+		 */
+		nvp = TAILQ_NEXT(vp, v_nmntvnodes);
+		VI_LOCK(vp);
+		if (vp->v_iflag & VI_XLOCK) {
+			VI_UNLOCK(vp);
+			continue;
+		}
+		ip = VTOI(vp);
+		if (vp->v_type == VNON || ((ip->i_flag &
+		    (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
+		    TAILQ_EMPTY(&vp->v_dirtyblkhd))) {
+			VI_UNLOCK(vp);
+			continue;
+		}
+		MNT_IUNLOCK(mp);
+		if ((error = vget(vp, lockreq, td)) != 0) {
+			MNT_ILOCK(mp);
+			if (error == ENOENT)
+				goto loop;
+			continue;
+		}
+		if ((error = VOP_FSYNC(vp, cred, waitfor, td)) != 0)
+			allerror = error;
+		VOP_UNLOCK(vp, 0, td);
+		vrele(vp);
+		MNT_ILOCK(mp);
+		if (TAILQ_NEXT(vp, v_nmntvnodes) != nvp)
+			goto loop;
+	}
+	MNT_IUNLOCK(mp);
+	/*
+	 * Force stale filesystem control information to be flushed.
+	 */
+	if (waitfor == MNT_WAIT) {
+		if ((error = softdep_flushworklist(ump->um_mountp, &count, td)))
+			allerror = error;
+		/* Flushed work items may create new vnodes to clean */
+		if (allerror == 0 && count) {
+			MNT_ILOCK(mp);
+			goto loop;
+		}
+	}
+#ifdef QUOTA
+	qsync(mp);
+#endif
+	devvp = ump->um_devvp;
+	VI_LOCK(devvp);
+	if (waitfor != MNT_LAZY &&
+	    (devvp->v_numoutput > 0 || TAILQ_FIRST(&devvp->v_dirtyblkhd))) {
+		vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY | LK_INTERLOCK, td);
+		if ((error = VOP_FSYNC(devvp, cred, waitfor, td)) != 0)
+			allerror = error;
+		VOP_UNLOCK(devvp, 0, td);
+		if (allerror == 0 && waitfor == MNT_WAIT) {
+			MNT_ILOCK(mp);
+			goto loop;
+		}
+	} else
+		VI_UNLOCK(devvp);
+	/*
+	 * Write back modified superblock.
+	 */
+	if (fs->fs_fmod != 0 && (error = ffs_sbupdate(ump, waitfor)) != 0)
+		allerror = error;
+	return (allerror);
+}
+
+int
+ffs_vget(mp, ino, flags, vpp)
+	struct mount *mp;
+	ino_t ino;
+	int flags;
+	struct vnode **vpp;
+{
+	struct thread *td = curthread; 		/* XXX */
+	struct fs *fs;
+	struct inode *ip;
+	struct ufsmount *ump;
+	struct buf *bp;
+	struct vnode *vp;
+	dev_t dev;
+	int error;
+
+	ump = VFSTOUFS(mp);
+	dev = ump->um_dev;
+
+	/*
+	 * We do not lock vnode creation as it is believed to be too
+	 * expensive for such rare case as simultaneous creation of vnode
+	 * for same ino by different processes. We just allow them to race
+	 * and check later to decide who wins. Let the race begin!
+	 */
+	if ((error = ufs_ihashget(dev, ino, flags, vpp)) != 0)
+		return (error);
+	if (*vpp != NULL)
+		return (0);
+
+	/*
+	 * If this MALLOC() is performed after the getnewvnode()
+	 * it might block, leaving a vnode with a NULL v_data to be
+	 * found by ffs_sync() if a sync happens to fire right then,
+	 * which will cause a panic because ffs_sync() blindly
+	 * dereferences vp->v_data (as well it should).
+	 */
+	ip = uma_zalloc(uma_inode, M_WAITOK);
+
+	/* Allocate a new vnode/inode. */
+	error = getnewvnode("ufs", mp, ffs_vnodeop_p, &vp);
+	if (error) {
+		*vpp = NULL;
+		uma_zfree(uma_inode, ip);
+		return (error);
+	}
+	bzero((caddr_t)ip, sizeof(struct inode));
+	/*
+	 * FFS supports recursive locking.
+	 */
+	vp->v_vnlock->lk_flags |= LK_CANRECURSE;
+	vp->v_data = ip;
+	ip->i_vnode = vp;
+	ip->i_ump = ump;
+	ip->i_fs = fs = ump->um_fs;
+	ip->i_dev = dev;
+	ip->i_number = ino;
+#ifdef QUOTA
+	{
+		int i;
+		for (i = 0; i < MAXQUOTAS; i++)
+			ip->i_dquot[i] = NODQUOT;
+	}
+#endif
+	/*
+	 * Exclusively lock the vnode before adding to hash. Note, that we
+	 * must not release nor downgrade the lock (despite flags argument
+	 * says) till it is fully initialized.
+	 */
+	lockmgr(vp->v_vnlock, LK_EXCLUSIVE, (struct mtx *)0, td);
+
+	/*
+	 * Atomicaly (in terms of ufs_hash operations) check the hash for
+	 * duplicate of vnode being created and add it to the hash. If a
+	 * duplicate vnode was found, it will be vget()ed from hash for us.
+	 */
+	if ((error = ufs_ihashins(ip, flags, vpp)) != 0) {
+		vput(vp);
+		*vpp = NULL;
+		return (error);
+	}
+
+	/* We lost the race, then throw away our vnode and return existing */
+	if (*vpp != NULL) {
+		vput(vp);
+		return (0);
+	}
+
+	/* Read in the disk contents for the inode, copy into the inode. */
+	error = bread(ump->um_devvp, fsbtodb(fs, ino_to_fsba(fs, ino)),
+	    (int)fs->fs_bsize, NOCRED, &bp);
+	if (error) {
+		/*
+		 * The inode does not contain anything useful, so it would
+		 * be misleading to leave it on its hash chain. With mode
+		 * still zero, it will be unlinked and returned to the free
+		 * list by vput().
+		 */
+		brelse(bp);
+		vput(vp);
+		*vpp = NULL;
+		return (error);
+	}
+	if (ip->i_ump->um_fstype == UFS1)
+		ip->i_din1 = uma_zalloc(uma_ufs1, M_WAITOK);
+	else
+		ip->i_din2 = uma_zalloc(uma_ufs2, M_WAITOK);
+	ffs_load_inode(bp, ip, fs, ino);
+	if (DOINGSOFTDEP(vp))
+		softdep_load_inodeblock(ip);
+	else
+		ip->i_effnlink = ip->i_nlink;
+	bqrelse(bp);
+
+	/*
+	 * Initialize the vnode from the inode, check for aliases.
+	 * Note that the underlying vnode may have changed.
+	 */
+	error = ufs_vinit(mp, ffs_specop_p, ffs_fifoop_p, &vp);
+	if (error) {
+		vput(vp);
+		*vpp = NULL;
+		return (error);
+	}
+	/*
+	 * Finish inode initialization.
+	 */
+	VREF(ip->i_devvp);
+	/*
+	 * Set up a generation number for this inode if it does not
+	 * already have one. This should only happen on old filesystems.
+	 */
+	if (ip->i_gen == 0) {
+		ip->i_gen = arc4random() / 2 + 1;
+		if ((vp->v_mount->mnt_flag & MNT_RDONLY) == 0) {
+			ip->i_flag |= IN_MODIFIED;
+			DIP(ip, i_gen) = ip->i_gen;
+		}
+	}
+	/*
+	 * Ensure that uid and gid are correct. This is a temporary
+	 * fix until fsck has been changed to do the update.
+	 */
+	if (fs->fs_magic == FS_UFS1_MAGIC &&		/* XXX */
+	    fs->fs_old_inodefmt < FS_44INODEFMT) {	/* XXX */
+		ip->i_uid = ip->i_din1->di_ouid;	/* XXX */
+		ip->i_gid = ip->i_din1->di_ogid;	/* XXX */
+	}						/* XXX */
+
+#ifdef MAC
+	if ((mp->mnt_flag & MNT_MULTILABEL) && ip->i_mode) {
+		/*
+		 * If this vnode is already allocated, and we're running
+		 * multi-label, attempt to perform a label association
+		 * from the extended attributes on the inode.
+		 */
+		error = mac_associate_vnode_extattr(mp, vp);
+		if (error) {
+			/* ufs_inactive will release ip->i_devvp ref. */
+			vput(vp);
+			*vpp = NULL;
+			return (error);
+		}
+	}
+#endif
+
+	*vpp = vp;
+	return (0);
+}
+
+/*
+ * File handle to vnode
+ *
+ * Have to be really careful about stale file handles:
+ * - check that the inode number is valid
+ * - call ffs_vget() to get the locked inode
+ * - check for an unallocated inode (i_mode == 0)
+ * - check that the given client host has export rights and return
+ *   those rights via. exflagsp and credanonp
+ */
+int
+ffs_fhtovp(mp, fhp, vpp)
+	struct mount *mp;
+	struct fid *fhp;
+	struct vnode **vpp;
+{
+	struct ufid *ufhp;
+	struct fs *fs;
+
+	ufhp = (struct ufid *)fhp;
+	fs = VFSTOUFS(mp)->um_fs;
+	if (ufhp->ufid_ino < ROOTINO ||
+	    ufhp->ufid_ino >= fs->fs_ncg * fs->fs_ipg)
+		return (ESTALE);
+	return (ufs_fhtovp(mp, ufhp, vpp));
+}
+
+/*
+ * Vnode pointer to File handle
+ */
+/* ARGSUSED */
+int
+ffs_vptofh(vp, fhp)
+	struct vnode *vp;
+	struct fid *fhp;
+{
+	struct inode *ip;
+	struct ufid *ufhp;
+
+	ip = VTOI(vp);
+	ufhp = (struct ufid *)fhp;
+	ufhp->ufid_len = sizeof(struct ufid);
+	ufhp->ufid_ino = ip->i_number;
+	ufhp->ufid_gen = ip->i_gen;
+	return (0);
+}
+
+/*
+ * Initialize the filesystem.
+ */
+static int
+ffs_init(vfsp)
+	struct vfsconf *vfsp;
+{
+
+	softdep_initialize();
+	return (ufs_init(vfsp));
+}
+
+/*
+ * Undo the work of ffs_init().
+ */
+static int
+ffs_uninit(vfsp)
+	struct vfsconf *vfsp;
+{
+	int ret;
+
+	ret = ufs_uninit(vfsp);
+	softdep_uninitialize();
+	return (ret);
+}
+
+/*
+ * Write a superblock and associated information back to disk.
+ */
+static int
+ffs_sbupdate(mp, waitfor)
+	struct ufsmount *mp;
+	int waitfor;
+{
+	struct fs *fs = mp->um_fs;
+	struct buf *bp;
+	int blks;
+	void *space;
+	int i, size, error, allerror = 0;
+
+	if (fs->fs_ronly == 1 &&
+	    (mp->um_mountp->mnt_flag & (MNT_RDONLY | MNT_UPDATE)) != 
+	    (MNT_RDONLY | MNT_UPDATE))
+		panic("ffs_sbupdate: write read-only filesystem");
+	/*
+	 * First write back the summary information.
+	 */
+	blks = howmany(fs->fs_cssize, fs->fs_fsize);
+	space = fs->fs_csp;
+	for (i = 0; i < blks; i += fs->fs_frag) {
+		size = fs->fs_bsize;
+		if (i + fs->fs_frag > blks)
+			size = (blks - i) * fs->fs_fsize;
+		bp = getblk(mp->um_devvp, fsbtodb(fs, fs->fs_csaddr + i),
+		    size, 0, 0, 0);
+		bcopy(space, bp->b_data, (u_int)size);
+		space = (char *)space + size;
+		if (waitfor != MNT_WAIT)
+			bawrite(bp);
+		else if ((error = bwrite(bp)) != 0)
+			allerror = error;
+	}
+	/*
+	 * Now write back the superblock itself. If any errors occurred
+	 * up to this point, then fail so that the superblock avoids
+	 * being written out as clean.
+	 */
+	if (allerror)
+		return (allerror);
+	if (fs->fs_magic == FS_UFS1_MAGIC && fs->fs_sblockloc != SBLOCK_UFS1 &&
+	    (fs->fs_flags & FS_FLAGS_UPDATED) == 0) {
+		printf("%s: correcting fs_sblockloc from %jd to %d\n",
+		    fs->fs_fsmnt, fs->fs_sblockloc, SBLOCK_UFS1);
+		fs->fs_sblockloc = SBLOCK_UFS1;
+	}
+	if (fs->fs_magic == FS_UFS2_MAGIC && fs->fs_sblockloc != SBLOCK_UFS2 &&
+	    (fs->fs_flags & FS_FLAGS_UPDATED) == 0) {
+		printf("%s: correcting fs_sblockloc from %jd to %d\n",
+		    fs->fs_fsmnt, fs->fs_sblockloc, SBLOCK_UFS2);
+		fs->fs_sblockloc = SBLOCK_UFS2;
+	}
+	bp = getblk(mp->um_devvp, btodb(fs->fs_sblockloc), (int)fs->fs_sbsize,
+	    0, 0, 0);
+	fs->fs_fmod = 0;
+	fs->fs_time = time_second;
+	bcopy((caddr_t)fs, bp->b_data, (u_int)fs->fs_sbsize);
+	ffs_oldfscompat_write((struct fs *)bp->b_data, mp);
+	if (waitfor != MNT_WAIT)
+		bawrite(bp);
+	else if ((error = bwrite(bp)) != 0)
+		allerror = error;
+	return (allerror);
+}
+
+static int
+ffs_extattrctl(struct mount *mp, int cmd, struct vnode *filename_vp,
+	int attrnamespace, const char *attrname, struct thread *td)
+{
+
+#ifdef UFS_EXTATTR
+	return (ufs_extattrctl(mp, cmd, filename_vp, attrnamespace,
+	    attrname, td));
+#else
+	return (vfs_stdextattrctl(mp, cmd, filename_vp, attrnamespace,
+	    attrname, td));
+#endif
+}
+
+static void
+ffs_ifree(struct ufsmount *ump, struct inode *ip)
+{
+
+	if (ump->um_fstype == UFS1 && ip->i_din1 != NULL)
+		uma_zfree(uma_ufs1, ip->i_din1);
+	else if (ip->i_din2 != NULL)
+		uma_zfree(uma_ufs2, ip->i_din2);
+	uma_zfree(uma_inode, ip);
+}
diff --git a/src/sys/ufs/ffs/ffs_vnops.c b/src/sys/ufs/ffs/ffs_vnops.c
new file mode 100644
index 0000000..5b2d02b
--- /dev/null
+++ b/src/sys/ufs/ffs/ffs_vnops.c
@@ -0,0 +1,1815 @@
+/*
+ * Copyright (c) 2002, 2003 Networks Associates Technology, Inc.
+ * All rights reserved.
+ *
+ * This software was developed for the FreeBSD Project by Marshall
+ * Kirk McKusick and Network Associates Laboratories, the Security
+ * Research Division of Network Associates, Inc. under DARPA/SPAWAR
+ * contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA CHATS
+ * research program
+ *
+ * Copyright (c) 1982, 1986, 1989, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. All advertising materials mentioning features or use of this software
+ *    must display the following acknowledgement:
+ *	This product includes software developed by the University of
+ *	California, Berkeley and its contributors.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)ffs_vnops.c	8.15 (Berkeley) 5/14/95
+ */
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/sys/ufs/ffs/ffs_vnops.c,v 1.119 2003/10/04 20:38:32 alc Exp $");
+
+#include <sys/param.h>
+#include <sys/bio.h>
+#include <sys/systm.h>
+#include <sys/buf.h>
+#include <sys/conf.h>
+#include <sys/extattr.h>
+#include <sys/kernel.h>
+#include <sys/limits.h>
+#include <sys/malloc.h>
+#include <sys/mount.h>
+#include <sys/proc.h>
+#include <sys/resourcevar.h>
+#include <sys/signalvar.h>
+#include <sys/stat.h>
+#include <sys/vmmeter.h>
+#include <sys/vnode.h>
+
+#include <vm/vm.h>
+#include <vm/vm_extern.h>
+#include <vm/vm_object.h>
+#include <vm/vm_page.h>
+#include <vm/vm_pager.h>
+#include <vm/vnode_pager.h>
+
+#include <ufs/ufs/extattr.h>
+#include <ufs/ufs/quota.h>
+#include <ufs/ufs/inode.h>
+#include <ufs/ufs/ufs_extern.h>
+#include <ufs/ufs/ufsmount.h>
+
+#include <ufs/ffs/fs.h>
+#include <ufs/ffs/ffs_extern.h>
+#include "opt_directio.h"
+
+#ifdef DIRECTIO
+extern int	ffs_rawread(struct vnode *vp, struct uio *uio, int *workdone);
+#endif
+static int	ffs_fsync(struct vop_fsync_args *);
+static int	ffs_getpages(struct vop_getpages_args *);
+static int	ffs_read(struct vop_read_args *);
+static int	ffs_write(struct vop_write_args *);
+static int	ffs_extread(struct vnode *vp, struct uio *uio, int ioflag);
+static int	ffs_extwrite(struct vnode *vp, struct uio *uio, int ioflag,
+		    struct ucred *cred);
+static int	ffsext_strategy(struct vop_strategy_args *);
+static int	ffs_closeextattr(struct vop_closeextattr_args *);
+static int	ffs_deleteextattr(struct vop_deleteextattr_args *);
+static int	ffs_getextattr(struct vop_getextattr_args *);
+static int	ffs_listextattr(struct vop_listextattr_args *);
+static int	ffs_openextattr(struct vop_openextattr_args *);
+static int	ffs_setextattr(struct vop_setextattr_args *);
+
+
+/* Global vfs data structures for ufs. */
+vop_t **ffs_vnodeop_p;
+static struct vnodeopv_entry_desc ffs_vnodeop_entries[] = {
+	{ &vop_default_desc,		(vop_t *) ufs_vnoperate },
+	{ &vop_fsync_desc,		(vop_t *) ffs_fsync },
+	{ &vop_getpages_desc,		(vop_t *) ffs_getpages },
+	{ &vop_read_desc,		(vop_t *) ffs_read },
+	{ &vop_reallocblks_desc,	(vop_t *) ffs_reallocblks },
+	{ &vop_write_desc,		(vop_t *) ffs_write },
+	{ &vop_closeextattr_desc,	(vop_t *) ffs_closeextattr },
+	{ &vop_deleteextattr_desc,	(vop_t *) ffs_deleteextattr },
+	{ &vop_getextattr_desc,		(vop_t *) ffs_getextattr },
+	{ &vop_listextattr_desc,	(vop_t *) ffs_listextattr },
+	{ &vop_openextattr_desc,	(vop_t *) ffs_openextattr },
+	{ &vop_setextattr_desc,		(vop_t *) ffs_setextattr },
+	{ NULL, NULL }
+};
+static struct vnodeopv_desc ffs_vnodeop_opv_desc =
+	{ &ffs_vnodeop_p, ffs_vnodeop_entries };
+
+vop_t **ffs_specop_p;
+static struct vnodeopv_entry_desc ffs_specop_entries[] = {
+	{ &vop_default_desc,		(vop_t *) ufs_vnoperatespec },
+	{ &vop_fsync_desc,		(vop_t *) ffs_fsync },
+	{ &vop_reallocblks_desc,	(vop_t *) ffs_reallocblks },
+	{ &vop_strategy_desc,		(vop_t *) ffsext_strategy },
+	{ &vop_closeextattr_desc,	(vop_t *) ffs_closeextattr },
+	{ &vop_deleteextattr_desc,	(vop_t *) ffs_deleteextattr },
+	{ &vop_getextattr_desc,		(vop_t *) ffs_getextattr },
+	{ &vop_listextattr_desc,	(vop_t *) ffs_listextattr },
+	{ &vop_openextattr_desc,	(vop_t *) ffs_openextattr },
+	{ &vop_setextattr_desc,		(vop_t *) ffs_setextattr },
+	{ NULL, NULL }
+};
+static struct vnodeopv_desc ffs_specop_opv_desc =
+	{ &ffs_specop_p, ffs_specop_entries };
+
+vop_t **ffs_fifoop_p;
+static struct vnodeopv_entry_desc ffs_fifoop_entries[] = {
+	{ &vop_default_desc,		(vop_t *) ufs_vnoperatefifo },
+	{ &vop_fsync_desc,		(vop_t *) ffs_fsync },
+	{ &vop_reallocblks_desc,	(vop_t *) ffs_reallocblks },
+	{ &vop_strategy_desc,		(vop_t *) ffsext_strategy },
+	{ &vop_closeextattr_desc,	(vop_t *) ffs_closeextattr },
+	{ &vop_deleteextattr_desc,	(vop_t *) ffs_deleteextattr },
+	{ &vop_getextattr_desc,		(vop_t *) ffs_getextattr },
+	{ &vop_listextattr_desc,	(vop_t *) ffs_listextattr },
+	{ &vop_openextattr_desc,	(vop_t *) ffs_openextattr },
+	{ &vop_setextattr_desc,		(vop_t *) ffs_setextattr },
+	{ NULL, NULL }
+};
+static struct vnodeopv_desc ffs_fifoop_opv_desc =
+	{ &ffs_fifoop_p, ffs_fifoop_entries };
+
+VNODEOP_SET(ffs_vnodeop_opv_desc);
+VNODEOP_SET(ffs_specop_opv_desc);
+VNODEOP_SET(ffs_fifoop_opv_desc);
+
+/*
+ * Synch an open file.
+ */
+/* ARGSUSED */
+static int
+ffs_fsync(ap)
+	struct vop_fsync_args /* {
+		struct vnode *a_vp;
+		struct ucred *a_cred;
+		int a_waitfor;
+		struct thread *a_td;
+	} */ *ap;
+{
+	struct vnode *vp = ap->a_vp;
+	struct inode *ip = VTOI(vp);
+	struct buf *bp;
+	struct buf *nbp;
+	int s, error, wait, passes, skipmeta;
+	ufs_lbn_t lbn;
+
+	wait = (ap->a_waitfor == MNT_WAIT);
+	if (vn_isdisk(vp, NULL)) {
+		lbn = INT_MAX;
+		if (vp->v_rdev->si_mountpoint != NULL &&
+		    (vp->v_rdev->si_mountpoint->mnt_flag & MNT_SOFTDEP))
+			softdep_fsync_mountdev(vp);
+	} else {
+		lbn = lblkno(ip->i_fs, (ip->i_size + ip->i_fs->fs_bsize - 1));
+	}
+
+	/*
+	 * Flush all dirty buffers associated with a vnode.
+	 */
+	passes = NIADDR + 1;
+	skipmeta = 0;
+	if (wait)
+		skipmeta = 1;
+	s = splbio();
+	VI_LOCK(vp);
+loop:
+	TAILQ_FOREACH(bp, &vp->v_dirtyblkhd, b_vnbufs)
+		bp->b_vflags &= ~BV_SCANNED;
+	for (bp = TAILQ_FIRST(&vp->v_dirtyblkhd); bp; bp = nbp) {
+		nbp = TAILQ_NEXT(bp, b_vnbufs);
+		/* 
+		 * Reasons to skip this buffer: it has already been considered
+		 * on this pass, this pass is the first time through on a
+		 * synchronous flush request and the buffer being considered
+		 * is metadata, the buffer has dependencies that will cause
+		 * it to be redirtied and it has not already been deferred,
+		 * or it is already being written.
+		 */
+		if ((bp->b_vflags & BV_SCANNED) != 0)
+			continue;
+		bp->b_vflags |= BV_SCANNED;
+		if ((skipmeta == 1 && bp->b_lblkno < 0))
+			continue;
+		if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL))
+			continue;
+		if (!wait && LIST_FIRST(&bp->b_dep) != NULL &&
+		    (bp->b_flags & B_DEFERRED) == 0 &&
+		    buf_countdeps(bp, 0)) {
+			bp->b_flags |= B_DEFERRED;
+			BUF_UNLOCK(bp);
+			continue;
+		}
+		VI_UNLOCK(vp);
+		if ((bp->b_flags & B_DELWRI) == 0)
+			panic("ffs_fsync: not dirty");
+		if (vp != bp->b_vp)
+			panic("ffs_fsync: vp != vp->b_vp");
+		/*
+		 * If this is a synchronous flush request, or it is not a
+		 * file or device, start the write on this buffer immediatly.
+		 */
+		if (wait || (vp->v_type != VREG && vp->v_type != VBLK)) {
+
+			/*
+			 * On our final pass through, do all I/O synchronously
+			 * so that we can find out if our flush is failing
+			 * because of write errors.
+			 */
+			if (passes > 0 || !wait) {
+				if ((bp->b_flags & B_CLUSTEROK) && !wait) {
+					(void) vfs_bio_awrite(bp);
+				} else {
+					bremfree(bp);
+					splx(s);
+					(void) bawrite(bp);
+					s = splbio();
+				}
+			} else {
+				bremfree(bp);
+				splx(s);
+				if ((error = bwrite(bp)) != 0)
+					return (error);
+				s = splbio();
+			}
+		} else if ((vp->v_type == VREG) && (bp->b_lblkno >= lbn)) {
+			/* 
+			 * If the buffer is for data that has been truncated
+			 * off the file, then throw it away.
+			 */
+			bremfree(bp);
+			bp->b_flags |= B_INVAL | B_NOCACHE;
+			splx(s);
+			brelse(bp);
+			s = splbio();
+		} else
+			vfs_bio_awrite(bp);
+
+		/*
+		 * Since we may have slept during the I/O, we need 
+		 * to start from a known point.
+		 */
+		VI_LOCK(vp);
+		nbp = TAILQ_FIRST(&vp->v_dirtyblkhd);
+	}
+	/*
+	 * If we were asked to do this synchronously, then go back for
+	 * another pass, this time doing the metadata.
+	 */
+	if (skipmeta) {
+		skipmeta = 0;
+		goto loop;
+	}
+
+	if (wait) {
+		while (vp->v_numoutput) {
+			vp->v_iflag |= VI_BWAIT;
+			msleep((caddr_t)&vp->v_numoutput, VI_MTX(vp),
+			    PRIBIO + 4, "ffsfsn", 0);
+  		}
+		VI_UNLOCK(vp);
+
+		/* 
+		 * Ensure that any filesystem metatdata associated
+		 * with the vnode has been written.
+		 */
+		splx(s);
+		if ((error = softdep_sync_metadata(ap)) != 0)
+			return (error);
+		s = splbio();
+
+		VI_LOCK(vp);
+		if (!TAILQ_EMPTY(&vp->v_dirtyblkhd)) {
+			/*
+			 * Block devices associated with filesystems may
+			 * have new I/O requests posted for them even if
+			 * the vnode is locked, so no amount of trying will
+			 * get them clean. Thus we give block devices a
+			 * good effort, then just give up. For all other file
+			 * types, go around and try again until it is clean.
+			 */
+			if (passes > 0) {
+				passes -= 1;
+				goto loop;
+			}
+#ifdef DIAGNOSTIC
+			if (!vn_isdisk(vp, NULL))
+				vprint("ffs_fsync: dirty", vp);
+#endif
+		}
+	}
+	VI_UNLOCK(vp);
+	splx(s);
+	return (UFS_UPDATE(vp, wait));
+}
+
+
+/*
+ * Vnode op for reading.
+ */
+/* ARGSUSED */
+static int
+ffs_read(ap)
+	struct vop_read_args /* {
+		struct vnode *a_vp;
+		struct uio *a_uio;
+		int a_ioflag;
+		struct ucred *a_cred;
+	} */ *ap;
+{
+	struct vnode *vp;
+	struct inode *ip;
+	struct uio *uio;
+	struct fs *fs;
+	struct buf *bp;
+	ufs_lbn_t lbn, nextlbn;
+	off_t bytesinfile;
+	long size, xfersize, blkoffset;
+	int error, orig_resid;
+	int seqcount;
+	int ioflag;
+	vm_object_t object;
+
+	vp = ap->a_vp;
+	uio = ap->a_uio;
+	ioflag = ap->a_ioflag;
+	if (ap->a_ioflag & IO_EXT)
+#ifdef notyet
+		return (ffs_extread(vp, uio, ioflag));
+#else
+		panic("ffs_read+IO_EXT");
+#endif
+#ifdef DIRECTIO
+	if ((ioflag & IO_DIRECT) != 0) {
+		int workdone;
+
+		error = ffs_rawread(vp, uio, &workdone);
+		if (error != 0 || workdone != 0)
+			return error;
+	}
+#endif
+
+	GIANT_REQUIRED;
+
+	seqcount = ap->a_ioflag >> 16;
+	ip = VTOI(vp);
+
+#ifdef DIAGNOSTIC
+	if (uio->uio_rw != UIO_READ)
+		panic("ffs_read: mode");
+
+	if (vp->v_type == VLNK) {
+		if ((int)ip->i_size < vp->v_mount->mnt_maxsymlinklen)
+			panic("ffs_read: short symlink");
+	} else if (vp->v_type != VREG && vp->v_type != VDIR)
+		panic("ffs_read: type %d",  vp->v_type);
+#endif
+	fs = ip->i_fs;
+	if ((u_int64_t)uio->uio_offset > fs->fs_maxfilesize)
+		return (EFBIG);
+
+	orig_resid = uio->uio_resid;
+	if (orig_resid <= 0)
+		return (0);
+
+	object = vp->v_object;
+
+	bytesinfile = ip->i_size - uio->uio_offset;
+	if (bytesinfile <= 0) {
+		if ((vp->v_mount->mnt_flag & MNT_NOATIME) == 0)
+			ip->i_flag |= IN_ACCESS;
+		return 0;
+	}
+
+	if (object) {
+		vm_object_reference(object);
+	}
+
+	/*
+	 * Ok so we couldn't do it all in one vm trick...
+	 * so cycle around trying smaller bites..
+	 */
+	for (error = 0, bp = NULL; uio->uio_resid > 0; bp = NULL) {
+		if ((bytesinfile = ip->i_size - uio->uio_offset) <= 0)
+			break;
+
+		lbn = lblkno(fs, uio->uio_offset);
+		nextlbn = lbn + 1;
+
+		/*
+		 * size of buffer.  The buffer representing the
+		 * end of the file is rounded up to the size of
+		 * the block type ( fragment or full block, 
+		 * depending ).
+		 */
+		size = blksize(fs, ip, lbn);
+		blkoffset = blkoff(fs, uio->uio_offset);
+		
+		/*
+		 * The amount we want to transfer in this iteration is
+		 * one FS block less the amount of the data before
+		 * our startpoint (duh!)
+		 */
+		xfersize = fs->fs_bsize - blkoffset;
+
+		/*
+		 * But if we actually want less than the block,
+		 * or the file doesn't have a whole block more of data,
+		 * then use the lesser number.
+		 */
+		if (uio->uio_resid < xfersize)
+			xfersize = uio->uio_resid;
+		if (bytesinfile < xfersize)
+			xfersize = bytesinfile;
+
+		if (lblktosize(fs, nextlbn) >= ip->i_size) {
+			/*
+			 * Don't do readahead if this is the end of the file.
+			 */
+			error = bread(vp, lbn, size, NOCRED, &bp);
+		} else if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERR) == 0) {
+			/* 
+			 * Otherwise if we are allowed to cluster,
+			 * grab as much as we can.
+			 *
+			 * XXX  This may not be a win if we are not
+			 * doing sequential access.
+			 */
+			error = cluster_read(vp, ip->i_size, lbn,
+				size, NOCRED, uio->uio_resid, seqcount, &bp);
+		} else if (seqcount > 1) {
+			/*
+			 * If we are NOT allowed to cluster, then
+			 * if we appear to be acting sequentially,
+			 * fire off a request for a readahead
+			 * as well as a read. Note that the 4th and 5th
+			 * arguments point to arrays of the size specified in
+			 * the 6th argument.
+			 */
+			int nextsize = blksize(fs, ip, nextlbn);
+			error = breadn(vp, lbn,
+			    size, &nextlbn, &nextsize, 1, NOCRED, &bp);
+		} else {
+			/*
+			 * Failing all of the above, just read what the 
+			 * user asked for. Interestingly, the same as
+			 * the first option above.
+			 */
+			error = bread(vp, lbn, size, NOCRED, &bp);
+		}
+		if (error) {
+			brelse(bp);
+			bp = NULL;
+			break;
+		}
+
+		/*
+		 * If IO_DIRECT then set B_DIRECT for the buffer.  This
+		 * will cause us to attempt to release the buffer later on
+		 * and will cause the buffer cache to attempt to free the
+		 * underlying pages.
+		 */
+		if (ioflag & IO_DIRECT)
+			bp->b_flags |= B_DIRECT;
+
+		/*
+		 * We should only get non-zero b_resid when an I/O error
+		 * has occurred, which should cause us to break above.
+		 * However, if the short read did not cause an error,
+		 * then we want to ensure that we do not uiomove bad
+		 * or uninitialized data.
+		 */
+		size -= bp->b_resid;
+		if (size < xfersize) {
+			if (size == 0)
+				break;
+			xfersize = size;
+		}
+
+		{
+			/*
+			 * otherwise use the general form
+			 */
+			error =
+				uiomove((char *)bp->b_data + blkoffset,
+					(int)xfersize, uio);
+		}
+
+		if (error)
+			break;
+
+		if ((ioflag & (IO_VMIO|IO_DIRECT)) &&
+		   (LIST_FIRST(&bp->b_dep) == NULL)) {
+			/*
+			 * If there are no dependencies, and it's VMIO,
+			 * then we don't need the buf, mark it available
+			 * for freeing. The VM has the data.
+			 */
+			bp->b_flags |= B_RELBUF;
+			brelse(bp);
+		} else {
+			/*
+			 * Otherwise let whoever
+			 * made the request take care of
+			 * freeing it. We just queue
+			 * it onto another list.
+			 */
+			bqrelse(bp);
+		}
+	}
+
+	/* 
+	 * This can only happen in the case of an error
+	 * because the loop above resets bp to NULL on each iteration
+	 * and on normal completion has not set a new value into it.
+	 * so it must have come from a 'break' statement
+	 */
+	if (bp != NULL) {
+		if ((ioflag & (IO_VMIO|IO_DIRECT)) &&
+		   (LIST_FIRST(&bp->b_dep) == NULL)) {
+			bp->b_flags |= B_RELBUF;
+			brelse(bp);
+		} else {
+			bqrelse(bp);
+		}
+	}
+
+	if (object) {
+		VM_OBJECT_LOCK(object);
+		vm_object_vndeallocate(object);
+	}
+	if ((error == 0 || uio->uio_resid != orig_resid) &&
+	    (vp->v_mount->mnt_flag & MNT_NOATIME) == 0)
+		ip->i_flag |= IN_ACCESS;
+	return (error);
+}
+
+/*
+ * Vnode op for writing.
+ */
+static int
+ffs_write(ap)
+	struct vop_write_args /* {
+		struct vnode *a_vp;
+		struct uio *a_uio;
+		int a_ioflag;
+		struct ucred *a_cred;
+	} */ *ap;
+{
+	struct vnode *vp;
+	struct uio *uio;
+	struct inode *ip;
+	struct fs *fs;
+	struct buf *bp;
+	struct thread *td;
+	ufs_lbn_t lbn;
+	off_t osize;
+	int seqcount;
+	int blkoffset, error, extended, flags, ioflag, resid, size, xfersize;
+	vm_object_t object;
+
+	vp = ap->a_vp;
+	uio = ap->a_uio;
+	ioflag = ap->a_ioflag;
+	if (ap->a_ioflag & IO_EXT)
+#ifdef notyet
+		return (ffs_extwrite(vp, uio, ioflag, ap->a_cred));
+#else
+		panic("ffs_read+IO_EXT");
+#endif
+
+	GIANT_REQUIRED;
+
+	extended = 0;
+	seqcount = ap->a_ioflag >> 16;
+	ip = VTOI(vp);
+
+	object = vp->v_object;
+	if (object) {
+		vm_object_reference(object);
+	}
+
+#ifdef DIAGNOSTIC
+	if (uio->uio_rw != UIO_WRITE)
+		panic("ffswrite: mode");
+#endif
+
+	switch (vp->v_type) {
+	case VREG:
+		if (ioflag & IO_APPEND)
+			uio->uio_offset = ip->i_size;
+		if ((ip->i_flags & APPEND) && uio->uio_offset != ip->i_size) {
+			if (object) {
+				VM_OBJECT_LOCK(object);
+				vm_object_vndeallocate(object);
+			}
+			return (EPERM);
+		}
+		/* FALLTHROUGH */
+	case VLNK:
+		break;
+	case VDIR:
+		panic("ffswrite: dir write");
+		break;
+	default:
+		panic("ffswrite: type %p %d (%d,%d)", vp, (int)vp->v_type,
+			(int)uio->uio_offset,
+			(int)uio->uio_resid
+		);
+	}
+
+	fs = ip->i_fs;
+	if (uio->uio_offset < 0 ||
+	    (u_int64_t)uio->uio_offset + uio->uio_resid > fs->fs_maxfilesize) {
+		if (object) {
+			VM_OBJECT_LOCK(object);
+			vm_object_vndeallocate(object);
+		}
+		return (EFBIG);
+	}
+	/*
+	 * Maybe this should be above the vnode op call, but so long as
+	 * file servers have no limits, I don't think it matters.
+	 */
+	td = uio->uio_td;
+	if (vp->v_type == VREG && td &&
+	    uio->uio_offset + uio->uio_resid >
+	    td->td_proc->p_rlimit[RLIMIT_FSIZE].rlim_cur) {
+		PROC_LOCK(td->td_proc);
+		psignal(td->td_proc, SIGXFSZ);
+		PROC_UNLOCK(td->td_proc);
+		if (object) {
+			VM_OBJECT_LOCK(object);
+			vm_object_vndeallocate(object);
+		}
+		return (EFBIG);
+	}
+
+	resid = uio->uio_resid;
+	osize = ip->i_size;
+	if (seqcount > BA_SEQMAX)
+		flags = BA_SEQMAX << BA_SEQSHIFT;
+	else
+		flags = seqcount << BA_SEQSHIFT;
+	if ((ioflag & IO_SYNC) && !DOINGASYNC(vp))
+		flags |= IO_SYNC;
+
+	for (error = 0; uio->uio_resid > 0;) {
+		lbn = lblkno(fs, uio->uio_offset);
+		blkoffset = blkoff(fs, uio->uio_offset);
+		xfersize = fs->fs_bsize - blkoffset;
+		if (uio->uio_resid < xfersize)
+			xfersize = uio->uio_resid;
+
+		if (uio->uio_offset + xfersize > ip->i_size)
+			vnode_pager_setsize(vp, uio->uio_offset + xfersize);
+
+                /*      
+		 * We must perform a read-before-write if the transfer size
+		 * does not cover the entire buffer.
+                 */
+		if (fs->fs_bsize > xfersize)
+			flags |= BA_CLRBUF;
+		else
+			flags &= ~BA_CLRBUF;
+/* XXX is uio->uio_offset the right thing here? */
+		error = UFS_BALLOC(vp, uio->uio_offset, xfersize,
+		    ap->a_cred, flags, &bp);
+		if (error != 0)
+			break;
+		/*
+		 * If the buffer is not valid we have to clear out any
+		 * garbage data from the pages instantiated for the buffer.
+		 * If we do not, a failed uiomove() during a write can leave
+		 * the prior contents of the pages exposed to a userland
+		 * mmap().  XXX deal with uiomove() errors a better way.
+		 */
+		if ((bp->b_flags & B_CACHE) == 0 && fs->fs_bsize <= xfersize)
+			vfs_bio_clrbuf(bp);
+		if (ioflag & IO_DIRECT)
+			bp->b_flags |= B_DIRECT;
+
+		if (uio->uio_offset + xfersize > ip->i_size) {
+			ip->i_size = uio->uio_offset + xfersize;
+			DIP(ip, i_size) = ip->i_size;
+			extended = 1;
+		}
+
+		size = blksize(fs, ip, lbn) - bp->b_resid;
+		if (size < xfersize)
+			xfersize = size;
+
+		error =
+		    uiomove((char *)bp->b_data + blkoffset, (int)xfersize, uio);
+		if ((ioflag & (IO_VMIO|IO_DIRECT)) &&
+		   (LIST_FIRST(&bp->b_dep) == NULL)) {
+			bp->b_flags |= B_RELBUF;
+		}
+
+		/*
+		 * If IO_SYNC each buffer is written synchronously.  Otherwise
+		 * if we have a severe page deficiency write the buffer 
+		 * asynchronously.  Otherwise try to cluster, and if that
+		 * doesn't do it then either do an async write (if O_DIRECT),
+		 * or a delayed write (if not).
+		 */
+		if (ioflag & IO_SYNC) {
+			(void)bwrite(bp);
+		} else if (vm_page_count_severe() ||
+			    buf_dirty_count_severe() ||
+			    (ioflag & IO_ASYNC)) {
+			bp->b_flags |= B_CLUSTEROK;
+			bawrite(bp);
+		} else if (xfersize + blkoffset == fs->fs_bsize) {
+			if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) {
+				bp->b_flags |= B_CLUSTEROK;
+				cluster_write(bp, ip->i_size, seqcount);
+			} else {
+				bawrite(bp);
+			}
+		} else if (ioflag & IO_DIRECT) {
+			bp->b_flags |= B_CLUSTEROK;
+			bawrite(bp);
+		} else {
+			bp->b_flags |= B_CLUSTEROK;
+			bdwrite(bp);
+		}
+		if (error || xfersize == 0)
+			break;
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+	}
+	/*
+	 * If we successfully wrote any data, and we are not the superuser
+	 * we clear the setuid and setgid bits as a precaution against
+	 * tampering.
+	 */
+	if (resid > uio->uio_resid && ap->a_cred && 
+	    suser_cred(ap->a_cred, PRISON_ROOT)) {
+		ip->i_mode &= ~(ISUID | ISGID);
+		DIP(ip, i_mode) = ip->i_mode;
+	}
+	if (resid > uio->uio_resid)
+		VN_KNOTE(vp, NOTE_WRITE | (extended ? NOTE_EXTEND : 0));
+	if (error) {
+		if (ioflag & IO_UNIT) {
+			(void)UFS_TRUNCATE(vp, osize,
+			    IO_NORMAL | (ioflag & IO_SYNC),
+			    ap->a_cred, uio->uio_td);
+			uio->uio_offset -= resid - uio->uio_resid;
+			uio->uio_resid = resid;
+		}
+	} else if (resid > uio->uio_resid && (ioflag & IO_SYNC))
+		error = UFS_UPDATE(vp, 1);
+
+	if (object) {
+		VM_OBJECT_LOCK(object);
+		vm_object_vndeallocate(object);
+	}
+
+	return (error);
+}
+
+/*
+ * get page routine
+ */
+static int
+ffs_getpages(ap)
+	struct vop_getpages_args *ap;
+{
+	off_t foff, physoffset;
+	int i, size, bsize;
+	struct vnode *dp, *vp;
+	vm_object_t obj;
+	vm_pindex_t pindex;
+	vm_page_t mreq;
+	int bbackwards, bforwards;
+	int pbackwards, pforwards;
+	int firstpage;
+	ufs2_daddr_t reqblkno, reqlblkno;
+	int poff;
+	int pcount;
+	int rtval;
+	int pagesperblock;
+
+	GIANT_REQUIRED;
+
+	pcount = round_page(ap->a_count) / PAGE_SIZE;
+	mreq = ap->a_m[ap->a_reqpage];
+
+	/*
+	 * if ANY DEV_BSIZE blocks are valid on a large filesystem block,
+	 * then the entire page is valid.  Since the page may be mapped,
+	 * user programs might reference data beyond the actual end of file
+	 * occuring within the page.  We have to zero that data.
+	 */
+	VM_OBJECT_LOCK(mreq->object);
+	if (mreq->valid) {
+		if (mreq->valid != VM_PAGE_BITS_ALL)
+			vm_page_zero_invalid(mreq, TRUE);
+		vm_page_lock_queues();
+		for (i = 0; i < pcount; i++) {
+			if (i != ap->a_reqpage) {
+				vm_page_free(ap->a_m[i]);
+			}
+		}
+		vm_page_unlock_queues();
+		VM_OBJECT_UNLOCK(mreq->object);
+		return VM_PAGER_OK;
+	}
+	VM_OBJECT_UNLOCK(mreq->object);
+	vp = ap->a_vp;
+	obj = vp->v_object;
+	bsize = vp->v_mount->mnt_stat.f_iosize;
+	pindex = mreq->pindex;
+	foff = IDX_TO_OFF(pindex) /* + ap->a_offset should be zero */;
+
+	if (bsize < PAGE_SIZE)
+		return vnode_pager_generic_getpages(ap->a_vp, ap->a_m,
+						    ap->a_count,
+						    ap->a_reqpage);
+
+	/*
+	 * foff is the file offset of the required page
+	 * reqlblkno is the logical block that contains the page
+	 * poff is the index of the page into the logical block
+	 */
+	reqlblkno = foff / bsize;
+	poff = (foff % bsize) / PAGE_SIZE;
+
+	dp = VTOI(vp)->i_devvp;
+	if (ufs_bmaparray(vp, reqlblkno, &reqblkno, 0, &bforwards, &bbackwards)
+	    || (reqblkno == -1)) {
+		VM_OBJECT_LOCK(obj);
+		vm_page_lock_queues();
+		for(i = 0; i < pcount; i++) {
+			if (i != ap->a_reqpage)
+				vm_page_free(ap->a_m[i]);
+		}
+		vm_page_unlock_queues();
+		if (reqblkno == -1) {
+			if ((mreq->flags & PG_ZERO) == 0)
+				pmap_zero_page(mreq);
+			vm_page_undirty(mreq);
+			mreq->valid = VM_PAGE_BITS_ALL;
+			VM_OBJECT_UNLOCK(obj);
+			return VM_PAGER_OK;
+		} else {
+			VM_OBJECT_UNLOCK(obj);
+			return VM_PAGER_ERROR;
+		}
+	}
+
+	physoffset = (off_t)reqblkno * DEV_BSIZE + poff * PAGE_SIZE;
+	pagesperblock = bsize / PAGE_SIZE;
+	/*
+	 * find the first page that is contiguous...
+	 * note that pbackwards is the number of pages that are contiguous
+	 * backwards.
+	 */
+	firstpage = 0;
+	if (ap->a_count) {
+		pbackwards = poff + bbackwards * pagesperblock;
+		if (ap->a_reqpage > pbackwards) {
+			firstpage = ap->a_reqpage - pbackwards;
+			VM_OBJECT_LOCK(obj);
+			vm_page_lock_queues();
+			for(i=0;i<firstpage;i++)
+				vm_page_free(ap->a_m[i]);
+			vm_page_unlock_queues();
+			VM_OBJECT_UNLOCK(obj);
+		}
+
+	/*
+	 * pforwards is the number of pages that are contiguous
+	 * after the current page.
+	 */
+		pforwards = (pagesperblock - (poff + 1)) +
+			bforwards * pagesperblock;
+		if (pforwards < (pcount - (ap->a_reqpage + 1))) {
+			VM_OBJECT_LOCK(obj);
+			vm_page_lock_queues();
+			for( i = ap->a_reqpage + pforwards + 1; i < pcount; i++)
+				vm_page_free(ap->a_m[i]);
+			vm_page_unlock_queues();
+			VM_OBJECT_UNLOCK(obj);
+			pcount = ap->a_reqpage + pforwards + 1;
+		}
+
+	/*
+	 * number of pages for I/O corrected for the non-contig pages at
+	 * the beginning of the array.
+	 */
+		pcount -= firstpage;
+	}
+
+	/*
+	 * calculate the size of the transfer
+	 */
+
+	size = pcount * PAGE_SIZE;
+
+	if ((IDX_TO_OFF(ap->a_m[firstpage]->pindex) + size) >
+		obj->un_pager.vnp.vnp_size)
+		size = obj->un_pager.vnp.vnp_size -
+			IDX_TO_OFF(ap->a_m[firstpage]->pindex);
+
+	physoffset -= foff;
+	rtval = VOP_GETPAGES(dp, &ap->a_m[firstpage], size,
+		(ap->a_reqpage - firstpage), physoffset);
+
+	return (rtval);
+}
+
+/*
+ * Extended attribute area reading.
+ */
+static int
+ffs_extread(struct vnode *vp, struct uio *uio, int ioflag)
+{
+	struct inode *ip;
+	struct ufs2_dinode *dp;
+	struct fs *fs;
+	struct buf *bp;
+	ufs_lbn_t lbn, nextlbn;
+	off_t bytesinfile;
+	long size, xfersize, blkoffset;
+	int error, orig_resid;
+
+	GIANT_REQUIRED;
+
+	ip = VTOI(vp);
+	fs = ip->i_fs;
+	dp = ip->i_din2;
+
+#ifdef DIAGNOSTIC
+	if (uio->uio_rw != UIO_READ || fs->fs_magic != FS_UFS2_MAGIC)
+		panic("ffs_extread: mode");
+
+#endif
+	orig_resid = uio->uio_resid;
+	if (orig_resid <= 0)
+		return (0);
+
+	bytesinfile = dp->di_extsize - uio->uio_offset;
+	if (bytesinfile <= 0) {
+		if ((vp->v_mount->mnt_flag & MNT_NOATIME) == 0)
+			ip->i_flag |= IN_ACCESS;
+		return 0;
+	}
+
+	for (error = 0, bp = NULL; uio->uio_resid > 0; bp = NULL) {
+		if ((bytesinfile = dp->di_extsize - uio->uio_offset) <= 0)
+			break;
+
+		lbn = lblkno(fs, uio->uio_offset);
+		nextlbn = lbn + 1;
+
+		/*
+		 * size of buffer.  The buffer representing the
+		 * end of the file is rounded up to the size of
+		 * the block type ( fragment or full block, 
+		 * depending ).
+		 */
+		size = sblksize(fs, dp->di_extsize, lbn);
+		blkoffset = blkoff(fs, uio->uio_offset);
+		
+		/*
+		 * The amount we want to transfer in this iteration is
+		 * one FS block less the amount of the data before
+		 * our startpoint (duh!)
+		 */
+		xfersize = fs->fs_bsize - blkoffset;
+
+		/*
+		 * But if we actually want less than the block,
+		 * or the file doesn't have a whole block more of data,
+		 * then use the lesser number.
+		 */
+		if (uio->uio_resid < xfersize)
+			xfersize = uio->uio_resid;
+		if (bytesinfile < xfersize)
+			xfersize = bytesinfile;
+
+		if (lblktosize(fs, nextlbn) >= dp->di_extsize) {
+			/*
+			 * Don't do readahead if this is the end of the info.
+			 */
+			error = bread(vp, -1 - lbn, size, NOCRED, &bp);
+		} else {
+			/*
+			 * If we have a second block, then
+			 * fire off a request for a readahead
+			 * as well as a read. Note that the 4th and 5th
+			 * arguments point to arrays of the size specified in
+			 * the 6th argument.
+			 */
+			int nextsize = sblksize(fs, dp->di_extsize, nextlbn);
+
+			nextlbn = -1 - nextlbn;
+			error = breadn(vp, -1 - lbn,
+			    size, &nextlbn, &nextsize, 1, NOCRED, &bp);
+		}
+		if (error) {
+			brelse(bp);
+			bp = NULL;
+			break;
+		}
+
+		/*
+		 * If IO_DIRECT then set B_DIRECT for the buffer.  This
+		 * will cause us to attempt to release the buffer later on
+		 * and will cause the buffer cache to attempt to free the
+		 * underlying pages.
+		 */
+		if (ioflag & IO_DIRECT)
+			bp->b_flags |= B_DIRECT;
+
+		/*
+		 * We should only get non-zero b_resid when an I/O error
+		 * has occurred, which should cause us to break above.
+		 * However, if the short read did not cause an error,
+		 * then we want to ensure that we do not uiomove bad
+		 * or uninitialized data.
+		 */
+		size -= bp->b_resid;
+		if (size < xfersize) {
+			if (size == 0)
+				break;
+			xfersize = size;
+		}
+
+		error = uiomove((char *)bp->b_data + blkoffset,
+					(int)xfersize, uio);
+		if (error)
+			break;
+
+		if ((ioflag & (IO_VMIO|IO_DIRECT)) &&
+		   (LIST_FIRST(&bp->b_dep) == NULL)) {
+			/*
+			 * If there are no dependencies, and it's VMIO,
+			 * then we don't need the buf, mark it available
+			 * for freeing. The VM has the data.
+			 */
+			bp->b_flags |= B_RELBUF;
+			brelse(bp);
+		} else {
+			/*
+			 * Otherwise let whoever
+			 * made the request take care of
+			 * freeing it. We just queue
+			 * it onto another list.
+			 */
+			bqrelse(bp);
+		}
+	}
+
+	/* 
+	 * This can only happen in the case of an error
+	 * because the loop above resets bp to NULL on each iteration
+	 * and on normal completion has not set a new value into it.
+	 * so it must have come from a 'break' statement
+	 */
+	if (bp != NULL) {
+		if ((ioflag & (IO_VMIO|IO_DIRECT)) &&
+		   (LIST_FIRST(&bp->b_dep) == NULL)) {
+			bp->b_flags |= B_RELBUF;
+			brelse(bp);
+		} else {
+			bqrelse(bp);
+		}
+	}
+
+	if ((error == 0 || uio->uio_resid != orig_resid) &&
+	    (vp->v_mount->mnt_flag & MNT_NOATIME) == 0)
+		ip->i_flag |= IN_ACCESS;
+	return (error);
+}
+
+/*
+ * Extended attribute area writing.
+ */
+static int
+ffs_extwrite(struct vnode *vp, struct uio *uio, int ioflag, struct ucred *ucred)
+{
+	struct inode *ip;
+	struct ufs2_dinode *dp;
+	struct fs *fs;
+	struct buf *bp;
+	ufs_lbn_t lbn;
+	off_t osize;
+	int blkoffset, error, flags, resid, size, xfersize;
+
+	GIANT_REQUIRED;
+
+	ip = VTOI(vp);
+	fs = ip->i_fs;
+	dp = ip->i_din2;
+
+#ifdef DIAGNOSTIC
+	if (uio->uio_rw != UIO_WRITE || fs->fs_magic != FS_UFS2_MAGIC)
+		panic("ext_write: mode");
+#endif
+
+	if (ioflag & IO_APPEND)
+		uio->uio_offset = dp->di_extsize;
+
+	if (uio->uio_offset < 0 ||
+	    (u_int64_t)uio->uio_offset + uio->uio_resid > NXADDR * fs->fs_bsize)
+		return (EFBIG);
+
+	resid = uio->uio_resid;
+	osize = dp->di_extsize;
+	flags = IO_EXT;
+	if ((ioflag & IO_SYNC) && !DOINGASYNC(vp))
+		flags |= IO_SYNC;
+
+	for (error = 0; uio->uio_resid > 0;) {
+		lbn = lblkno(fs, uio->uio_offset);
+		blkoffset = blkoff(fs, uio->uio_offset);
+		xfersize = fs->fs_bsize - blkoffset;
+		if (uio->uio_resid < xfersize)
+			xfersize = uio->uio_resid;
+
+                /*      
+		 * We must perform a read-before-write if the transfer size
+		 * does not cover the entire buffer.
+                 */
+		if (fs->fs_bsize > xfersize)
+			flags |= BA_CLRBUF;
+		else
+			flags &= ~BA_CLRBUF;
+		error = UFS_BALLOC(vp, uio->uio_offset, xfersize,
+		    ucred, flags, &bp);
+		if (error != 0)
+			break;
+		/*
+		 * If the buffer is not valid we have to clear out any
+		 * garbage data from the pages instantiated for the buffer.
+		 * If we do not, a failed uiomove() during a write can leave
+		 * the prior contents of the pages exposed to a userland
+		 * mmap().  XXX deal with uiomove() errors a better way.
+		 */
+		if ((bp->b_flags & B_CACHE) == 0 && fs->fs_bsize <= xfersize)
+			vfs_bio_clrbuf(bp);
+		if (ioflag & IO_DIRECT)
+			bp->b_flags |= B_DIRECT;
+
+		if (uio->uio_offset + xfersize > dp->di_extsize)
+			dp->di_extsize = uio->uio_offset + xfersize;
+
+		size = sblksize(fs, dp->di_extsize, lbn) - bp->b_resid;
+		if (size < xfersize)
+			xfersize = size;
+
+		error =
+		    uiomove((char *)bp->b_data + blkoffset, (int)xfersize, uio);
+		if ((ioflag & (IO_VMIO|IO_DIRECT)) &&
+		   (LIST_FIRST(&bp->b_dep) == NULL)) {
+			bp->b_flags |= B_RELBUF;
+		}
+
+		/*
+		 * If IO_SYNC each buffer is written synchronously.  Otherwise
+		 * if we have a severe page deficiency write the buffer 
+		 * asynchronously.  Otherwise try to cluster, and if that
+		 * doesn't do it then either do an async write (if O_DIRECT),
+		 * or a delayed write (if not).
+		 */
+		if (ioflag & IO_SYNC) {
+			(void)bwrite(bp);
+		} else if (vm_page_count_severe() ||
+			    buf_dirty_count_severe() ||
+			    xfersize + blkoffset == fs->fs_bsize ||
+			    (ioflag & (IO_ASYNC | IO_DIRECT)))
+			bawrite(bp);
+		else
+			bdwrite(bp);
+		if (error || xfersize == 0)
+			break;
+		ip->i_flag |= IN_CHANGE | IN_UPDATE;
+	}
+	/*
+	 * If we successfully wrote any data, and we are not the superuser
+	 * we clear the setuid and setgid bits as a precaution against
+	 * tampering.
+	 */
+	if (resid > uio->uio_resid && ucred && 
+	    suser_cred(ucred, PRISON_ROOT)) {
+		ip->i_mode &= ~(ISUID | ISGID);
+		dp->di_mode = ip->i_mode;
+	}
+	if (error) {
+		if (ioflag & IO_UNIT) {
+			(void)UFS_TRUNCATE(vp, osize,
+			    IO_EXT | (ioflag&IO_SYNC), ucred, uio->uio_td);
+			uio->uio_offset -= resid - uio->uio_resid;
+			uio->uio_resid = resid;
+		}
+	} else if (resid > uio->uio_resid && (ioflag & IO_SYNC))
+		error = UFS_UPDATE(vp, 1);
+	return (error);
+}
+
+
+/*
+ * Vnode operating to retrieve a named extended attribute.
+ *
+ * Locate a particular EA (nspace:name) in the area (ptr:length), and return
+ * the length of the EA, and possibly the pointer to the entry and to the data.
+ */
+static int
+ffs_findextattr(u_char *ptr, u_int length, int nspace, const char *name, u_char **eap, u_char **eac)
+{
+	u_char *p, *pe, *pn, *p0;
+	int eapad1, eapad2, ealength, ealen, nlen;
+	uint32_t ul;
+
+	pe = ptr + length;
+	nlen = strlen(name);
+
+	for (p = ptr; p < pe; p = pn) {
+		p0 = p;
+		bcopy(p, &ul, sizeof(ul));
+		pn = p + ul;
+		/* make sure this entry is complete */
+		if (pn > pe)
+			break;
+		p += sizeof(uint32_t);
+		if (*p != nspace)
+			continue;
+		p++;
+		eapad2 = *p++;
+		if (*p != nlen)
+			continue;
+		p++;
+		if (bcmp(p, name, nlen))
+			continue;
+		ealength = sizeof(uint32_t) + 3 + nlen;
+		eapad1 = 8 - (ealength % 8);
+		if (eapad1 == 8)
+			eapad1 = 0;
+		ealength += eapad1;
+		ealen = ul - ealength - eapad2;
+		p += nlen + eapad1;
+		if (eap != NULL)
+			*eap = p0;
+		if (eac != NULL)
+			*eac = p;
+		return (ealen);
+	}
+	return(-1);
+}
+
+static int
+ffs_rdextattr(u_char **p, struct vnode *vp, struct thread *td, int extra)
+{
+	struct inode *ip;
+	struct ufs2_dinode *dp;
+	struct uio luio;
+	struct iovec liovec;
+	int easize, error;
+	u_char *eae;
+
+	ip = VTOI(vp);
+	dp = ip->i_din2;
+	easize = dp->di_extsize;
+
+	eae = malloc(easize + extra, M_TEMP, M_WAITOK);
+
+	liovec.iov_base = eae;
+	liovec.iov_len = easize;
+	luio.uio_iov = &liovec;
+	luio.uio_iovcnt = 1;
+	luio.uio_offset = 0;
+	luio.uio_resid = easize;
+	luio.uio_segflg = UIO_SYSSPACE;
+	luio.uio_rw = UIO_READ;
+	luio.uio_td = td;
+
+	error = ffs_extread(vp, &luio, IO_EXT | IO_SYNC);
+	if (error) {
+		free(eae, M_TEMP);
+		return(error);
+	}
+	*p = eae;
+	return (0);
+}
+
+static int
+ffs_open_ea(struct vnode *vp, struct ucred *cred, struct thread *td)
+{
+	struct inode *ip;
+	struct ufs2_dinode *dp;
+	int error;
+
+	ip = VTOI(vp);
+
+	if (ip->i_ea_area != NULL)
+		return (EBUSY);
+	dp = ip->i_din2;
+	error = ffs_rdextattr(&ip->i_ea_area, vp, td, 0);
+	if (error)
+		return (error);
+	ip->i_ea_len = dp->di_extsize;
+	ip->i_ea_error = 0;
+	return (0);
+}
+
+/*
+ * Vnode extattr transaction commit/abort
+ */
+static int
+ffs_close_ea(struct vnode *vp, int commit, struct ucred *cred, struct thread *td)
+{
+	struct inode *ip;
+	struct uio luio;
+	struct iovec liovec;
+	int error;
+	struct ufs2_dinode *dp;
+
+	ip = VTOI(vp);
+	if (ip->i_ea_area == NULL)
+		return (EINVAL);
+	dp = ip->i_din2;
+	error = ip->i_ea_error;
+	if (commit && error == 0) {
+		if (cred == NOCRED)
+			cred =  vp->v_mount->mnt_cred;
+		liovec.iov_base = ip->i_ea_area;
+		liovec.iov_len = ip->i_ea_len;
+		luio.uio_iov = &liovec;
+		luio.uio_iovcnt = 1;
+		luio.uio_offset = 0;
+		luio.uio_resid = ip->i_ea_len;
+		luio.uio_segflg = UIO_SYSSPACE;
+		luio.uio_rw = UIO_WRITE;
+		luio.uio_td = td;
+		/* XXX: I'm not happy about truncating to zero size */
+		if (ip->i_ea_len < dp->di_extsize)
+			error = ffs_truncate(vp, 0, IO_EXT, cred, td);
+		error = ffs_extwrite(vp, &luio, IO_EXT | IO_SYNC, cred);
+	}
+	free(ip->i_ea_area, M_TEMP);
+	ip->i_ea_area = NULL;
+	ip->i_ea_len = 0;
+	ip->i_ea_error = 0;
+	return (error);
+}
+
+/*
+ * Vnode extattr strategy routine for special devices and fifos.
+ *
+ * We need to check for a read or write of the external attributes.
+ * Otherwise we just fall through and do the usual thing.
+ */
+static int
+ffsext_strategy(struct vop_strategy_args *ap)
+/*
+struct vop_strategy_args {
+	struct vnodeop_desc *a_desc;
+	struct vnode *a_vp;
+	struct buf *a_bp;
+};
+*/
+{
+	struct vnode *vp;
+	daddr_t lbn;
+
+	KASSERT(ap->a_vp == ap->a_bp->b_vp, ("%s(%p != %p)",
+	    __func__, ap->a_vp, ap->a_bp->b_vp));
+	vp = ap->a_vp;
+	lbn = ap->a_bp->b_lblkno;
+	if (VTOI(vp)->i_fs->fs_magic == FS_UFS2_MAGIC &&
+	    lbn < 0 && lbn >= -NXADDR)
+		return (ufs_vnoperate((struct vop_generic_args *)ap));
+	if (vp->v_type == VFIFO)
+		return (ufs_vnoperatefifo((struct vop_generic_args *)ap));
+	return (ufs_vnoperatespec((struct vop_generic_args *)ap));
+}
+
+/*
+ * Vnode extattr transaction commit/abort
+ */
+static int
+ffs_openextattr(struct vop_openextattr_args *ap)
+/*
+struct vop_openextattr_args {
+	struct vnodeop_desc *a_desc;
+	struct vnode *a_vp;
+	IN struct ucred *a_cred;
+	IN struct thread *a_td;
+};
+*/
+{
+	struct inode *ip;
+	struct fs *fs;
+
+	ip = VTOI(ap->a_vp);
+	fs = ip->i_fs;
+	if (fs->fs_magic == FS_UFS1_MAGIC)
+		return (ufs_vnoperate((struct vop_generic_args *)ap));
+
+	if (ap->a_vp->v_type == VCHR)
+		return (EOPNOTSUPP);
+
+	return (ffs_open_ea(ap->a_vp, ap->a_cred, ap->a_td));
+}
+
+
+/*
+ * Vnode extattr transaction commit/abort
+ */
+static int
+ffs_closeextattr(struct vop_closeextattr_args *ap)
+/*
+struct vop_closeextattr_args {
+	struct vnodeop_desc *a_desc;
+	struct vnode *a_vp;
+	int a_commit;
+	IN struct ucred *a_cred;
+	IN struct thread *a_td;
+};
+*/
+{
+	struct inode *ip;
+	struct fs *fs;
+
+	ip = VTOI(ap->a_vp);
+	fs = ip->i_fs;
+	if (fs->fs_magic == FS_UFS1_MAGIC)
+		return (ufs_vnoperate((struct vop_generic_args *)ap));
+
+	if (ap->a_vp->v_type == VCHR)
+		return (EOPNOTSUPP);
+
+	return (ffs_close_ea(ap->a_vp, ap->a_commit, ap->a_cred, ap->a_td));
+}
+
+/*
+ * Vnode operation to remove a named attribute.
+ */
+static int
+ffs_deleteextattr(struct vop_deleteextattr_args *ap)
+/*
+vop_deleteextattr {
+	IN struct vnode *a_vp;
+	IN int a_attrnamespace;
+	IN const char *a_name;
+	IN struct ucred *a_cred;
+	IN struct thread *a_td;
+};
+*/
+{
+	struct inode *ip;
+	struct fs *fs;
+	uint32_t ealength, ul;
+	int ealen, olen, eapad1, eapad2, error, i, easize;
+	u_char *eae, *p;
+	int stand_alone;
+
+	ip = VTOI(ap->a_vp);
+	fs = ip->i_fs;
+
+	if (fs->fs_magic == FS_UFS1_MAGIC)
+		return (ufs_vnoperate((struct vop_generic_args *)ap));
+
+	if (ap->a_vp->v_type == VCHR)
+		return (EOPNOTSUPP);
+
+	if (strlen(ap->a_name) == 0)
+		return (EINVAL);
+
+	error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
+	    ap->a_cred, ap->a_td, IWRITE);
+	if (error) {
+		if (ip->i_ea_area != NULL && ip->i_ea_error == 0)
+			ip->i_ea_error = error;
+		return (error);
+	}
+
+	if (ip->i_ea_area == NULL) {
+		error = ffs_open_ea(ap->a_vp, ap->a_cred, ap->a_td);
+		if (error)
+			return (error);
+		stand_alone = 1;
+	} else {
+		stand_alone = 0;
+	}
+
+	ealength = eapad1 = ealen = eapad2 = 0;
+
+	eae = malloc(ip->i_ea_len, M_TEMP, M_WAITOK);
+	bcopy(ip->i_ea_area, eae, ip->i_ea_len);
+	easize = ip->i_ea_len;
+
+	olen = ffs_findextattr(eae, easize, ap->a_attrnamespace, ap->a_name,
+	    &p, NULL);
+	if (olen == -1) {
+		/* delete but nonexistent */
+		free(eae, M_TEMP);
+		if (stand_alone)
+			ffs_close_ea(ap->a_vp, 0, ap->a_cred, ap->a_td);
+		return(ENOATTR);
+	}
+	bcopy(p, &ul, sizeof ul);
+	i = p - eae + ul;
+	if (ul != ealength) {
+		bcopy(p + ul, p + ealength, easize - i);
+		easize += (ealength - ul);
+	}
+	if (easize > NXADDR * fs->fs_bsize) {
+		free(eae, M_TEMP);
+		if (stand_alone)
+			ffs_close_ea(ap->a_vp, 0, ap->a_cred, ap->a_td);
+		else if (ip->i_ea_error == 0)
+			ip->i_ea_error = ENOSPC;
+		return(ENOSPC);
+	}
+	p = ip->i_ea_area;
+	ip->i_ea_area = eae;
+	ip->i_ea_len = easize;
+	free(p, M_TEMP);
+	if (stand_alone)
+		error = ffs_close_ea(ap->a_vp, 1, ap->a_cred, ap->a_td);
+	return(error);
+}
+
+/*
+ * Vnode operation to retrieve a named extended attribute.
+ */
+static int
+ffs_getextattr(struct vop_getextattr_args *ap)
+/*
+vop_getextattr {
+	IN struct vnode *a_vp;
+	IN int a_attrnamespace;
+	IN const char *a_name;
+	INOUT struct uio *a_uio;
+	OUT size_t *a_size;
+	IN struct ucred *a_cred;
+	IN struct thread *a_td;
+};
+*/
+{
+	struct inode *ip;
+	struct fs *fs;
+	u_char *eae, *p;
+	unsigned easize;
+	int error, ealen, stand_alone;
+
+	ip = VTOI(ap->a_vp);
+	fs = ip->i_fs;
+
+	if (fs->fs_magic == FS_UFS1_MAGIC)
+		return (ufs_vnoperate((struct vop_generic_args *)ap));
+
+	if (ap->a_vp->v_type == VCHR)
+		return (EOPNOTSUPP);
+
+	error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
+	    ap->a_cred, ap->a_td, IREAD);
+	if (error)
+		return (error);
+
+	if (ip->i_ea_area == NULL) {
+		error = ffs_open_ea(ap->a_vp, ap->a_cred, ap->a_td);
+		if (error)
+			return (error);
+		stand_alone = 1;
+	} else {
+		stand_alone = 0;
+	}
+	eae = ip->i_ea_area;
+	easize = ip->i_ea_len;
+
+	ealen = ffs_findextattr(eae, easize, ap->a_attrnamespace, ap->a_name,
+	    NULL, &p);
+	if (ealen >= 0) {
+		error = 0;
+		if (ap->a_size != NULL)
+			*ap->a_size = ealen;
+		else if (ap->a_uio != NULL)
+			error = uiomove(p, ealen, ap->a_uio);
+	} else
+		error = ENOATTR;
+	if (stand_alone)
+		ffs_close_ea(ap->a_vp, 0, ap->a_cred, ap->a_td);
+	return(error);
+}
+
+/*
+ * Vnode operation to retrieve extended attributes on a vnode.
+ */
+static int
+ffs_listextattr(struct vop_listextattr_args *ap)
+/*
+vop_listextattr {
+	IN struct vnode *a_vp;
+	IN int a_attrnamespace;
+	INOUT struct uio *a_uio;
+	OUT size_t *a_size;
+	IN struct ucred *a_cred;
+	IN struct thread *a_td;
+};
+*/
+{
+	struct inode *ip;
+	struct fs *fs;
+	u_char *eae, *p, *pe, *pn;
+	unsigned easize;
+	uint32_t ul;
+	int error, ealen, stand_alone;
+
+	ip = VTOI(ap->a_vp);
+	fs = ip->i_fs;
+
+	if (fs->fs_magic == FS_UFS1_MAGIC)
+		return (ufs_vnoperate((struct vop_generic_args *)ap));
+
+	if (ap->a_vp->v_type == VCHR)
+		return (EOPNOTSUPP);
+
+	error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
+	    ap->a_cred, ap->a_td, IREAD);
+	if (error)
+		return (error);
+
+	if (ip->i_ea_area == NULL) {
+		error = ffs_open_ea(ap->a_vp, ap->a_cred, ap->a_td);
+		if (error)
+			return (error);
+		stand_alone = 1;
+	} else {
+		stand_alone = 0;
+	}
+	eae = ip->i_ea_area;
+	easize = ip->i_ea_len;
+
+	error = 0;
+	if (ap->a_size != NULL)
+		*ap->a_size = 0;
+	pe = eae + easize;
+	for(p = eae; error == 0 && p < pe; p = pn) {
+		bcopy(p, &ul, sizeof(ul));
+		pn = p + ul;
+		if (pn > pe)
+			break;
+		p += sizeof(ul);
+		if (*p++ != ap->a_attrnamespace)
+			continue;
+		p++;	/* pad2 */
+		ealen = *p;
+		if (ap->a_size != NULL) {
+			*ap->a_size += ealen + 1;
+		} else if (ap->a_uio != NULL) {
+			error = uiomove(p, ealen + 1, ap->a_uio);
+		}
+	}
+	if (stand_alone)
+		ffs_close_ea(ap->a_vp, 0, ap->a_cred, ap->a_td);
+	return(error);
+}
+
+/*
+ * Vnode operation to set a named attribute.
+ */
+static int
+ffs_setextattr(struct vop_setextattr_args *ap)
+/*
+vop_setextattr {
+	IN struct vnode *a_vp;
+	IN int a_attrnamespace;
+	IN const char *a_name;
+	INOUT struct uio *a_uio;
+	IN struct ucred *a_cred;
+	IN struct thread *a_td;
+};
+*/
+{
+	struct inode *ip;
+	struct fs *fs;
+	uint32_t ealength, ul;
+	int ealen, olen, eapad1, eapad2, error, i, easize;
+	u_char *eae, *p;
+	int stand_alone;
+
+	ip = VTOI(ap->a_vp);
+	fs = ip->i_fs;
+
+	if (fs->fs_magic == FS_UFS1_MAGIC)
+		return (ufs_vnoperate((struct vop_generic_args *)ap));
+
+	if (ap->a_vp->v_type == VCHR)
+		return (EOPNOTSUPP);
+
+	if (strlen(ap->a_name) == 0)
+		return (EINVAL);
+
+	/* XXX Now unsupported API to delete EAs using NULL uio. */
+	if (ap->a_uio == NULL)
+		return (EOPNOTSUPP);
+
+	error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
+	    ap->a_cred, ap->a_td, IWRITE);
+	if (error) {
+		if (ip->i_ea_area != NULL && ip->i_ea_error == 0)
+			ip->i_ea_error = error;
+		return (error);
+	}
+
+	if (ip->i_ea_area == NULL) {
+		error = ffs_open_ea(ap->a_vp, ap->a_cred, ap->a_td);
+		if (error)
+			return (error);
+		stand_alone = 1;
+	} else {
+		stand_alone = 0;
+	}
+
+	ealen = ap->a_uio->uio_resid;
+	ealength = sizeof(uint32_t) + 3 + strlen(ap->a_name);
+	eapad1 = 8 - (ealength % 8);
+	if (eapad1 == 8)
+		eapad1 = 0;
+	eapad2 = 8 - (ealen % 8);
+	if (eapad2 == 8)
+		eapad2 = 0;
+	ealength += eapad1 + ealen + eapad2;
+
+	eae = malloc(ip->i_ea_len + ealength, M_TEMP, M_WAITOK);
+	bcopy(ip->i_ea_area, eae, ip->i_ea_len);
+	easize = ip->i_ea_len;
+
+	olen = ffs_findextattr(eae, easize,
+	    ap->a_attrnamespace, ap->a_name, &p, NULL);
+        if (olen == -1) {
+		/* new, append at end */
+		p = eae + easize;
+		easize += ealength;
+	} else {
+		bcopy(p, &ul, sizeof ul);
+		i = p - eae + ul;
+		if (ul != ealength) {
+			bcopy(p + ul, p + ealength, easize - i);
+			easize += (ealength - ul);
+		}
+	}
+	if (easize > NXADDR * fs->fs_bsize) {
+		free(eae, M_TEMP);
+		if (stand_alone)
+			ffs_close_ea(ap->a_vp, 0, ap->a_cred, ap->a_td);
+		else if (ip->i_ea_error == 0)
+			ip->i_ea_error = ENOSPC;
+		return(ENOSPC);
+	}
+	bcopy(&ealength, p, sizeof(ealength));
+	p += sizeof(ealength);
+	*p++ = ap->a_attrnamespace;
+	*p++ = eapad2;
+	*p++ = strlen(ap->a_name);
+	strcpy(p, ap->a_name);
+	p += strlen(ap->a_name);
+	bzero(p, eapad1);
+	p += eapad1;
+	error = uiomove(p, ealen, ap->a_uio);
+	if (error) {
+		free(eae, M_TEMP);
+		if (stand_alone)
+			ffs_close_ea(ap->a_vp, 0, ap->a_cred, ap->a_td);
+		else if (ip->i_ea_error == 0)
+			ip->i_ea_error = error;
+		return(error);
+	}
+	p += ealen;
+	bzero(p, eapad2);
+
+	p = ip->i_ea_area;
+	ip->i_ea_area = eae;
+	ip->i_ea_len = easize;
+	free(p, M_TEMP);
+	if (stand_alone)
+		error = ffs_close_ea(ap->a_vp, 1, ap->a_cred, ap->a_td);
+	return(error);
+}
diff --git a/src/sys/ufs/ffs/fs.h b/src/sys/ufs/ffs/fs.h
new file mode 100644
index 0000000..a207977
--- /dev/null
+++ b/src/sys/ufs/ffs/fs.h
@@ -0,0 +1,604 @@
+/*
+ * Copyright (c) 1982, 1986, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. All advertising materials mentioning features or use of this software
+ *    must display the following acknowledgement:
+ *	This product includes software developed by the University of
+ *	California, Berkeley and its contributors.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)fs.h	8.13 (Berkeley) 3/21/95
+ * $FreeBSD: src/sys/ufs/ffs/fs.h,v 1.40 2003/11/16 07:08:27 wes Exp $
+ */
+
+#ifndef _UFS_FFS_FS_H_
+#define _UFS_FFS_FS_H_
+
+/*
+ * Each disk drive contains some number of filesystems.
+ * A filesystem consists of a number of cylinder groups.
+ * Each cylinder group has inodes and data.
+ *
+ * A filesystem is described by its super-block, which in turn
+ * describes the cylinder groups.  The super-block is critical
+ * data and is replicated in each cylinder group to protect against
+ * catastrophic loss.  This is done at `newfs' time and the critical
+ * super-block data does not change, so the copies need not be
+ * referenced further unless disaster strikes.
+ *
+ * For filesystem fs, the offsets of the various blocks of interest
+ * are given in the super block as:
+ *	[fs->fs_sblkno]		Super-block
+ *	[fs->fs_cblkno]		Cylinder group block
+ *	[fs->fs_iblkno]		Inode blocks
+ *	[fs->fs_dblkno]		Data blocks
+ * The beginning of cylinder group cg in fs, is given by
+ * the ``cgbase(fs, cg)'' macro.
+ *
+ * Depending on the architecture and the media, the superblock may
+ * reside in any one of four places. For tiny media where every block 
+ * counts, it is placed at the very front of the partition. Historically,
+ * UFS1 placed it 8K from the front to leave room for the disk label and
+ * a small bootstrap. For UFS2 it got moved to 64K from the front to leave
+ * room for the disk label and a bigger bootstrap, and for really piggy
+ * systems we check at 256K from the front if the first three fail. In
+ * all cases the size of the superblock will be SBLOCKSIZE. All values are
+ * given in byte-offset form, so they do not imply a sector size. The
+ * SBLOCKSEARCH specifies the order in which the locations should be searched.
+ */
+#define SBLOCK_FLOPPY	     0
+#define SBLOCK_UFS1	  8192
+#define SBLOCK_UFS2	 65536
+#define SBLOCK_PIGGY	262144
+#define SBLOCKSIZE	  8192
+#define SBLOCKSEARCH \
+	{ SBLOCK_UFS2, SBLOCK_UFS1, SBLOCK_FLOPPY, SBLOCK_PIGGY, -1 }
+
+/*
+ * Max number of fragments per block. This value is NOT tweakable.
+ */
+#define MAXFRAG 	8
+
+/*
+ * Addresses stored in inodes are capable of addressing fragments
+ * of `blocks'. File system blocks of at most size MAXBSIZE can
+ * be optionally broken into 2, 4, or 8 pieces, each of which is
+ * addressable; these pieces may be DEV_BSIZE, or some multiple of
+ * a DEV_BSIZE unit.
+ *
+ * Large files consist of exclusively large data blocks.  To avoid
+ * undue wasted disk space, the last data block of a small file may be
+ * allocated as only as many fragments of a large block as are
+ * necessary.  The filesystem format retains only a single pointer
+ * to such a fragment, which is a piece of a single large block that
+ * has been divided.  The size of such a fragment is determinable from
+ * information in the inode, using the ``blksize(fs, ip, lbn)'' macro.
+ *
+ * The filesystem records space availability at the fragment level;
+ * to determine block availability, aligned fragments are examined.
+ */
+
+/*
+ * MINBSIZE is the smallest allowable block size.
+ * In order to insure that it is possible to create files of size
+ * 2^32 with only two levels of indirection, MINBSIZE is set to 4096.
+ * MINBSIZE must be big enough to hold a cylinder group block,
+ * thus changes to (struct cg) must keep its size within MINBSIZE.
+ * Note that super blocks are always of size SBSIZE,
+ * and that both SBSIZE and MAXBSIZE must be >= MINBSIZE.
+ */
+#define MINBSIZE	4096
+
+/*
+ * The path name on which the filesystem is mounted is maintained
+ * in fs_fsmnt. MAXMNTLEN defines the amount of space allocated in
+ * the super block for this name.
+ */
+#define MAXMNTLEN	468
+
+/*
+ * The volume name for this filesystem is maintained in fs_volname.
+ * MAXVOLLEN defines the length of the buffer allocated.
+ */
+#define MAXVOLLEN	32
+
+/*
+ * There is a 128-byte region in the superblock reserved for in-core
+ * pointers to summary information. Originally this included an array
+ * of pointers to blocks of struct csum; now there are just a few
+ * pointers and the remaining space is padded with fs_ocsp[].
+ *
+ * NOCSPTRS determines the size of this padding. One pointer (fs_csp)
+ * is taken away to point to a contiguous array of struct csum for
+ * all cylinder groups; a second (fs_maxcluster) points to an array
+ * of cluster sizes that is computed as cylinder groups are inspected,
+ * and the third points to an array that tracks the creation of new
+ * directories. A fourth pointer, fs_active, is used when creating
+ * snapshots; it points to a bitmap of cylinder groups for which the
+ * free-block bitmap has changed since the snapshot operation began.
+ */
+#define	NOCSPTRS	((128 / sizeof(void *)) - 4)
+
+/*
+ * A summary of contiguous blocks of various sizes is maintained
+ * in each cylinder group. Normally this is set by the initial
+ * value of fs_maxcontig. To conserve space, a maximum summary size
+ * is set by FS_MAXCONTIG.
+ */
+#define FS_MAXCONTIG	16
+
+/*
+ * MINFREE gives the minimum acceptable percentage of filesystem
+ * blocks which may be free. If the freelist drops below this level
+ * only the superuser may continue to allocate blocks. This may
+ * be set to 0 if no reserve of free blocks is deemed necessary,
+ * however throughput drops by fifty percent if the filesystem
+ * is run at between 95% and 100% full; thus the minimum default
+ * value of fs_minfree is 5%. However, to get good clustering
+ * performance, 10% is a better choice. hence we use 10% as our
+ * default value. With 10% free space, fragmentation is not a
+ * problem, so we choose to optimize for time.
+ */
+#define MINFREE		8
+#define DEFAULTOPT	FS_OPTTIME
+
+/*
+ * Grigoriy Orlov <gluk@ptci.ru> has done some extensive work to fine
+ * tune the layout preferences for directories within a filesystem.
+ * His algorithm can be tuned by adjusting the following parameters
+ * which tell the system the average file size and the average number
+ * of files per directory. These defaults are well selected for typical
+ * filesystems, but may need to be tuned for odd cases like filesystems
+ * being used for sqiud caches or news spools.
+ */
+#define AVFILESIZ	16384	/* expected average file size */
+#define AFPDIR		64	/* expected number of files per directory */
+
+/*
+ * The maximum number of snapshot nodes that can be associated
+ * with each filesystem. This limit affects only the number of
+ * snapshot files that can be recorded within the superblock so
+ * that they can be found when the filesystem is mounted. However,
+ * maintaining too many will slow the filesystem performance, so
+ * having this limit is a good idea.
+ */
+#define FSMAXSNAP 20
+
+/*
+ * Used to identify special blocks in snapshots:
+ *
+ * BLK_NOCOPY - A block that was unallocated at the time the snapshot
+ *	was taken, hence does not need to be copied when written.
+ * BLK_SNAP - A block held by another snapshot that is not needed by this
+ *	snapshot. When the other snapshot is freed, the BLK_SNAP entries
+ *	are converted to BLK_NOCOPY. These are needed to allow fsck to
+ *	identify blocks that are in use by other snapshots (which are
+ *	expunged from this snapshot).
+ */
+#define BLK_NOCOPY ((ufs2_daddr_t)(1))
+#define BLK_SNAP ((ufs2_daddr_t)(2))
+
+/*
+ * Sysctl values for the fast filesystem.
+ */
+#define	FFS_ADJ_REFCNT		 1	/* adjust inode reference count */
+#define	FFS_ADJ_BLKCNT		 2	/* adjust inode used block count */
+#define	FFS_BLK_FREE		 3	/* free range of blocks in map */
+#define	FFS_DIR_FREE		 4	/* free specified dir inodes in map */
+#define	FFS_FILE_FREE		 5	/* free specified file inodes in map */
+#define	FFS_SET_FLAGS		 6	/* set filesystem flags */
+#define	FFS_MAXID		 7	/* number of valid ffs ids */
+
+/*
+ * Command structure passed in to the filesystem to adjust filesystem values.
+ */
+#define	FFS_CMD_VERSION		0x19790518	/* version ID */
+struct fsck_cmd {
+	int32_t	version;	/* version of command structure */
+	int32_t	handle;		/* reference to filesystem to be changed */
+	int64_t	value;		/* inode or block number to be affected */
+	int64_t	size;		/* amount or range to be adjusted */
+	int64_t	spare;		/* reserved for future use */
+};
+
+/*
+ * Per cylinder group information; summarized in blocks allocated
+ * from first cylinder group data blocks.  These blocks have to be
+ * read in from fs_csaddr (size fs_cssize) in addition to the
+ * super block.
+ */
+struct csum {
+	int32_t	cs_ndir;		/* number of directories */
+	int32_t	cs_nbfree;		/* number of free blocks */
+	int32_t	cs_nifree;		/* number of free inodes */
+	int32_t	cs_nffree;		/* number of free frags */
+};
+struct csum_total {
+	int64_t	cs_ndir;		/* number of directories */
+	int64_t	cs_nbfree;		/* number of free blocks */
+	int64_t	cs_nifree;		/* number of free inodes */
+	int64_t	cs_nffree;		/* number of free frags */
+	int64_t	cs_numclusters;		/* number of free clusters */
+	int64_t	cs_spare[3];		/* future expansion */
+};
+
+/*
+ * Super block for an FFS filesystem.
+ */
+struct fs {
+	int32_t	 fs_firstfield;		/* historic filesystem linked list, */
+	int32_t	 fs_unused_1;		/*     used for incore super blocks */
+	int32_t	 fs_sblkno;		/* offset of super-block in filesys */
+	int32_t	 fs_cblkno;		/* offset of cyl-block in filesys */
+	int32_t	 fs_iblkno;		/* offset of inode-blocks in filesys */
+	int32_t	 fs_dblkno;		/* offset of first data after cg */
+	int32_t	 fs_old_cgoffset;	/* cylinder group offset in cylinder */
+	int32_t	 fs_old_cgmask;		/* used to calc mod fs_ntrak */
+	int32_t  fs_old_time;		/* last time written */
+	int32_t	 fs_old_size;		/* number of blocks in fs */
+	int32_t	 fs_old_dsize;		/* number of data blocks in fs */
+	int32_t	 fs_ncg;		/* number of cylinder groups */
+	int32_t	 fs_bsize;		/* size of basic blocks in fs */
+	int32_t	 fs_fsize;		/* size of frag blocks in fs */
+	int32_t	 fs_frag;		/* number of frags in a block in fs */
+/* these are configuration parameters */
+	int32_t	 fs_minfree;		/* minimum percentage of free blocks */
+	int32_t	 fs_old_rotdelay;	/* num of ms for optimal next block */
+	int32_t	 fs_old_rps;		/* disk revolutions per second */
+/* these fields can be computed from the others */
+	int32_t	 fs_bmask;		/* ``blkoff'' calc of blk offsets */
+	int32_t	 fs_fmask;		/* ``fragoff'' calc of frag offsets */
+	int32_t	 fs_bshift;		/* ``lblkno'' calc of logical blkno */
+	int32_t	 fs_fshift;		/* ``numfrags'' calc number of frags */
+/* these are configuration parameters */
+	int32_t	 fs_maxcontig;		/* max number of contiguous blks */
+	int32_t	 fs_maxbpg;		/* max number of blks per cyl group */
+/* these fields can be computed from the others */
+	int32_t	 fs_fragshift;		/* block to frag shift */
+	int32_t	 fs_fsbtodb;		/* fsbtodb and dbtofsb shift constant */
+	int32_t	 fs_sbsize;		/* actual size of super block */
+	int32_t	 fs_spare1[2];		/* old fs_csmask */
+					/* old fs_csshift */
+	int32_t	 fs_nindir;		/* value of NINDIR */
+	int32_t	 fs_inopb;		/* value of INOPB */
+	int32_t	 fs_old_nspf;		/* value of NSPF */
+/* yet another configuration parameter */
+	int32_t	 fs_optim;		/* optimization preference, see below */
+	int32_t	 fs_old_npsect;		/* # sectors/track including spares */
+	int32_t	 fs_old_interleave;	/* hardware sector interleave */
+	int32_t	 fs_old_trackskew;	/* sector 0 skew, per track */
+	int32_t	 fs_id[2];		/* unique filesystem id */
+/* sizes determined by number of cylinder groups and their sizes */
+	int32_t	 fs_old_csaddr;		/* blk addr of cyl grp summary area */
+	int32_t	 fs_cssize;		/* size of cyl grp summary area */
+	int32_t	 fs_cgsize;		/* cylinder group size */
+	int32_t	 fs_spare2;		/* old fs_ntrak */
+	int32_t	 fs_old_nsect;		/* sectors per track */
+	int32_t  fs_old_spc;		/* sectors per cylinder */
+	int32_t	 fs_old_ncyl;		/* cylinders in filesystem */
+	int32_t	 fs_old_cpg;		/* cylinders per group */
+	int32_t	 fs_ipg;		/* inodes per group */
+	int32_t	 fs_fpg;		/* blocks per group * fs_frag */
+/* this data must be re-computed after crashes */
+	struct	csum fs_old_cstotal;	/* cylinder summary information */
+/* these fields are cleared at mount time */
+	int8_t   fs_fmod;		/* super block modified flag */
+	int8_t   fs_clean;		/* filesystem is clean flag */
+	int8_t 	 fs_ronly;		/* mounted read-only flag */
+	int8_t   fs_old_flags;		/* old FS_ flags */
+	u_char	 fs_fsmnt[MAXMNTLEN];	/* name mounted on */
+	u_char	 fs_volname[MAXVOLLEN];	/* volume name */
+	u_int64_t fs_swuid;		/* system-wide uid */
+	int32_t  fs_pad;		/* due to alignment of fs_swuid */
+/* these fields retain the current block allocation info */
+	int32_t	 fs_cgrotor;		/* last cg searched */
+	void 	*fs_ocsp[NOCSPTRS];	/* padding; was list of fs_cs buffers */
+	u_int8_t *fs_contigdirs;	/* # of contiguously allocated dirs */
+	struct	csum *fs_csp;		/* cg summary info buffer for fs_cs */
+	int32_t	*fs_maxcluster;		/* max cluster in each cyl group */
+	u_int	*fs_active;		/* used by snapshots to track fs */
+	int32_t	 fs_old_cpc;		/* cyl per cycle in postbl */
+	int32_t	 fs_maxbsize;		/* maximum blocking factor permitted */
+	int64_t	 fs_sparecon64[17];	/* old rotation block list head */
+	int64_t	 fs_sblockloc;		/* byte offset of standard superblock */
+	struct	csum_total fs_cstotal;	/* cylinder summary information */
+	ufs_time_t fs_time;		/* last time written */
+	int64_t	 fs_size;		/* number of blocks in fs */
+	int64_t	 fs_dsize;		/* number of data blocks in fs */
+	ufs2_daddr_t fs_csaddr;		/* blk addr of cyl grp summary area */
+	int64_t	 fs_pendingblocks;	/* blocks in process of being freed */
+	int32_t	 fs_pendinginodes;	/* inodes in process of being freed */
+	int32_t	 fs_snapinum[FSMAXSNAP];/* list of snapshot inode numbers */
+	int32_t	 fs_avgfilesize;	/* expected average file size */
+	int32_t	 fs_avgfpdir;		/* expected # of files per directory */
+	int32_t	 fs_save_cgsize;	/* save real cg size to use fs_bsize */
+	int32_t	 fs_sparecon32[26];	/* reserved for future constants */
+	int32_t  fs_flags;		/* see FS_ flags below */
+	int32_t	 fs_contigsumsize;	/* size of cluster summary array */ 
+	int32_t	 fs_maxsymlinklen;	/* max length of an internal symlink */
+	int32_t	 fs_old_inodefmt;	/* format of on-disk inodes */
+	u_int64_t fs_maxfilesize;	/* maximum representable file size */
+	int64_t	 fs_qbmask;		/* ~fs_bmask for use with 64-bit size */
+	int64_t	 fs_qfmask;		/* ~fs_fmask for use with 64-bit size */
+	int32_t	 fs_state;		/* validate fs_clean field */
+	int32_t	 fs_old_postblformat;	/* format of positional layout tables */
+	int32_t	 fs_old_nrpos;		/* number of rotational positions */
+	int32_t	 fs_spare5[2];		/* old fs_postbloff */
+					/* old fs_rotbloff */
+	int32_t	 fs_magic;		/* magic number */
+};
+
+/* Sanity checking. */
+#ifdef CTASSERT
+CTASSERT(sizeof(struct fs) == 1376);
+#endif
+
+/*
+ * Filesystem identification
+ */
+#define	FS_UFS1_MAGIC	0x011954	/* UFS1 fast filesystem magic number */
+#define	FS_UFS2_MAGIC	0x19540119	/* UFS2 fast filesystem magic number */
+#define	FS_BAD2_MAGIC	0x19960408	/* UFS2 incomplete newfs magic number */
+#define	FS_OKAY		0x7c269d38	/* superblock checksum */
+#define FS_42INODEFMT	-1		/* 4.2BSD inode format */
+#define FS_44INODEFMT	2		/* 4.4BSD inode format */
+
+/*
+ * Preference for optimization.
+ */
+#define FS_OPTTIME	0	/* minimize allocation time */
+#define FS_OPTSPACE	1	/* minimize disk fragmentation */
+
+/*
+ * Filesystem flags.
+ *
+ * The FS_UNCLEAN flag is set by the kernel when the filesystem was
+ * mounted with fs_clean set to zero. The FS_DOSOFTDEP flag indicates
+ * that the filesystem should be managed by the soft updates code.
+ * Note that the FS_NEEDSFSCK flag is set and cleared only by the
+ * fsck utility. It is set when background fsck finds an unexpected
+ * inconsistency which requires a traditional foreground fsck to be
+ * run. Such inconsistencies should only be found after an uncorrectable
+ * disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
+ * it has successfully cleaned up the filesystem. The kernel uses this
+ * flag to enforce that inconsistent filesystems be mounted read-only.
+ * The FS_INDEXDIRS flag when set indicates that the kernel maintains
+ * on-disk auxiliary indexes (such as B-trees) for speeding directory
+ * accesses. Kernels that do not support auxiliary indicies clear the
+ * flag to indicate that the indicies need to be rebuilt (by fsck) before
+ * they can be used.
+ *
+ * FS_ACLS indicates that ACLs are administratively enabled for the
+ * file system, so they should be loaded from extended attributes,
+ * observed for access control purposes, and be administered by object
+ * owners.  FS_MULTILABEL indicates that the TrustedBSD MAC Framework
+ * should attempt to back MAC labels into extended attributes on the
+ * file system rather than maintain a single mount label for all
+ * objects.
+ */
+#define FS_UNCLEAN    0x01	/* filesystem not clean at mount */
+#define FS_DOSOFTDEP  0x02	/* filesystem using soft dependencies */
+#define FS_NEEDSFSCK  0x04	/* filesystem needs sync fsck before mount */
+#define FS_INDEXDIRS  0x08	/* kernel supports indexed directories */
+#define FS_ACLS       0x10	/* file system has ACLs enabled */
+#define FS_MULTILABEL 0x20	/* file system is MAC multi-label */
+#define FS_FLAGS_UPDATED 0x80	/* flags have been moved to new location */
+
+/*
+ * Macros to access bits in the fs_active array.
+ */
+#define	ACTIVECGNUM(fs, cg)	((fs)->fs_active[(cg) / (NBBY * sizeof(int))])
+#define	ACTIVECGOFF(cg)		(1 << ((cg) % (NBBY * sizeof(int))))
+
+/*
+ * The size of a cylinder group is calculated by CGSIZE. The maximum size
+ * is limited by the fact that cylinder groups are at most one block.
+ * Its size is derived from the size of the maps maintained in the
+ * cylinder group and the (struct cg) size.
+ */
+#define CGSIZE(fs) \
+    /* base cg */	(sizeof(struct cg) + sizeof(int32_t) + \
+    /* old btotoff */	(fs)->fs_old_cpg * sizeof(int32_t) + \
+    /* old boff */	(fs)->fs_old_cpg * sizeof(u_int16_t) + \
+    /* inode map */	howmany((fs)->fs_ipg, NBBY) + \
+    /* block map */	howmany((fs)->fs_fpg, NBBY) +\
+    /* if present */	((fs)->fs_contigsumsize <= 0 ? 0 : \
+    /* cluster sum */	(fs)->fs_contigsumsize * sizeof(int32_t) + \
+    /* cluster map */	howmany(fragstoblks(fs, (fs)->fs_fpg), NBBY)))
+
+/*
+ * The minimal number of cylinder groups that should be created.
+ */
+#define MINCYLGRPS	4
+
+/*
+ * Convert cylinder group to base address of its global summary info.
+ */
+#define fs_cs(fs, indx) fs_csp[indx]
+
+/*
+ * Cylinder group block for a filesystem.
+ */
+#define	CG_MAGIC	0x090255
+struct cg {
+	int32_t	 cg_firstfield;		/* historic cyl groups linked list */
+	int32_t	 cg_magic;		/* magic number */
+	int32_t  cg_old_time;		/* time last written */
+	int32_t	 cg_cgx;		/* we are the cgx'th cylinder group */
+	int16_t	 cg_old_ncyl;		/* number of cyl's this cg */
+	int16_t  cg_old_niblk;		/* number of inode blocks this cg */
+	int32_t	 cg_ndblk;		/* number of data blocks this cg */
+	struct	csum cg_cs;		/* cylinder summary information */
+	int32_t	 cg_rotor;		/* position of last used block */
+	int32_t	 cg_frotor;		/* position of last used frag */
+	int32_t	 cg_irotor;		/* position of last used inode */
+	int32_t	 cg_frsum[MAXFRAG];	/* counts of available frags */
+	int32_t	 cg_old_btotoff;	/* (int32) block totals per cylinder */
+	int32_t	 cg_old_boff;		/* (u_int16) free block positions */
+	int32_t	 cg_iusedoff;		/* (u_int8) used inode map */
+	int32_t	 cg_freeoff;		/* (u_int8) free block map */
+	int32_t	 cg_nextfreeoff;	/* (u_int8) next available space */
+	int32_t	 cg_clustersumoff;	/* (u_int32) counts of avail clusters */
+	int32_t	 cg_clusteroff;		/* (u_int8) free cluster map */
+	int32_t	 cg_nclusterblks;	/* number of clusters this cg */
+	int32_t  cg_niblk;		/* number of inode blocks this cg */
+	int32_t	 cg_initediblk;		/* last initialized inode */
+	int32_t	 cg_sparecon32[3];	/* reserved for future use */
+	ufs_time_t cg_time;		/* time last written */
+	int64_t	 cg_sparecon64[3];	/* reserved for future use */
+	u_int8_t cg_space[1];		/* space for cylinder group maps */
+/* actually longer */
+};
+
+/*
+ * Macros for access to cylinder group array structures
+ */
+#define cg_chkmagic(cgp) ((cgp)->cg_magic == CG_MAGIC)
+#define cg_inosused(cgp) \
+    ((u_int8_t *)((u_int8_t *)(cgp) + (cgp)->cg_iusedoff))
+#define cg_blksfree(cgp) \
+    ((u_int8_t *)((u_int8_t *)(cgp) + (cgp)->cg_freeoff))
+#define cg_clustersfree(cgp) \
+    ((u_int8_t *)((u_int8_t *)(cgp) + (cgp)->cg_clusteroff))
+#define cg_clustersum(cgp) \
+    ((int32_t *)((u_int8_t *)(cgp) + (cgp)->cg_clustersumoff))
+
+/*
+ * Turn filesystem block numbers into disk block addresses.
+ * This maps filesystem blocks to device size blocks.
+ */
+#define fsbtodb(fs, b)	((b) << (fs)->fs_fsbtodb)
+#define	dbtofsb(fs, b)	((b) >> (fs)->fs_fsbtodb)
+
+/*
+ * Cylinder group macros to locate things in cylinder groups.
+ * They calc filesystem addresses of cylinder group data structures.
+ */
+#define	cgbase(fs, c)	(((ufs2_daddr_t)(fs)->fs_fpg) * (c))
+#define	cgdmin(fs, c)	(cgstart(fs, c) + (fs)->fs_dblkno)	/* 1st data */
+#define	cgimin(fs, c)	(cgstart(fs, c) + (fs)->fs_iblkno)	/* inode blk */
+#define	cgsblock(fs, c)	(cgstart(fs, c) + (fs)->fs_sblkno)	/* super blk */
+#define	cgtod(fs, c)	(cgstart(fs, c) + (fs)->fs_cblkno)	/* cg block */
+#define cgstart(fs, c)							\
+       ((fs)->fs_magic == FS_UFS2_MAGIC ? cgbase(fs, c) :		\
+       (cgbase(fs, c) + (fs)->fs_old_cgoffset * ((c) & ~((fs)->fs_old_cgmask))))
+
+/*
+ * Macros for handling inode numbers:
+ *     inode number to filesystem block offset.
+ *     inode number to cylinder group number.
+ *     inode number to filesystem block address.
+ */
+#define	ino_to_cg(fs, x)	((x) / (fs)->fs_ipg)
+#define	ino_to_fsba(fs, x)						\
+	((ufs2_daddr_t)(cgimin(fs, ino_to_cg(fs, x)) +			\
+	    (blkstofrags((fs), (((x) % (fs)->fs_ipg) / INOPB(fs))))))
+#define	ino_to_fsbo(fs, x)	((x) % INOPB(fs))
+
+/*
+ * Give cylinder group number for a filesystem block.
+ * Give cylinder group block number for a filesystem block.
+ */
+#define	dtog(fs, d)	((d) / (fs)->fs_fpg)
+#define	dtogd(fs, d)	((d) % (fs)->fs_fpg)
+
+/*
+ * Extract the bits for a block from a map.
+ * Compute the cylinder and rotational position of a cyl block addr.
+ */
+#define blkmap(fs, map, loc) \
+    (((map)[(loc) / NBBY] >> ((loc) % NBBY)) & (0xff >> (NBBY - (fs)->fs_frag)))
+
+/*
+ * The following macros optimize certain frequently calculated
+ * quantities by using shifts and masks in place of divisions
+ * modulos and multiplications.
+ */
+#define blkoff(fs, loc)		/* calculates (loc % fs->fs_bsize) */ \
+	((loc) & (fs)->fs_qbmask)
+#define fragoff(fs, loc)	/* calculates (loc % fs->fs_fsize) */ \
+	((loc) & (fs)->fs_qfmask)
+#define lfragtosize(fs, frag)	/* calculates ((off_t)frag * fs->fs_fsize) */ \
+	(((off_t)(frag)) << (fs)->fs_fshift)
+#define lblktosize(fs, blk)	/* calculates ((off_t)blk * fs->fs_bsize) */ \
+	(((off_t)(blk)) << (fs)->fs_bshift)
+/* Use this only when `blk' is known to be small, e.g., < NDADDR. */
+#define smalllblktosize(fs, blk)    /* calculates (blk * fs->fs_bsize) */ \
+	((blk) << (fs)->fs_bshift)
+#define lblkno(fs, loc)		/* calculates (loc / fs->fs_bsize) */ \
+	((loc) >> (fs)->fs_bshift)
+#define numfrags(fs, loc)	/* calculates (loc / fs->fs_fsize) */ \
+	((loc) >> (fs)->fs_fshift)
+#define blkroundup(fs, size)	/* calculates roundup(size, fs->fs_bsize) */ \
+	(((size) + (fs)->fs_qbmask) & (fs)->fs_bmask)
+#define fragroundup(fs, size)	/* calculates roundup(size, fs->fs_fsize) */ \
+	(((size) + (fs)->fs_qfmask) & (fs)->fs_fmask)
+#define fragstoblks(fs, frags)	/* calculates (frags / fs->fs_frag) */ \
+	((frags) >> (fs)->fs_fragshift)
+#define blkstofrags(fs, blks)	/* calculates (blks * fs->fs_frag) */ \
+	((blks) << (fs)->fs_fragshift)
+#define fragnum(fs, fsb)	/* calculates (fsb % fs->fs_frag) */ \
+	((fsb) & ((fs)->fs_frag - 1))
+#define blknum(fs, fsb)		/* calculates rounddown(fsb, fs->fs_frag) */ \
+	((fsb) &~ ((fs)->fs_frag - 1))
+
+/*
+ * Determine the number of available frags given a
+ * percentage to hold in reserve.
+ */
+#define freespace(fs, percentreserved) \
+	(blkstofrags((fs), (fs)->fs_cstotal.cs_nbfree) + \
+	(fs)->fs_cstotal.cs_nffree - \
+	(((off_t)((fs)->fs_dsize)) * (percentreserved) / 100))
+
+/*
+ * Determining the size of a file block in the filesystem.
+ */
+#define blksize(fs, ip, lbn) \
+	(((lbn) >= NDADDR || (ip)->i_size >= smalllblktosize(fs, (lbn) + 1)) \
+	    ? (fs)->fs_bsize \
+	    : (fragroundup(fs, blkoff(fs, (ip)->i_size))))
+#define sblksize(fs, size, lbn) \
+	(((lbn) >= NDADDR || (size) >= ((lbn) + 1) << (fs)->fs_bshift) \
+	  ? (fs)->fs_bsize \
+	  : (fragroundup(fs, blkoff(fs, (size)))))
+
+
+/*
+ * Number of inodes in a secondary storage block/fragment.
+ */
+#define	INOPB(fs)	((fs)->fs_inopb)
+#define	INOPF(fs)	((fs)->fs_inopb >> (fs)->fs_fragshift)
+
+/*
+ * Number of indirects in a filesystem block.
+ */
+#define	NINDIR(fs)	((fs)->fs_nindir)
+
+extern int inside[], around[];
+extern u_char *fragtbl[];
+
+#endif
diff --git a/src/sys/ufs/ffs/softdep.h b/src/sys/ufs/ffs/softdep.h
new file mode 100644
index 0000000..94567e2
--- /dev/null
+++ b/src/sys/ufs/ffs/softdep.h
@@ -0,0 +1,589 @@
+/*
+ * Copyright 1998, 2000 Marshall Kirk McKusick. All Rights Reserved.
+ *
+ * The soft updates code is derived from the appendix of a University
+ * of Michigan technical report (Gregory R. Ganger and Yale N. Patt,
+ * "Soft Updates: A Solution to the Metadata Update Problem in File
+ * Systems", CSE-TR-254-95, August 1995).
+ *
+ * Further information about soft updates can be obtained from:
+ *
+ *	Marshall Kirk McKusick		http://www.mckusick.com/softdep/
+ *	1614 Oxford Street		mckusick@mckusick.com
+ *	Berkeley, CA 94709-1608		+1-510-843-9542
+ *	USA
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY MARSHALL KIRK MCKUSICK ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED.  IN NO EVENT SHALL MARSHALL KIRK MCKUSICK BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ *	@(#)softdep.h	9.7 (McKusick) 6/21/00
+ * $FreeBSD: src/sys/ufs/ffs/softdep.h,v 1.16 2002/07/19 07:29:38 mckusick Exp $
+ */
+
+#include <sys/queue.h>
+
+/*
+ * Allocation dependencies are handled with undo/redo on the in-memory
+ * copy of the data. A particular data dependency is eliminated when
+ * it is ALLCOMPLETE: that is ATTACHED, DEPCOMPLETE, and COMPLETE.
+ * 
+ * ATTACHED means that the data is not currently being written to
+ * disk. UNDONE means that the data has been rolled back to a safe
+ * state for writing to the disk. When the I/O completes, the data is
+ * restored to its current form and the state reverts to ATTACHED.
+ * The data must be locked throughout the rollback, I/O, and roll
+ * forward so that the rolled back information is never visible to
+ * user processes. The COMPLETE flag indicates that the item has been
+ * written. For example, a dependency that requires that an inode be
+ * written will be marked COMPLETE after the inode has been written
+ * to disk. The DEPCOMPLETE flag indicates the completion of any other
+ * dependencies such as the writing of a cylinder group map has been
+ * completed. A dependency structure may be freed only when both it
+ * and its dependencies have completed and any rollbacks that are in
+ * progress have finished as indicated by the set of ALLCOMPLETE flags
+ * all being set. The two MKDIR flags indicate additional dependencies
+ * that must be done when creating a new directory. MKDIR_BODY is
+ * cleared when the directory data block containing the "." and ".."
+ * entries has been written. MKDIR_PARENT is cleared when the parent
+ * inode with the increased link count for ".." has been written. When
+ * both MKDIR flags have been cleared, the DEPCOMPLETE flag is set to
+ * indicate that the directory dependencies have been completed. The
+ * writing of the directory inode itself sets the COMPLETE flag which
+ * then allows the directory entry for the new directory to be written
+ * to disk. The RMDIR flag marks a dirrem structure as representing
+ * the removal of a directory rather than a file. When the removal
+ * dependencies are completed, additional work needs to be done
+ * (truncation of the "." and ".." entries, an additional decrement
+ * of the associated inode, and a decrement of the parent inode). The
+ * DIRCHG flag marks a diradd structure as representing the changing
+ * of an existing entry rather than the addition of a new one. When
+ * the update is complete the dirrem associated with the inode for
+ * the old name must be added to the worklist to do the necessary
+ * reference count decrement. The GOINGAWAY flag indicates that the
+ * data structure is frozen from further change until its dependencies
+ * have been completed and its resources freed after which it will be
+ * discarded. The IOSTARTED flag prevents multiple calls to the I/O
+ * start routine from doing multiple rollbacks. The SPACECOUNTED flag
+ * says that the files space has been accounted to the pending free
+ * space count. The NEWBLOCK flag marks pagedep structures that have
+ * just been allocated, so must be claimed by the inode before all
+ * dependencies are complete. The INPROGRESS flag marks worklist
+ * structures that are still on the worklist, but are being considered
+ * for action by some process. The UFS1FMT flag indicates that the
+ * inode being processed is a ufs1 format. The EXTDATA flag indicates
+ * that the allocdirect describes an extended-attributes dependency.
+ * The ONWORKLIST flag shows whether the structure is currently linked
+ * onto a worklist.
+ */
+#define	ATTACHED	0x0001
+#define	UNDONE		0x0002
+#define	COMPLETE	0x0004
+#define	DEPCOMPLETE	0x0008
+#define	MKDIR_PARENT	0x0010	/* diradd & mkdir only */
+#define	MKDIR_BODY	0x0020	/* diradd & mkdir only */
+#define	RMDIR		0x0040	/* dirrem only */
+#define	DIRCHG		0x0080	/* diradd & dirrem only */
+#define	GOINGAWAY	0x0100	/* indirdep only */
+#define	IOSTARTED	0x0200	/* inodedep & pagedep only */
+#define	SPACECOUNTED	0x0400	/* inodedep only */
+#define	NEWBLOCK	0x0800	/* pagedep only */
+#define	INPROGRESS	0x1000	/* dirrem, freeblks, freefrag, freefile only */
+#define	UFS1FMT		0x2000	/* indirdep only */
+#define	EXTDATA		0x4000	/* allocdirect only */
+#define ONWORKLIST	0x8000
+
+#define	ALLCOMPLETE	(ATTACHED | COMPLETE | DEPCOMPLETE)
+
+/*
+ * The workitem queue.
+ * 
+ * It is sometimes useful and/or necessary to clean up certain dependencies
+ * in the background rather than during execution of an application process
+ * or interrupt service routine. To realize this, we append dependency
+ * structures corresponding to such tasks to a "workitem" queue. In a soft
+ * updates implementation, most pending workitems should not wait for more
+ * than a couple of seconds, so the filesystem syncer process awakens once
+ * per second to process the items on the queue.
+ */
+
+/* LIST_HEAD(workhead, worklist);	-- declared in buf.h */
+
+/*
+ * Each request can be linked onto a work queue through its worklist structure.
+ * To avoid the need for a pointer to the structure itself, this structure
+ * MUST be declared FIRST in each type in which it appears! If more than one
+ * worklist is needed in the structure, then a wk_data field must be added
+ * and the macros below changed to use it.
+ */
+struct worklist {
+	LIST_ENTRY(worklist)	wk_list;	/* list of work requests */
+	unsigned short		wk_type;	/* type of request */
+	unsigned short		wk_state;	/* state flags */
+};
+#define WK_DATA(wk) ((void *)(wk))
+#define WK_PAGEDEP(wk) ((struct pagedep *)(wk))
+#define WK_INODEDEP(wk) ((struct inodedep *)(wk))
+#define WK_NEWBLK(wk) ((struct newblk *)(wk))
+#define WK_BMSAFEMAP(wk) ((struct bmsafemap *)(wk))
+#define WK_ALLOCDIRECT(wk) ((struct allocdirect *)(wk))
+#define WK_INDIRDEP(wk) ((struct indirdep *)(wk))
+#define WK_ALLOCINDIR(wk) ((struct allocindir *)(wk))
+#define WK_FREEFRAG(wk) ((struct freefrag *)(wk))
+#define WK_FREEBLKS(wk) ((struct freeblks *)(wk))
+#define WK_FREEFILE(wk) ((struct freefile *)(wk))
+#define WK_DIRADD(wk) ((struct diradd *)(wk))
+#define WK_MKDIR(wk) ((struct mkdir *)(wk))
+#define WK_DIRREM(wk) ((struct dirrem *)(wk))
+#define WK_NEWDIRBLK(wk) ((struct newdirblk *)(wk))
+
+/*
+ * Various types of lists
+ */
+LIST_HEAD(dirremhd, dirrem);
+LIST_HEAD(diraddhd, diradd);
+LIST_HEAD(newblkhd, newblk);
+LIST_HEAD(inodedephd, inodedep);
+LIST_HEAD(allocindirhd, allocindir);
+LIST_HEAD(allocdirecthd, allocdirect);
+TAILQ_HEAD(allocdirectlst, allocdirect);
+
+/*
+ * The "pagedep" structure tracks the various dependencies related to
+ * a particular directory page. If a directory page has any dependencies,
+ * it will have a pagedep linked to its associated buffer. The
+ * pd_dirremhd list holds the list of dirrem requests which decrement
+ * inode reference counts. These requests are processed after the
+ * directory page with the corresponding zero'ed entries has been
+ * written. The pd_diraddhd list maintains the list of diradd requests
+ * which cannot be committed until their corresponding inode has been
+ * written to disk. Because a directory may have many new entries
+ * being created, several lists are maintained hashed on bits of the
+ * offset of the entry into the directory page to keep the lists from
+ * getting too long. Once a new directory entry has been cleared to
+ * be written, it is moved to the pd_pendinghd list. After the new
+ * entry has been written to disk it is removed from the pd_pendinghd
+ * list, any removed operations are done, and the dependency structure
+ * is freed.
+ */
+#define DAHASHSZ 5
+#define DIRADDHASH(offset) (((offset) >> 2) % DAHASHSZ)
+struct pagedep {
+	struct	worklist pd_list;	/* page buffer */
+#	define	pd_state pd_list.wk_state /* check for multiple I/O starts */
+	LIST_ENTRY(pagedep) pd_hash;	/* hashed lookup */
+	struct	mount *pd_mnt;		/* associated mount point */
+	ino_t	pd_ino;			/* associated file */
+	ufs_lbn_t pd_lbn;		/* block within file */
+	struct	dirremhd pd_dirremhd;	/* dirrem's waiting for page */
+	struct	diraddhd pd_diraddhd[DAHASHSZ]; /* diradd dir entry updates */
+	struct	diraddhd pd_pendinghd;	/* directory entries awaiting write */
+};
+
+/*
+ * The "inodedep" structure tracks the set of dependencies associated
+ * with an inode. One task that it must manage is delayed operations
+ * (i.e., work requests that must be held until the inodedep's associated
+ * inode has been written to disk). Getting an inode from its incore 
+ * state to the disk requires two steps to be taken by the filesystem
+ * in this order: first the inode must be copied to its disk buffer by
+ * the VOP_UPDATE operation; second the inode's buffer must be written
+ * to disk. To ensure that both operations have happened in the required
+ * order, the inodedep maintains two lists. Delayed operations are
+ * placed on the id_inowait list. When the VOP_UPDATE is done, all
+ * operations on the id_inowait list are moved to the id_bufwait list.
+ * When the buffer is written, the items on the id_bufwait list can be
+ * safely moved to the work queue to be processed. A second task of the
+ * inodedep structure is to track the status of block allocation within
+ * the inode.  Each block that is allocated is represented by an
+ * "allocdirect" structure (see below). It is linked onto the id_newinoupdt
+ * list until both its contents and its allocation in the cylinder
+ * group map have been written to disk. Once these dependencies have been
+ * satisfied, it is removed from the id_newinoupdt list and any followup
+ * actions such as releasing the previous block or fragment are placed
+ * on the id_inowait list. When an inode is updated (a VOP_UPDATE is
+ * done), the "inodedep" structure is linked onto the buffer through
+ * its worklist. Thus, it will be notified when the buffer is about
+ * to be written and when it is done. At the update time, all the
+ * elements on the id_newinoupdt list are moved to the id_inoupdt list
+ * since those changes are now relevant to the copy of the inode in the
+ * buffer. Also at update time, the tasks on the id_inowait list are
+ * moved to the id_bufwait list so that they will be executed when
+ * the updated inode has been written to disk. When the buffer containing
+ * the inode is written to disk, any updates listed on the id_inoupdt
+ * list are rolled back as they are not yet safe. Following the write,
+ * the changes are once again rolled forward and any actions on the
+ * id_bufwait list are processed (since those actions are now safe).
+ * The entries on the id_inoupdt and id_newinoupdt lists must be kept
+ * sorted by logical block number to speed the calculation of the size
+ * of the rolled back inode (see explanation in initiate_write_inodeblock).
+ * When a directory entry is created, it is represented by a diradd.
+ * The diradd is added to the id_inowait list as it cannot be safely
+ * written to disk until the inode that it represents is on disk. After
+ * the inode is written, the id_bufwait list is processed and the diradd
+ * entries are moved to the id_pendinghd list where they remain until
+ * the directory block containing the name has been written to disk.
+ * The purpose of keeping the entries on the id_pendinghd list is so that
+ * the softdep_fsync function can find and push the inode's directory
+ * name(s) as part of the fsync operation for that file.
+ */
+struct inodedep {
+	struct	worklist id_list;	/* buffer holding inode block */
+#	define	id_state id_list.wk_state /* inode dependency state */
+	LIST_ENTRY(inodedep) id_hash;	/* hashed lookup */
+	struct	fs *id_fs;		/* associated filesystem */
+	ino_t	id_ino;			/* dependent inode */
+	nlink_t	id_nlinkdelta;		/* saved effective link count */
+	LIST_ENTRY(inodedep) id_deps;	/* bmsafemap's list of inodedep's */
+	struct	buf *id_buf;		/* related bmsafemap (if pending) */
+	long	id_savedextsize;	/* ext size saved during rollback */
+	off_t	id_savedsize;		/* file size saved during rollback */
+	struct	workhead id_pendinghd;	/* entries awaiting directory write */
+	struct	workhead id_bufwait;	/* operations after inode written */
+	struct	workhead id_inowait;	/* operations waiting inode update */
+	struct	allocdirectlst id_inoupdt; /* updates before inode written */
+	struct	allocdirectlst id_newinoupdt; /* updates when inode written */
+	struct	allocdirectlst id_extupdt; /* extdata updates pre-inode write */
+	struct	allocdirectlst id_newextupdt; /* extdata updates at ino write */
+	union {
+	struct	ufs1_dinode *idu_savedino1; /* saved ufs1_dinode contents */
+	struct	ufs2_dinode *idu_savedino2; /* saved ufs2_dinode contents */
+	} id_un;
+};
+#define id_savedino1 id_un.idu_savedino1
+#define id_savedino2 id_un.idu_savedino2
+
+/*
+ * A "newblk" structure is attached to a bmsafemap structure when a block
+ * or fragment is allocated from a cylinder group. Its state is set to
+ * DEPCOMPLETE when its cylinder group map is written. It is consumed by
+ * an associated allocdirect or allocindir allocation which will attach
+ * themselves to the bmsafemap structure if the newblk's DEPCOMPLETE flag
+ * is not set (i.e., its cylinder group map has not been written).
+ */ 
+struct newblk {
+	LIST_ENTRY(newblk) nb_hash;	/* hashed lookup */
+	struct	fs *nb_fs;		/* associated filesystem */
+	int	nb_state;		/* state of bitmap dependency */
+	ufs2_daddr_t nb_newblkno;	/* allocated block number */
+	LIST_ENTRY(newblk) nb_deps;	/* bmsafemap's list of newblk's */
+	struct	bmsafemap *nb_bmsafemap; /* associated bmsafemap */
+};
+
+/*
+ * A "bmsafemap" structure maintains a list of dependency structures
+ * that depend on the update of a particular cylinder group map.
+ * It has lists for newblks, allocdirects, allocindirs, and inodedeps.
+ * It is attached to the buffer of a cylinder group block when any of
+ * these things are allocated from the cylinder group. It is freed
+ * after the cylinder group map is written and the state of its
+ * dependencies are updated with DEPCOMPLETE to indicate that it has
+ * been processed.
+ */
+struct bmsafemap {
+	struct	worklist sm_list;	/* cylgrp buffer */
+	struct	buf *sm_buf;		/* associated buffer */
+	struct	allocdirecthd sm_allocdirecthd; /* allocdirect deps */
+	struct	allocindirhd sm_allocindirhd; /* allocindir deps */
+	struct	inodedephd sm_inodedephd; /* inodedep deps */
+	struct	newblkhd sm_newblkhd;	/* newblk deps */
+};
+
+/*
+ * An "allocdirect" structure is attached to an "inodedep" when a new block
+ * or fragment is allocated and pointed to by the inode described by
+ * "inodedep". The worklist is linked to the buffer that holds the block.
+ * When the block is first allocated, it is linked to the bmsafemap
+ * structure associated with the buffer holding the cylinder group map
+ * from which it was allocated. When the cylinder group map is written
+ * to disk, ad_state has the DEPCOMPLETE flag set. When the block itself
+ * is written, the COMPLETE flag is set. Once both the cylinder group map
+ * and the data itself have been written, it is safe to write the inode
+ * that claims the block. If there was a previous fragment that had been
+ * allocated before the file was increased in size, the old fragment may
+ * be freed once the inode claiming the new block is written to disk.
+ * This ad_fragfree request is attached to the id_inowait list of the
+ * associated inodedep (pointed to by ad_inodedep) for processing after
+ * the inode is written. When a block is allocated to a directory, an
+ * fsync of a file whose name is within that block must ensure not only
+ * that the block containing the file name has been written, but also
+ * that the on-disk inode references that block. When a new directory
+ * block is created, we allocate a newdirblk structure which is linked
+ * to the associated allocdirect (on its ad_newdirblk list). When the
+ * allocdirect has been satisfied, the newdirblk structure is moved to
+ * the inodedep id_bufwait list of its directory to await the inode
+ * being written. When the inode is written, the directory entries are
+ * fully committed and can be deleted from their pagedep->id_pendinghd
+ * and inodedep->id_pendinghd lists.
+ */
+struct allocdirect {
+	struct	worklist ad_list;	/* buffer holding block */
+#	define	ad_state ad_list.wk_state /* block pointer state */
+	TAILQ_ENTRY(allocdirect) ad_next; /* inodedep's list of allocdirect's */
+	ufs_lbn_t ad_lbn;		/* block within file */
+	ufs2_daddr_t ad_newblkno;	/* new value of block pointer */
+	ufs2_daddr_t ad_oldblkno;	/* old value of block pointer */
+	long	ad_newsize;		/* size of new block */
+	long	ad_oldsize;		/* size of old block */
+	LIST_ENTRY(allocdirect) ad_deps; /* bmsafemap's list of allocdirect's */
+	struct	buf *ad_buf;		/* cylgrp buffer (if pending) */
+	struct	inodedep *ad_inodedep;	/* associated inodedep */
+	struct	freefrag *ad_freefrag;	/* fragment to be freed (if any) */
+	struct	workhead ad_newdirblk;	/* dir block to notify when written */
+};
+
+/*
+ * A single "indirdep" structure manages all allocation dependencies for
+ * pointers in an indirect block. The up-to-date state of the indirect
+ * block is stored in ir_savedata. The set of pointers that may be safely
+ * written to the disk is stored in ir_safecopy. The state field is used
+ * only to track whether the buffer is currently being written (in which
+ * case it is not safe to update ir_safecopy). Ir_deplisthd contains the
+ * list of allocindir structures, one for each block that needs to be
+ * written to disk. Once the block and its bitmap allocation have been
+ * written the safecopy can be updated to reflect the allocation and the
+ * allocindir structure freed. If ir_state indicates that an I/O on the
+ * indirect block is in progress when ir_safecopy is to be updated, the
+ * update is deferred by placing the allocindir on the ir_donehd list.
+ * When the I/O on the indirect block completes, the entries on the
+ * ir_donehd list are processed by updating their corresponding ir_safecopy
+ * pointers and then freeing the allocindir structure.
+ */
+struct indirdep {
+	struct	worklist ir_list;	/* buffer holding indirect block */
+#	define	ir_state ir_list.wk_state /* indirect block pointer state */
+	caddr_t ir_saveddata;		/* buffer cache contents */
+	struct	buf *ir_savebp;		/* buffer holding safe copy */
+	struct	allocindirhd ir_donehd;	/* done waiting to update safecopy */
+	struct	allocindirhd ir_deplisthd; /* allocindir deps for this block */
+};
+
+/*
+ * An "allocindir" structure is attached to an "indirdep" when a new block
+ * is allocated and pointed to by the indirect block described by the
+ * "indirdep". The worklist is linked to the buffer that holds the new block.
+ * When the block is first allocated, it is linked to the bmsafemap
+ * structure associated with the buffer holding the cylinder group map
+ * from which it was allocated. When the cylinder group map is written
+ * to disk, ai_state has the DEPCOMPLETE flag set. When the block itself
+ * is written, the COMPLETE flag is set. Once both the cylinder group map
+ * and the data itself have been written, it is safe to write the entry in
+ * the indirect block that claims the block; the "allocindir" dependency 
+ * can then be freed as it is no longer applicable.
+ */
+struct allocindir {
+	struct	worklist ai_list;	/* buffer holding indirect block */
+#	define	ai_state ai_list.wk_state /* indirect block pointer state */
+	LIST_ENTRY(allocindir) ai_next;	/* indirdep's list of allocindir's */
+	int	ai_offset;		/* pointer offset in indirect block */
+	ufs2_daddr_t ai_newblkno;	/* new block pointer value */
+	ufs2_daddr_t ai_oldblkno;	/* old block pointer value */
+	struct	freefrag *ai_freefrag;	/* block to be freed when complete */
+	struct	indirdep *ai_indirdep;	/* address of associated indirdep */
+	LIST_ENTRY(allocindir) ai_deps;	/* bmsafemap's list of allocindir's */
+	struct	buf *ai_buf;		/* cylgrp buffer (if pending) */
+};
+
+/*
+ * A "freefrag" structure is attached to an "inodedep" when a previously
+ * allocated fragment is replaced with a larger fragment, rather than extended.
+ * The "freefrag" structure is constructed and attached when the replacement
+ * block is first allocated. It is processed after the inode claiming the
+ * bigger block that replaces it has been written to disk. Note that the
+ * ff_state field is is used to store the uid, so may lose data. However,
+ * the uid is used only in printing an error message, so is not critical.
+ * Keeping it in a short keeps the data structure down to 32 bytes.
+ */
+struct freefrag {
+	struct	worklist ff_list;	/* id_inowait or delayed worklist */
+#	define	ff_state ff_list.wk_state /* owning user; should be uid_t */
+	struct	mount *ff_mnt;		/* associated mount point */
+	ufs2_daddr_t ff_blkno;		/* fragment physical block number */
+	long	ff_fragsize;		/* size of fragment being deleted */
+	ino_t	ff_inum;		/* owning inode number */
+};
+
+/*
+ * A "freeblks" structure is attached to an "inodedep" when the
+ * corresponding file's length is reduced to zero. It records all
+ * the information needed to free the blocks of a file after its
+ * zero'ed inode has been written to disk.
+ */
+struct freeblks {
+	struct	worklist fb_list;	/* id_inowait or delayed worklist */
+	ino_t	fb_previousinum;	/* inode of previous owner of blocks */
+	uid_t	fb_uid;			/* uid of previous owner of blocks */
+	struct	vnode *fb_devvp;	/* filesystem device vnode */
+	struct	mount *fb_mnt;		/* associated mount point */
+	long	fb_oldextsize;		/* previous ext data size */
+	off_t	fb_oldsize;		/* previous file size */
+	ufs2_daddr_t fb_chkcnt;		/* used to check cnt of blks released */
+	ufs2_daddr_t fb_dblks[NDADDR];	/* direct blk ptrs to deallocate */
+	ufs2_daddr_t fb_iblks[NIADDR];	/* indirect blk ptrs to deallocate */
+	ufs2_daddr_t fb_eblks[NXADDR];	/* indirect blk ptrs to deallocate */
+};
+
+/*
+ * A "freefile" structure is attached to an inode when its
+ * link count is reduced to zero. It marks the inode as free in
+ * the cylinder group map after the zero'ed inode has been written
+ * to disk and any associated blocks and fragments have been freed.
+ */
+struct freefile {
+	struct	worklist fx_list;	/* id_inowait or delayed worklist */
+	mode_t	fx_mode;		/* mode of inode */
+	ino_t	fx_oldinum;		/* inum of the unlinked file */
+	struct	vnode *fx_devvp;	/* filesystem device vnode */
+	struct	mount *fx_mnt;		/* associated mount point */
+};
+
+/*
+ * A "diradd" structure is linked to an "inodedep" id_inowait list when a
+ * new directory entry is allocated that references the inode described
+ * by "inodedep". When the inode itself is written (either the initial
+ * allocation for new inodes or with the increased link count for
+ * existing inodes), the COMPLETE flag is set in da_state. If the entry
+ * is for a newly allocated inode, the "inodedep" structure is associated
+ * with a bmsafemap which prevents the inode from being written to disk
+ * until the cylinder group has been updated. Thus the da_state COMPLETE
+ * flag cannot be set until the inode bitmap dependency has been removed.
+ * When creating a new file, it is safe to write the directory entry that
+ * claims the inode once the referenced inode has been written. Since
+ * writing the inode clears the bitmap dependencies, the DEPCOMPLETE flag
+ * in the diradd can be set unconditionally when creating a file. When
+ * creating a directory, there are two additional dependencies described by
+ * mkdir structures (see their description below). When these dependencies
+ * are resolved the DEPCOMPLETE flag is set in the diradd structure.
+ * If there are multiple links created to the same inode, there will be
+ * a separate diradd structure created for each link. The diradd is
+ * linked onto the pg_diraddhd list of the pagedep for the directory
+ * page that contains the entry. When a directory page is written,
+ * the pg_diraddhd list is traversed to rollback any entries that are
+ * not yet ready to be written to disk. If a directory entry is being
+ * changed (by rename) rather than added, the DIRCHG flag is set and
+ * the da_previous entry points to the entry that will be "removed"
+ * once the new entry has been committed. During rollback, entries
+ * with da_previous are replaced with the previous inode number rather
+ * than zero.
+ *
+ * The overlaying of da_pagedep and da_previous is done to keep the
+ * structure down to 32 bytes in size on a 32-bit machine. If a
+ * da_previous entry is present, the pointer to its pagedep is available
+ * in the associated dirrem entry. If the DIRCHG flag is set, the
+ * da_previous entry is valid; if not set the da_pagedep entry is valid.
+ * The DIRCHG flag never changes; it is set when the structure is created
+ * if appropriate and is never cleared.
+ */
+struct diradd {
+	struct	worklist da_list;	/* id_inowait or id_pendinghd list */
+#	define	da_state da_list.wk_state /* state of the new directory entry */
+	LIST_ENTRY(diradd) da_pdlist;	/* pagedep holding directory block */
+	doff_t	da_offset;		/* offset of new dir entry in dir blk */
+	ino_t	da_newinum;		/* inode number for the new dir entry */
+	union {
+	struct	dirrem *dau_previous;	/* entry being replaced in dir change */
+	struct	pagedep *dau_pagedep;	/* pagedep dependency for addition */
+	} da_un;
+};
+#define da_previous da_un.dau_previous
+#define da_pagedep da_un.dau_pagedep
+
+/*
+ * Two "mkdir" structures are needed to track the additional dependencies
+ * associated with creating a new directory entry. Normally a directory
+ * addition can be committed as soon as the newly referenced inode has been
+ * written to disk with its increased link count. When a directory is
+ * created there are two additional dependencies: writing the directory
+ * data block containing the "." and ".." entries (MKDIR_BODY) and writing
+ * the parent inode with the increased link count for ".." (MKDIR_PARENT).
+ * These additional dependencies are tracked by two mkdir structures that
+ * reference the associated "diradd" structure. When they have completed,
+ * they set the DEPCOMPLETE flag on the diradd so that it knows that its
+ * extra dependencies have been completed. The md_state field is used only
+ * to identify which type of dependency the mkdir structure is tracking.
+ * It is not used in the mainline code for any purpose other than consistency
+ * checking. All the mkdir structures in the system are linked together on
+ * a list. This list is needed so that a diradd can find its associated
+ * mkdir structures and deallocate them if it is prematurely freed (as for
+ * example if a mkdir is immediately followed by a rmdir of the same directory).
+ * Here, the free of the diradd must traverse the list to find the associated
+ * mkdir structures that reference it. The deletion would be faster if the
+ * diradd structure were simply augmented to have two pointers that referenced
+ * the associated mkdir's. However, this would increase the size of the diradd
+ * structure from 32 to 64-bits to speed a very infrequent operation.
+ */
+struct mkdir {
+	struct	worklist md_list;	/* id_inowait or buffer holding dir */
+#	define	md_state md_list.wk_state /* type: MKDIR_PARENT or MKDIR_BODY */
+	struct	diradd *md_diradd;	/* associated diradd */
+	struct	buf *md_buf;		/* MKDIR_BODY: buffer holding dir */
+	LIST_ENTRY(mkdir) md_mkdirs;	/* list of all mkdirs */
+};
+LIST_HEAD(mkdirlist, mkdir) mkdirlisthd;
+
+/*
+ * A "dirrem" structure describes an operation to decrement the link
+ * count on an inode. The dirrem structure is attached to the pg_dirremhd
+ * list of the pagedep for the directory page that contains the entry.
+ * It is processed after the directory page with the deleted entry has
+ * been written to disk.
+ *
+ * The overlaying of dm_pagedep and dm_dirinum is done to keep the
+ * structure down to 32 bytes in size on a 32-bit machine. It works
+ * because they are never used concurrently.
+ */
+struct dirrem {
+	struct	worklist dm_list;	/* delayed worklist */
+#	define	dm_state dm_list.wk_state /* state of the old directory entry */
+	LIST_ENTRY(dirrem) dm_next;	/* pagedep's list of dirrem's */
+	struct	mount *dm_mnt;		/* associated mount point */
+	ino_t	dm_oldinum;		/* inum of the removed dir entry */
+	union {
+	struct	pagedep *dmu_pagedep;	/* pagedep dependency for remove */
+	ino_t	dmu_dirinum;		/* parent inode number (for rmdir) */
+	} dm_un;
+};
+#define dm_pagedep dm_un.dmu_pagedep
+#define dm_dirinum dm_un.dmu_dirinum
+
+/*
+ * A "newdirblk" structure tracks the progress of a newly allocated
+ * directory block from its creation until it is claimed by its on-disk
+ * inode. When a block is allocated to a directory, an fsync of a file
+ * whose name is within that block must ensure not only that the block
+ * containing the file name has been written, but also that the on-disk
+ * inode references that block. When a new directory block is created,
+ * we allocate a newdirblk structure which is linked to the associated
+ * allocdirect (on its ad_newdirblk list). When the allocdirect has been
+ * satisfied, the newdirblk structure is moved to the inodedep id_bufwait
+ * list of its directory to await the inode being written. When the inode
+ * is written, the directory entries are fully committed and can be
+ * deleted from their pagedep->id_pendinghd and inodedep->id_pendinghd
+ * lists. Note that we could track directory blocks allocated to indirect
+ * blocks using a similar scheme with the allocindir structures. Rather
+ * than adding this level of complexity, we simply write those newly 
+ * allocated indirect blocks synchronously as such allocations are rare.
+ */
+struct newdirblk {
+	struct	worklist db_list;	/* id_inowait or pg_newdirblk */
+#	define	db_state db_list.wk_state /* unused */
+	struct	pagedep *db_pagedep;	/* associated pagedep */
+};