Generic binary search in Rust

I’ve been using Rust lately, and I really love it.  I wrote a little binary search algorithm that’s generic over the container type, the index type, and the element type, and I thought I’d share it.

use std::cmp::Ordering;
use std::ops::Add;
use std::ops::Div;
use std::ops::Index;
use std::ops::Sub;

pub trait BinarySearch<I, T> {
  /// Returns the index of an equal key, or the index at which the key would be inserted if one does not exist.
  fn binary_search(&self, key: &T, mut lower: I, mut upper: I) -> I;

// // Would use this, but std::num::One is unstable
// impl <C, I, T> BinarySearch<I, T> for C
//   where
//    C: Index<I, Output = T>,
//     I: Ord + Add<Output = I> + Sub<Output = I> + Div<Output = I> + Shr<I, Output = I> + One + Copy,
//     T: Ord {
//   fn binary_search(&self, key: &T, mut lower: I, mut upper: I) -> I {
//     let one = I::one();
//     loop {
//       if upper <= lower {
//         return lower;
//       }
//       let mid = (lower + upper).shr(one);
//       match key.cmp(&self[mid]) {
//         Ordering::Less => upper = mid,
//         Ordering::Greater => lower = mid + one,
//         Ordering::Equal => return mid,
//       }
//     }
//   }
// }

impl <C, I, T> BinarySearch<I, T> for C
   C: Index<I, Output = T>,
    I: Ord + Add<Output = I> + Sub<Output = I> + Div<Output = I> + Shr<I, Output = I> + Copy,
    T: Ord {
  fn binary_search(&self, key: &T, mut lower: I, mut upper: I) -> I {
    if upper == lower {
      return lower;
    let one = (upper - lower) / (upper - lower);
    loop {
      if upper <= lower {
        return lower;
      let mid = (lower + upper).shr(one);
      match key.cmp(&self[mid]) {
        Ordering::Less => upper = mid,
        Ordering::Greater => lower = mid + one,
        Ordering::Equal => return mid,

mod tests {
  use super::*;

  fn test_binary_search() {
    let data = vec![0, 0, 1, 2, 4, 7];
    for i in {
      let ix = data.binary_search(&data[i], 0, data.len());
      assert_eq!(data[ix], data[i]); // ix may not be i because of duplicates
    assert_eq!(data.binary_search(&-1, 0, data.len()), 0);
    assert_eq!(data.binary_search(&3, 0, data.len()), 4);
    assert_eq!(data.binary_search(&5, 0, data.len()), 5);
    assert_eq!(data.binary_search(&8, 0, data.len()), 6);


TailPlot 1.4 released

TailPlot version 1.4 is now out.  The latest version displays the values of points when the mouse is near them.  Get it while it’s hot!


A couple of years ago I started an application called TailPlot to plot live data from a file.  The idea is that it’s like tail -f, but the output is plotted (hence the name).  I’m just now getting around to mentioning it here.  It’s intended to be easy to use, so it’s just a JAR that can be run without installation.  I encourage everyone to submit feature requests and bug reports through GitHub.

Sending special keystrokes to remote/virtual computers

I recently had a need to send Ctrl-Alt-F1 to a VM, but it appears impossible. All of the applications I’ve tried only support sending certain common special keystrokes, like Ctrl-Alt-Del. Why don’t VM interfaces and VNC clients support sending arbitrary special keystrokes?

Several applications have an option to grab “all possible” keystrokes, but this isn’t good enough. First of all, they doesn’t capture all keys. (Specifically, they all miss Ctrl-Alt-F1, at least on my laptop.) Secondly, the user doesn’t know if the keystroke will be captured or not. If it’s a dangerous keystroke like Ctrl-Alt-Del or Ctrl-Alt-Backspace, the try-and-see approach has obvious drawbacks. This also relies on support from the host OS’s windowing system and is not portable.

A few applications support a wide range of specific keystrokes, but this only works if you can list every possible keystroke a user might want to send. This would be hard enough with only “standard” special keystrokes, but becomes practically impossible when considering that users may have customized global keyboard shortcuts.

The real solution is to provide a way to enter arbitrary keystrokes. Possibilities include virtual keyboards and using textual representations, such as “Ctrl-Alt-F1” (or “C-M-“, or whatever your favorate representation is).

I’ve filed bugs with the apps I’ve tested, at least the ones with bug trackers.

(I’d like to clarify that I’m not making the point that there are no applications that support entering arbitrary keystrokes, especially since it’s impossible to test all applications. My point is that it’s a basic feature that they should all support.)

BASH prompt

I know it’s cliché, but I thought I’d post my BASH prompt just in case anyone finds it useful. Features include:

  • username, host, and directory
  • number of background jobs
  • exit status of last command
  • wall-clock time of last command
  • current date/time
  • horizontal rule to separate outputs
  • resizable


The duration of the last command is based on the actual duration, not just time between prompt displays, so it doesn’t include time spent sitting idle at the prompt.  By resizable, I mean that the width of the horizontal rule is based on the width of the terminal, so resizing the window will affect the size of new horizontal rules.

The following is the code I put in my .bashrc, ignoring checks on whether features like color are enabled.  It assumes you have bc installed, but it could theoretically be re-written to use BASH arithmetic.  I used a trick posted on superuser to run commands before the user’s command.

_ps1_preexec () {
    [ -n "$COMP_LINE" ] && return  # do nothing if completing
    [ "$BASH_COMMAND" = "$PROMPT_COMMAND" ] && return # don't cause a preexec for $PROMPT_COMMAND

    _ps1_start_ts=${_ps1_start_ts:-$(date '+scale=3; %s+%N/1000000000' | bc)}

trap '_ps1_preexec' DEBUG

function _ps1_display_prompt2 {
    if [ "$_ps1_start_ts" != "" ]; then
        _ps1_duration=$(date "+scale=3; %s+%N/1000000000-$_ps1_start_ts" | bc)
        unset _ps1_duration
    unset _ps1_start_ts


function _ps1_display_prompt {
    local boxch_h="\u2500"
    local bold="\033[01m"
    local reset="\033[00m"
    local black="\033[30m"
    local red="\033[31m"
    local green="\033[32m"
    local blue="\033[34m"
    local s1="${USER}@$(hostname)"
    local s2="$1"
    local s3="$(date)"
    local status=""
    local exit_status="$2"
    local jobcount="$3"
    if [ "$exit_status" != "0" ]; then
        status="$status ${bold}${red}exit:$exit_status$reset"
    if [ "$jobcount" != "0" ]; then
        status="$status jobs:$jobcount"

    if [ "$_ps1_duration" != "" ]; then
        status="$status duration:${_ps1_duration}s"
    if [ "$status" != "" ]; then
        status="${status:1}" # strip leading space
    local status_stripped=$(echo "$status" | sed -r 's/\\033\[[^m]+m//g')

    local hrlen=$(($(tput cols) - ${#s1} - ${#s2} - 3 - ${#s3} - 2))
    local status_box=""
    if [ "$status" != "" ]; then
        hrlen=$(($hrlen - ${#status_stripped} - 3))
    local hr=$(for alias_hr_i in $(seq 1 $hrlen); do echo -en $boxch_h; done)
    echo -en "[${bold}${green}${s1}${reset}:${bold}${blue}${s2}${reset}]${status_box}${hr}[${bold}${black}${s3}${reset}]\n\$ "

PS1='$(_ps1_display_prompt "\w" $? \j)'

Fixing Spark’s

I’ve been using Apache Spark recently and like it quite a lot, although it still has several rough edges.  One that I ran into is a quirk in

I had two RDDs of equal length, but when I zipped them together, the zipped RDD had fewer elements than its parents.  Looking at the documentation for, it says:

Assumes that the two RDDs have the *same number of partitions* and the *same number of elements in each partition* (e.g. one was made through a map on the other).

So having equal-length RDDs is not sufficient.  For example, an RDD with the elements (a, b, c) partitioned as ([a, b], [c]) cannot be correctly zipped with (A, B, C) partitioned as ([A], [B, C]). I’d like to point out that this is a horrible violation of the principle of least astonishment.  Unfortunately, Spark does not provide a way to zip equal-length but unequally-partitioned RDDs. (If any Spark developers are reading this, it would be a much-appreciated feature.)

While incredibly inefficient, I wrote an implementation of a general zip function based on what Spark provides. It’s quite a mess in Java; I imagine it’s a bit nicer in Scala. I tested this by zipping two RDDs with 1000 elements, each element being a 1000-byte array. It takes about 400ms on my older, dual-core laptop. I haven’t yet had the chance to test this in a distributed setting.

The index function may be useful for other things, too, such as creating an RDD with the first k elements, rather than calling RDD.take(k), which returns the entire dataset to the client.

Again, the performance is terrible, but this may still be useful if other work dominates the computation.


public static <V1, V2> JavaPairRDD<V1, V2> zip(JavaRDD<V1> rdd1, JavaRDD<V2> rdd2) {
    return JavaPairRDD.fromJavaRDD(index(rdd1).join(index(rdd2)).values());

 * Keys an RDD by the index of each element
public static <T> JavaPairRDD<Integer, T> index(JavaRDD<T> rdd) {
    // The RDD consists of a single entry, which is a list of the sizes of each partition
    JavaRDD<List<Integer>> partitionSizesRDD = rdd.mapPartitions(new FlatMapFunction<Iterator<T>, Integer>() {
        public Iterable<Integer> call(Iterator<T> arg0) throws Exception {
            int count = 0;
            while(arg0.hasNext()) {
            return Collections.singleton(count);
    // The RDD consists of a single entry, which is a list of the start indices of each partition
    JavaRDD<List<Integer>> partitionOffsetsRDD = partitionSizesRDD
            .map(new Function<List<Integer>, List<Integer>>() {
                public List<Integer> call(List<Integer> arg0) throws Exception {
                    int offset = 0;
                    List<Integer> out = new ArrayList<>();
                    for(Integer i : arg0) {
                        offset += i;
                    return out;
    ClassManifest<Tuple2<Integer, T>> cm = (ClassManifest) scala.reflect.ClassManifest$.MODULE$
    // Since partitionOffsetsRDD has a single element, this has the effect of pairing each partition with the list of all partition offsets
    JavaPairRDD<List<T>, List<Integer>> cartesian = rdd.glom().cartesian(partitionOffsetsRDD);
    // Since the RDD was already glommed, each partition has a single element.
    // We use the partition index to find the correct partition offset in the list of offsets.
    // Each individual element is then output with its index.
    JavaRDD<Tuple2<Integer, T>> result = cartesian.mapPartitionsWithIndex(
            new Function2<Object, Iterator<Tuple2<List<T>, List<Integer>>>, Iterator<Tuple2<Integer, T>>>() {
                public Iterator<Tuple2<Integer, T>> call(Object arg0, Iterator<Tuple2<List<T>, List<Integer>>> arg1)
                        throws Exception {
                    int partition = (Integer) arg0;
                    Tuple2<List<T>, List<Integer>> t =;
                    assert !arg1.hasNext();
                    final int initialOffset = t._2().get(partition);
                    final Iterator<T> itr = t._1().iterator();
                    return new Iterator<Tuple2<Integer, T>>() {
                        int offset = initialOffset;

                        public boolean hasNext() {
                            return itr.hasNext();

                        public Tuple2<Integer, T> next() {
                            return new Tuple2<Integer, T>(offset++,;

                        public void remove() {
                            throw new UnsupportedOperationException();
            }, true, cm);
    return JavaPairRDD.fromJavaRDD(result);

Swing dashed line performance

Drawing dashed lines in Swing can be surprisingly slow, and worse, the default implementation seems to be sub-optimal.  Of course, for most apps, this isn’t a big deal, but performance-sensitive apps that draw a lot of dashed lines (or draw them often) can suffer.

The first thing to optimize is to only draw the portion of the line within the clipping region.  For example, if the clipping rectangle is (0,0)-(100,100), then drawing the line (0,0)-(1000,0) takes much more time than drawing (0,0)-(100,0).  It seems that Swing still processes each individual dash, even if they will never be drawn.  This is slightly tricky when the beginning of the line is outside the clipping region.  For example, consider a line with a dash length of 5 and a space length of 5.  Drawing (0,0)-(100,0) is not the same as (-25,0)-(100,0) because the phase of the dash is different.  (The first will start with a dash at X=0, but the second will start with a space.)  To clip the line properly, the distance between the original start point and the clipped start point must be an integer multiple of the period, which in this case is 10.  Therefore, the clipped line would be (-5,0)-(100,0).

To really squeeze out the last bit of performance, draw the dashed manually.  In other words, instead of using a dashed stroke and drawing (0,0)-(100,0), use a plain stroke and draw (0,0)-(5,0), (10,0)-(15,0), etc.  It seems like this should actually reduce performance, but it doesn’t.

I wrote a micro-benchmark to time these approaches.  I use a window that is 200 pixels wide, and draw a line from -1000 to 1200.  The results on my laptop are:

Approach Time (microseconds)
Naive 734.7
Clipped 126.4
Clipped & manual dashes 30.4
Non-dashed line 3.9

As you can see, the results are quite dramatic.

As with any optimization, results may vary.  I’m running this on a Linux laptop with a Java 1.7 HotSpot JVM.

Here’s the benchmark:  (I had to fiddle with the extension.  WordPress refuses to let me attach a .java file “for security reasons”, even though I’m an admin and I enabled the “Admins can upload any type of file” option.  Go figure.)

Hybrid bitmap/vector plots with gnuplot

I have a LaTeX document that includes a plot of a large dataset generated with gnuplot.  Everything’s fine, except that the resulting PDF is huge because of the enormous number of points in the plot.  A quick fix for this would be to switch the gnuplot terminal from epslatex to png.  Unfortunately, that also affects the text in the plot, which no longer matches the text in the document.  What is really needed is a pnglatex terminal.  Since that doesn’t exist, I hacked it.

The idea is to take advantage of the fact that the epslatex terminal already splits the text from the graphics.  First, we convert the .eps file to a .png with ImageMagick:

convert -density 288 myplot.eps myplot.png

The density argument is the DPI at which to generate the PNG.  This may need to be tweaked to minimize pixelation and aliasing artifacts.  Depending on your PDF viewer, higher may not necessarily be better.  Next, we change the generated .tex file to include our .png instead of the .eps:

sed -r 's:\\includegraphics\{([^}]+)\}:\\includegraphics[scale=0.25]{\1.png}:g' < myplot.tex > myplot-png.tex

The default DPI seems to be 72, so the scale argument should be 72 divided by the density argument to ImageMagick.  Finally, include myplot-png.tex instead of myplot.tex in the main document.

Note that rasterization necessarily loses information, and the result is of lower quality than using the epslatex terminal.  The resulting PDF can be much smaller, though, so the tradeoff may be worth it.

Overhead of the modulo operator

While optimizing a number-crunching algorithm, I noticed (almost by accident) that a large amount of time was spent performing modulo/remainder operations (%) used for wrapping around the ends of arrays.  I found that changing

i = (i + 1) % n;


while(i >= n) {
  i -= n;

sped up my code.  I decided to investigate how fast the common integer arithmetic operators are.  I ran some tests using a microbenchmarking tool I wrote.  (It takes care of warming up the code for JIT compiling, randomizing test order, and repeating tests to calculate variability.)  For comparison, I wrote a similar tool in C++ and tested the operations with it, too.  I included three ways of incrementing with wraparound:

  • i = (i + 1) % n;
  • i++; while(i > n) i -= n;
  • i++; if(i > n) i -= n;

Here are the results.  Times are the time to perform a single operation:

Operation Java C++ C++ (optimization enabled)
Addition 0.3993 +/- 0.0005 ns 2.5030 +/- 0.0020 ns 0.1530 +/- 0.0000 ns
Multiplication 1.2647 +/- 0.0075 ns 3.1280 +/- 0.0070 ns 1.2880 +/- 0.0020 ns
Division 11.1591 +/- 0.0807 ns 13.2260 +/- 0.0050 ns 11.3220 +/- 0.0020 ns
Modulo 12.6602 +/- 0.2044 ns 14.9750 +/- 0.2280 ns 12.0850 +/- 0.2150 ns
Increment with modulo 13.4542 +/- 0.0289 ns 15.3370 +/- 0.0150 ns 13.0320 +/- 0.0040 ns
Increment with while 0.7853 +/- 0.0553 ns 3.5040 +/- 0.3080 ns 1.4430 +/- 0.0810 ns
Increment with if 1.1490 +/- 0.0611 ns 3.2240 +/- 0.2540 ns 1.7230 +/- 0.0000 ns

Why are there differences between the Java and C++ results?  I’m not sure, but they are reproducible and consistent.  Almost certainly, it boils down to the exact native opcodes the code is (just-in-time) compiled to.

Also surprising is that incrementing with a “while” check is faster than incrementing with an “if” check.  I expected the “while” to be slower, because it requires an additional comparison after i is decremented.  Whatever the cause is, it is reproducible and fairly consistent across Java and C++.

Of course, results will vary by compiler, JVM, and architecture, so this should all be taken with a grain of salt.

Logging file names and line numbers with Boost::Log

I recently started using Boost::Log, and in general, I like it.  One glaring deficiency, though, is that it doesn’t log file names and line numbers.  This is really irritating, especially since it’s such basic functionality, and it’s the one thing all logging frameworks do.  I wrote a bit of C++ that adds this by adding the file name and line number as custom attributes to the log record.  I’ve attached a sample that shows how to use it.

Happy logging!