We develop benchmarks to evaluate large language models on code editing performance, as previous benchmarks proved insufficient. We also fine-tune models specifically for code editing.