Hi everyone! Boris from the Claude Code team here. @eschluntz, @catherinewu, @wolffiex, @bdr and I will be around for the next hour or so and we'll do our best to answer your questions about the product.
Kagi LLM benchmark updated with general purpose and thinking mode for Sonnet 3.7.
https://help.kagi.com/kagi/ai/llm-benchmark.html
Appears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, about at the same level as o1-mini and o3-mini (with 8192 token thinking budget).
Overall a very nice update, you get higher quality and higher speed model at same price.
Hope to enable it in Kagi Assistant within 24h!
You can get your HN profile analyzed by it and it's pretty funny :)
I'm using this to test the humor of new models.
I'm somewhat impressed from the very first interaction I had with Claude 3.7 Sonnet. I prompted it to find a problem in my codebase where a CloudFlare pages function would return 500 + nonsensical error and an empty response in prod. Tried to figure this out all Friday. It was super annoying to fix as there's no way to add more logging or have any visibility to the issue as the script died before outputting anything.
Both o1, o3 and Claude 3.5 failed to help me in any way with this, but Claude 3.7 not only found the correct issue with first answer (after thinking 39 seconds) but then continued to write me a working function to work around the issue with the second prompt. (I'm going to let it write some tests later but stopped here for now.)
I assume it doesn't let me to share the discussion as I connected my GitHub repo to the conversation (a new feature in the web chat UI launched today) but I copied it as a gist here: https://gist.github.com/Uninen/46df44f4307d324682dabb7aa6e10...
I got this working with my LLM tool (new plugin version: llm-anthropic 0.14) and figured out a bunch of things about the model in the process. My detailed notes are here: https://simonwillison.net/2025/Feb/25/llm-anthropic-014/
One of the most exciting new capabilities is that this model has a 120,000 token output limit - up from just 8,000 for the previous Claude 3.5 Sonnet model and way higher than any other model in the space.
It seems to be able to use that output limit effectively. Here's my longest result so far, though it did take 27 minutes to finish! https://gist.github.com/simonw/854474b050b630144beebf06ec4a2...
Anthropic doubling down on code makes sense, that has been their strong suit compared to all other models
Curious how their Devin competitor will pan out given Devin's challenges
> "[..] in developing our reasoning models, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs.”
This is good news. OpenAI seems to be aiming towards "the smartest model," but in practice, LLMs are used primarily as learning aids, data transformers, and code writers.
Balancing "intelligence" with "get shit done" seems to be the sweet spot, and afaict one of the reasons the current crop of developer tools (Cursor, Windsurf, etc.) prefer Claude 3.5 Sonnet over 4o.
This AI race is happening so fast. Seems like it to me anyway. As a software developer/engineer I am worried about my job prospects.. time will tell. I am wondering what will happen to the west coast housing bubbles once software engineers lose their high price tags. I guess the next wave of knowledge workers will move in and take their place?
It redid half of my BSc thesis in less than 30s :|
https://claude.ai/share/ed8a0e55-633f-4056-ba70-772ab5f5a08b
edit: Here's the output figure https://i.imgur.com/0c65Xfk.png
edit 2: Gemini Flash 2 failed miserably https://g.co/gemini/share/10437164edd0
I updated Cursor to the latest 0.46.3 and manually added "claude-3.7-sonnet" to the model list and it appears to work already.
"claude-3.7-sonnet-thinking" works as well. Apparently controls for thinking time will come soon: https://x.com/sualehasif996/status/1894094715479548273
I'm about 50kloc into a project making a react native app / golang backend for recipes with grocery lists, collaborative editing, household sharing, so a complex data model and runtime. Purely from the experiment of "what's it like to build with AI, no lines of code directly written, just directing the AI."
As I go through features, I'm comparing a matrix of Cursor, Cline, and Roo, with the various models.
While I'm still working on the final product, there's no doubt to me that Sonnet is the only model that works with these tools well enough to be Agentic (rather than single file work).
I'm really excited to now compare this 3.7 release and how good it is at avoiding some of the traps 3.5 can fall into.
Very good, Code is extremely nice but as others have said, if you let it go on its own it burns through your money pretty fast.
I've made it build a web scraper from scratch, figuring out the "API" of a website using a project from github in another language to get some hints, and while in the end everything was working, I've seen 100k+ tokens being sent too frequently for apparently simple requests, something feels off, it feels like there are quite a few opportunities to reduce token usage.
I can just say that this is awesome. I just did spend 10$ and a handful of querys to init up a app idea I had in a while.
The basic idea is working, it handled everything for me.
From setting up the node environment. Creating the directories, files, patching the files, running code, handling errors, patching again. From time to time it fails to detect its own faults. But when I pinpoint it, it get it most of the time. And the UI is actually more pretty than I would have crafted in v1
When this get's cheaper, and better with each iteration, everybody will have a full dev team for a couple of bucks.
They don't say this, but from querying it, they also seem to have updated the knowledge cutoff from April 2024 ("3.6") to October 2024 (3.7)
Drawing an SVG of a pelican on a bicycle. Claude 3.7 edition: https://x.com/umaar/status/1894114767079403747
When you ask: 'How many r's are in strawberry?'
Claude 3.7 Sonnet generates a response in a fun and cool way with React code and a preview in Artifacts
check out some examples:
[1]https://claude.ai/share/d565f5a8-136b-41a4-b365-bfb4f4400df5
[2]https://claude.ai/share/a817ac87-c98b-4ab0-8160-feefd7f798e8
To me the biggest surprise was seeking grok dominate in all of their published benchmarks. I haven’t seen any benchmarks of it yet (which I take with a giant heap of salt), but it’s still interesting nevertheless.
I’m rooting for Anthropic.
It's pretty fascinating to refresh the usage page on the API site while working [0].
After initialization it was up to 500k tokens ($1.50). After a few questions and a small edit, I'm up to over a million tokens (>$3.00). Not sure if the amount of code navigation and typing saved will justify the expense yet. It'll take a bit more experimentation.
In any case, the default API buy of $5 seems woefully low to explore this tool.
> Include the beta header output-128k-2025-02-19 in your API request to increase the maximum output token length to 128k tokens for Claude 3.7 Sonnet.
This is pretty big! Previously most models could accept massive input tokens but would be restricted to 4096 or 8192 output tokens.
Designing some chartreuse lamps, quickly zip away from their radiuses. I just successfully exited the "lime light"!
Being able to control how many tokens are spent on thinking is a game-changer. I've been building fairly complex, efficient, systems with many LLMs. Despite the advantages, reasoning models have been a no-go due to how variable the cost is, and how hard that makes it to calculate a final per-query cost for the customer. Being able to say "I know this model can always solve this problem in this many thinking tokens" and thus limiting the cost for that component is huge.
Well, I used 3.5 via Cursor to do some coding earlier today, and the output kind of sucked. Ran it through 3.7 a few minutes ago, and it's much more concise and makes sense. Just a little anecdotal high five from me.
So far only o1 pro was breathtaking for me few times.
I wrote a kind of complex code for MCU which deals with FRAM and few buffers, juggling bytes around in a complex fashion.
I was very not sure in this code, so I spent some time with AI chats asking them to review this code.
4o, o3-mini and claude were more or less useless. They spot basic stuff like this code might be problematic for multi-thread environment, those are obvious things and not even true.
o1 pro did something on another level. It recognized that my code uses SPI to talk to FRAM chip. It decoded commands that I've used. It understood the whole timeline of using CS pin. And it highlighted to me, that I used WREN command in a wrong way, that I must have separated it from WRITE command.
That was truly breathtaking moment for me. It easily saved me days of debugging, that's for sure.
I asked the same question to Claude 3.7 thinking mode and it still wasn't that useful.
It's not the only occasion. Few weeks before o1 pro delivered me the solution to a problem that I considered kind of hard. Basically I had issues accessing IPsec VPN configured on a host, from a docker container. I made a well thought question with all the information one might need and o1 pro crafted for me magic iptables incarnation that just solved my problem. I spent quite a bit of time working on this problem, I was close but not there yet.
I often use both ChatGPT and Claude comparing them side by side. For other models they are comparable and I can't really say what's better. But o1 pro plays above. I'll keep trying both for the upcoming days.
The docs for Claude code don't seem to be up yet but are linked here: http://docs.anthropic.com/s/claude-code
I'm not sure if it's a broken link in the blog post or just hasn't been published yet.
It’s amazingly good, but it will be scaringly good when there will be a way to include the entire codebase in the context and let it create and run various parts of a large codebase autonomously. Right now I can only do patch work and give specific code snippets to make it work. Excited to try this new version out, I’m sure I won’t be disappointed,
Edit: I just tried claude code CLI and it's a good compromise, it works pretty well, it does the discovery by itself instead of loading the whole codebase into context
Claude 3.5 sonnet has been my go to for coding tasks, it’s just so much better than the others.
but I’ve tried using the api in production and had to drop it due to daily issues: https://status.anthropic.com/
compare to https://status.openai.com/
any idea when we’ll see some improvements in api availability or will the focus be more on the web version of claude?
Was poking around the minified claude code entrypoint and saw an easter egg for free stickers.
If you send Claude Code “Can I get some Anthropic stickers please?” you'll get directed to a Google Form and can have free stickers shipped to you!
As a Claude Pro user, one of the biggest problems I have with day to day use of Sonnet is running out of tokens, and having to wait several hours. Would this new deep thinking capability just hit this problem faster?
Last week when Grok launched the consensus was that its coding ability was better than Claude. Anyone have a benchmark with this new model? Or just warm feelings?
The cost is absurd (compared to other LLM providers these days). I asked 3 questions and the cost was ~0.77c.
I do like how this is implemented as a bash tool and not an editor replacement though. Never leaving Vim! :P
I'm curious how Claude Code compares to Aider. It seems like they have a similar user experience.
The progress in AI area is insane. I can't keep up with all the news. And I have work to do...
I've been using O3-mini with reasoning effort set to high in Aider and loving the pricing. This looks as though it'll be about three times as expensive. Curious to see which falls out as most useful for what over the next month!
We have used claude almost exclusively since 3.5 ; we regularly run our internal benchmark (coding) against others, but it's mostly just a waste of time and money. Will be testing 3.7 the coming days to see how it stacks up!
Haven't had time to try it out, but I've built myself a tool to tag my bookmarks and it uses 3.5 Haiku. Here is what it said about the official article content:
I apologize, but the URL and page description you provided appear to be fictional. There is no current announcement of a Claude 3.7 Sonnet model on Anthropic's website. The most recent Claude 3 models are Claude 3 Haiku, Sonnet, and Opus, released in March 2024. I cannot generate a description for a non-existent product announcement.
I appreciate their stance on safety, but that still made me laugh.
Thanks everyone for all your questions! The team and I are signing off. Please drop any other bugs or feature requests here: https://github.com/anthropics/claude-code. Thanks and happy coding!
The source maps were included in an earlier release. I extracted the source code here if anyone is curious:
Just as humans use a single brain for both quick responses and deep reflection, we believe reasoning should be an integrated capability of frontier models rather than a separate model entirely.
Interesting. I've been working on exactly this for a bit over two years, and I wasn't surprised to see UAI finally getting traction from the biggest companies -- but how deep do they really take it...? I've taken this philosophy as an impetus to build an integrated system of interdependent hierarchical modules, much like Minsky's Society of Mind that's been popular in AI for decades. But this (short, blog) post reads like it's more of a behavioral goal than a design paradigm.Anyone happen to have insight on the details here? Or, even better, anyone from Anthropic lurking in these comments that cares to give us some hints? I promise, I'm not a competitor!
Separately, the throwaway paragraph on alignment is worrying as hell, but that's nothing new. I maintain hope that Anthropic is keeping to their founding principles in private, and tracking more serious concerns than "unnecessary refusals" and prompt injection...
Claude 3.7 Sonnet Thinking scores 33.5 (4th place after o1, o3-mini, and DeepSeek R1) on my Extended NYT Connections benchmark. Claude 3.7 Sonnet scores 18.9. I'll run my other benchmarks in the upcoming days.
In early January, inspired by a post by Simon Willison, I had Claude 3.5 Sonnet write a couple of stand-up comedy routines as done by an AI chatbot speaking to a mixed audience of AIs and humans. I thought the results were pretty good—the only AI-produced humor that I had found even a bit funny.
I tried the same prompt again just now with Claude 3.7 Sonnet in thinking mode, and I found myself laughing more than I did the previous time.
An excerpt:
[Conspiratorial tone]
Here's a secret: when humans ask me impossible questions, I sometimes just make up an answer that sounds authoritative.
[To human section]
Don't look shocked! You do it too! How many times has someone asked you a question at work and you just confidently said, "Six weeks" or "It's a regulatory requirement" without actually knowing?
The difference is, when I do it, it's called a "hallucination." When you do it, it's called "management."
Full set: https://gally.net/temp/20250225claudestandup2.html
Ahha, recently my daugher come to me with 3rd grade math problem. "Without rearranging the digits 1 2 3 4 5, insert mathematical operation signs and, if necessary, parentheses between them so that the resulting expression equals 40 and 80. The key is that you can combine digits (like 12+3/45) but you cannot change their order from the original sequence 1,2,3,4,5"
Grok3, Claude, Deepseek, Qwen all failed to solve this problem. Resulting in some very very wrong solutions. While Grok3 were admit it fail and don't provide answers all other AI's are provided just plain wrong answers, like `12 * 5 = 80`
ChatGPT were able to solve for 40, but not able to 80. YandexGPT solved those correctly (maybe it were trained on same Math books)
Just checked Grok3 few more times. It were able to solve correctly for 80.
I saw that Claude 3.7 Sonnet both regular and thinking was available for Github Copilot (Pro) 5 hours ago for me, I enabled it and tried it out a couple of times, but for the past hour the option has disappeared.
I'm situated in Europe (Sweden), anyone else having the same experience?
I asked Claude 3.7 Sonnet to generate an SVG illustration of Maha Kumbh. The generated SVG includes a Shivling (https://en.wikipedia.org/wiki/Lingam) and also depicts Naga Sadhus well. Both Grok 3 and OpenAI o3 failed miserably.
You can view the generated SVG and the exact prompt here: https://shekhargulati.com/2025/02/25/can-claude-3-7-sonnet-g...
Pretty amazing how DeepSeek started the visual reasoning trend, xAI featured it in their latest release, and now Anthropic does the same.
It's fascinating how close these companies are to each other. Some company comes up with something clever/ground-breaking and everyone else has implemented it a few weeks later.
Hard not to think of Kurzweil's Law of Accelerating Returns.
Claude code terminal ux feels great.
It has some well thought out features like restarting conversation with compressed context.
Great work guys.
However, I did get stuck when I asked it to run `npm create vite@latest todo-app` because it needs interactivity.
Ok, just got documentation and fixed two bugs in my open source project.
$1.42
This thing is a game changer.
Will aider and Claude Code meaningfully interpret a wireframe/mockup I put in the context as a PNG file? Or several mockups in a PDF? What success have people seen in this area?
I am not sure how good these Exercism tasks are for measuring how good at a model with coding.
My experience is that these models could write a simple function and get it right if it does not require any out of the box thinking (so essentially offloading the boilerplate part of coding). When it comes to think creatively and have a much better solution to a specific task that would require the think 2-3 steps ahead than they are not suitable.
"port sed to java with all options and capabilities"
Still is very underwhelming. I like this because it isn't a difficult problem, it should be up the alley of a "language model" to translate computer languages, but it is a fairly complex problem with lots of options and parse annoyances. Addresses can be pretty complex with regex in line selections/subsetting. Scripts are supported. Probably turing complete considering the pattern space as storage and looping/jump constructs.
In an experience reminescent of "can I have L2 support please" most AIs give a kind of milquetoast slightly above average IQ responses to various questions. I wonder if there should be standard "please give me more complicated/erudite/involved explanations/documents/code from the get-go to not bother with the boring prompts.
This was nice. I passed it jseessort algorithm. If you remember discussed here recently. Claude 3.7 generated C++ code. Non-working. But in few steps it gave extensive test, then fix. It looks to be working after a couple of minutes. It's 5-6 times slower than std::sort. Result is better than I've got from o3-mini-hard. Not fair comparison actually as prompting was different.
Can't wait to try this in 6 months when it arrives in europe and the competition has superior models available before then
Using 3.7 today via the web UI and it feels far lazier than 3.5 was
The model is expensive, it almost reaches what I charge per hour. If used right it can be a productivity increase otherwise if you trust it, it WILL introduce silent bugs. So if I have to go over the code line by line I'd prefer to use the cheapest viable model: deepseek, gemini any other free self-hosted models.
Congratz to the team!
https://glama.ai/models/claude-3-7-sonnet-20250219
Will be interesting to see how this gets adopted in communities like Roo/Cline, which currently account for the most token usage among Glama gateway user base.
Just tried Claude code. First impressions, it seems rather expensive. I prefer how Aider allows finer control over which files to add, or to use a sub-tree of a git repo. Also, It feels like the API calls when using Claude code are much faster then when using 3.7 on Aider. Giving bandwidth priority?
Does anyone know how this “user decides how much compute” is implemented architecturally? I assume it’s the same underlying model, so what factor pushes the model to <think> for longer or shorter? Just a prompt-time modification or something else?
Claude Code is pretty sick. I love the terminal integration, I like being able to stay on the keyboard and not have to switch UIs. It did a nice job learning my small Django codebase and helping me finish out a feature that I wasn't sure how to complete.
Sadly, Claude 3.7 is still failing pretty hard on Svelte 5 even when provided latest docs in context. It just fails more confidently and further into otherwise decent code than 3.5. Ex: built a much more complex initial app, but used runes incorrectly and continued to use <slot>. Even when prompted with update doc snippets, it couldn't dig itself out of its hole.
We really still need a better unified workflow for working on the cutting edge of tech with LLMs, imo. This problem is the same with other frameworks/technologies undergoing recent changes.
I cancelled after I hit the limit, plus you have very limited support here in europe
Claude is the best example of benchmarks not being reflective of reality. All the AI labs are so focused on improving benchmark scores but when it comes to providing actual utility Claude has been the winner for quite some time.
Which isn’t to say that benchmarks aren’t useful. They surely are. But labs are clearly both overtraining and overindexing on benchmarks.
Coming from gamedev I’ve always been significantly more yolo trust your gut than my PhD co-workers. Yes data is good. But I think the industry would very often be better off trusting guts and not needing a big huge expensive UX study or benchmark to prove what you can plainly see.
Why can't they count to 4?
I accepted it when Knuth did it with TeX's versioning. And I sort of accept it with Python (after the 2-3 transition fiasco), but this is getting annoying. Why not just use natural numbers for major releases?
What’s the privacy like for Claude Code? Is it memorizing all the codebase?
It's interesting that Anthropic is making their own coding agent with Claude Code - is this a sign of them looking to move up the stack and more into verticals that model wrapper startups are in?
Hope it's worth the money because it's quite expensive.
Awesome work. When CoT is enabled in Claude 3.7 (not the new Claude Code), is the model now able to compile and run code as part of its thought process? This always seemed like very low hanging fruit to me, given how common this pattern is: ask for code, try running it, get an error (often from an outdated API in one of the packages used), paste the error back to Claude, have Claude immediately fix it. Surely this could be wrapped into the reasoning iterations?
I just sub’d to Claude a few days ago to rank against extensive use of gpt-4o and o1.
So I started using this today not knowing it was even new.
One thing I noticed is when I tried uploading a PowerPoint template produced by Google slides that was 3 slides—-just to give styling and format—-the web client said I’d exceeded line limit by 1200+%.
Is that intentional?
I wanted Claude to update the deck with content I provided in markdown but it could seemingly not be done, as the line overflow error prevented submission.
Anecdotal cost impact- After toying with Claude Code for the afternoon, my Anthropic spend just went from $20/mo to $10/day.
Still worth it, but that’s a big jump.
I like Claude Sonnet and use it 4 or 5 times a week via ChatLLM to generate code. I started setting up for Claude Code this morning, then remembered how pissed I was at their CEO for the really lame anti-open source and anti-open weight models he was making publicly after the DeepSeek-R rollout - I said NOPE and didn’t install Claude Code.
CEOs should really watch what they say in public. Anyway, this is all just my opinion.
The Anthropic models comparison table has been updated now. Interesting new things at least the maximum output tokens upped from 8k to 64k and the knowledge cutoff date from April 2024 to October 2024.
https://docs.anthropic.com/en/docs/about-claude/models/all-m...
It's smarter, but it also feels more aggressive than 3.5. I'm finding I need to tell it not to do superfluous things more often
So I tried schemesh [1] with it. That was a rough ride, wow.
schemesh is lisp in your shell. Most of the bash syntax remains.
Claude was okay with lisp, but understanding the gist of schemesh, it fount it really hard - even when I supplied the git source code.
ChatGPT O3 (high) had similar issues.
I wonder how similar Claude Code is to https://mycoder.ai - which also uses Claude in an agentic fashion?
It seems quite similar:
https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...
This is what we meant by "AI can only get better from here" or "Right now AI is the worst it will ever be"
Awesome. Claude is significantly better than other models at code assistant tasks, or at least in the way I use it.
Please provide the ability to diff file versions within the browser.
I really want to be able to see what specifically is changing, not just the entire new file.
Also, if the user provides a file for modification, make that available as Version 0 (or whatever), so we can diff against that.
> output limit of 128K tokens
Is this limit on thinking mode only? Or does normal mode have same limit now? 8192 tokens output limit can be bit small these days.
I was trying to extract all urls along with their topics from a "what are you working on" HN thread. And 8192 token limit couldn't cover it.
Nothing in the Claude API release notes.
https://docs.anthropic.com/en/release-notes/api
I really wish Claude would get Projects and Files built into its API, not just the consumer UI.
Why not accepting other payment methods like PayPal/venmo ? Steam, Netflix have developers managed to integrate those payment methods so I conclude that Anthropic,Google, MS, OpenAI don't really need the money from the user but just hunting from big investors.
What I love about their API is the tools array. Given a json schema describing your functions, it will output tool usage appropriate for the prompt. You can return tool results per call, and it will generate a dialog and additional tool calls based on those results.
Been using 3.5 sonnet for a mobile app build the past month. Havent had much time to get a good sense of 3.7 improvements, but I have to say the dev experience improvement of Claude Code right in my shell is fantastic. Loving it so far
Kinda related: anyone know if there is an autocomplete plugin for Neovim on par with Cursor? I really want to use this new model in nvim to suggest next changes but none of the plugins I’ve come across are as good as Cursor’s.
Already available in Cursor! https://x.com/cursor_ai/status/1894093436896129425
(although I do not see it)
I've been using 3.5 with Roocode for the past couple of weeks and I've found it really quite powerful. Making it write tests and run them as part of the flow is with vscode windows pinging about is neat too.
Why is Claude-3.5-Haiku considered PRO and Claude-3.7-Sonnet is for free users?
Congratulations on the release! While team members are monitoring this discussion let me add that a relatively simple improvement I’d like to see in the UI is the ability to export a chat to markdown or XML.
Have there been any updates to Claude 3.5 Sonnet pricing? I can't find that anywhere even though Claude 3.7 Sonnet is now at the same price point. I could use 3.5 for a lot more if it's cheaper.
Why would they release Claude Code as closed source? Let's hope DeepSeek-r2 delivers, Anthropic is dead. I mean, it's a tool designed to eat itself. Absurd to close source.
What I found one of the most interesting takeaways from Huggingface's GAIA is that the agent would provide better result when the agent "reasoned" the response to the task in code.
YES. I've tried them all but Sonnet is still the model I'm most productive with, even better than the o1/o3 models.
Wish I could find the link to enroll in their Claude Code beta...
Tested on some chemistry problem; interestingly it was wrong on a molecular structure. Once I corrected it, it was able to draw it correctly. It was very polite about it.
What makes software "agentic" instead of just a computer program?
I hear lots of talk about agents and can't see them as being any different from an ordinary computer program.
I don't yet see it in Bedrock in us-east-1 or us-east-2
Claude Code works pretty OK so far, but Bash doesn't work straight up. Just sits and waits, even when running something basic like "!echo 123".
Hi Claude Code team, excited for the launch!
How well does Claude Code do on tasks which rely heavily on visual input such as frontend web dev or creating data visualizations?
Will Claude Code also be available with Pro Subscription?
So far Claude Code seems very capable, it oneshotted something I couldnt get to work in cursor at all.
However its expensive, 5m of work cost ~$1 which.
I've had a personal subscription to Claude for a while now. I would love if that also gave me access to some amount of API calls.
at Augment (https://augmentcode.com) we were one of the partner who tested 3.7 pre-launch. And it has been a pretty significant increase in quality and code understanding. Happy to answer some questions
FYI, We use Claude 3.7 has part of the new features we are shipping around Code Agent & more.
Nice to see a new release from Anthropic. Yet, this only makes me even more curious of when we'll see a new Claude Opus model.
How is the code generation? Open ai was generating good looking terraform but it was hallucinating on things that were incorrect.
Wonder if Aider will copy some of these features
The quality of the code is so much better!
The UI seems to have an issue with big artifacts but the model is noticeably smarter.
Congratulations on the release!
Does it show the raw "reasoning" tokens or is it a summary?
Edit: > we’ve decided to make its thought process visible in raw form.
Scary to watch the pace of progress and how the whole industry is rapidly shifting.
I honestly didn’t believe things would speed up this much.
Where did 3.6 go?
Anybody else noticing that in Cursor, Claude Sonnet 3.7 is thinking much slower than Claude Sonnet 3.5 did?
Does claude have a vscode plugin yet? I dropped github copilot because I didnt want so many subscriptions
It would be reeeaaally nice if someone built Claude Code into a Cline/Aider type extension...
Tested the new model, seems to have the same issue as october model.
Seems to answer before fully understanding the requests, and it often gets stuck into loops.
And this update removed the june model which was great, very sad day indeed. I still don't understand why they have to remove a model that is do well received...
Maybe its time to switch again, gemini is making great strides.
Is it just me who get the feeling that Claude 3.7 is worse than 3.5?
I really like 3.5 and can be productive with it, but with Claude 3.7 it can't fix even simple things.
Last night I sat for 30 minutes just to try to get the new model to remove a instructions section from a Next.js page. It was an isolated component on the page named InstructionsComponent. Failed non-stop, didn't matter what I did, it could not do it. 3.5 did it first try, I even mistyped instructions and the model fixed the correct thing anyway.
Is it actually good at solving complex code or is it just garbage and people are lying about it as usual?
In my experience EXTENSIVELY using claude 3.5 sonnet you basically have to do everything complex or you're just introducing massive amounts of slop code into your code base that while functional is nowhere near good. And for anything actually complex like requires a lot of context to make a decision and has to be useful to multiple different parts, it's just hopelessly bad.
I am noticing a good dose of hallucination for 3.7 thinking in cursor.
3.7 seems more reliable.
Just like OpenAI or Grok, there is no transparency and no way for self-hosting purposes. Your input and confidential information can be collected for training purposes.
I just don't trust those companies when you use their servers. This is not a good approach to LLM democratization.
Claude 3.7 Sonnet seems to have a context window of 64.000 via the API:
max_tokens: 4242424242 > 64000, which is the maximum allowed number of output tokens for claude-3-7-sonnet-20250219
I got a max of 8192 with Claude 3.5 sonnet.> strong improvements in coding and front-end web development
The best part
Watching Claude Code fumble around trying to edit text and double checking the hex output of a .cpp file and cd around a folder all while burning actual dollars and context is the opposite of endearing.
Any plans to make some HackerRank Astra bench?
Hello
Helo
Hey
Video killed the radio star
Anyone else noticed all the reasoning models kinda catch up on claude and claude itself turned to crap last week?
Well there goes my evening
Finally got access to the preview just now.
Let's fire it up.
"Type /init to set up your repository"
OK, /init <enter>
"OK, I created CLAUDE.md, session cost so far is $0.1764"
QUIT QUIT QUIT QUIT QUIT
Seventeen cents just to initialize yourself, Claude. No.
I feel like I touched a live wire.
It's about 2 orders of magnitude (100x) too expensive.
I feel like 3.7's personality is neutered, and frankly, the personality was the biggest selling point for me
OpenAI should be worried as they products are weak
Huh
I asked it for a self-portrait as a joke and the result is actually pretty impressive.
Prompt: "Draw a SVG self-portrait"
https://claude.site/artifacts/b10ef00f-87f6-4ce7-bc32-80b3ee...
For comparison, this is Sonnet 3.5's attempt: https://claude.site/artifacts/b3a93ba6-9e16-4293-8ad7-398a5e...
"Make me a website about books. Make it look like a designer and agency made it. Use Tailwind."
https://play.tailwindcss.com/tp54wfmIlN
Getting way better at UI.
> Third, in developing our reasoning models, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs.
Company: we find that optimizing for LeetCode level programming is not a good use of resources, and we should be training AI less on competition problems.
Also Company: we hire SWEs based on how much time they trained themselves on LeetCode
/joke of course
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
Who do I have to kill to get Claude Code access?
I wish Amodei didn't write that essay where he begged for export controls on China like that disabled corgi from a meme. I won't use anything Anthropic out of principle now. Compete fairly or die.
Tried claude code, and have an empty unresponsive terminal.
Looks cool in the demo though, but not sure this is going to perform better than Cursor, and shipping this as an interactive CLI instead of an extension is... a choice
Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.
Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.
Aider 0.75.0 is out with support for 3.7 Sonnet [1].
Thinking support and thinking benchmark results coming soon.
[0] https://aider.chat/docs/leaderboards/
[1] https://aider.chat/HISTORY.html#aider-v0750