How to split a string in C++?

How do you write a C program to split a string by a delimiter?

  • I think it uses ktochar, but exact code would be appreciated.

  • Answer:

    You said you wanted the exact code. See if this is what you want. A...

Bilesh Ganguly at Quora Visit the source

Was this solution helpful to you?

Other answers

http://www.cplusplus.com/reference/cstring/strtok/ is usually used for string 'tokenizing' (splitting in tokens) - and repeated calls to strtok will give you each time a new token. While this is perfectly fine if you write single-threaded programs and you don't need the source string to remain intact. However, thread safety is important, and I would advise against using it. It is usually considered a bug to use it. This function is not even reentrant, meaning that two separate calls from separate threads that don't work on the same data will give unexpected results. A reentrant version is strtok_r. The reentrant version partially fixes the problem; however, the design of the function is bad: it modifies the input data, it loses the separator itself (you can't know what it was, because strtok replaces it with \0) and it's generally a mess. Unfortunately, the best answer is to 'write your own'. Not even STL in C++ offers you anything better. The preferred way is rolling your own, using http://www.cplusplus.com/reference/cstring/strchr/ and friends.

Dorin Lazăr

Executive Summary: Use strtok_r for natural languages and strsep for everything else. I think the answers so far illustrate beautifully how confusing this issue can be. :p There are many ways to do this, but most of the ones you'll hear are often... less than ideal. Some people mentioned strtok, which was the Old Way. It uses global state variables which are hidden from you, making it a poor choice for multi-threaded code, or even sufficiently complicated code that'll require you to use it in more than one function; it's easy to accidentally start tokenizing one string using another's context, making your program behave unpredictably. Outside of simple, quick & dirty programs, you should avoid it. Only one person (as of this writing) suggested strtok_r. This is a fine suggestion. It works by putting that hidden global state variable into an opaque "context" variable you must pass for each invocation. This makes it effective but, in my opinion, ugly. Someone else suggested the use of strcspn, which only returns the length of the next token and therefore requires that you do your own string splitting. It's not a bad idea, but there are easier ways that do that work for you. One person reinvented the wheel that is the http://en.wikipedia.org/wiki/Lexical_analysis, but in an inefficient and highly insecure way, and still another gave you C++ when you asked for C. Those are known risks when asking the Internet to help you with C, I'm afraid. My preferred method of tokenizing a string is to use strsep. Like strtok_r (and strtok before it), it overwrites delimiter characters in the string with null bytes, so if you wanted to keep a copy of the original, untokenized string, you had best make a copy (strdup is the easiest way to do this). strsep has a few differences from strtok_r that will be made apparent in the example code below. The gist of it is that strtok_r is better for tokenizing natural languages (like English) and strsep is better for more well-defined input like CSV. Example strtok_r code: The following is a simple program that reads in one line at a time, tokenizes each line using strtok_r, and then prints all of its tokens. I split on whitespace as well as commas to illustrate an important difference in the way strtok_r and strsep behave regarding starting and ending delimiters as well as repeated delimiters. #include <stdio.h> #include <stdlib.h> #include <string.h> char **strsplit(const char* str, const char* delim, size_t* numtokens) { // copy the original string so that we don't overwrite parts of it // (don't do this if you don't need to keep the old line, // as this is less efficient) char *s = strdup(str); // these three variables are part of a very common idiom to // implement a dynamically-growing array size_t tokens_alloc = 1; size_t tokens_used = 0; char **tokens = calloc(tokens_alloc, sizeof(char*)); char *token, *strtok_ctx; for (token = strtok_r(s, delim, &strtok_ctx); token != NULL; token = strtok_r(NULL, delim, &strtok_ctx)) { // check if we need to allocate more space for tokens if (tokens_used == tokens_alloc) { tokens_alloc *= 2; tokens = realloc(tokens, tokens_alloc * sizeof(char*)); } tokens[tokens_used++] = strdup(token); } // cleanup if (tokens_used == 0) { free(tokens); tokens = NULL; } else { tokens = realloc(tokens, tokens_used * sizeof(char*)); } *numtokens = tokens_used; free(s); return tokens; } int main(void) { char *line = NULL; size_t linelen; char **tokens; size_t numtokens; while (getline(&line, &linelen, stdin) != -1) { tokens = strsplit(line, ", \t\n", &numtokens); for (size_t i = 0; i < numtokens; i++) { printf(" token: \"%s\"\n", tokens[i]); free(tokens[i]); } if (tokens != NULL) free(tokens); } if (line != NULL) free(line); return EXIT_SUCCESS; } In my opinion, the need for that context variable (strtok_ctx in my example, above) is ugly, and the fact that strtok_r requires a different calling convention for tokens after the first makes it annoying to write loops for. Example run: $ gcc -Wall -Wextra -Werror -g -o split split.c $ ./split foo,bar,baz token: "foo" token: "bar" token: "baz" ,,foo,,bar,,baz,, token: "foo" token: "bar" token: "baz" ^D $ Note how, in the second line, any delimiters at the start and end are skipped, and several consecutive occurrences of delimiters are treated as one delimiter. This makes strtok_r great for tokenizing natural languages because you'd be filtering out all of those blanks anyway, but less viable for, say, CSV files, where you definitely want to keep the empty values. That is, if ,,foo,,bar,,baz,, were in a CSV file, you would be expecting 9 values, the 3rd, 5th, and 7th of which are non-empty strings, but strtok_r would only give you the non-empty values with no way of knowing the positions in which they appear. Example strsep code: I'll just rewrite the strsplit function from the earlier example. char **strsplit(const char* str, const char* delim, size_t* numtokens) { char *s = strdup(str); size_t tokens_alloc = 1; size_t tokens_used = 0; char **tokens = calloc(tokens_alloc, sizeof(char*)); char *token, *rest = s; while ((token = strsep(&rest, delim)) != NULL) { if (tokens_used == tokens_alloc) { tokens_alloc *= 2; tokens = realloc(tokens, tokens_alloc * sizeof(char*)); } tokens[tokens_used++] = strdup(token); } if (tokens_used == 0) { free(tokens); tokens = NULL; } else { tokens = realloc(tokens, tokens_used * sizeof(char*)); } *numtokens = tokens_used; free(s); return tokens; } The primary difference to notice in the code is that that huge, multi-line for statement can be replaced with a much nicer while loop. And while strsep does need an extra char* variable just like strtok_r does, you actually know what it will be used for: storing the rest of the string after the token. This makes strsep more convenient if your program wants to partially tokenize a string (e.g. if your program takes command lines and dispatches to other functions based on the command). Example run: $ gcc -Wall -Wextra -Werror -g -o split split.c $ ./split foo bar baz token: "foo" token: "bar" token: "baz" token: "" ,,foo,,bar,,baz,, token: "" token: "" token: "foo" token: "" token: "bar" token: "" token: "baz" token: "" token: "" token: "" ^D $ The main difference is that each and every delimiter is split on; start and end delimiters are not skipped and consecutive delimiters are not lumped together as a single delimiter. The other important thing to notice is where that final empty token comes from in each line: that's the result of splitting on the newline character, '\n'. Because strsep does not ignore delimiters at the end of the string, it splits on the newline character as well (when it's one of the delimiter characters), producing an extra, empty token. Depending upon your use case, you may want to check for a terminating newline and remove it before tokenizing the line to avoid getting this extra token. It deserves mention, however, that for sufficiently complicated tokenization rules (e.g. as used by compilers when reading in your source code), it is best to use a lexer. A lexer is a developer tool that takes regular expressions describing tokens and automatically generates efficient functions for you to use to tokenize input according to those rules. The most popular lexer for C is http://flex.sourceforge.net/. Only the sorriest excuses of Linux package managers would fail to include flex in their repositories; you should have no trouble in getting a hold of it.

Costya Perepelitsa

I prefer using the strcspn[1] function. strok is also an alternative but it has two problems associated with it. It modifies the string being parsed in place, which in may cases is unacceptable because you may want to refer to the same string somewhere down the line. It is not thread safe, so in a multithreaded program you should not use it. Although, strtok_r overcomes the problem of thread safety. But (1) still remains. I am not providing any code because it sounds like a homework assignment. [1]: http://netbsd.gw.com/cgi-bin/man-cgi?strcspn++NetBSD-current

Abhinav Upadhyay

Its something I did just this semester in my CS200 class. We would call it the string split function and it takes the string and the delimiter, which is assumed to be a single character, as an argument and returns a vector of strings. Im writing from my phone so excuse the non code format. EDIT: This method might seem a bit inefficient or impractical for some situations, but it was meant to be used in a checkers program to split a series of board positions such as C6-D5-C7, which can further be processed to play a checkers move. #include <iostream> #include <sstream> #include <vector> std::vector<std::string> string_split(std::string arg, char delim) { std::string cont; std::vector<std::string> ans; for (char a : arg) { if ( a==delim) { ans.push_back(cont); cont =""; // forgot to add this earlier else { cont = cont + a; } } ans.push_back(cont); return ans; }

Hammad Mazhar

It seems obvious to use the standard C string library in almost all applications to solve such a problem. I up-voted the great summary by Costya.However, in cases where the resource is extremely limited or the speed performance is extremely critical or simply in a bare-metal environment where you do not have even a standard string library available. Doing just a “reinvent the wheel” approach is not a bad idea in such a simple case like this to best favor your scenario.Assume the delimiter is just one character. In many cases, it is. For example, we often uses path1:path2:path3 as the input parameter for a list of file paths.So, here are two approaches I use to split a null (0) terminated string with a single character delimiter.1. Use while loop in an inline function inline void str_split(char *str, char delim) { char *p = str; while(p[0]) { if (p[0] == delim) p[0] = 0; p++; } } 2. Use #define and a while loop in just one line #define str_split(str, delim) do { char *p = str; while(p[0]) { if (p[0] == delim) p[0] = delim; p++ }} while(0) The good things about the above are It does not use any library function so no need to build the image or executable with a string library in cases where you really want to do so; It is extremely fast and the compiler can inline the whole function for the best optimization. It is thread safe. In case you are not sure the input string is good, you can simply add a boundary check in the code to avoid potential memory corruption.

Henry Zhu

The responses like to use  strtok_r() are additional libraries and not C, so are not the right answer. What the question is asking for is the code to manipulate a string, not what external libraries calls are available to do it instead.  And a clear reason to know how to write it yourself is exactly because all these library routines have the side effects listed in the conversations. An example could be something like: // return the string after the delimiter or nul if not found short  splitString( char* src, char delimiter ) {    char* ret = nul;    short len = strlen( src );    for( short idx=0, idx<len, idx++ )    {       if( src[idx] == delimiter )       {          ret = &src[idx+1];       }    } // end for    return( ret ); }  //end splitString() Now obviously this is unfinished because it does not check is the passed in are good, the length limits, Unicode, if the delimiter is the last char, etc. But this is a much better suggestion than strtok_r(). This is like a homework or interview problem to show C++ competency, and strtok_r() would not due.

Kirk Augustin

There can be many positions to split a string at ,depending on its' length.Here is how to for  one. compile x.c #include <locale.h> #include <stdio.h> int main(void) { setlocale(LC_ALL,""); printf("%'Id\n",2014); } ./x 2,014

Sameer Gupta

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.